Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

Babak Shahbaba, Dept. of Public Health Sciences, University of Toronto
Radford M. Neal, Dept. of Statistics and Dept. of Computer Science, University of Toronto

We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information.

Technical Report No. 0606, Dept. of Statistics, University of Toronto (May 2006), 14 pages: postscript, pdf.

Also available from arXiv.org.


Associated references: A revised version of this technical report has been published:
Shahbaba, B. and Neal, R. M. (2006) ``Gene function classification using Bayesian models with hierarchy-based priors'', BMC Bioinformatics, 7:448, 9 pages: abstract, pdf, html, associated references.

The following technical report introduced the class of models that are used here for gene function classification:
Shahbaba, B. and Neal, R. M. (2005) ``Improving classification when a class hierarchy is available using a hierarchy-based prior'', Technical Report No. 0510, Dept. of Statistics, 11 pages: abstract, postscript, pdf, associated references.