This is Penn State

Characterization and Prediction of Macromolecular (Protein, DNA, RNA) Sequence-Structure-Function-Interaction Relationships

 

Project Personnel

  • Dr. Vasant Honavar, Professor and Edward Frymoyer Chair of Information Sciences and Technology, Principal Investigator.
  • Dr. Drena Dobbs, Professor of Genetics, Cell, and Developmental Biology, Collaborator
  • Dr. Robert Jernigan , Professor, Department of Biochemistry, Biophysics, and Molecular Biology, Collaborator.
  • Yasser El-Manzalawy, Postdoctoral Research Associate, Computer Science.
  • Carson Andorf, US Department of Agriculture, Collaborator.
  • Rafael Jordan, Ph.D. Student, Computer Science.
  • Rasna Walia, Ph.D. Student, Bioinformatics and Computational Biology.
  • Li Xue, Ph.D. Student, Bioinformatics and Computational Biology.

Alumni

  • Fadi Towfic, Ph.D., Bioinformatics and Computational Biology, 2010.
  • Cornelia Caragea, Ph.D., Computer Science, 2009.
  • Yasser El-Manzalawy, Ph.D., Computer Science, 2008.
  • Diane Schroeder, Undergraduate student, Computer Science and Genetics, 2002.
  • Michael Terribilini, Ph.D., Bioinformatics and Computational Biology, 2008.
  • Feihong Wu, Ph.D., Bionformatics and Computational Biology, 2008.
  • Changhui Yan, Ph.D., Bioinformatics and Computational Biology, 2005.

Project Summary

Assigning putative functions from sequences remains one of the most challenging problems in functional genomics. Improvements in annotating protein sequences can be expected to yield significant improvements in gene annotations. Protein-protein, protein-DNA, and protein-protein interactions play a pivotal role in protein function. Experimental detection of residues in protein-protein interaction surfaces must come from determination of the structure of protein-protein, protein-DNA and protein-RNA complexes. However, experimental determination of such complexes lags far behind the number of known protein sequences. Hence, there is a need for development of reliable computational methods for identifying protein-protein, protein-RNA, and protein-DNA interface residues. Identification of macromolecualr interaction sites and detection of specific amino acid residues that contribute to the specificity and strength of such interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Against this background, this project is aimed at developing systematically evaluating, and disseminating computational methods for discovering sequence and structural correlates of protein function by analyzing large data sets derived from multiple information sources (e.g., protein sequences, protein structures, protein-protein interaction data, gene expression data), from multiple perspectives, based on different views of structure and function. Some specific aims of this research are:

  • To develop, implement, and evaluate novel machine learning algorithms for assigning proteins to structural and functional families that address specific limitations of existing data mining algorithms for computational characterization of protein sequence-structure-function relationships, including in particular, algorithms for exploiting prior knowledge (e.g., hierarchical taxonomies of attributes) and for multi-label classification tasks (e.g., for assigning a protein to several, not necessarily disjoint classes of function (e.g., based on existing models of biological function such as those captured by the Gene Ontology (GO) classifications); probabilistic graphical models and probabilistic language models for sequence classification; and efficient algorithms for learning from multiple tables which provide a natural way to incorporate information (data and knowledge) from multiple sources in analysis of protein structure and function from multiple perspectives.
  • To develop, implement, and systematically evaluate machine learning approaches for characterization and prediction of protein-protein, protein-DNA, and protein-RNA interaction residues and other functionally important sites (e.g., B-cell and T-cell epitopes, glycosylation and phosphorylation sites) primarily from protein sequence data (but utilizing other sources of data when available - including predicted or known structures of the protein but not the complex, evolutionary profiles, etc.) As part of this effort, we have assembled a comprehensive database of protein-protein interface residues, and non-redundant datasets of protein-RNA, and protein-DNA interfaces and B-cell and T-cell epitopes; and developed novel machine learning approaches to prediction of functionally important sites using a variety of sequence and structure-derived features of proteins.

This project is funded in part by the grants from the National Institutes of Health (GM066387 to Vasant Honavar, GM081680 to Andrzej Kloczkowski).

This project has resulted in:

  • Application of classifiers trained using machine learning to discover a large set of incorrect Gene Ontology annotations an experimentally well-studied family of proteins - mouse kinases (Vasant Honavar with Ph.D. student Carson Andorf)
  • Development and applications of probabilistic graphical models and related methods for assigning protein sequences to functional families, predicting protein subcellular localization, etc. (Vasant Honavar with Ph.D. students Carson Andorf and Cornelia Caragea)
  • Construction and analysis of PPIDB, a comprehensive database of protein-protein interfaces (Vasant Honavar with Ph.D. students Feihong Wu, Raphael Jordan and collaborator Drena Dobbs)
  • Development of machine learning approaches to prediction of protein-protein interface residues from amino acid sequence, evolutionary and when available, structural information (Vasant Honavar with Ph.D. student Changhui Yan and collaborator Drena Dobbs and Robert Jernigan)
  • Demonstration of the pitfalls of commonly used windows-based cross-validation for sequence-based classification tasks (e.g., phosphorylation site prediction, DNA-binding site prediction) (Vasant Honavar with Ph.D. student Cornelia Caragea)
  • Development of machine learning approaches and implementation of online servers for prediction of protein-RNA interface residues from amino acid sequence and when available, structural information (Vasant Honavar with Ph.D. students Michael Terribilini, Cornelia Caragea, and collaborator Drena Dobbs)
  • Development of machine learning approaches and implementation of online servers for prediction of protein-DNA interface residues from amino acid sequence, and when available, structural information (Vasant Honavar with Ph.D. student Changhui Yan, Cornelia Caragea, and collaborator Drena Dobbs)
  • Structural characterization of protein-protein and protein-RNA interfaces (Vasant Honaavr with Ph.D. students Feihong Wu and Fadi Towfic)
  • Development of machine learning methods and online servers for identification of posttranslational modification sites e.g., glycosylation sites in amino-acid sequences (Vasant Honavar with Ph.D. student Cornelia Caragea)
  • Development of kernel-based methods for predicting B-cell epitopes from amino acid sequences (Vasant Hoanavar with Ph.D. student Yasser El-Manzalawy and collaborator Drena Dobbs)
  • Demonstrations of the pitfalls of commonly used benchmark datasets for evaluating the performance of machine learning approaches to epitope prediction (Vasant Honavar with Ph.D. student Yasser El-Manzalawy)
  • Development of some of the state-of-the-art machine learning methods for predicting variable length linear B-cell epitopes and conformational B-cell epitopes (Vasant Honavar with Ph.D. student Yasser El-Manzalawy)
  • Development and characterization of the effectiveness of the sequence homology based methods for predicting protein-protein interaction sites (Vasant Honavar with Ph.D. student Xue Li and collaborator Drena Dobbs)
  • Prediction of the designability of binary (H-P) protein sequences (Robert Jernigan with Ph.D. Student Myron Peto, and collaborators Andrzej Kloczkowski, and Vasant Honavar)
  • Prediction of protein and RNA binding sites in recalcitrant (with regard to attempts at structure determination) proteins e.g., HIV-1 and EIAV and experimental confirmation of the predictions (Drena Dobbs with Ph.D. students Jae-Hyung Lee, Michael Terribilini and collaborators Vasant Honavar and Susan Carpenter)
  • Development and application of an approach to combining homology modeling and structure prediction methods with machine learning to predict sequence and structural correlates of functionally important sites of telomerase (RNA, DNA, and protein binding sites) (Drena Dobbs with Ph.D. students Michael Terribilini, Jae-Hyung Lee, Cornelia Caragea, and collaborator Vasant Honavar)

Software, Databases, and Servers

Publications

  1. El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2011). Predicting MHC-II binding affinity using multiple instance regression. IEEE/ACM Transactions on Computational Biology and Bioinformatics. DOI: 10.1109/TCBB2010.94

  2. Lewis, B.A., Walia, R.R., Terribilini, M., Ferguson, J., Zheng, C., Honavar, V., and Dobbs, D. (2011). PRIDB: A Protein-RNA Interface Database. Nucleic Acids Research. D277-282. DOI: 10.1093/nar/gkq1108.

  3. Xue, L., Dobbs, D., and Honavar, V.. (2011). HomPPI: A Class of Sequence Homology Based Protein-Protein Interface Prediction Methods. BMC Bioinformatics 12:244 doi:10.1186/1471-2105-12-244

  4. Xue, L., Jordan, R., El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2011). Sequence Based Partner-Specific Prediction of Protein- Protein Interfaces and its Application in Ranking Docked Models. In: ACM Conference on Bioinformatics and Computational Biology. ACM Press.

  5. Caragea, C., Silvescu, A., Caragea, D. and Honavar, V. (2010). Abstraction-Augmented Markov Models. In: Proceedings of the IEEE Conference on Data Mining (ICDM 2010). IEEE Press. pp. 68-77.

  6. Caragea, C. Silvescu, A., Caragea, D., and Honavar, V. (2010). Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models. BMC Bioinformatics. doi: 10.1186/1471-2105-11-S8-S6.

  7. Caragea, C., Silvescu, A., Caragea, D., and Honavar, V. (2010). Semi-Supervised Sequence Classification Using Abstraction Augmented Markov Models. In: Proceedings of the ACM Conference on Bioinformatics and Computational Biology. pp. 257-264, doi: 10.1145/1854776.1854813. ACM Press.

  8. El-Manzalawy, Y. and Honavar, V. (2010). Recent Advances in B-Cell Epitope Prediction Methods. Immunome Research Suppl. 2:S2.

  9. Koul, N., Bui, N., and Honavar, V. (2010). Scalable, Updatable Predictive Models for Sequence Data. In Proceedings of the IEEE Intenational Conference on Bioinformatics and Biomedicine (BIBM 2010).

  10. Koul, N. and Honavar, V. (2010). Learning in the Presence of Ontology Mapping Errors. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. pp. 291-296. ACM Press.

  11. Towfic, F., Caragea, C., Dobbs, D., and Honavar, V. (2010). Struct-NB: Predicting protein-RNA binding sites using structural features. International Journal of Data Mining and Bioinformatics. Vol. 4. pp. 21-43.

  12. Towfic, F., VanderPlas, S., Oliver, C.A., Couture, O., Tuggle, C.K., Greenlee, M.H.W., and Honavar, V. (2010). Detection of gene orthology from gene co-expression and protein interaction networks. BMC Bioinformatics, 11(Suppl 3):S7.

  13. Tuggle, C.K., Bearson, S.M.D, Huang, T.H., Couture, O., Wang, Y., Kuhar, D., Lunney, J.K., Honavar, V. (2010). Methods for transcriptomic analyses of the porcine host immune response: Application to Salmonella infection using microarrays. Veterinary Immunology and Immunopathology. Vol. 138. pp. 282-291.

  14. Caragea, C., Sinapov, J., Dobbs, D., and Honavar, V. (2009). Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling. BMC Bioinformatics. doi:10.1186/1471-2105-10-S4-S4

  15. Couture, O., Callenberg, K., Koul, N., Pandit, S., Younes, J., Hu, Z-L., Dekkers, J., Reecy, J., Honavar, V., and Tuggle, C. (2009). ANEXdb: An Integrated Animal ANnotation and Microarray EXpression Database. Mammalian Genome. DOI 10.1007/s00335-009-9234-1

  16. Silvescu, A., Caragea, C. and Honavar, V. (2009). Combining Super-structuring and Abstraction on Sequence Classification. IEEE Conference on Data Mining (ICDM 2009).

  17. Towfic, F., Greenlee, H., and Honavar, V. (2009). Aligning Biomolecular Networks Using Modular Graph Kernels. In: Proceedings of the 9th Workshop on Algorithms in Bioinformatics (WABI 2009). Berlin: Springer-Verlag: LNBI Vol. 5724, pp. 345-361.

  18. Towfic, F., Greenlee, H., and Honavar, V. (2009). Detecting Orthologous Genes Based on Protein-Protein Interaction Networks. In: Proceedings of the IEEE Conference on Bioinformatics and Biomedicine (BIBM 2009). IEEE Press.

  19. Caragea, C. and Honavar, V. (2008). Machine Learning in Computational Biology. In: Encyclopedia of Database Systems, (Raschid, L., Editor), Springer.

  20. Caragea, C., Sinapov, J., Dobbs, D., and Honavar, V. (2008). Using Global Sequence Similarity to Enhance Macromolecular Sequence Labeling. IEEE Conference on Bioinformatics and Biomedicine, IEEE Press, pp. 104-111.

  21. El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2008). On Evaluating MHC-II Binding Peptide Prediction Methods. PLoS One, 3(9): e3268. doi:10.1371/journal.pone.0003268

  22. El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2008). Predicting Flexible Length Linear B-cell Epitopes, 7th International Conference on Computational Systems Bioinformatics, Stanford, CA. Singapore: World Scientific.

  23. El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2008). Predicting linear B-cell epitopes using string kernels. Journal of Molecular Recognition, DOI: 10.1002/jmr.893

  24. El-Manzalawy, Y., Dobbs, D., and Honavar, V. (2008). Predicting Protective Linear B-cell Epitopes using Evolutionary Information. IEEE Conference on Bioinformatics and Biomedicine, pp. 289-292, IEEE Press.

  25. Lee. J-H., Hamilton, M., Gleeson, C., Caragea, C., Zaback, P., Sander, J., Lee, X., Wu, F., Terribilini, M., Honavar, V. and Dobbs, D. Striking Similarities in Diverse Telomerase Proteins Revealed by Combining Structure Prediction and Machine Learning Approaches.. In Proceedings of the Pacific Symposium on Biocomputing (PSB 2008). Vol. 13. pp. 501-512, 2008.

  26. Peto M., Kloczkowski A., Honavar V., Jernigan R.L. (2008). Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable. BMC Bioinformatics, 9:487-.

  27. Yan, C., Dobbs, D., Jernigan, R., and Honavar, V. (2008). Characterization of Protein-Protein Interfaces. The Protein journal. doi:10.1007/s10930-007-9108-x

  28. Andorf, C., Dobbs, D. and Honavar, V. (2007). Exploring Inconsistencies in Genome Wide Protein Function Annotations: A Machine Learning Approach. BMC Bioinformatics 8:284 doi:10.1186/1471-2105-8-284

  29. Caragea, C., Sinapov, J., Dobbs, D., and Honavar, V. (2007). Assessing the Performance of Macromolecular Sequence Classifiers, In: Proceedings of the IEEE Conference on Bioinformatics and Bioengineering (BIBE 2007). pp. 320-326, 2007.

  30. Caragea, C., Sinapov, J., Silvescu, A., Dobbs, D. And Honavar, V. (2007). Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers. BMC Bioinformatics. doi:10.1186/1471-2105-8-438.

  31. Terribilini, M., Sander, J.D., Lee, J-H., Zaback, P., Jernigan, R.L., Honavar, V. and Dobbs, D. (2007). RNABindR: A Server for Analyzing and Predicting RNA Binding Sites in Proteins. Nucleic Acids Research. doi:10.1093/nar/gkm294

  32. Towfic, F., Gemperline, D.C., Caragea, C., Wu, F., Dobbs, D., and Honavar, V. (2007). Structural Characterization of RNA-Binding Sites of Proteins: Preliminary Results. In: Computational Structural Bioinformatics Workshop, IEEE International Conference on Bioinformatics and Biomedicine.

  33. Wu, F., Towfic, F., Dobbs, D. and Honavar, V. (2007). Analysis of Protein Protein Dimeric Interfaces. IEEE International Conference on Bioinformatics and Biomedicine.

  34. Kang, D-K., Silvescu, A. and Honavar, V. (2006) RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science., Berlin: Springer-Verlag. pp. 45-54, 2006.

  35. Terribilini, M., Lee. J-H., Yan, C., Carpenter, S., Jernigan, R., Honavar, V. and Dobbs, D. Identifying interaction sites in recalcitrant proteins: predicted protein and rna binding sites in HIV-1 and EIAV agree with experimental data. Pacific Symposium on Biocomputing, Hawaii, World Scientific. Vol. 11. pp. 415-426, 2006.

  36. Terribilini, M., Lee, J.-H., Yan, C., Jernigan, R. L., Honavar, V. and Dobbs, D. (2006). Predicting RNA-binding Sites from Amino Acid Sequence. In: RNA Journal.. Vol. 12. No. 1450. pp. 1462.

  37. Wu, F., Olson, B., Dobbs, D., and Honavar, V. (2006). Using Kernel Methods to Predict Protein-Protein Interaction Sites from Sequence. IEEE Joint Conference on Neural Networks, Vancouver, Canada, IEEE Press.

  38. Yan, C., Terribilini, M., , Wu, F., Jernigan, R.L., Dobbs, D. and Honavar, V. (2006) Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics, 2006.

  39. Caragea, D., Silvescu, A., Pathak, J., Bao, J., Andorf, C., Dobbs, D., and Honavar, V. Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. Data Integration in Life Sciences (DILS 2005) Springer-Verlag Lecture Notes in Computer Science, San Diego, Berlin: Springer-Verlag. Vol. 3615. pp. 175-190, 2005.

  40. Yakhnenko, O., Silvescu, A., and Honavar, V. Discriminatively Trained Markov Model for Sequence Classification. IEEE Conference on Data Mining (ICDM 2005), Houston, Texas, IEEE Press, 2005.

  41. Wu. F., Zhang, J., and Honavar, V. Learning Classifiers Using Hierarchically Structured Class Taxonomies. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA 2005), Edinburgh, Berlin, Springer-Verlag. Vol. 3607. pp. 313-320, 2005.

  42. Sen, T.Z., Kloczkowski, A., Jernigan, R.L., Yan, C., Honavar, V., Ho, K-M., Wang, C-Z., Ihm, Y., Cao, H., Gu, X., and Dobbs, D. Predicting Binding Sites of Protease-Inhibitor Complexes by Combining Multiple Methods. BMC Bioinformatics. Vol. 5. pp. 205, 2004.

  43. Yan, C., Dobbs, D., and Honavar, V. A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. Proceedings of the Conference on Intelligent Systems in Molecular Biology (ISMB 2004). 2004.

  44. Yan, C., Dobbs, D., and Honavar, V. A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. Bioinformatics. Vol. 20. pp. i371-378, 2004.

  45. Yan, C., Dobbs, D., and Honavar, V. Identifying Protein-Protein Interaction Sites from Surface Residues – A Support Vector Machine Approach. Neural Computing Applications. Vol. 13. pp. 123-129, 2004.

  46. Wang, X., Schroeder, D., Dobbs, D., and Honavar, V. (2003). Automated Data-Driven Discovery of Motif-Based Protein Function Classifiers. Information Sciences. Vol. 155. pp. 1-18.

Presentations

  1. Honavar, V. (2009). Invited Keynote Talk, Aligning Macromolecular Networks. Sixth International Biotechnology and Bioinformatics Symposium (BIOT 2009), Lincoln, Nebraska, October 2009.

  2. Honavar, V. (2009). Invited Talk, From Annotating Sequences to Aligning Networks. University Sixth Annual Computation and Informatics in Biology and Medicine Retreat, University of Wisconsin, Madison, October 2009.

  3. Honavar, V. (2009). Invited Colloquium, Transforming Biology From a Descriptive Science into a Predictive Science, Indian Institute of Information Technology, Bangalore, India, January 2009.

  4. Honavar, V. (2008). Invited Colloquium, Transforming Biology From a Descriptive Science into a Predictive Science: Predictive Models of Macromolecular Function and Interaction. Bioinformatics Center, University of Pune, India, December 2008.

  5. Honavar, V. (2008). Invited Colloquium, Semantics-Enabled Infrastructure for Collaborative, Integrative e‐Science. School of Information Technology, Jawaharlal Nehru University, New Delhi, India, December 2008.

  6. Honavar, V. (2008). Invited Talk, Computational Sciences. High Performance Computing Center, Jawaharlal Nehru University, New Delhi, India, December 2008.

  7. Honavar, V. (2008). Invited Plenary Talk, Machine Learning in Bioinformatics, Annual Conference of the Italian Association for Artificial Intelligence (AI*IA 2008), Cagliari, Italy, September 2008.

  8. Honavar, V. (2008). Keynote Talk, International Congress on Pervasive Computing and Management (ICPCM 2008), New Delhi, India, December 2008.

  9. Honavar, V. (2008). Invited Talk, Telluride Meeting on Characterizing the Landscape From Biomolecules to Cellular Networks, Telluride, Colorado, July 2008.

  10. Honavar, V. (2008). Invited Colloquium, Semantics-Enabled infrastructure for collaborative, integrative e-science. Yahoo!, Bangalore, India, January 2008.

  11. Honavar, V. (2007). Semantic Web for Collaborative Knowledge Acquisition. HP Research labs, Bangalore, India.

  12. Honavar, V. (2007). Keynote Talk, Computational Structural Bioinformatics Workshop, IEEE Conference on Bioinformatics and Biomedicine, Silicon Valley, 2007.

  13. Honavar, V. (2007). Invited Talk, Making Biology and Medicine a Predictive Science. NSF Workshop on Biomedical Informatics. Oregon, 2007.

  14. Honavar, V. (2007). Invited Talk, Knowledge Acquisition from Semantically Disparate Distributed Data. NSF Workshop on Next Generation Data Mining and Cyber‐Enabled Discovery, Baltimore, Maryland, 2007.

  15. Honavar, V. (2006). Keynote Talk, Semantic Web for Collaborative e-Science, International Conference on Intelligent Sensing and Information Processing, Bangalore, India.

  16. Honavar, V. (2006). Invited Colloquium, Algorithms and Software for Knowledge Acquisition from Semantically Heterogeneous, Distributed Data Sources. Dept. of Electrical and Computer Engineering. University of Iowa.

  17. Honavar, V. (2006). Invited Colloquium, Algorithms and Software for Collaborative Discovery in Systems Biology. Dept. Biostatistics, Bioinformatics and Epidemiology. Medical University of South Carolina.

  18. Honavar, V. (2005). Invited Talk, Algorithms and Software for Knowledge Acquisition from Semantically Heterogeneous, Distributed, Autonomous Information Sources. Google Research.

  19. Honavar, V. (2002). Invited Colloquium, Computational Discovery of Protein Sequence-Structure-Function Relationships: Bioinformatics Infrastructure and Sample Applications. University of Wisconsin-Madison Biostatistics and Medical Informatics Department. 2002.