This is Penn State

Data-Driven Discovery of Macromolecular Sequence-Structure-Function-Interaction-Expression Relationships

(in collaboration with Drena Dobbs and Robert Jernigan funded in part by a National Institutes of Health Grant 5R21GM066387)

Assigning putative functions from sequences remains one of the most challenging problems in functional genomics. Improvements in annotating protein sequences can be expected to yield significant improvements in gene annotations. Protein-protein, protein-DNA, and protein-protein interactions play a pivotal role in protein function. Experimental detection of residues in protein-protein interaction surfaces must come from determination of the structure of protein-protein, protein-DNA and protein-RNA complexes. However, experimental determination of such complexes lags far behind the number of known protein sequences. Hence, there is a need for development of reliable computational methods for identifying protein-protein interface residues. Identification of protein-protein interaction sites and detection of specific amino acid residues that contribute to the specificity and strength of such interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks.

Against this background, this project aims to develop and systematically evaluate computational methods for discovering sequence and structural correlates of protein function by analyzing large data sets derived from multiple information sources (e.g., protein sequences, protein structures, protein-protein interaction data, gene expression data), from multiple perspectives, based on different views of structure and function. Some specific aims of this research are:

  1. To develop, implement, and evaluate novel data mining algorithms for assigning proteins to structural and functional families that address specific limitations of existing data mining algorithms for computational characterization of protein sequence-structure-function relationships, including in particular, probabilistic graphical models and probabilistic language models for sequence classification; and new algorithms for learning from network data which provide a natural way to incorporate information (data and knowledge) from multiple sources in analysis of protein structure and function from multiple perspectives.
  2. To develop, implement, and systematically evaluate data mining approaches for characterization and prediction of protein-protein, protein-DNA, and protein-RNA interaction residues and other functionally important sites (e.g., B-cell and T-cell epitopes, glycosylation and phosphorylation sites) primarily from protein sequence data (but utilizing other sources of data when available – including predicted or known structures of the protein but not the complex, evolutionary profiles, etc.)

Some of the results to date include:

  • Comprehensive Database of Protein-protein Interfaces (Jordan et al., 2012) and of Protein-RNA Interfaces (Lewis et al., 2010).
  • Development of a state-of-the-art approach to predicting protein-RNA interface.
  • Development of sequence homology based methods and online servers for protein interface prediction (Xue et al., 2011), including non partner-specific methods for predicting obligate interfaces and interfaces of disordered proteins and partner-specific methods for predicting transient interfaces.
  • Development of sequence-based machine learning methods for predicting the approximate number of putative interaction partners of a protein (Andorf et al., 2013).
  • Development of a novel approach and online server for scoring docked protein-protein complex conformations using predicted partner-specific protein-protein interfaces (Xue et al., 2011; 2012).
  • Demonstration of the pitfalls of commonly used windows-based cross-validation for sequence-based classification tasks (e.g., phosphorylation site prediction, DNA-binding site prediction) (Caragea et al., 2009).
  • Application of classifiers trained using machine learning to discover a large set of incorrect Gene Ontology annotations an experimentally well-studied family of proteins – mouse kinases (Andorf et al., 2007).
  • Development of machine learning approaches and online servers for prediction of protein-DNA interface residues from amino acid sequence, and when available, structural information (Yan et al., 2006).
  • Structural characterization of protein-protein and protein-RNA interfaces (Towfic et al., 2011).
  • Development of machine learning methods and online servers for identification of posttranslational modification sites e.g., phosphorylation sites, glycosylation sites in amino-acid sequences (Caragea et al., 2007).
  • Development of machine learning methods and online servers for predicting linear and B-cell epitopes from amino acid sequences (El-Manzalawy et al., 2008) including methods for predicting variable length and conformational B-cell epitopes.
  • Demonstrations of the pitfalls of commonly used benchmark datasets for evaluating the performance of machine learning approaches to MHC-II binding site prediction (El-Manzalawy, 2008).
  • Prediction of the designability of binary (H-P) protein sequences (Peto et al., 2008).
  • Prediction of protein and RNA binding sites in recalcitrant (with regard to attempts at structure determination) proteins e.g., HIV-1 and EIAV and experimental confirmation of the predictions (with Lee et al., 2008).

The online web servers can be found at (