This is Penn State

Learning Predictive Models from Richly Structured Data

 

Many applications e.g., biomolecular sequence analysis, image classification, text classification, social network analysis, among others require methods for classification of structured data. Of particular interest are topologically structured data, i.e. data whose topology reflects intrinsic dependencies between the constituent elements that make up the data. Learning from topologically or relationally structured data presents several challenges:

  • Data representation: The representation of the data that is presented to a learner has to be rich enough to capture distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning harder due to over-fitting.
  • Sparsity of Labeled Data: In many applications, e.g., image annotation, sequence annotation, social network analysis, the available data (sequences, images, etc.) are only partially labeled. Hence there is a need for methods that can take advantage of vast amounts of unlabeled or partially labeled data, together with limited amounts of labeled data.

To cope with these challenges, we have developed a novel approach to learning compact yet accurate predictive models from topologically structured data. Our approach exploits the complementary strengths of super-structuring (constructing complex features by combining existing features) and abstraction (grouping similar features to generate more abstract features). Super-structuring provides a way to increase the predictive accuracy of the learned models by enriching the data representation (hence, super-structuring increases the complexity of the learned models), whereas abstraction helps reduce the number of model parameters by simplifying the data representation (Silvescu et al., 2011). Some results of this work to date include:

  1. Abstraction Augmented Markov models (AAMMs). AAMMs are generalizations of Markov Models (MM). AAMMs simplify the data representation used by the standard MMs by grouping similar subsequences to organize them in an abstraction hierarchy (Caragea et al., 2010a, 2010b, 2010c). Experimental results on text document classification and protein subcellular localization show that adapting data representation by combining super-structuring and abstraction makes it possible to construct predictive models that use substantially smaller number of features (by one to three orders of magnitude) than those obtained using super-structuring alone (whose size grows exponentially with the length of direct dependencies). Super-structuring and abstraction-based models are competitive with and, in some cases, outperform, models that use only super-structuring. Our experiments have also demonstrated the promise of abstraction-augmented Markov Models in learning sequence classifiers in a semi-supervised setting where only some of the sequences are labeled.
  2. Development of Abstraction-super-structuring Normal Forms (Silvescu and Honavar, 2011) that offer a general theoretical framework for structural (as opposed to parametric) aspects of induction using abstraction (grouping of similar entities) and super-structuring (combining topologically close entities) and exploration of its relation to ideas e.g., radical positivism in the philosophy of science (with PhD student Adrian Silvescu).
  3. Development of discriminatively trained probabilistic models for sequence classification (Yakhnenko et al., 2005), generalized multiple instance learning algorithms with applications in bioinformatics, text and image analysis (El-Manzalawy et al., 2009).
  4. Development multi-relational learning algorithms (Atramentov et al., 2003), relational Bayesian classifiers from RDF data, independence-based Markov Network learning algorithms (Bromberg et al., 2009), recursive Naïve Bayes learning algorithms with applications to sequence classification (Kang et al., 2006).
  5. Theoretical characterization of independence and decomposability of functions that take values into an Abelian group including probability distributions, energy functions, value functions, fitness functions, and relations (Silvescu and Honavar, 2006).

Work in progress is aimed at extending this approach to learning predictive models from richly structured data at multiple levels of abstractions (images and text (multi-modal data), social networks and social media, linked open data, biomolecular interaction network data).