This is Penn State

Algorithms and Software for Knowledge Acquisition from Semantically Heterogeneous, Distributed Data

(funded in part by grants from the National Science Foundation)

Recent development of high throughput data acquisition technologies in a number of domains (e.g., biological sciences, atmospheric sciences, commerce) together with advances in digital storage, computing, and communications technologies have resulted in the proliferation of a multitude of physically distributed data repositories created and maintained by autonomous entities (e.g., scientists, organizations). The resulting increasingly data rich domains offer unprecedented opportunities in knowledge acquisition (e.g., discovery of a priori unknown complex relationships, construction of predictive models) from data. However, realizing these opportunities presents several challenges in practice: Data repositories are autonomously designed and operated, large in size, physically distributed, and differ in structure, organization, semantics, and query and processing capabilities.

Our research, aimed at addressing some of these challenges, has led to:

  1. The development of a general theoretical framework for learning predictive models (e.g., classifiers) from large, physically distributed data sources where it is neither desirable nor feasible to gather all of the data in a centralized location for analysis. This framework [Caragea et al., 2001; 2003; 2004a] offers a general recipe for the design of algorithms for learning from distributed data that are provably exact with respect to their centralized counterparts (in the sense that the model constructed from a collection of physically distributed data sets is provably identical to that obtained in the setting where the learning algorithm has access to the entire data set). A key feature of our approach is the clear separation of concerns between hypothesis construction and extraction and refinement of sufficient statistics needed by the learning algorithm from data which reduces the problem of learning from data to a problem of decomposing a query for sufficient statistics across multiple data sources and combining the answers returned by the data sources to obtain the answer for the original query. Our work has resulted in provably exact algorithms (relative to the centralized counterparts) for learning decision trees, neural networks, support vector machines and Bayesian networks from distributed data.
  2. The development of theoretically sound yet practical variants of a large class of algorithms [Caragea et al., 2001; 2003; 2004a] for learning predictive models (classifiers) from distributed data sources under a variety of assumptions (motivated by practical applications) concerning the nature of data fragmentation, and the query capabilities and operations permitted by the data sources (e.g., execution of user supplied procedures), and precise characterization of the complexity (computation, memory, and communication requirements) of the resulting algorithms relative to their centralized counterparts.
  3. Development of a scalable statistical query based approach to learning and updating sequence classifiers from very large sequence data sets (Koul et al., 2010).
  4. The development of a theoretically sound approach to formulation and execution of statistical queries across semantically heterogeneous data sources [Caragea et al., 2004b; Caragea et al., 2005; Caragea et al., 2006; 2007a; 2007b; Bao et al., 2007d]. This work has shown how to use semantic correspondences and mappings specified by users from a set of terms and relationships among terms (user ontology) to terms and relations in data source specific ontologies to construct a sound procedure for answering queries for sufficient statistics needed for learning classifiers from semantically heterogeneous data. An important component of this work has to do with the development of statistically sound approaches to handling data specified at different levels of abstraction across different data sources.
  5. Abstraction-Driven Algorithms for Building Compact yet Accurate Classifiers. We have developed a general approach for exploiting attribute value hierarchies (AVH) that group the values of attributes to learn compact yet accurate predictive models from data specified at different levels of abstraction. Instantiations of this approach in the case of Naïve Bayes (Zhang et al., 2004; 2006) and Decision Trees (Zhang et al., 2003) show that the resulting algorithms yield predictive models that are more compact than those produced by their counterparts that do not have access to AVH without sacrificing the quality of the predictors.
  6. Development of a statistical query based approach to learning classifiers from semantically disparate multi-relational data (Caragea et al., 2010).
  7. Demonstration of the theoretical equivalence of a certain class of inter-ontology mapping errors and noise models, and hence the reduction of the problem of learning in the presence of mapping errors from semantically disparate data to the problem of learning from noisy data (Koul et al., 2010, 2012).
  8. The design and implementation of INDUS – A modular, extensible, open source software toolkit for data-driven knowledge acquisition from large, distributed, autonomous, semantically heterogeneous data sources (;

Research in progress is aimed at:

  1. Extension of the statistical query based learning framework to learning predictive models from Linked Open (RDF) Data e.g., algorithms for learning Relational Bayesian Classifiers from RDF data in settings where the learner can access the RDF data only through a restricted set of queries against an access interface (Lin et al., 2011).
  2. Extension of the statistical query based learning framework to learning predictive models from network data.
  3. Applications of the resulting algorithms to social network and social media analytics and analysis and prediction of biomolecular interactions.