This is Penn State

Topics in Grammatical Inference and Computational Learning Theory


Grammatical Inference, variously referred to as automata induction, grammar induction, and automatic language acquisition, refers to the process of learning of grammars and languages from data. Machine learning of grammars finds a variety of applications in syntactic pattern recognition, adaptive intelligent agents, diagnosis, computational biology, systems modeling, prediction, natural language acquisition, data mining and knowledge discovery.

Our work on learning Regular Grammars demonstrates the feasibility of learning regular languages from examples under additional assumptions concerning the distribution from which the examples are drawn, thereby addressing the problem posed by Pitt in his seminal paper (against the background of strong negative results regarding the feasibility of learning regular grammars within the standard PAC learning framework): “Are DFA PAC-identifiable if examples are drawn from the uniform distribution, or some other known simple distribution?”

  1. The class of simple DFA (i.e., DFA whose canonical representations have logarithmic Kolmogorov complexity) is efficiently PAC learnable under the Solomonoff Levin universal distribution (Parekh and Honavar, 1999).
  2. If the examples are sampled at random according to the universal distribution by a teacher that is knowledgeable about the target concept, the entire class of DFA is efficiently PAC learnable under the universal distribution, that is, DFA are efficiently learnable under the PACS Model (Parekh and Honavar, 1999; Parekh and Honavar, 2001).
  3. Any concept that is learnable under Gold’s model for learning from characteristic samples, Goldman and Mathias’ polynomial teachability model, and the model for learning from example based queries is also learnable under the PACS model (Parekh and Honavar, 2000; 2001).

Related work has led to the development of polynomial algorithms for learning regular languages from examples and membership queries (Nichitiu et al., 2000). Our work on learning of grammars used to model natural languages (in particular, dependency grammars, stochastic context free grammars) has led to:

  1. The development of a novel regularization scheme, namely, unambiguity regularization that favors grammars that yield unambiguous parses, which includes as special cases and improves upon, standard expectation maximization (EM), Viterbi EM, and Softmax EM algorithms for unsupervised learning of grammars (Tu and Honavar, 2012).
  2. Demonstration (and explanation) of the benefits of curricula (e.g., a means of ordering training samples presented to the learner in an inductive learning setting) using an incremental construction hypothesis which asserts (loosely speaking) that when the target of learning is a structure e.g., a grammar that can be decomposed into a set of sub-structures e.g., grammar rules, an ideal curriculum is one that gradually emphasizes data samples that help the learner to successively discover new substructures (Tu and Honavar, 2011 ).
  3. An iterative bi-clustering approach to learning probabilistic context free grammars (Tu et al., 2008, 2011).

Work in progress is aimed at extending the theoretical foundations and algorithms for grammatical inference to settings that call for learning from multimodal data (e.g., combination of words and pictures). Some results to date include an algorithm for learning a multi-modal hierarchical Dirichlet process model for annotating images from partially labeled data (Yakhnenko and Honavar, 2009).