This is Penn State

Learning Predictive Models from Multi-Modal Data


Recent years have witnessed rapid advances in our ability to acquire and store massive amounts of data across different modalities (such as text, speech, images, etc.). The availability of such data presents us with challenges of information organization, retrieval of images based on user text, identification of regions of interest in the image based on user text and so on. Scene understanding requires determining the objects that appear in the scene, their locations relative to each other, etc. Of particular interest is image annotation, the task of assigning keywords to an image based on its contents. Traditional machine learning approaches to image annotation require large amounts of labeled data. This requirement is often unrealistic, as obtaining labeled data is, in general, expensive and time consuming. In contrast, weakly or partially labeled data (e.g., tagged images) are plentiful. This presents us with the challenge of learning predictive models from partially labeled data. We have formulated the problem of image annotation as multiple instance, multiple label learning problem. Multiple instance, multiple label learning is a generalization of supervised learning in which the training examples are bags of instances and each bag is labeled with a set of labels. We have explored two learning frameworks: generative and discriminative, and proposed models within each framework to address the problem of assigning text keywords to images:

  • Multimodal Hierarchical Dirichlet Process, which is a non-parametric generalization of the hierarchical mixture models. Our experimental evaluation shows that the performance of this model does not depend on the number of mixture components, unlike the standard mixture model which suffers from over-fitting (Yakhnenko and Honavar, 2009).
  • A discriminative model, which given the input bag of instances what is the most likely assignment of labels to the bag. We learn as many classifiers as there are possible labels and force the classifiers to share weights using trace-norm regularization. Our experimental results on standard benchmark datasets show that the performance of this model is comparable to the state-of-the-art multiple instance multiple label classifiers and that unlike some state-of-the-art models, it is scalable and practical for datasets with a large number of training instances and possible labels (Yakhnenko and Honavar, 2010).
  • A generalization of the discriminative model to a semi-supervised setting to allow the model take advantage of labeled and unlabeled data. We assume that the data lies in a low-dimensional manifold and introduce a penalty that ensures that the classifiers assign similar labels to similar instances (i.e. instances that are near-by in the manifold induced by the samples). Our experimental results show the effectiveness of this approach in learning to annotate images from partially labeled data (Yakhnenko and Honavar, 2010).