This is Penn State

Generative Models of Complex Data (supported in part by grants from the National Science Foundation and the National Institutes of Health)
 

Generative models trained on large data sets have had astounding success at text, image, video, and even molecular structures. We are developing computationally efficient yet accurate learning algorithms for a variety of applications of generative models. Our recent work has resulted in:

  • GraphECL, a simple contrastive learning algorithm for fast and accurate inference on graphs. Graph contrastive learning offers a promising approach to applications where there is a scarcity of task-specific labels. However, existing methods for graph contrastive learning incur significant computational overhead for due to their reliance on message passing. This makes them unsuitable for latency-constrained applications. GraphECL does away with the need for expensive message passing during inference. It introduces a novel coupling of the MLP and GNN models, where the former learns to computationally efficiently mimic the computations performed by the latter. We provide a theoretical analysis showing why MLP can capture essential structural information in neighbors well enough to match the performance of GNN in downstream tasks. Extensive evaluations using widely used real-world benchmarks that show that GraphECL achieves superior performance and inference efficiency compared to state-of-the-art graph contrastive learning (GCL) methods on homophilous and heterophilous graphs.
  • Multi-modal generative models for molecular structure generation, molecular property prediction, and related tasks. Foundation models trained on large data sets of molecules (e.g., proteins, material structures) have proven useful for applications such as drug discovery and material discovery. Generative models trained on molecular structures are quite effective at capturing structural information about the molecules. However, they fail to take advantage of rich information available in textual descriptions of the molecular properties, or information available through other modalities. Hence, we extract and curate 200,000 pairs of molecular structures and their descriptions taken from biomedical texts to obtain the PubChem3D dataset. We use generative models trained on the resulting data set to demonstrate superior performance on several tasks including molecular property prediction, zero-shot text-guided molecule retrieval, and 3D molecule description. In related work, we have developed a multi-modal representation learning framework over natural language text, chemical descriptions of molecules, 3D binding pockets, among others. To facilitate direct experimental evaluation of the resulting framework, we assembled a high quality data set. We trained and evaluated a multimodal generative model on several cross-modal retrieval tasks: Molecule to Language (M2L), Language to Molecule (L2M), Language to Conformation (L2C), and Conformation to Language (C2L).
Work in progress is aimed at incorporating physics into large-scale data-driven generative models, including those for materials discovery and drug design.