This is Penn State

Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data Intensive Science

(funded in part by grants from the National Science Foundation)
 


Scientific progress in many disciplines is increasingly enabled by our ability to examine naturalphenomena through the computational lens (e.g., using algorithmic abstractions of the underlying processes) and our ability to acquire, share, integrate, and analyze disparate types of data. However, realizing the full potential of data to accelerate science calls for significant advances in data and computational infrastructure to support collaborative data-intensive science by teams of researchers that transcend institutional and disciplinary boundaries.


This project aims to conceptualize, design, and implement a Virtual Data Collaboratory (VDC), to support collaborative, data-intensive science research by multi-disciplinary teams drawn from multiple institutions. Specifically, the project aims to design VDC, a federated infrastructure that integrates the state of the art data-intensive computing platforms, storage, and networking, with an innovative data services layer across Rutgers University, Pennsylvania State University, and several other institutions in the region, interconnected through a high-speed network, with the potential to expand to incorporate academic/research institutions across the United States. VDC will leverage existing national/international and regional data repositories (including NSF funded repositories like the Ocean Observatories Initiative (OOI) and the Protein Data Bank (PDB)), existing investments in advanced cyberinfrastructure, like the NSF funded Big Data Regional Hubs, XSEDE, OSG, among others. VDC will provide the collaborative infrastructure and platform for developing and integrating algorithmic abstractions of scientific domains e.g., biology, coupled with methods and tools for data analytics, modeling, and simulation, cognitive tools (representations, processes, protocols, workflows, software) to advance science. VDC will support reproducible, sharable, and reconfigurable dataintensive scientific workflows [Parashar et al., 2019].


The project employs several collaborative science use cases to develop and evaluate the VDC infrastructure. For example, one use case involves a collaboration between Vasant Honavar and Helen Berman, a Rutgers structural biologist and the founder of the Nucleic Acid Database (NDB) and former director of the Protein Data Bank (PDB), a widely used archival database of curated protein structures, will use VDC to assemble carefully curated data sets of protein-DNA and protein RNA complexes and interfaces; and develop machine learning and other computational methods and tools for reliable prediction of protein-RNA and protein-DNA interfaces. The team will use VDC to establish shared data and computational infrastructure, complete with workflows for documenting, comparing, and reproducing computational analyses and prediction of protein-RNA complexes, interfaces. In addition to helping develop and evaluate the VDC infrastructure, the results of this effort will advance our understanding of the molecular mechanisms by which proteins recognize and bind to DNA and RNA, and their role in a variety of important biological processes that orchestrate development, aging, disease, etc.