Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data Intensive Science
(funded in part by grants from the National Science Foundation)
Scientific progress in many disciplines is increasingly enabled by our ability to examine
naturalphenomena through the computational lens (e.g., using algorithmic abstractions of the
underlying processes) and our ability to acquire, share, integrate, and analyze disparate types of data.
However, realizing the full potential of data to accelerate science calls for significant advances in data
and computational infrastructure to support collaborative data-intensive science by teams of
researchers that transcend institutional and disciplinary boundaries.
This project aims to conceptualize, design, and implement a Virtual Data Collaboratory (VDC), to
support collaborative, data-intensive science research by multi-disciplinary teams drawn from multiple
institutions. Specifically, the project aims to design VDC, a federated infrastructure that integrates the
state of the art data-intensive computing platforms, storage, and networking, with an innovative data
services layer across Rutgers University, Pennsylvania State University, and several other institutions
in the region, interconnected through a high-speed network, with the potential to expand to
incorporate academic/research institutions across the United States. VDC will leverage existing
national/international and regional data repositories (including NSF funded repositories like the Ocean
Observatories Initiative (OOI) and the Protein Data Bank (PDB)), existing investments in advanced
cyberinfrastructure, like the NSF funded Big Data Regional Hubs, XSEDE, OSG, among others.
VDC will provide the collaborative infrastructure and platform for developing and integrating
algorithmic abstractions of scientific domains e.g., biology, coupled with methods and tools for data
analytics, modeling, and simulation, cognitive tools (representations, processes, protocols, workflows,
software) to advance science. VDC will support reproducible, sharable, and reconfigurable dataintensive
scientific workflows [Parashar et al., 2019].
The project employs several collaborative science use cases to develop and evaluate the VDC
infrastructure. For example, one use case involves a collaboration between Vasant Honavar and Helen
Berman, a Rutgers structural biologist and the founder of the Nucleic Acid Database (NDB) and former
director of the Protein Data Bank (PDB), a widely used archival database of curated protein structures,
will use VDC to assemble carefully curated data sets of protein-DNA and protein RNA complexes and
interfaces; and develop machine learning and other computational methods and tools for reliable
prediction of protein-RNA and protein-DNA interfaces. The team will use VDC to establish shared data
and computational infrastructure, complete with workflows for documenting, comparing, and
reproducing computational analyses and prediction of protein-RNA complexes, interfaces. In addition
to helping develop and evaluate the VDC infrastructure, the results of this effort will advance our
understanding of the molecular mechanisms by which proteins recognize and bind to DNA and RNA,
and their role in a variety of important biological processes that orchestrate development, aging,
disease, etc.