Computational Infrastructure for Sensitive Data Analysis
The project aims to develop a novel framework for data access and use policy (DAUP) compliant analysis of sensitive data. The framework will support (i) Querying and retrieval of information from the data store that are permitted by the user or project specific DAUP. Such information could include the schema of the data store, metadata that specify the variables, and their domains and ranges, etc.; (ii) Execution of system or user-supplied implementations of algorithms for construct predictive or causal models or visualizations from the data in the data store; (iii) Evaluation of the predictive performance of the resulting models on benchmark data or user-provided data; (iv) Deployment of the validated models in the form of web servers that provide predictions or visualizations over user-submitted data or over results of user-defined queries against the data store; and (v) Publication of reusable analytics workflows. This exploratory project seeks to test the feasibility of the framework using predictive and causal modeling of data from an online health community as a test case. A major outcome of this project is the open source software infrastructure for facilitating analysis and visualization of sensitive data. This research will: (i) fill a major gap in infrastructure for predictive modeling from sensitive data; (ii) significantly lower the barrier to the entry of researchers with deep expertise in analytics to domains (e.g., health, education) that involve sensitive data; (iii) improve the accuracy of assessment of the state-of-the-art in predictive and causal modeling in such domains by facilitating rigorous comparison of algorithms; and (iv) facilitate, reproducible analysis of sensitive data.
This research will (i) yield a prototype open source software infrastructure to support analysis and visualization of sensitive data; (ii) Accelerate data-driven advances in domains that involve sensitive data e.g., health, education through broad engagement of talent in developing better algorithms; and (iii) Support incorporation of hands-on experience with such applications into Data Sciences education through hackathons and competitions organized around specific sensitive data sets.