Center for Complexity and Self-Management of Chronic Disease (CSCD): Core 2: Methods and Analytics Progress (2018-2019)

In 2018-2019, investigators in the CSCD Methods and Analytics core continued to make significant advances. First we developed a powerful protocol enabling data sharing and aggregation via statistical obfuscation. Second, we introducing a new compressive big data analytics (CBDA) framework for analyzing biomedical and health studies. Third, we deployed an advanced visualization webapp for interrogating extremely-high dimensional data. This brief progress report outlines some of these developments.

I. Support for Open Data-Sharing

In support of open-science, we developed a novel statistical approach that enables the harmonization, merging, and sharing of complex datasets without compromising sensitive information like person identifiable elements (NIHMSID 1012970, DOI: 10.1080/00949655.2018.1545228). The DataSifter provides on-the-fly de-identification of structured and unstructured sensitive high-dimensional data such as clinical data from electronic health records (EHR). The technique provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Our simulation results suggest that the DataSifter can provide privacy protection while maintaining data utility for different types of outcomes of interest. The application of DataSifter on a large autism dataset provides a realistic demonstration of its promise practical applications.

II. Application of Compressive Big Data Analytics (CBDA) in Biomedical and Health Studies

We introduced a scalable computational statistics method for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. The CBDA mathematical framework enables the study of the ergodic properties and the asymptotics of the specific statistical inference approaches. We implemented and validated the high-throughput CBDA method using pure R and several simulated datasets as well as a real neuroimaging-genetics of Alzheimer's disease case-study (PMCID: PMC6116997, DOI: 10.1371/journal.pone.0202674).

III. Visualization of High-dimensional Diabetes Data

We developed a distributed webapp for visually interrogating complex data archives. It allows all users to address health questions like: Do patient phenotypes (e.g., race, gender, and age), clinical settings (e.g., admission type, time in hospital, medical specialty of admitting physician), and treatment regiments (e.g., number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetes medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization) affect diabetes treatment outcomes?
Examples of specific driving healthcare challenges that can be addressed include: This open-science project provides a low-cost solution for interactive visual analytics, hypothesis generation, and pattern identification for complex biomedical and healthcare case-studies. It does not require any special software or licensing and supports evidence-based discovery science and provides semi-automated clinical decision support for health practitioners.
CSCD High-Dimensional t-SNE Data Viz Webapp
Figure: Visualization of the high-dimensional Diabetes data into 3D, using t-SNE. This CSCD TensorBoard application provides a tutorial and a high-end visualization protocol for interrogating extremely high-dimensional data. Users may upload and interrogate their own data into the webapp. .

IV. Curricular Developments