Center for Complexity and Self-Management of Chronic Disease (CSCD): Core 2: Methods and Analytics Progress (2017-2018)

The Methods and Analytics core made significant advances in 2017-2018. This progress is summarized below:

I. Predictive big data analytics study of Amyotrophic Lateral Sclerosis (ALS)

We developed a new non-parametric method for estimating the prognosis of ALS patients using survival analysis (PMC5749893). Our survival ranking technique transforms patients' survival data into a linear space of hazard ranks and enables the subsequent machine learning prediction of the neurodegenerative progression. This technique was received the top ranking in the DREAM Amyotrophic Lateral Sclerosis (ALS) Stratification Challenge. As an application, we identified salient feature that are important in ALS diagnosis and prognosis.

II. Genomic Data Analysis

We introduced a new theoretical model for analyzing genetics sequence data (PMC5361063). We compared our approach to other techniques for quantifying sequence distances and variability. Most alignment-free methods rely on counting words, which are small contiguous fragments of the genome. Our approach considers the locations of nucleotides in the sequences and relies more on appropriate statistical distributions. We reported results of extracting information and comparing matching fidelity and location regularization information to classify mutation sequences.

III. Visualization of Extremely High-dimensional Data

The CSCD analytics team developed a new demonstration of modeling, simplifying and visualizing extremely high-dimensional data. Many datasets have million observations and attributes/features. Datasets with high dimensions/features are subjected to what is colloquially known as the curse of dimensionality. For instance, medical images generate thousands of features and are difficult to integrate with clinical and phenotypic information. We utilized a novel manifold statistical technique, t-distributed stochastic neighbor embedding (t-SNE), to reduce 3,000 dimensional data for 10,000 volunteers into a 3D space.

IV. Curricular Developments