SOCR ≫ DSPA ≫ Topics ≫

1 DSPA Mission and Objectives

This textbook is based on the HS650: Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan. These materials collectively aim to provide learners with a solid foundation of the challenges, opportunities, and strategies for designing, collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets. Readers that finish the textbook and successfully complete the examples and assignments will gain unique skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

  • Vision: Enable active-learning by integrating driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference
  • Values: Effective, reliable, reproducible, and transformative data-driven discovery supporting open-science
  • Strategic priorities: Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics covered in the remaining chapters, we will discuss several driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

2 Examples of driving motivational problems and challenges

For each of the studies below, we illustrate several clinically-relevant scientific questions, identify appropriate data sources, describe the types of data elements, and pinpoint various complexity challenges.

2.1 Alzheimer’s Disease

  • Identify the relation between observed clinical phenotypes and expected behavior
  • Prognosticate future cognitive decline (3-12 months prospectively) as a function of imaging data and clinical assessment (both model-based and model-free machine learning prediction methods will be used)
  • Derive and interpret the classifications of subjects into clusters using the harmonized and aggregated data from multiple sources
Data Source Sample Size/Data Type Summary
ADNI Archive Clinical data: demographics, clinical assessments, cognitive assessments; Imaging data: sMRI, fMRI, DTI, PiB/FDG PET; Genetics data: Ilumina SNP genotyping; Chemical biomarker: lab tests, proteomics. Each data modality comes with a different number of cohorts. Generally, \(200\le N \le 1200\). For instance, previously conducted ADNI studies with N>500 [ doi: 10.3233/JAD-150335, doi: 10.1111/jon.12252, doi: 10.3389/fninf.2014.00041] ADNI provides interesting data modalities, multiple cohorts (e.g., early-onset, mild, and severe dementia, controls) that allow effective model training and validation NACC Archive

2.2 Parkinson’s Disease

  • Predict the clinical diagnosis of patients using all available data (with and without the UPDRS clinical assessment, which is the basis of the clinical diagnosis by a physician)
  • Compute derived neuroimaging and genetics biomarkers that can be used to model the disease progression and provide automated clinical decisions support
  • Generate decision trees for numeric and categorical responses (representing clinically relevant outcome variables) that can be used to suggest an appropriate course of treatment for specific clinical phenotypes
Data Source Sample Size/Data Type Summary
PPMI Archive Demographics: age, medical history, sex; Clinical data: physical, verbal learning and language, neurological and olfactory (University of Pennsylvania Smell Identification Test, UPSIT) tests), vital signs, MDS-UPDRS scores (Movement Disorder; Society-Unified Parkinson’s Disease Rating Scale), ADL (activities of daily living), Montreal Cognitive Assessment (MoCA), Geriatric Depression Scale (GDS-15); Imaging data: structural MRI; Genetics data: llumina ImmunoChip (196,524 variants) and NeuroX (covering 240,000 exonic variants) with 100% sample success rate, and 98.7% genotype success rate genotyped for APOE e2/e3/e4. Three cohorts of subjects; Group 1 = {de novo PD Subjects with a diagnosis of PD for two years or less who are not taking PD medications}, N1 = 263; Group 2 = {PD Subjects with Scans without Evidence of a Dopaminergic Deficit (SWEDD)}, N2 = 40; Group 3 = {Control Subjects without PD who are 30 years or older and who do not have a first degree blood relative with PD}, N3 = 127 The longitudinal PPMI dataset including clinical, biological and imaging data (screening, baseline, 12, 24, and 48 month follow-ups) may be used conduct model-based predictions as well as model-free classification and forecasting analyses

2.3 Drug and substance use

  • Is the Risk for Alcohol Withdrawal Syndrome (RAWS) screen a valid and reliable tool for predicting alcohol withdrawal in an adult medical inpatient population?
  • What is the optimal cut-off score from the AUDIT-C to predict alcohol withdrawal based on RAWS screening?
  • Should any items be deleted from, or added to, the RAWS screening tool to enhance its performance in predicting the emergence of alcohol withdrawal syndrome in an adult medical inpatient population?
Data Source Sample Size/Data Type Summary
MAWS Data / UMHS EHR / WHO AWS Data Scores from Alcohol Use Disorders Identification Test-Consumption (AUDIT-C) [49], including dichotomous variables for any current alcohol use (AUDIT-C, question 1), total AUDIT-C score > 8, and any positive history of alcohol withdrawal syndrome (HAWS) ~1,000 positive cases per year among 10,000 adult medical inpatients, % RAWS screens completed, % positive screens, % entered into MAWS protocol who receive pharmacological treatment for AWS, % entered into MAWS protocol without a completed RAWS screen

2.4 Amyotrophic lateral sclerosis

  • Identify the most highly-significant variables that have power to jointly predict the progression of ALS (in terms of clinical outcomes like ALSFRS and muscle function)
  • Provide a decision tree prediction of adverse events based on subject phenotype and 0-3 month clinical assessment changes
Data Source Sample Size/Data Type Summary
ProAct Archive Over 100 clinical variables are recorded for all subjects including: Demographics: age, race, medical history, sex; Clinical data: Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS), adverse events, onset_delta, onset_site, drugs use (riluzole) The PRO-ACT training dataset contains clinical and lab test information of 8,635 patients. Information of 2,424 study subjects with valid gold standard ALSFRS slopes will be used in out processing, modeling and analysis The time points for all longitudinally varying data elements will be aggregated into signature vectors. This will facilitate the modeling and prediction of ALSFRS slope changes over the first three months (baseline to month 3)

3 Normal Brain Visualization

The SOCR Brain Visualization tool has preloaded sMRI, ROI labels, and fiber track models for a normal brain. It also allows users to drag-and-drop their data into the browser to visualize and navigate through the stereotactic data (including imaging, parcellations and tractography).

4 Neurodegeneration

A recent study of Structural Neuroimaging in Alzheimer’s Disease illustrates the Big Data challenges in modeling complex neuroscientific data. Specifically, 808 ADNI subjects were divided into 3 groups: 200 subjects with Alzheimer’s disease (AD), 383 subjects with mild cognitive impairment (MCI), and 225 asymptomatic normal controls (NC). Their sMRI data were parcellated using BrainParser, and the 80 most important neuroimaging biomarkers were extracted using the global shape analysis Pipeline workflow. Using a pipeline implementation of Plink, the authors obtained 80 SNPs highly-associated with the imaging biomarkers. The authors observed significant correlations between genetic and neuroimaging phenotypes in the 808 ADNI subjects. These results suggest that differences between AD, MCI, and NC cohorts may be examined by using powerful joint models of morphometric, imaging and genotypic data.