Learning Modules

Class Notes » R Code » Assignment »

Examples of Big Biomedical Challenges (AD, PD, ALS, AWD)

Brain Visualization

Neurodegeneration

Genomics computing

Neuroimaging-genetics

Common Characteristics of Big (Biomedical and Health) Data

High-throughput Big Data Analytics

Statistical Software – Pros/Cons Comparison  

Getting started  

Install Basic Shell-based R  

GUI based R Invocation (RStudio)

RStudio GUI Layout  

Help

Simple Long-to-Wide Data format translation  

Data generation

I/O  

Slicing and extracting data  

Variable conversion  

Variable information  

Data selection and manipulation  

Math Functions  

Matrix Operations  

Advanced Data Processing

Strings

Plotting

QQ Normal Probability Plots

Low-level plotting Commands  

Graphics parameters  

Optimization and model Fitting  

Statistics  

Distributions  

Programming

Data Simulation Primer

Class Notes » R Code » Assignment »

Managing Data in R

Saving and Loading R Data Structures  

Importing and Saving Data from CSV Files  

Exploring the Structure of Data  

Exploring Numeric Variables  

Measuring the Central Tendency - mean and median  

Measuring Spread - quartiles and the five-number summary  

Visualizing Numeric Variables - boxplots

Visualizing Numeric Variables - Histograms  

Understanding Numeric Data - uniform and normal distributions  

Measuring Spread - variance and standard deviation  

Exploring Categorical Variables  

Measuring the Central Tendency - the mode

Exploring Relationships Between Variables

Imputation of Missing Data

Parsing web pages and visualizing tabular HTML data

Cohort-Rebalancing (for Imbalanced Groups)

Classification of visualization methods  

Composition  

Histograms and density plots  

Pie Chart  

Heat map  

Comparison

Paired Scatter Plots  

Barplots  

Trees and Graphs  

Correlation Plots  

Relationships

Line plots using ggplot  

Density Plots

Distributions  

2D Kernel Density and 3D Surface Plots

Jitter plot  

Appendix  

Hands-on Activity (Health Behavior Risks)  

Class Notes » R Code » Assignment »

Linear Algebra & Matrix Computing  

Building Matrices  

Create matrices  

Adding columns and rows  

Matrix subscripts  

Matrix Operations  

Addition  

Subtraction  

Multiplication  

Elementwise multiplication  

Matrix multiplication  

Division  

Transpose  

Inverse

Matrix Operations  

Matrix Algebra Notation  

Matrix Notation  

Solving Systems of Equations  

The identity matrix  

Vectors, Matrices, and Scalars  

Sample Statistics  

Mean  

Variance  

Applications of Matrix Algebra: Linear modeling  

Finding function extrema (min/max) using calculus  

Least Square Estimation  

The R lm Function  

Eigenvalues and Eigenvectors  

Other important functions  

Matrix notation  

Linear regression

Sample covariance matrix  

Simple linear regression  

Ordinary least squares estimation

Correlations  

Multiple Linear Regression

Case Study 1: Baseball Players

Step 2 - exploring and preparing the data

Step 3 - training a model on the data  

Step 4 - evaluating model performance  

Step 5 - improving model performance  

Regression trees and model trees

Heart Attack Data

Class Notes » R Code » Assignment »

Principal Component Analysis (PCA)

Independent Component Analysis (ICA)

Factor Analysis (FA)

Singular Value Decomposition (SVD)

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Uniform Manifold Approximation and Projection (UMAP)

Class Notes » R Code » Assignment »

Understanding classification using nearest neighbors

The kNN algorithm

Calculating distance

Choosing an appropriate k

Preparing data for use with kNN

Why is the kNN algorithm lazy?

Predictive Diagnostics

Probabilistic Learning - the Naive Bayes Algorithm

Assumptions

BayesFormula  

The Laplace Estimator  

Case Study: Head and Neck Cancer Medication  

Understanding decision trees

Divide and conquer

The C5.0 decision tree algorithm

Choosing the best split

Pruning the decision tree

Boosting the accuracy of decision trees

Making some mistakes more costly than others

Understanding classification rules

Separate and conquer

The One Rule algorithm

The RIPPER algorithm

Rules from decision trees

Class Notes » R Code » Assignment »

Neural Networks  

Network topology  

Training neural networks with backpropagation

Case Study 1: Google Trends and the Stock Market

Support Vector Machines (SVM)

Case Study 2: Optical Character Recognition (OCR)

Case Study 3: Iris Flowers

Class Notes » R Code » Assignment »

Association Rules  

Rule support and confidence

Case Study 1: Head and Neck Cancer Medications

Case Study 2: Groceries

Case Study 3: Survival of Titanic Passengers

Term Frequency (TF), Inverse Document Frequency (IDF)

Document Term Matrix (DTM)

Case-Study: Job ranking

NLP

Cosine similarity

Sentiment Analysis

Class Notes » R Code » Assignment »

Clustering as a machine learning task  

The k-Means Clustering Algorithm  

Case Study 1: Divorce and Consequences on Young Adults  

Case study 2: Pediatric Trauma

Practice Problem: Youth Development

Spectral Clustering

Gaussian Mixture Modeling

Class Notes » R Code » Assignment »

Measuring performance for classification  

Working with classification prediction data

Evaluation: Confusion matrices

Other performance measures

Visualizing performance tradeoffs

Estimating future performance (internal statistical validation)

The holdout method

Tuning stock models for better performance

Using caret for automated parameter tuning

Creating a simple tuned model

Customizing the tuning process

Improving model performance with meta-learning

Understanding ensembles

Bagging

Boosting

Random forests

Training random forests

Evaluating random forest performance

Forecasting types and assessment approaches

Overfitting

Internal Statistical Cross-validation is an iterative process

Example (Linear Regression)

Cross-validation methods

Case-Studies

Summary of CS output

Alternative predictor functions

Prediction Models

Appendix: R Debugging

Class Notes » R Code » Assignment »

Working with specialized data and databases

Querying data in SQL databases

Downloading the complete text of web pages

Web-page Data Scraping

Parsing JSON from web APIs

Reading and writing Microsoft Excel spreadsheets using XLSX

Visualizing network data

Data Streams and Streaming Classification

Optimization and improving the computational performance

Generalizing tabular data structures with dplyr

Parallel computing

GPU computing

R integration with Python, C/C++, Java, etc.

Class Notes » R Code » Assignment »

Variable selection methods

Filtering-based, wrapper-based, and embedded methods

Comparing random forest classification, recursive feature elimination, and stepwise variable selection

Case Study - ALS

Evaluating model performance

Regularized Linear Modeling  

Ridge Regression  

Least Absolute Shrinkage and Selection Operator (LASSO) Regression  

Linear Regression  

Assessing Prediction Accuracy  

Estimating Prediction Error

Improving Prediction Accuracy

General Regularization Framework

Example: Neuroimaging-genetics study of Parkinson's Disease Dataset

Computational Complexity

n-Fold Cross Validation

Controlled Variable Selection: Knockoff Filtering: Simulated Example

PD Neuroimaging-genetics Case-Study

Visualization  

Class Notes » R Code » Assignment »

Time series analysis

Identifying the Diff, AR and MA parameters

Structural Equation Modeling (SEM)

Case study - Parkinson's Disease (PD)  

Linear Mixed model  

GLMM and GEE Longitudinal data analysis

Recurrent Neural Networks (RNN)

Multi-covariate Long Short-Term Memory (LSTM) Networks

Class Notes » R Code » Assignment »

Free (unconstrained) optimization

Constrained Optimization

Equality and Inequality constraints

Lagrange Multipliers

Linear and Quadratic Programming

Manual vs. Automated Lagrange Multiplier Optimization

Data Denoising

Class Notes » R Code » Assignment »

Perceptrons

Biological Relevance

Simple Neural Net Examples XOR and NAND Operators

Sonar data example

Schizophrenia Neuroimaging Study

Spirals 2D Data

IBS Study

Country QoL Ranking Data

Handwritten Digits Classification

Classifying Real-World Images

Classifying Real-World Images using Tensorflow and Keras

Data Generation: Simulating Synthetic Data/Images

Generative Adversarial networks (GANs)

CIFAR10 10-Class Image Archive

Transfer Learning

Text Classification

2D Brain Tumor Image Classification

The author is profoundly indebted to all his direct mentors, past and current advisors for nurturing his curiosity, inspiring his studies, guiding the course of his career, and providing constructive and critical feedback throughout. Among these scholars are Gencho Skordev (Sofia University), Kenneth Kuttler (Michigan Tech University), De Witt L. Sumners and Fred Huffer (Florida State University), Jan de Leeuw, Nicolas Christou, and Michael Mega (UCLA), Arthur Toga (USC), Brian Athey, Kathleen Potempa, Janet Larson and Gilbert Omenn (University of Michigan).

Many colleagues, students, researchers, and fellows have shared their expertise, creativity, valuable time, and critical assessment for generating, validating, and enhancing the DSPA open-science resources. Among these are Christopher Aakre, Simeone Marino, Jiachen Xu, Ming Tang, Nina Zhou, Chao Gao, Alexandr Kalinin, Syed Husain, Brady Zhu, Farshid Sepehrband, Lu Zhao, Sam Hobel, Hanbo Sun, Tuo Wang, and many others. Researchers in the Statistics Online Computational Resource (SOCR), the Big Data for Discovery Science (BDDS) Center, and the Michigan Institute for Data Science (MIDAS) provided encouragement and valuable suggestions.

The development of the DSPA materials was partially supported by the US National Science Foundation (grants 1916425, 1734853, 1636840, 1416953, 0716055 and 1023115), US National Institutes of Health (grants P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503, P30AG053760, UL1TR002240, R01CA233487, R01MH121079, R01MH126137, T32GM141746), and the Elsie Andresen Fiske Research Fund.

SOCR Resource Visitor number Web Analytics DSPA Email