Upon successful completion of this course, students are expected to have moderate competency in at least two of each of the three competency areas listed below:
Areas  Competency  Expectation  Notes  
Algorithms and Applications  Tools  Working knowledge of basic software tools (commandline, GUI based, or webservices)  Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL  
Algorithms  Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures  Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, userfriendly interfaces, string matching  
Application Domain  Data analysis experience from at least one application area, either through coursework, internship, research project, etc.  Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences  
Data Management  Data validation & visualization  Curation, Exploratory Data Analysis (EDA) and visualization  Data provenance, validation, visualization via histograms, QQ plots, scatterplots (ggplot, Dashboard, D3.js)  
Data wrangling  Skills for data normalization,
data cleaning, data aggregation, and data
harmonization/registration

Data imperfections include missing values, inconsistent string formatting (‘20160101’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data  
Data infrastructure  Handling databases, webservices, Hadoop, multisource data  Data structures, SOAP protocols, ontologies, XML, JSON, streaming  
Analysis Methods  Statistical inference  Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling  Biological variability vs. technological noise, parametric (likelihood) vs nonparametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression  
Study design and diagnostics  Design of experiments, power calculations and sample sizing, strength of evidence, pvalues, False Discovery Rates  Multistage testing, variance normalizing transforms, histogram equalization, goodnessoffit tests, model overfitting, model reduction  
Machine
Learning 
Dimensionality reduction, knearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN  Empirical risk minimization. Supervised, semisupervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning 