#' ---
#' title: "Data Science and Predictive Analytics (UMich HS650)"
#' subtitle: "__Dimensionality Reduction__

"
#' author: "### SOCR/MIDAS (Ivo Dinov)

"
#' date: "`r format(Sys.time(), '%B %Y')`"
#' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics]
#' output:
#' html_document:
#' theme: spacelab
#' highlight: tango
#' includes:
#' before_body: SOCR_header.html
#' after_body: SOCR_footer_tracker.html
#' toc: true
#' number_sections: true
#' toc_depth: 2
#' toc_float:
#' collapsed: false
#' smooth_scroll: true
#' ---
#'
#' Now that we have most of the fundamentals covered in the previous chapters, we can delve into the first data analytic method, *dimension reduction*, which reduces the number of features when dealing with a very large number of variables. Dimension reduction can help us extract a set of "uncorrelated" principal variables and reduce the complexity of the data. We are not simply picking some of the original variables. Rather, we are constructing new "uncorrelated" variables as functions of the old features.
#'
#' Dimensionality reduction techniques enable exploratory data analyses by reducing the complexity of the dataset, still approximately preserving important properties, such as retaining the distances between cases or subjects. If we are able to reduce the complexity down to a few dimensions, we can then plot the data and untangle its intrinsic characteristics.
#'
#' We will (1) start with a synthetic example demonstrating the reduction of a 2D data into 1D, (2) explain the notion of rotation matrices, (3) show examples of principal component analysis (PCA), singular value decomposition (SVD), independent component analysis (ICA) and factor analysis (FA), and (4) present a Parkinson's disease case-study at the end.
#'
#' # Example: Reducing 2D to 1D
#' We consider an example looking at twin heights. Suppose we simulate 1,000 2D points that representing `normalized` individual heights, i.e., number of standard deviations from the mean height. Each 2D point represents a pair of twins. We will simulate this scenario using [Bivariate Normal Distribution](http://socr.umich.edu/HTML5/BivariateNormal/).
#'
#'
library(MASS)
set.seed(1234)
n <- 1000
y=t(mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
#'
#'
#' $Twin1_{Height}$ | $Twin2_{Height}$
#' -----------------|-----------------
#' $y[1,1]$ | $y[1,2]$
#' $y[2,1]$ | $y[2,2]$
#' $y[3,1]$ | $y[3,2]$
#' $\cdots$ | $\cdots$
#' $y[500,1]$ | $y[500,2]$
#'
#' $$ y^T_{2\times500} = \begin{bmatrix}
#' y[1, ]=Twin1_{Height} \\
#' y[2, ]=Twin2_{Height}
#' \end{bmatrix}=BVN \left ( \mu= \begin{bmatrix}
#' Twin1_{Height} \\
#' Twin2_{Height}
#' \end{bmatrix} , \Sigma=\begin{bmatrix}
#' 1 & 0.95 \\
#' 0.95 & 1
#' \end{bmatrix}
#' \right ) .$$
#'
#'
plot(y[1, ], y[2, ], xlab="Twin 1 (standardized height)",
ylab="Twin 2 (standardized height)", xlim=c(-3, 3), ylim=c(-3, 3))
points(y[1, 1:2], y[2, 1:2], col=2, pch=16)
#'
#'
#' These data may represent a fraction of the information included in a high-throughput neuroimaging genetics study of twins. You can see [one example of such pediatric study here](http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI).
#'
#' Tracking the distances between any two samples can be accomplished using the `dist` function. For example, here is the distance between the two RED points in the figure above:
#'
#'
d=dist(t(y))
as.matrix(d)[1, 2]
#'
#'
#' To reduce the 2D data to a simpler 1D plot we can reduce the data to a 1D matrix (vector) preserving (approximately) the distances between the 2D points.
#'
#' The 2D plot shows the Euclidean distance between the pair of RED points, the length of this line is the distance between the 2 points. In 2D, these lines tend to go along the direction of the diagonal. If we `rotate` the plot so that the diagonal is in the x-axis:
#'
#'
z1 = (y[1, ]+y[2, ])/2 # the sum (actually average)
z2 = (y[1, ]-y[2, ]) # the difference
z = rbind( z1, z2) # matrix z has the same dimension as y
thelim <- c(-3, 3)
# par(mar=c(1, 2))
# par(mfrow=c(2,1))
plot(y[1, ], y[2, ], xlab="Twin 1 (standardized height)",
ylab="Twin 2 (standardized height)",
xlim=thelim, ylim=thelim)
points(y[1, 1:2], y[2, 1:2], col=2, pch=16)
plot(z[1, ], z[2, ], xlim=thelim, ylim=thelim, xlab="Average height", ylab="Difference in height")
points(z[1, 1:2], z[2, 1:2], col=2, pch=16)
par(mfrow=c(1,1))
#'
#'
#' Of course, matrix linear algebra notation can be used to represent this affine transformation of the data. Here we can see that to get `z` we multiplied `y` by the matrix:
#'
#' $$ A = \begin{pmatrix}
#' 1/2&1/2\\
#' 1&-1\\
#' \end{pmatrix}
#' \implies z = A \times y $$
#'
#' We can invert this transform by multiplying the result by the `inverse` (rotation matrix) $A^{-1}$ as follows:
#'
#' $$ A^{-1} = \begin{pmatrix}
#' 1&1/2\\
#' 1&-1/2\\
#' \end{pmatrix}
#' \implies y = A^{-1} \times z $$
#'
#' You can try this in `R`:
#'
A <- matrix(c(1/2, 1, 1/2, -1), nrow=2, ncol=2); A # define a matrix
A_inv <- solve(A); A_inv # inverse
A %*% A_inv # Verify result
#'
#'
#' Note that this matrix transformation did not preserve distances, i.e., it's not a simple rotation in 2D:
#'
#'
d=dist(t(y)); as.matrix(d)[1, 2] # distance between first two points in Y
d1=dist(t(z)); as.matrix(d1)[1, 2] # distance between first two points in Z=A*Y
#'
#'
#' # Matrix Rotations
#'
#' One important question is how to identify transformations that distances unchanged. In mathematics, transformations between metric spaces that are distance-preserving are called isometries (or congruence, or congruent transformation).
#'
#' First, let's test the MA transformation we used above:
#' $$M=Y_1 - Y_2 \\ A=\frac{Y_1+Y_2}{2}.$$
#'
#'
MA <- matrix(c(1/2, 1, 1/2, -1), 2, 2)
MA_z <- MA%*%y
d <- dist(t(y))
d_MA <- dist(t(MA_z))
plot(as.numeric(d), as.numeric(d_MA))
abline(0, 1, col=2)
#'
#'
#' Observe that this MA transformation is not an isometry - the distances are not preserved. Here is one example with $v1=\begin{bmatrix} v1_x=0 \\ v1_y=1 \end{bmatrix}$, $v2=\begin{bmatrix} v2_x=1 \\ v2_y=0 \end{bmatrix}$, which are distance $\sqrt{2}$ apart in their native space, but separated further by the transformation $MA$, $d(MA(v1),MA(v2))=2$.
#'
#'
MA; t(MA); solve(MA); t(MA) - solve(MA)
v1 <- c(0,1); v2 <- c(01,0); rbind(v1,v2)
euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))
euc.dist(v1,v2)
v1_t <- MA %*% v1; v2_t <- MA %*% v2
euc.dist(v1_t,v2_t)
#'
#'
#' More generally, if
#' $$ \begin{pmatrix}
#' Y_1\\
#' Y_2\\
#' \end{pmatrix}
#' \sim N \left( \begin{pmatrix}
#' \mu_1\\
#' \mu_2\\
#' \end{pmatrix},
#' \begin{pmatrix}
#' \sigma_1^2&\sigma_{12}\\
#' \sigma_{12}&\sigma_2^2\\
#' \end{pmatrix}
#' \right)$$
#' Then,
#' $$ Z = AY + \eta \sim BVN(\eta + A\mu,A\Sigma A^{T}).$$
#' Where BVN denotes bivariates normal distribution, $A = \begin{pmatrix}a&b\\c&d\\ \end{pmatrix}$, $Y=(Y_1,Y_2)^T$, $\mu = (\mu_1,\mu_2)$, $\Sigma = \begin{pmatrix} \sigma_1^2&\sigma_{12}\\ \sigma_{12}&\sigma_2^2\\ \end{pmatrix}$.
#'
#' You can verify this by using the [change of variable theorem](https://en.wikipedia.org/wiki/Change_of_variables). Thus, affine transformations preserve bivariate normality. However, there is by no mean to guarantee an isometry.
#'
#' The question now is under what additional conditions for the transformation matrix $A$, can we guarantee an isometry.
#'
#' Notice that,
#' $$ d^2(P_i,P_j) =\sum_{k=1}^{T} (P_{jk}-P_{ik})^2 = ||P||^2 = P^TP,$$
#'
#' where $P = (P_{j,1}-P_{i,1},...,P_{j,T}-P_{i,T})^T$, $P_i$ and $P_j$ is any two points in $T$ dimensions.
#'
#' Thus, the only requirement we need is $(AY)^T(AY)=Y^TY$, $i.e\ A^TA=I$, which implies that $A$ is an orthogonal (rotational) matrix.
#'
#' Let's use a two dimension orthogonal matrix to illustrate this.
#'
#' Set $A = \frac{1}{\sqrt{2}} \begin{pmatrix}1&1\\1&-1\\ \end{pmatrix}$. It's easy to verify that A is an orthogonal (2D rotation) matrix.
#'
#' The simplest way to test the isometry is to perform the linear transformation directly as follow.
#'
#'
A <- 1/sqrt(2)*matrix(c(1, 1, 1, -1), 2, 2)
z <- A%*%y
d <- dist(t(y))
d2 <- dist(t(z))
plot(as.numeric(d), as.numeric(d2))
abline(0, 1, col=2)
#'
#'
#' We can observe that distance computed from original data and after rotation is the same. This transformation is called a rotation (isometry) of $y$.
#'
#' The alternative method is to simulate from joint distribution of $Z = (Z_1,Z_2)^T$.
#'
#' As we have mentioned above: $Z = AY + \eta \sim BVN(\eta + A\mu,A\Sigma A^{T})$.
#'
#' where $\eta = (0,0)^T$, $\Sigma = \begin{pmatrix} 1&0.95\\0.95&1\\ \end{pmatrix}$, $A = \frac{1}{\sqrt{2}} \begin{pmatrix}1&1\\1&-1\\ \end{pmatrix}$.
#'
#' We can compute $A\Sigma A^{T}$ by hand or using matrix multiplication in `R`:
#'
#'
sig <- matrix(c(1,0.95,0.95,1),nrow=2)
A%*%sig%*%t(A)
#'
#'
#' $A\Sigma A^{T}$ represents the variance-covariance matrix, $cov(z_1,z_2)=0$. We can simulate $z_1$, $z_2$ independently from $z_1\sim N(0,1.95)$ and $z_2 \sim N(0,0.05)$ (Note: independence and uncorrelation are equivalent for bivariate normal distribution).
#'
#'
set.seed(2017)
zz1 = rnorm(1000,0,sd = sqrt(1.95))
zz2 = rnorm(1000,0,sd = sqrt(0.05))
zz = rbind(zz1,zz2)
d3 = dist(t(zz))
qqplot(d,d3)
abline(a = 0,b=1,col=2)
#'
#'
#' We can observe that distance computed from original data and the simulated data is the same.
#'
#'
thelim <- c(-3, 3)
#par(mfrow=c(2,1))
plot(y[1, ], y[2, ], xlab="Twin 1 (standardized height)",
ylab="Twin 2 (standardized height)",
xlim=thelim, ylim=thelim)
points(y[1, 1:2], y[2, 1:2], col=2, pch=16)
plot(z[1, ], z[2, ], xlim=thelim, ylim=thelim, xlab="Average height", ylab="Difference in height")
points(z[1, 1:2], z[2, 1:2], col=2, pch=16)
par(mfrow=c(1,1))
#'
#'
#' We applied this transformation and observed that the distances between points were unchanged after the rotation. This rotation achieves the goals of:
#'
#' * preserving the distances between points, and
#' * reducing the dimensionality of the data (see plot reducing 2D to 1D).
#'
#' Removing the second dimension and recomputing the distances, we get:
#'
#'
d4 = dist(z[1, ]) ##distance computed using just first dimension
plot(as.numeric(d), as.numeric(d4))
abline(0, 1)
#'
#'
#' The 1D distance provides a very good approximation to the actual 2D distance. This first dimension of the transformed data is called the first `principal component`. In general, this idea motivates the use of principal component analysis (PCA) and the singular value decomposition (SVD) to achieve dimension reduction.
#'
#' # Notation
#'
#' In the notation above, the rows represent variables and columns represent cases. In general, rows represent cases and columns represent variables. Hence, in our example shown here, $Y$ would be transposed to be an $N \times 2$ matrix. This is the most common way to represent the data: individuals in the rows, features are columns. In genomics, it is more common to represent subjects/SNPs/genes in the columns. For example, genes are rows and samples are columns. The sample covariance matrix usually denoted with
#' $\mathbf{X}^\top\mathbf{X}$ and has cells representing covariance between two units. Yet for this to be the case, we need the rows of $\mathbf{X}$ to represents units. Here, we have to compute, $\mathbf{Y}\mathbf{Y}^\top$ instead following the rescaling.
#'
#' # Summary (PCA vs. ICA vs. FA)
#'
#' Principle Component Analysis (PCA), Independent Component Analysis (ICA), and Factor Analysis (FA) are similar strategies, seeking to identify a new basis (vectors representing the principal directions) that the data is projected against to maximize certain (specific to each technique) objective function. These basis functions, or vectors, are just linear combinations of the original features in the data/signal.
#'
#' The Singular value decomposition (SVD), discussed later in this chapter, provides a specific matrix factorization algorithm that can be employed in various techniques to decompose a data matrix $X_{m\times n}$ as ${U\Sigma V^{T}}$, where ${U}$ is an $m \times m$ real or complex unitary matrix (${U^TU=UU^TI}$, i.e., $|\det(U)|=1$), ${\Sigma }$ is a $m\times n$ rectangular diagonal matrix of *singular values*, representing non-negative values on the diagonal, and ${V}$ is an $n\times n$ unitary matrix.
#'
#' Method | Assumptions | Cost Function Optimization | Applications
#' -------|-------------|----------------------------|-------------
#' *PCA* | Gaussian signals | Aims to explain the variance in the original signal. Minimizes the covariance of the data and yields high-energy orthogonal vectors in terms of the signal variance. PCA looks for an orthogonal linear transformation that maximizes the variance of the variables | Relies on $1^{st}$ and $2^{nd}$ moments of the measured data, which makes it useful when data features are close to Gaussian
#' *ICA* | No Gaussian signal assumptions | Minimizes higher-order statistics (e.g., $3^{rd}$ and $4^{th}$ order skewness and kurtosis), effectively minimizing the *mutual information* of the transformed output. ICA seeks a linear transformation where the basis vectors are statistically independent, but neither Gaussian, orthogonal or ranked in order | Applicable for non-Gaussian, very noisy, or mixture processes composed of simultaneous input from multiple sources
#' *FA* | Approximately Gaussian data | Objective function relies on second order moments to compute likelihood ratios. FA *factors* are linear combinations that maximize the shared portion of the variance underlying *latent variables*, which may use a variety of optimization strategies (e.g., maximum likelihood) | PCA-generalization used to test a theoretical model of latent factors causing the observed features
#'
#' #Principal Component Analysis (PCA)
#'
#' PCA (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
#'
#' **Principal Components**
#'
#' Let's consider the simplest situation where we have *n* observations $\{p_1, p_2, ..., p_n\}$ with 2 features $p_i=(x_i, x_2)$. When we draw them on a plot, we use x-axis and y-axis for positioning. However, we can make our own coordinate system by principal components.
#'
#'
ex<-data.frame(x=c(1, 3, 5, 6, 10, 16, 50), y=c(4, 6, 5, 7, 10, 13, 12))
reg1<-lm(y~x, data=ex)
plot(ex)
abline(reg1, col='red', lwd=4)
text(40, 10.5, "pc1")
segments(10.5, 11, 15, 7, lwd=4)
text(11, 7, "pc2")
#'
#'
#' Illustrated on the graph, the first PC, $pc_1$ is a minimum distance fit in the feature space. The second PC is a minimum distance fit to a line perpendicular to the first PC. Similarly, the third PC would be a minimum distance fit to all previous PCs. In our case of a 2D space, two PC is the most we can have. In higher dimensional spaces, we need to consider how many PCs do we need to make the best performance.
#'
#' In general, the formula for the first PC is $pc_1=a_1^TX=\sum_{i=1}^N a_{i, 1}X_i$ where $X_i$ is a $n\times 1$ vector representing a column of the matrix $X$ (representing a total of n observations and N features). Weights $a_1=\{a_{1, 1}, a_{2, 1}, ..., a_{N, 1}\}$ are chosen to maximize the variance of $pc_1$. According to this rule, the *k*th PC is $pc_k=a_k^TX=\sum_{i=1}^N a_{i, k}X_i$. $a_k=\{a_{1, k}, a_{2, k}, ..., a_{N, k}\}$ has to be constrained by more conditions:
#'
#' 1. Variance of $pc_k$ is maximized
#' 2. $Cov(pc_k, pc_l)=0$, $\forall 1\leq l0$ columns of $U$ may be trivial (zeros). It's customary to drop the zero columns of $U$ for $n>>p$ to avid dealing with unnecessarily large (trivial) matrices.
#'
#' # Case Study for dimension reduction
#'
#' **Step 1: : Collecting Data**
#'
#' The data we will be using in this case study is the Clinical, Genetic and Imaging Data for Parkinson's Disease in the SOCR website. A detailed data explanation is on the following link [PD data](http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata). Let's import the data into R.
#'
#'
# Loading required package: xml2
wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata")
html_nodes(wiki_url, "#content")
pd_data <- html_table(html_nodes(wiki_url, "table")[[1]])
head(pd_data); summary(pd_data)
#'
#'
#' **Step 2: Exploring and preparing the data**
#'
#' To make sure that the data is ready for further modeling, we need to fix a few things. Firstly, the `Dx` variable or diagnosis is a factor. We need to change it to a numeric variable. Second, we don't need the patient ID and time variable in the dimension reduction procedures.
#'
pd_data$Dx <- gsub("PD", 1, pd_data$Dx)
pd_data$Dx <- gsub("HC", 0, pd_data$Dx)
pd_data$Dx <- gsub("SWEDD", 0, pd_data$Dx)
pd_data$Dx <- as.numeric(pd_data$Dx)
attach(pd_data)
pd_data<-pd_data[, -c(1, 33)]
#'
#'
#' **Step 3 - training a model on the data**
#'
#' **1. PCA**
#'
#' Now we start the process of fitting a PCA model. Here we will use the `princomp()` function and use the correlation rather than the covariance matrix for calculation.
#'
#'
pca.model <- princomp(pd_data, cor=TRUE)
summary(pca.model) # pc loadings (i.e., eigenvector columns)
plot(pca.model)
biplot(pca.model)
fviz_pca_biplot(pca.model, axes = c(1, 2), geom = "point",
col.ind = "black", col.var = "steelblue", label = "all",
invisible = "none", repel = F, habillage = pd_data$Sex,
palette = NULL, addEllipses = TRUE, title = "PCA - Biplot")
#'
#'
#' We can see that in real world examples PCs do not necessarily have a "elbow" plot. In our model, each PC explains about the same amount of variation. Thus, it is hard to tell how many PCs or factors we need to pick. This would be an ad hoc decision. We can understand this better after understanding the following FA model.
#'
#' **2. FA**
#'
#' Let's set up an Cattel's Scree test to determine the number of factors first.
#'
ev <- eigen(cor(pd_data)) # get eigenvalues
ap <- parallel(subject=nrow(pd_data), var=ncol(pd_data), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
summary(nS)
#'
#'
#' Although the Cattel's Scree test suggest that we should use 14 factors, the real fit shows 14 is not enough. Previous PCA results suggest we need around 20 PCs to obtain a cumulative variance of 0.6. After a few trials we find that 19 factors can pass the chi square test for sufficient number of factors at $0.05$ level.
#'
#'
fa.model<-factanal(pd_data, 19, rotation="varimax")
fa.model
#'
#'
#' This data matrix has relatively low correlation. Thus, it is not suitable for ICA.
#'
#'
cor(pd_data)[1:10, 1:10]
#'