0$ columns of $U$ may be trivial (zeros). It's customary to drop the zero columns of $U$ for $n>>p$ to avid dealing with unnecessarily large (trivial) matrices. #' #' # t-SNE (t-distributed Stochastic Neighbor Embedding) #' #' The t-SNE technique represents a recent machine learning strategy for nonlinear dimensionality reduction that is useful for embedding (e.g., scatter-plotting) of high-dimensional data into lower-dimensional (1D, 2D, 3D) spaces. For each object (point in the high-dimensional space), the method models *similar objects* using nearby and *dissimilar objects* using remote distant objects. The two steps in t-SNE include (1) construction of a probability distribution over pairs of the original high-dimensional objects where that similar objects have a high probability of being picked and correspondingly, dissimilar objects have a small probability of being selected; and (2) defining a similar probability distribution over the points in the derived low-dimensional embedding minimizing the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the high- and low-dimensional distributions relative to the locations of the objects in the embedding map. Either Euclidean or non-Euclidean distance measures between objects may be used as similarity metrics. #' #' ## t-SNE Formulation #' Suppose we have dimensional data ($N$D): $x_1, x_2,..., x_N$. In step 1, for each pair ($x_i, x_j$), t-SNE estimates the probabilities $p_{i,j}$ that are proportional to their corresponding similarities, $p_{j | i}$: #' #' $$p_{j | i} = \frac{\exp\left (\frac{-||x_i - x_j||^2}{2\sigma_i^2} \right )}{\sum_{k \neq i} \exp\left (\frac{-||x_i - x_k||^2}{2\sigma_i^2} \right )}.$$ #' #' The similarity between $x_j$ and $x_i$ may be thought of as the conditional probability, $p_{j | i}$. That is, assuming $N$D Gaussian distributions centered at each point $x_i$, neighbors are selected based on a probability distribution (proportion of their probability density), which represents the chance that $x_i$ may select $x_j$ as its neighbor, $p_{i,j} = \frac{p_{j | i} + p_{i |j}}{2N}$. #' #' The **perplexity** ($perp$) of a discrete probability distribution, $p$, is defined as an exponential function of the entropy, $H(p)$, over all discrete events: $perp(x)=2^{H(p)}=2^{-\sum _{x}p(x)\log_{2}p(x)}$. t-SNE performs a binary search for the value $\sigma_i$ that produces a predefined value $perp$. The simple interpretation of the perplexity at a data point $x_i$, $2^{H(p_i)}$, is as a smooth measure of the effective number of points in the $x_i$ neighborhood. The performance of t-SNE may vary with the perplexity value, which is typically specified by the user, e.g., between $5\leq perp\leq 50$. #' #' Then, the precision (variance, $\sigma_i$) of the local Gaussian kernels may be chosen to ensure that the *perplexity* of the conditional distribution equals a specified perplexity. This allows adapting the kernel bandwidth to the sample data density with smaller $\sigma_i$ fitted in denser areas of the sample data space. A particular value of $\sigma_i$ yields a probability distribution, $p_i$, over all of other data points, which has an increasing entropy as $\sigma_i$ increases. #' #' t-SNE learns a mapping $f: \{x_1, x_2, ..., x_N\} \longrightarrow \{y_1, y_2, ..., y_d\}$, where $x_i\in R^N$ and $y_i \in R^d$ ($n\gg d$) that resembles closely the *original similarities*, $p_{i,j}$ and represents the *derived similarities*, $q_{i,j}$ between pairs of embedded points $y_i,y_j$, defined by: #' #' $$q_{i,j} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq i} (1 + ||y_i - y_k||^2)^{-1}}.$$ #' #' The `t-distributed` reference in t-SNE refers to the heavy-tailed *Student-t distribution* ($t_{df=1}$) which [councides](http://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_StudentsT) with [Cauchy distribution](http://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Cauchy), $f(z)=\frac{1}{1+z^2}$. It is used to model and measure similarities between closer points in the embedded low-dimensional space, as well as dissimilarities of objects that map far apart in the embedded space. #' The rationale for using *Student t distribution* for mapping the points is based on the fact that the volume of the $N$D ball of radius $r$, $B^N$, is proportional to $r^N$. Specifically, $V_N(r) = \frac{\pi^\frac{N}{2}}{\Gamma\left(\frac{N}{2} + 1\right)}r^N$, where $\Gamma()$ is the [Euler's gamma function](https://en.wikipedia.org/wiki/Gamma_function), which is an extension of the factorial function to non-integer arguments. For large $N$, when we select uniformly random points inside $B^N$, most points will be expected to be close to the ball surface, $S^{N-1}$, and few will be expected near the $B^N$ center, as half the volume of $B^N$ is included in the hyper-area *inside* $B^N$ and *outside* a ball of radius $r_1=\frac{1}{\sqrt[N]{2}}\times r \sim r$. You can try this with $N=2$, $\{x\in R^2 |\ ||x||\leq r\}$, representing a disk in a 2D plane. #' #' When reducing the dimensionality of a dataset, if we used the Gaussian distribution for the mapping embedding into the lower dimensional space, there will be a distortion of the distribution of the distances between neighboring objects. This is simply because the *distribution* of the distances is much different between the original (high-dimensional) and a the map-transformed low-dimensional spaces. t-SNE tries to (approximately) preserve the distances in the two spaces to avoid imbalances that may lead to biases due to excessive attraction-repulsion forces. Using Student t distribution $df=1$ (aka Cauchy distribution) for mapping the points preserves (to some extent) the distance similarity distribution, because of the heavier tails of $t$ compared to the Gaussian distribution. For a given similarity between a pair of data points, the two corresponding map points will need to be much further apart in order for their similarity to match the data similarity. #' #' A minimization process with respect to the objects $y_i$ using gradient descent of a (non-symmetric) objective function, *Kullback-Leibler divergence* between the distributions $Q$ and $P$ , is used to determine the object locations $y_i$ in the map, i.e., #' #' $$KL(P || Q) = \sum_{i \neq j} p_{i,j} \log \frac{p_{i,j}}{q_{i,j}}.$$ #' #' The minimization of the KL objective function by gradient descent may be analytically represented by: #' #' $$\frac{\partial {KL(P||Q)}}{\partial {y_i}}= \sum_{j}{(p_{i,j}-q_{i,j})f(|x_i-x_j|) u_{i,j}},$$ #' where $f(z)=\frac{z}{1+z^2}$ and $u_{i,j}$ is a unit vector from $y_j$ to $y_i$. This gradient represents the aggregate sum of all spring forces applied to map point $x_i$. #' #' This optimization leads to an embedding mapping that "preserves" the object (data point) similarities of the original high-dimensional inputs into the lower dimensional space. Note that the data similarity matrix ($p_{i,j}$) is fixed, whereas its counterpart, the map similarity matrix ($q_{i,j}$) depends on the embedding map. Of course, we want these two distance matrices to be as close as possible, implying that similar data points in the original space yield similar map-points in the reduced dimension. #' #' ## t-SNE Example: Hand-written Digit Recognition #' #' Later, in [Chapter 10](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html) and [Chapter 22](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/22_DeepLearning.html), we will present the Optical Character Recognition (OCR) and analysis of hand-written notes (unstructured text). #' #' Below, we show a simple example of generating a 2D embedding of the [hand-written digits dataset](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TrainingData.zip) using t-SNE. #' #' # install.packages("tsne"); library (tsne) # install.packages("Rtsne") library(Rtsne) # Download the hand-written digits data pathToZip <- tempfile() download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TrainingData.zip", pathToZip) train <- read.csv(unzip(pathToZip)) dim(train) unlink(pathToZip) # identify the label-nomenclature - digits 0, 1, 2, ..., 9 - and map to diff colors colMap <- function(x){ cols <- rainbow(length(x))[order(order(x))] #special indexing } train.labels<-train$label train$label<-as.factor(train$label) train.labels.colors <- colMap(train.labels) names(train.labels.colors) <- train$label # unique(train$label) # May need to check and increase the RAM allocation memory.limit() memory.limit(50000) # Remove the labels (column 1) and Scale the image intensities to [0; 1] train <- data.matrix(train[, -1]); dim(train) train <- t(train/255) # Visualize some of the images library("imager") # first convert the CSV data (one row per image, 42,000 rows) array_3D <- array(train[ , ], c(28, 28, 42000)) mat_2D <- matrix(array_3D[,,1], nrow = 28, ncol = 28) plot(as.cimg(mat_2D)) N <- 42000 img_3D <- as.cimg(array_3D[,,], 28, 28, N) # plot the k-th image (1<=k<=N) k <- 5; plot(img_3D, k) k <- 6; plot(img_3D, k) k <- 7; plot(img_3D, k) pretitle <- function(index) bquote(bold("Image: "~.(index))) #layout(t(1:2)) op <- par(mfrow = c(2,2), oma = c(5,4,0,0) + 0.1, mar = c(0,0,1,1) + 0.1) for (k in 1:4) { plot(img_3D, k, xlim = c(0,28), ylim = c(28,0), axes=F, ann=T, main=pretitle(k)) } # Run the t-SNE, tracking the execution time (artificially reducing the sample-size to get reasonable calculation time) execTime_tSNE <- system.time(tsne_digits <- Rtsne(t(train)[1:10000 , ], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)); execTime_tSNE # Full dataset(42K * 1K) execution may take over 5-mins # execTime_tSNE <- system.time(tsne_digits <- Rtsne(train[ , ], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)); execTime_tSNE # Plot the result 2D map embedding of the data with digits or solid discs par(mfrow=c(1,1)) # reset the plot canvas # Plot only first 1,000 cases (to avoid clutter) plot(tsne_digits$Y[1:1000, ], t='n', main="t-SNE") # don't plot the points to avoid clutter text(tsne_digits$Y[1:1000, ], labels=names(train.labels.colors)[1:1000], col=train.labels.colors[1:1000]) # plot all cases as solid discs with colors corresponding to each of the 10 numbers plot(tsne_digits$Y, main="t-SNE Clusters", col=train.labels.colors, pch = 19) legend("topright", unique(names(train.labels.colors)), fill=unique(train.labels.colors), bg='gray90', cex=0.5) #' #' #' # Dimensionality Reduction Case Study (Parkinson's Disease) #' #' ## Step 1: Collecting Data #' #' The data we will be using in this case study is the Clinical, Genetic and Imaging Data for Parkinson's Disease in the SOCR website. A detailed data explanation is on the following link [PD data](http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata). Let's import the data into R. #' #' # Loading required package: xml2 wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata") html_nodes(wiki_url, "#content") pd_data <- html_table(html_nodes(wiki_url, "table")[[1]]) head(pd_data); summary(pd_data) #' #' #' ## Step 2: Exploring and preparing the data #' #' To make sure that the data is ready for further modeling, we need to fix a few things. Firstly, the `Dx` variable or diagnosis is a factor. We need to change it to a numeric variable. Second, we don't need the patient ID and time variable in the dimension reduction procedures. #' #' pd_data$Dx <- gsub("PD", 1, pd_data$Dx) pd_data$Dx <- gsub("HC", 0, pd_data$Dx) pd_data$Dx <- gsub("SWEDD", 0, pd_data$Dx) pd_data$Dx <- as.numeric(pd_data$Dx) attach(pd_data) pd_data<-pd_data[, -c(1, 33)] #' #' #' ## PCA #' #' Now we start the process of fitting a PCA model. Here we will use the `princomp()` function and use the correlation rather than the covariance matrix for calculation. #' #' pca.model <- princomp(pd_data, cor=TRUE) summary(pca.model) # pc loadings (i.e., eigenvector columns) plot(pca.model) biplot(pca.model) fviz_pca_biplot(pca.model, axes = c(1, 2), geom = "point", col.ind = "black", col.var = "steelblue", label = "all", invisible = "none", repel = F, habillage = pd_data$Sex, palette = NULL, addEllipses = TRUE, title = "PCA - Biplot") #' #' #' We can see that in real world examples PCs do not necessarily have a "elbow" plot. In our model, each PC explains about the same amount of variation. Thus, it is hard to tell how many PCs or factors we need to pick. This would be an ad hoc decision. We can understand this better after understanding the following FA model. #' #' ## FA #' #' Let's set up an Cattel's Scree test to determine the number of factors first. #' #' ev <- eigen(cor(pd_data)) # get eigenvalues ap <- parallel(subject=nrow(pd_data), var=ncol(pd_data), rep=100, cent=.05) nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea) summary(nS) #' #' #' Although the Cattel's Scree test suggest that we should use 14 factors, the real fit shows 14 is not enough. Previous PCA results suggest we need around 20 PCs to obtain a cumulative variance of 0.6. After a few trials we find that 19 factors can pass the chi square test for sufficient number of factors at $0.05$ level. #' #' fa.model<-factanal(pd_data, 19, rotation="varimax") fa.model #' #' #' This data matrix has relatively low correlation. Thus, it is not suitable for ICA. #' #' cor(pd_data)[1:10, 1:10] #' #' #' ## t-SNE #' #' Finally, let's try the t-Distributed Stochastic Neighbor Embedding method on the [PD data](http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata). #' #' # install.packages("Rtsne") library(Rtsne) # If working with post-processed PD data above: remove duplicates (after stripping time) # pd_data <- unique(pd_data[,]) # If wqorking with raw PD data: reload it pd_data <- html_table(html_nodes(wiki_url, "table")[[1]]) # Run the t-SNE, tracking the execution time (artificially reducing the sample-size to get reasonable calculation time) execTime_tSNE <- system.time(tsne_digits <- Rtsne(pd_data, dims = 2, perplexity=30, verbose=TRUE, max_iter = 1000)); execTime_tSNE # Plot the result 2D map embedding of the data # table(pd_data$Sex) # plot(tsne_digits$Y, main="t-SNE Clusters", col=rainbow(length(unique(pd_data$Sex))), pch = 1) #legend("topright", c("Male", "Female"), fill=rainbow(length(unique(pd_data$Sex))), bg='gray90', cex=0.5) table(pd_data$Dx) # Either use the DX label column to set the colors col = as.factor(pd_data$Dx) #plot(tsne_digits$Y, main="t-SNE Clusters", col=as.factor(pd_data$Dx), pch = 15) #legend("topright", c("HC", "PD", "SWEDD"), fill=unique(as.factor(pd_data$Dx)), bg='gray90', cex=0.5) # Or to set the colors explicitly CharToColor = function(input_char){ mapping = c("HC"="blue", "PD"="red", "SWEDD"="yellow") mapping[input_char] } pd_data$Dx.col = sapply(pd_data$Dx, CharToColor) plot(tsne_digits$Y, main="t-SNE Clusters", col=pd_data$Dx.col, pch = 15) legend("topright", c("HC", "PD", "SWEDD"), fill=unique(pd_data$Dx.col), bg='gray90', cex=0.5) #' #' #' The results of the PCA, ICA, FA, and t-SNE methods on the [PD data](http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata) imply that the data is complex and intrinsically high-dimensional, which prevents explicit embeddings into a low-dimensional (e.g., 2D) space. More advanced methods to interrogate this dataset will be demonstrated later. #' #' #'

#'
#'
#'
#'
#'
#'
#'
#'
#'
#'
#'
#'
#'

#'