#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

Deep Learning, Neural Networks

" #' author: "

SOCR/MIDAS (Ivo Dinov)

" #' date: "`r format(Sys.time(), '%B %Y')`" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' after_body: SOCR_footer_tracker.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' [Deep learning](https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_%0Anetwork_architectures) is a special branch of machine learning using a collage of algorithms to model high-level motifs in data. Deep learning resembles the biological communications between brain neurons in the central nervous system (CNS), where synthetic graphs represent the CNS network as pairs of nodes and connections/edges between them. For instance, in a simple synthetic network consisting of a pair of connected nodes, an output send by one node is received by the other as an input signal. When more nodes are present in the network, they may be arranged in multiple levels (like a multiscale object) where the $i^{th}$ layer output serves as the input of the next $(i+1)^{st}$ layer. The signal is manipulated at each layer and sent as a layer output downstream and interpreted as an input to the next, $(i+1)^{st}$ layer, and so forth. Deep learning relies on many layers of nodes, many edges linking the nodes, forming input/output (I/O) layered grids representing a multiscale processing network. At each layer, linear and non-linear transformations are converting inputs into outputs. #' #' In this chapter, we explore the R deep learning package `MXNetR` and demonstrate state-of-the-art deep learning models utilizing CPU and GPU for fast training (learning) and testing (validation). Other powerful deep learning frameworks include *TensorFlow*, *Theano*, *Caffe*, *Torch*, *CNTK* and *Keras*. #' #' *Neural Networks vs. Deep Learning*: Deep Learning is a machine learning strategy that learns a deep multi-level hierarchical representation of the affinities and motifs in the dataset. Machine learning Neural Nets tend to use shallower network models. Although there are no formal restrictions on the depth of the layers in a Neural Net, few layers are commonly utilized. Recent methodological, algorithmic, computational, infrastructure and service advances overcome previous limitations. In addition, the rise of *Big Data* accelerated the evolution of *classical Neural Nets* to *Deep Neural Nets*, which can now handle lots of layers and many hidden nodes per layer. The former is a precursor to the latter, however, there are also *non-neural* deep learning techniques. For example, *syntactic pattern recognition methods* and *grammar induction discover hierarchies*. #' #' # Deep Learning Training #' #' Review [Chapter 10 (Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines)](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/10_ML_NN_SVM_Class.html) prior to proceeding. #' #' ## Perceptrons #' A **perceptron** is an artificial analogue of a neuronal brain cell that calculates a *weighted sum of the input values* and *outputs a thresholded version of that result*. For a bivariate perceptron, $P$, having two inputs, $(X,Y)$, denote the weights of the inputs by $A$ and $B$, respectively. Then, the weighted sum could be represented as: $$W = A X + B Y.$$ #' #' At each layer $l$, the weight matrix, $W^{(l)}$, has the following properties: #' #' * The number of rows of $W^{(l)}$ equals the number of nodes/units in the previous $(l-1)^{st}$ layer, and #' * The number of columns of $W^{(l)}$ equals the number of units in the next $(l+1)^{st}$ layer. #' #' Neuronal cells fire depending on the presynaptic inputs to the cell which causes constant fluctuations of the neuronal the membrane - depolarizing or hyperpolarizing, i.e., the cell membrane potential rises or falls. Similarly, perceptrons rely on thresholding of the weight-averaged input signal, which for biological cells corresponds to voltage increases passing a critical threshold. Perceptrons output non-zero values only when the weighted sum exceeds a certain threshold $C$. In terms of its input vector, $(X,Y)$, we can describe the output of each perceptron ($P$) by: #' #' $$ #' Output(P) = #' \left\{ #' \begin{array} {} #' 1, & if A X + B Y > C \\ #' 0, & if A X + B Y \leq C #' \end{array} #' \right. . #' $$ #' #' Feed-forward networks are constructed as layers of perceptrons where the first layer ingests the inputs and the last layer generates the network outputs. The intermediate (internal) layers are not directly connected to the external world, and are called hidden layers. In *fully connected networks*, each perceptron in one layer is connected to every perceptron on the next layer enabling information "fed forward" from one layer to the next. There are no connection between perceptrons in the same layer. #' #' Multilayer perceptrons (fully-connected feed-forward neural network) consist of several fully-connected layers representing an `input matrix` $X_{n \times m}$ and a generated `output matrix` $Y_{n \times k}$. The input $X_{n,m}$ is a matrix encoding the $n$ cases and $m$ features per case. The weight matrix $W_{m,k}^{(l)}$ for layer $l$ has rows ($i$) corresponding to the weights leading from all the units $i$ in the previous layer to all of the units $j$ in the current layer. The product matrix $X\times W$ has dimensions $n\times k$. #' #' The hidden size parameter $k$, the weight matrix $W_{m \times k}$, and the bias vector $b_{n \times 1}$ are used to compute the outputs at each layer: #' #' $$Y_{n \times k}^{(l)} =f_k^{(l)}\left ( X_{n \times m} W_{m \times k}+b_{k \times 1} \right ).$$ #' #' The role of the bias parameter is similar to the intercept term in linear regression and helps improve the accuracy of prediction by shifting the decision boundary along $Y$ axis. The outputs are fully-connected layers that feed into an `activation layer` to perform element-wise operations. Examples of **activation functions** that transform real numbers to probability-like values include, #' #' * the [sigmoid function]( https://en.wikipedia.org/wiki/Sigmoid_function), a special case of the [logistic function](https://en.wikipedia.org/wiki/Logistic_function), which converts real numbers to probabilities, #' * the rectifier (`relu`, [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))) function, which outputs the $\max(0, input)$, #' * the [tanh (hyperbolic tangent function)](https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent). #' #' #' sigmoid <- function(x) { 1 / (1 + exp(-x)) } relu <- function(y) { pmin(pmax(0, y), 1) } x_Llimit <- -10; x_Rlimit <- 10; y_Llimit <- 0; y_Rlimit <- 1 x <- seq(x_Llimit, x_Rlimit, 0.01) plot(c(x_Llimit,x_Rlimit), c(y_Llimit,y_Rlimit), type="n", xlab="Input/Domain", ylab="Output/Range" ) lines(x, sigmoid(x), col="blue", lwd=3) lines(x, (1/2)*(tanh(x)+1), col="green", lwd=3) lines(x, relu(x), col="red", lwd=3) legend("left", title="Probability Transform Functions", c("sigmoid","tanh","relu"), fill=c("blue", "green", "red"), ncol = 1, cex = 0.75) #' #' #' The final fully-connected layer may be hidden of size equal to the number of classes in the dataset and may be followed by a `softmax` layer mapping the input into a probability score. For example, if a size ${n\times m}$ input is denoted by $X_{n\times m}$, then the probability scores may be obtained by the `softmax` transformation function, which maps real valued vectors to vectors of probabilities: #' #' $$\left ( \frac{e^{x_{i,1}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}},\ldots, \frac{e^{x_{i,m}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}}\right ).$$ #' #' Below is a schematic of fully-connected feed-forward neural network of nodes #' $$ \{ a_{j=node\ index, l=layer\ index} \}_{j=1, l=1}^{n_j, 4}.$$ #' #' # install.packages("GGally"); install.packages("sna"); install.packages("network") #library("GGally"); library(network); library(sna); library(ggplot2) library('igraph') #define node names nodes <- c('a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)', 'a(1,3)','a(2,3)','a(3,3)', 'a(1,4)','a(2,4)' ) # define node x,y coordinates x <- c(rep(0,7), rep(2,5), rep(4,3), rep(6,2) ) y <- c(1:7, 2:6, 3:5, 4:5) # x;y # print x & y coordinates of all nodes #define edges from <- c('a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)', 'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)', 'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)', 'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)', 'a(1,3)','a(2,3)','a(3,3)', 'a(1,3)','a(2,3)','a(3,3)' ) to <- c('a(1,2)','a(1,2)','a(1,2)','a(1,2)', 'a(1,2)', 'a(1,2)', 'a(1,2)', 'a(2,2)','a(2,2)','a(2,2)','a(2,2)', 'a(2,2)', 'a(2,2)', 'a(2,2)', 'a(3,2)','a(3,2)','a(3,2)','a(3,2)', 'a(3,2)', 'a(3,2)', 'a(3,2)', 'a(4,2)','a(4,2)','a(4,2)','a(4,2)', 'a(4,2)', 'a(4,2)', 'a(4,2)', 'a(5,2)','a(5,2)','a(5,2)','a(5,2)', 'a(5,2)', 'a(5,2)', 'a(5,2)', 'a(1,3)','a(1,3)','a(1,3)','a(1,3)', 'a(1,3)', 'a(2,3)','a(2,3)','a(2,3)','a(2,3)', 'a(2,3)', 'a(3,3)','a(3,3)','a(3,3)','a(3,3)', 'a(3,3)', 'a(1,4)','a(1,4)','a(1,4)', 'a(2,4)','a(2,4)', 'a(2,4)' ) edge_names <- c("w(i=1,k=1,l=2)","","","", "", "", "", "","","","", "", "", "", "","","","w(i=4,k=3,l=2)", "", "", "", "","","","", "", "", "", "","","","", "", "", "w(i=5,k=7,l=2)", "","","","", "", "","","","", "", "","w(i=2,k=3,l=3)","", "", "", "w(i=1,k=1,l=4)","","", "","", "w(i=2,k=3,l=4)" ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) # plot(nn.graph) #' #' #' # igraph examples: http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html map <- function(x, range = c(0,1), from.range=NA) { if(any(is.na(from.range))) from.range <- range(x, na.rm=TRUE) ## check if all values are the same if(!diff(from.range)) return( matrix(mean(range), ncol=ncol(x), nrow=nrow(x), dimnames = dimnames(x))) ## map to [0,1] x <- (x-from.range[1]) x <- x/diff(from.range) ## handle single values if(diff(from.range) == 0) x <- 0 ## map from [0,1] to [range] if (range[1]>range[2]) x <- 1-x x <- x*(abs(diff(range))) + min(range) x[xmax(range)] <- NA return(x) } col <- rep("gray",length(V(nn.graph))) # input layer col[c(1:7, 16:17)] <- "lightblue" # hidden layer col[16:17] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.size=map(betweenness(nn.graph),c(25,30)), ylab="Input Layer", xlab="A schematic of fully-connected feed-forward neural network") # plot(nn.graph, vertex.size=map(betweenness(nn.graph),c(20,30)), edge.width=map(edge.betweenness(nn.graph), c(1,5))) #' #' #' col <- rep("gray",length(V(nn.graph))) # input layer col[c(1:7, 16:17)] <- "lightblue" # hidden layer col[16:17] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col) title(main="A schematic of fully-connected feed-forward neural network") mtext("Input Layer (j=1)", side=2, line=-1, col="blue") mtext(" Output Layer (j=4)", side=4, line=-2, col="green") mtext("Hidden Layers (j=2,3)", side=1, line=0, col="gray") #' #' #' The plot above illustrates the key elements in the action potential, or activation function, calculations and the corresponding training parameters: #' $$a_{node=k,layer=l} = f\left ( \displaystyle\sum_{i}{w_{k,i}^l \times a_{i}^{l-1} +b_{k}^l} \right ),$$ #' where: #' #' * $f$ is the *activation function*, e.g., [logistic functin](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html) $f(x) = \frac{1}{1+e^{-x}}$. It converts the aggregate weights at each node to probability values, #' * $w_{k,i}^l$ is the weight carried from the $i^{th}$ element of the $(l-1)^{th}$ layer to the $k^{th}$ element of the current $l^{th}$ layer, #' * $b_{k}^l$ is the (residual) bias present in the $k^{th}$ element in the $l^{th}$ layer. This is effectively the information not explained by the training model. #' #' These parameters may be estimated using different techniques (e.g., using least squares, or stochastically using steepest decent methods) based on the training data. #' #' # Biological Relevance #' #' There are parallels between biology (neuronal cells) and the mathematical models (perceptrons) for neural network representation. The human brain contain about $10^{11}$ neuronal cells connected by approximately $10^{15}$ synapses forming the basis of our functional phenotypes. The schematic below illustrates some of the parallels between brain biology and the mathematical representation using synthetic neural nets. Every neuronal cell receives multi-channel (afferent) input from its dendrites, generates output signals and disseminates the results via its (efferent) axonal connections and synaptic connections to dendrites of other neurons. #' #' The perceptron is a mathematical model of a neuronal cell that allows us to explicitly determine algorithmic and computational protocols transforming input signals into output actions. For instance, signal arriving through axon $x_0$ is modulated by some prior weight, e.g., synaptic strength, $w_0\times x_0$. Internally, within the neuronal cell, this input is aggregated (summed, or weight-averaged) with inputs from all other axons. Brain plasticity suggests that `synaptic strengths` (weight coefficients $w$) are strengthened by training and prior experience. This learning process controls the direction and strength of influence of neurons on other neurons. Either excitatory ($w>0$) or inhibitory ($w\leq 0$) influences are possible. Dendrites and axons carry signals to and from neurons, where the aggregate responses are computed and transmitted downstream. Neuronal cells only fire if action potentials exceed a certain threshold. In this situation, a signal is transmitted downstream through its axons. The neuron remains silent, if the summed signal is below the critical threshold. #' #' Timing of events is important in biological networks. In the computational perceptron model, a first order approximation may ignore the timing of neuronal firing (spike events) and only focus on the frequency of the firing. The firing rate of a neuron with an activation function $f$ represents the frequency of the spikes along the axon. We saw some examples of activations functions earlier. #' #' The diagram below illustrates the parallels between the brain network-synaptic organization and an artificial synthetic neural network. #' #' [![](http://wiki.socr.umich.edu/images/a/a6/DSPA_22_DeepLearning_Fig1.png)](http://wiki.socr.umich.edu/images/a/a6/DSPA_22_DeepLearning_Fig1.png) #' #' #' # Simple Neural Net Examples #' #' Before we look at some examples of deep learning algorithms applied to model observed natural phenomena, we will develop a couple of simple networks for computing fundamental Boolean operations. #' #' ## Exclusive OR (XOR) Operator #' The [exclusive OR (XOR) operator](https://en.wikipedia.org/wiki/Exclusive_or) works as a bivariate binary-outcome function, mapping pairs of false (0) and true (1) values to dichotomous false (0) and true (1) outcomes. #' #' We can design a simple two-layer neural network that calculates `XOR`. The **values within each neurons represent its explicit threshold**, which can be normalized so that all neurons utilize the same threshold, typically $1$). The **value labels associated with network connections (edges) represent the weights of the inputs**. When the threshold is not reached, output is $0$, and then the threshold is reached, the output is correspondingly $1$. #' #' # install.packages("network") library('igraph') #define node names nodes <- c('X', 'Y', # Inputs (0,1) 'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer 'Z' # Output Layer (0,1) ) # define node x,y coordinates x <- c(rep(0,2), rep(2,3), 4) y <- c(c(1.5, 2.5), 1:3, 2) ##x;y # print x & y coordinates of all nodes #define edges from <- c('X', 'X', 'Y', 'Y', 'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1', 'Z' ) to <- c('H_1_1_th1', 'H_1_2_th2', 'H_1_2_th2', 'H_1_3_th1', 'Z', 'Z', 'Z', 'Z' ) edge_names <- c( '1', '1', '1', '1', '1', '-2', '1', '1' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) # plot(nn.graph) # See the iGraph specs here: http://kateto.net/networks-r-igraph col <- rep("gray",length(V(nn.graph))) # input layer col[c(3:5)] <- "lightblue" # hidden layer col[6] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="sphere", vertex.size=c(35, 35, 80, 80, 80, 35), edge.label.cex=1.5) title(main="XOR Operator") mtext("Input Layer: Bivariate (0,1)", side=2, line=-1, col="blue") mtext(" Output Layer (XOR)", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") #' #' #' Let's work out manually the 4 possibilities: #' #' InputX|InputY|XOR Output(Z) #' ------|------|------------- #' 0 | 0 | 0 #' 0 | 1 | 1 #' 1 | 0 | 1 #' 1 | 1 | 0 #' #' We can validate that this network indeed represents an XOR operator by plugging in all four possible input combinations and confirming the expected results at the end. #' #' par(mfrow=c(2,2)) # 2 x 2 design nodes <- c('X=0', 'Y=0', # Inputs 'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer 'Z=0' # Output Layer (0,1) ) # define node x,y coordinates x <- c(rep(0,2), rep(2,3), 4) y <- c(c(1.5, 2.5), 1:3, 2) # x;y # print x & y coordinates of all nodes ######### (0,0) #define edges from <- c('X=0', 'X=0', 'Y=0', 'Y=0', 'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1', 'Z=0' ) to <- c('H_1_1_th1', 'H_1_2_th2', 'H_1_2_th2', 'H_1_3_th1', 'Z=0', 'Z=0', 'Z=0', 'Z=0' ) edge_names <- c( '1', '1', '1', '1', '1', '-2', '1', '1' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) col <- rep("gray",length(V(nn.graph))) # input layer col[c(3:5)] <- "lightblue" # hidden layer col[6] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30, 40, 40, 40, 30), edge.label.cex=2, edge.arrow.size=0.25) title(main="XOR Operator") mtext("Input Layer: Bivariate (X=0,Y=0)", side=2,line=-1, col="blue") mtext(" Output Layer (XOR), Z=0", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") ######### (0,1) nodes <- c('X=0', 'Y=1', # Inputs 'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer 'Z=1' # Output Layer (0,1) ) #define edges from <- c('X=0', 'X=0', 'Y=1', 'Y=1', 'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1', 'Z=1' ) to <- c('H_1_1_th1', 'H_1_2_th2', 'H_1_2_th2', 'H_1_3_th1', 'Z=1', 'Z=1', 'Z=1', 'Z=1' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) col <- rep("gray",length(V(nn.graph))) # input layer col[c(3:5)] <- "lightblue" # hidden layer col[6] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30, 40, 40, 40, 30), edge.label.cex=1, edge.arrow.size=0.25) title(main="XOR Operator") mtext("Input Layer: Bivariate (X=0,Y=1)",side=2,line=-1, col="blue") mtext(" Output Layer (XOR), Z=1", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") ######### (1,0) nodes <- c('X=1', 'Y=0', # Inputs 'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer 'Z=1' # Output Layer (0,1) ) #define edges from <- c('X=1', 'X=1', 'Y=0', 'Y=0', 'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1', 'Z=1' ) to <- c('H_1_1_th1', 'H_1_2_th2', 'H_1_2_th2', 'H_1_3_th1', 'Z=1', 'Z=1', 'Z=1', 'Z=1' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) col <- rep("gray",length(V(nn.graph))) # input layer col[c(3:5)] <- "lightblue" # hidden layer col[6] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30, 40, 40, 40, 30), edge.label.cex=1, edge.arrow.size=0.25) title(main="XOR Operator") mtext("Input Layer: Bivariate (X=1,Y=0)", side=2,line=-1, col="blue") mtext(" Output Layer (XOR), Z=1", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") ######### (1,1) nodes <- c('X=1', 'Y=1', # Inputs 'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer 'Z=0' # Output Layer (1,1) ) #define edges from <- c('X=1', 'X=1', 'Y=1', 'Y=1', 'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1', 'Z=0' ) to <- c('H_1_1_th1', 'H_1_2_th2', 'H_1_2_th2', 'H_1_3_th1', 'Z=0', 'Z=0', 'Z=0', 'Z=0' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) col <- rep("gray",length(V(nn.graph))) # input layer col[c(3:5)] <- "lightblue" # hidden layer col[6] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30, 40, 40, 40, 30), edge.label.cex=1.5, edge.arrow.size=0.25) title(main="XOR Operator") mtext("Input Layer: Bivariate (X=1,Y=1)",side=2,line=-1, col="blue") mtext(" Output Layer (XOR), Z=0", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") #' #' #' #' ## NAND Operator #' Another binary operator is `NAND` ([negative AND, Sheffer stroke](https://en.wikipedia.org/wiki/NAND_gate)) that produces a false (0) output if and only if both of its operands are true (1), and generates true (1), otherwise. Below is the `NAND` input-output table. #' #' InputX|InputY|XOR Output(Z) #' ------|------|------------- #' 0 | 0 | 1 #' 0 | 1 | 1 #' 1 | 0 | 1 #' 1 | 1 | 0 #' #' Similarly to the `XOR` operator, we can design a one-layer neural network that calculates `NAND`. The **values within each neurons represent its explicit threshold**, which can be normalized so that all neurons utilize the same threshold, typically $1$). The **value labels associated with network connections (edges) represent the weights of the inputs**. When the threshold is not reached, the output is trivial ($0$) and when the threshold is reached, the output is correspondingly $1$. Here is a shorthand analytic expression for the `NAND` calculation: #' #' $$NAND(X,Y) = 1.3 - (1\times X + 1\times Y).$$ #' Check that $NAND(X,Y)=0$ if and only if $X=1$ and $Y=1$, otherwise it equals $1$. #' #' # install.packages("network") library('igraph') #define node names nodes <- c('X', 'Y', # Inputs (0,1) 'H_th=1.3-(X+Y)?0', # Hidden Layer 'Z' # Output Layer (0,1) ) # define node x,y coordinates x <- c(rep(0,2), 2, 4) y <- c(c(1.5, 2.5), 2, 2) #x;y # print x & y coordinates of all nodes #define edges from <- c('X', 'Y', 'H_th=1.3-(X+Y)?0' ) to <- c('H_th=1.3-(X+Y)?0', 'H_th=1.3-(X+Y)?0', 'Z' ) edge_names <- c( '-1', '-1', '1' ) NodeList <- data.frame(nodes, x ,y) EdgeList <- data.frame(from, to, edge_names) nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names) # plot(nn.graph) # See the iGraph specs here: http://kateto.net/networks-r-igraph col <- rep("gray",length(V(nn.graph))) # input layer col[c(3)] <- "lightblue" # hidden layer col[4] <- "lightgreen" # output layer plot(nn.graph, vertex.color=col, vertex.shape="sphere", vertex.size=c(35, 35, 80, 35), edge.label.cex=2) title(main="NAND Operator") mtext("Input Layer: Bivariate (X,Y)", side=2, line=-1, col="blue") mtext(" Output Layer (NAND)", side=4, line=-2, col="green") mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray") #' #' #' ## Complex networks designed using simple building blocks #' #' Observe that stringing some of these primitive networks together, or/and increasing the number of hidden layers, allows us to model problems with exponentially increasing complexity. For instance, constructing a 4-input `NAND` function would simply require repeating several of our 2-input `NAND` operators. This will increase the space of possible outcomes from $2^2$ to $2^4$. Of course, introducing more depth in the **hidden layers** further expands the complexity of the problems that can be modeled using neural nets. #' #' You can interactively manipulate the [Google's TensorFlow Deep Neural Network Webapp](http://playground.tensorflow.org) to gain additional intuition and experience with the various components of deep learning networks. #' #' The [ConvnetJS demo provide another hands-on example using 2D classification with 2-layer neural network]( http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html). #' #' # Classification #' In [MXNet](mxnet.io/api/r/), [Multilayer perceptron (`MLP`)](https://github.com/dmlc/mxnet/blob/986b736b816018b96e9d1e2c358bb7665b80944d/example/image-classification/train_mnist.R) may be defined by: #' #' * Creating a place holder variable for the input data, `data = mx.sym.Variable('data')` #' * Flattening the data from 4D shape space (width, height, batch_size, num_channel) into 2D (num_channel\*width\*height, batch_size), 'data = mx.sym.Flatten(data=data)` #' * And iterating over the fully-connected layers: #' + First layer, `fc1 = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=128)` #' + Apply `relu` function to the output of the first fully-connnected layer, `act1 = mx.sym.Activation(data=fc1, name='relu1', act_type="relu")` #' + Generate the second fully-connected layer and apply the activation function, `fc2 = mx.sym.FullyConnected(data=act1, name='fc2', num_hidden = 64); act2 = mx.sym.Activation(data=fc2, name='relu2', act_type="relu")` #' + Generate the third/final fully-connected layer, with a hidden size $k=10$, which in digit recognition tasks corresponds to the number of unique digits $0:9]$, `fc3 = mx.sym.FullyConnected(data=act2, name='fc3', num_hidden=10)` #' #' * Finally, mapping the input into a probability score using the softmax and loss layer, `mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax')`. See the [mlp R source code here](https://raw.githubusercontent.com/dmlc/mxnet/e7514fe1b3265aaf15870b124bb6ed0edd82fa76/R-package/R/mlp.R). #' #' ## Sonar data example #' #' Let's load the `mlbench` and `mlbench` packages and demonstrate the basic invocation of `mxnet`. The [Sonar data](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/Sonar) `mlbench::Sonar` includes sonar signals bouncing off a metal cylinder or a roughly cylindrical rock. Each of 208 patterns includes a set of 60 numbers (features) in the range 0.0 to 1.0, and a label (M or R). Each feature represents the energy within a particular frequency band, integrated over a certain period of time. The M and R labels associated with each record rock or mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly. #' #' # Load the required packages: mlbench and mxnet # install.packages("mlbench"); install.packages("mxnet") # Note mxnet requires "visNetwork" # If it doesn't work, you may need the following lines: # install.packages("drat", repos="https://cran.rstudio.com") # drat:::addRepo("dmlc") # install.packages("mxnet") require(mlbench) require(mxnet) data(Sonar, package="mlbench") table(Sonar[,61]) Sonar[,61] = as.numeric(Sonar[,61])-1 # R = "1", "M" = "0" set.seed(123) train.ind = sample(1:nrow(Sonar),0.7*nrow(Sonar)) train.x = data.matrix(Sonar[train.ind, 1:60]) train.y = Sonar[train.ind, 61] test.x = data.matrix(Sonar[-train.ind, 1:60]) test.y = Sonar[-train.ind, 61] #' #' #' Let's start by using a **multi-layer perceptron** as a classifier. The `mxnet` function `mx.mlp` builds a general multi-layer neural network that can be utilized to do classification or regression graph modeling. It relies on the following parameters: #' #' * Training data and labels #' * Number of hidden nodes in each hidden layers #' * Number of nodes in the output layer #' * Type of activation #' * Type of output loss #' * The device to train (GPU or CPU) #' * Additional optional parameters, see `mx.model.FeedForward.create` #' #' Here is one example using the *training* and *testing* data we defined above: #' #' mx.set.seed(1234) model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax", num.round=200, array.batch.size=15, learning.rate=0.1, momentum=0.9, eval.metric=mx.metric.accuracy,verbose=F) #calculate Prediction sensitivity & specificity preds = predict(model.mx, test.x) # these are probabilities # You can inspect the test labels vs. assigned probabilities by # View(data.frame(test.y, preds[2,])) preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels table(preds1) pred.label = t(preds1) table(pred.label, test.y) library("caret") sensitivity(factor(preds1), factor(as.numeric(test.y)),positive = 1) specificity(factor(preds1), factor(as.numeric(test.y)),negative = 0) #' #' #' We can also use `crossval::diagnosticErrors()` and `crossval::confusionMatrix()` to get more detailed evaluations. Similar to `sensitivity()` and `specificity()`, we should specify the *negative* and *positive* labels. #' #' *Note* that you have to specify `crossval::confusionMatrix()` if you also have the `caret` package loaded, as `caret` also has a function called `confusionMatrix()`. #' #' library("crossval") diagnosticErrors(crossval::confusionMatrix(preds1,test.y, negative = 0)) #' #' #' Now, we compare the results of different number of rounds. #' #' mx.set.seed(1234) get_pred = function(n){ model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax", num.round=n, array.batch.size=15, learning.rate=0.1, momentum=0.9, eval.metric=mx.metric.accuracy,verbose=F) preds = predict(model.mx, test.x) } preds100 = get_pred(100) preds50 = get_pred(50) preds10 = get_pred(10) #' #' #' We can plot the ROC curve and calculate the AUC (Area under the curve): #' #' # install.packages("pROC"); install.packages("plotROC"); install.packages("reshape2") library(pROC); library(plotROC); library(reshape2); # compute AUC get_roc = function(preds){ roc_obj <- roc(test.y, preds[2,]) auc(roc_obj) } get_roc(preds) get_roc(preds100) get_roc(preds50) get_roc(preds10) #plot roc dt <- data.frame(test.y, preds[2,], preds100[2,], preds50[2,], preds10[2,]) colnames(dt) <- c("D","rounds=200","rounds=100","rounds=50","rounds=10") dt <- melt(dt,id.vars = "D") basicplot <- ggplot(dt, aes(d = D, m = value, colour=variable)) + geom_roc(labels = FALSE, size = 0.5, alpha.line = 0.6, linejoin = "mitre") + theme_bw() + coord_fixed(ratio = 1) + style_roc() + ggtitle("ROC CURVE")+ annotate("rect", xmin = 0.4, xmax = 1, ymin = 0.2, ymax = 0.75, alpha = .2)+ annotate("text", x = 0.7, y = 0.5, size = 3, label = "AUC: \n rounds=200: 0.9209\n rounds=100: 0.9128\n rounds=50: 0.8824\n rounds=10: 0.8022\n " ) basicplot #' #' #' The plot suggests that the results stabilize after 100 iterations. #' #' Let's look at some visualizations of the real labels of the test data (`test.y`) and their corresponding ML-derived classification labels (`preds[2,]`) using 200 iterations. #' #' graph.viz(model.mx$symbol) hist(preds10[2,],main = "rounds=10") hist(preds50[2,],main = "rounds=50") hist(preds100[2,],main = "rounds=100") hist(preds[2,],main = "rounds=200") #' #' #' We see a significant bimodal trend when the number of rounds increases. Another plot may be generated by: #' #' library(ggplot2) get_gghist = function(preds){ ggplot(data.frame(test.y, preds), aes(x=preds, group=test.y, fill=as.factor(test.y)))+ geom_histogram(position="dodge",binwidth=0.25)+theme_bw() } df = data.frame(preds[2,],preds100[2,],preds50[2,],preds10[2,]) p <- lapply(df,get_gghist) require(gridExtra) # used for arrange ggplots grid.arrange(p$preds10.2...,p$preds50.2...,p$preds100.2...,p$preds.2...) #' #' #' ## MXNet Notes #' #' * The `mx.mlp()` function is a proxy to the more complex and laborious process of defining a neural network by using **MXNet**'s `Symbol`. For instance, this call `model.mx <- mx.mlp(train.x, train.y, hidden_node=8, out_node=2, out_activation="softmax", num.round=20, array.batch.size=15, learning.rate=0.1, momentum=0.9, eval.metric=mx.metric.accuracy)` would be equivalent to a symbolic network definition like: `data <- mx.symbol.Variable("data"); fc1 <- mx.symbol.FullyConnected(data, num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu"); fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64); act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu"); fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2); lro <- mx.symbol.SoftmaxOutput(fc3, name="sm"); model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y, ctx=mx.cpu(), num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9)` (see example with linear regression below). #' #' * Layer-by-layer definitions translate inputs into outputs. At each level, the network allows for a different number of neurons and alternative activation functions. Other options can be specified by using `mx.symbol`: #' * `mx.symbol.Convolution` applies convolution to the input and then adds a bias. It can create convolutional neural networks. #' * `mx.symbol.Deconvolution` does the opposite and can be used in segmentation networks along with `mx.symbol.UpSampling`, e.g., to reconstruct the pixel-wise classification of an image. #' * `mx.symbol.Pooling` reduces the data by selecting signals with the highest response. #' * `mx.symbol.Flatten` links convolutional and pooling layers to form a fully connected network. #' * `mx.symbol.Dropout` attempts to cope with the overfitting problem. #' #' The function `mx.mlp()` is a wrapper for quick design of standard multi-layer perceptrons. For more extensive experiments, customized symbolic representation can be explicitly specified using combinations of the above methods. #' #' # Example of **LeNet** network for recognizing handwritten digits: data <- mx.symbol.Variable('data') conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20) tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh") pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=c(2,2), stride=c(2,2)) conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50) tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh") pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", kernel=c(2,2), stride=c(2,2)) flatten <- mx.symbol.Flatten(data=pool2) fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500) tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh") fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10) lenet <- mx.symbol.SoftmaxOutput(data=fc2) model <- mx.model.FeedForward.create(lenet, X=train.x, y=train.y, ctx=device.cpu, num.round=5, array.batch.size=100, learning.rate=0.05, momentum=0.9) #' #' #' To allow smooth, fast, and consistent operation on CPU and GPU, in `in mxnet`, the generic R function controlling the reproducibility of stochastic results is overwritten by `mx.set.seed`. So can use `mx.set.seed()` to control random numbers in **MXNet**. #' #' To examine the accuracy of the `model.mx` learner (trained on the training data), we can make prediction (on testing data) and evaluate the results using the provided testing labels (report the confusion matrix). #' #' preds = predict(model.mx, test.x) pred.label = max.col(t(preds))-1 table(pred.label, test.y) #' #' #' For multi-class predictions, `mxnet` outputs $n\ (class) \times m\ (examples)$ confusion matrices where each row corresponds to probability of the corresponding (column-defined) class. #' #' # Case-Studies #' #' Let's demonstrate regression and prediction deep learning examples using several complementary datasets. #' #' #' #' #' ## ALS regression example #' Let's first demonstrate a deep learning regression using the [ALS data]("https://umich.instructure.com/files/1789624/download?download_frd=1) to predict `ALSFRS_slope`. #' #' als <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1") als <- scale(als[,-c(1,7)]) train.ind = sample(1:nrow(als),0.7*nrow(als)) train.x = data.matrix(als[train.ind,-c(1,7)]) train.y = als[train.ind,7] test.x = data.matrix(scale(als[-train.ind,-c(1,7)])) test.y = als[-train.ind,7] # Define the input data data <- mx.symbol.Variable("data") # A fully connected hidden layer # data: input source # num_hidden: number of neurons in this hidden layer fc1 <- mx.symbol.FullyConnected(data, num_hidden=1) # Use linear regression for the output layer lro <- mx.symbol.LinearRegressionOutput(fc1) mx.set.seed(1234) # Create a MXNet Feedforward neural net model with the specified training. model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y, ctx=mx.cpu(), num.round=1000, array.batch.size=20, learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse,verbose = F) #' #' #' The option `verbose = F` can suppress messages including training accuracy report in each iteration. You may rerun the code with `verbose = T` to examine the rate of decrease of *train error* against the *number of iterations*. #' #' You must `scale` data before input it into `MXnet`, which expects that the training and testing sets are normalized to the **same scale**. There are two strategies to scale the data. #' #' * Either scaling the complete data simultaneously and then split them into train data and test data, or #' * Scale only the training data set to enable model-training, but save your protocol for data normalization, as new data (testing, validation) will need to be (pre)processed the same way as the training data. #' #' Have a look at the [Google TensorFlow API](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle®Dataset=reg-plane&learningRate=0.03®ularizationRate=0&noise=0&networkShape=4,2&seed=0.81083&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). It shows the importance of *learning rate* and the *number of rounds*. You should test different sets of parameters. #' #' * Too small *learning rate* may lead to long computations. #' * Too large *learning rate* may cause the algorithm to fail to converge, as large step size (learning rate) may by-pass the optimal solution and then oscillate or even diverge. #' #' preds = predict(model, test.x) sqrt(mean((preds-test.y)^2)) range(test.y) #' #' #' We can see that the *RMSE* on the test set is pretty small. To get a visual representation of the deep learning network we can also display this relatively simple computation graph: #' #' graph.viz(model$symbol) #' #' #' ## Spirals 2D Data #' We can again use the `mx.mlp` wrapper to construct the learning network, but we can also use a more flexible way to construct and configure the multi-layer network in `mxnet`. This configuration is done by using the `Symbol` call, which specifies the links among network nodes, the activation function, dropout ratio, and so on: #' #' Below we show configuration of perceptron with one hidden layer. #' #' ########### Network configuration # variables act <- mx.symbol.Variable("data") # affine transformation fc <- mx.symbol.FullyConnected(act, num.hidden = 10) # non-linear activation act <- mx.symbol.Activation(data = fc, act_type = "relu") # affine transformation fc <- mx.symbol.FullyConnected(act, num.hidden = 2) # softmax output and cross- mlp <- mx.symbol.SoftmaxOutput(fc) ####Preparing data set.seed(2235) ############ spirals dataset s <- sample(x = c("train", "test"), size = 1000, prob = c(.8,.2), replace = TRUE) dta <- mlbench.spirals(n = 1000, cycles = 1.2, sd = .03) dta <- cbind(dta[["x"]], as.integer(dta[["classes"]]) - 1) colnames(dta) <- c("x","y","label") ######### train, validate, test dta.train <- dta[s == "train",] dta.test <- dta[s == "test",] #' #' #' Let's display the data and examine its structure. #' #' dt <- as.data.frame(dta);dt[,3] <- as.factor(dt[,3]) dt.train <- dt[s == "train",] dt.test <- dt[s == "test",] p1 <- ggplot(dt,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Whole data structure") p2 <- ggplot(dt.train,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Train data structure") p3 <- ggplot(dt.test,aes(x = x,y = y,color=label))+geom_point()+ggtitle("Test data structure") grid.arrange(p1,p2,p3,nrow=3) #' #' #' # Network training # Feed-forward networks may be trained using iterative gradient descent algorithms. A **batch** is a subset of data that is used during single forward pass of the algorithm. An **epoch** represents one step of the iterative process that is repeated until all training examples are used. ############ basic spiral-data training mx.set.seed(2235) model <- mx.model.FeedForward.create( symbol = mlp, X = dta.train[, c("x", "y")], y = dta.train[, c("label")], num.round = 500, array.layout = "rowmajor", learning.rate = 1, eval.metric = mx.metric.accuracy,verbose = F) preds = predict(model, dta.test[,c(1:2)]) pred.label = max.col(t(preds))-1; table(pred.label, dta.test[,3]) #' #' #' The prediction result is close to perfect and we can inspect deeper the results using `crossval::confusionMatrix`. #' #' library("crossval") diagnosticErrors(crossval::confusionMatrix(pred.label,dta.test[,3],negative = 0)) #' #' #' ggplot(data.frame(dta.test[,3], preds[2,]), aes(x=preds[2,], group=dta.test[,3], fill=as.factor(dta.test[,3])))+ geom_histogram(position="dodge",binwidth=0.25)+theme_bw() #' #' #' Once we fit a model (like the binary label classification below), we can: #' #' * Visually inspect the quality of the ML classification. #' * Display the structure of the labeled test-data objects. #' #' # define a custom call-back, which stops the process of training when the progress in accuracy is below certain level of tolerance. It call is made after every epoch to check the status of convergence of the algorithm. mx.callback.train.stop <- function(tol = 1e-3, mean.n = 1e2, period = 100, min.iter = 100) { function(iteration, nbatch, env, verbose = TRUE) { if (nbatch == 0 & !is.null(env$metric)) { continue <- TRUE acc.train <- env$metric$get(env$train.metric)$value if (is.null(env$acc.log)) { env$acc.log <- acc.train } else { if ((abs(acc.train - mean(tail(env$acc.log, mean.n))) < tol & abs(acc.train - max(env$acc.log)) < tol & iteration > min.iter) | acc.train == 1) { cat("Training finished with final accuracy: ", round(acc.train * 100, 2), " %\n", sep = "") continue <- FALSE } env$acc.log <- c(env$acc.log, acc.train) } } if (iteration %% period == 0) { cat("[", iteration,"]"," training accuracy: ", round(acc.train * 100, 2), " %\n", sep = "") } return(continue) } } ###### training with custom stopping mx.set.seed(2235) model <- mx.model.FeedForward.create( symbol = mlp, X = dta.train[, c("x", "y")], y = dta.train[, c("label")], num.round = 2000, array.layout = "rowmajor", learning.rate = 1, epoch.end.callback = mx.callback.train.stop(), eval.metric = mx.metric.accuracy, verbose = FALSE ) labeled_spiral_data <- as.data.frame(cbind(dta.test[, c("x", "y")], as.factor(pred.label))) colnames(labeled_spiral_data) <- c("x", "y", "label") labeled_spiral_data$label <- as.factor(labeled_spiral_data$label) p4 <- ggplot(labeled_spiral_data, aes(x = x,y = y,color=label))+geom_point()+ggtitle("Structure of Predicted-Labels on Test Data") p4 #' #' #' ## IBS Study #' Let's try another example using the [IBS NI Study data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_April2011_NI_IBS_Pain). #' #' # IBS NI Data # UCLA Data wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_April2011_NI_IBS_Pain") IBSData<- html_table(html_nodes(wiki_url, "table")[[2]]) # table 2 set.seed(1234) test.ind = sample(1:354, 50, replace = F) # select 50/354 of cases for testing, train on remaining (354-50)/354 cases # UMich Data (includes MISSING data): use `mice` to impute missing data with mean: newData <- mice(data,m=5,maxit=50,meth='pmm',seed=500); summary(newData) # wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_April2011_NI_IBS_Pain") # IBSData<- html_table(html_nodes(wiki_url, "table")[[1]]) # load Table 1 # set.seed(1234) # test.ind = sample(1:337, 50, replace = F) # select 50/337 of cases for testing, train on remaining (337-50)/337 cases # summary(IBSData); IBSData[IBSData=="."] <- NA; newData <- mice(IBSData,m=5,maxit=50,meth='pmm',seed=500); summary(newData) html_nodes(wiki_url, "#content") # View (IBSData); dim(IBSData): Select an outcome response "DX"(3), "FS_IQ" (5) # scale/normalize all input variables IBSData <- na.omit(IBSData) IBSData[,4:66] <- scale(IBSData[,4:66]) # scale the entire dataset train.x = data.matrix(IBSData[-test.ind, c(4:66)]) # exclude outcome train.y = IBSData[-test.ind, 3]-1 test.x = data.matrix(IBSData[test.ind, c(4:66)]) test.y = IBSData[test.ind, 3]-1 # View(data.frame(train.x, train.y)) # View(data.frame(test.x, test.y)) # table(test.y); table(train.y) # num.round - number of iterations to train the model act <- mx.symbol.Variable("data") fc <- mx.symbol.FullyConnected(act, num.hidden = 10) act <- mx.symbol.Activation(data = fc, act_type = "sigmoid") fc <- mx.symbol.FullyConnected(act, num.hidden = 2) mlp <- mx.symbol.SoftmaxOutput(fc) mx.set.seed(2235) model <- mx.model.FeedForward.create( symbol = mlp, array.batch.size=20, X = train.x, y=train.y, num.round = 200, array.layout = "rowmajor", learning.rate = exp(-1), eval.metric = mx.metric.accuracy, verbose=FALSE) preds = predict(model, test.x) pred.label = max.col(t(preds))-1; table(pred.label, test.y) library("crossval") diagnosticErrors(crossval::confusionMatrix(pred.label,test.y,negative = 0)) #' #' #' ggplot(data.frame(test.y, preds[2,]), aes(x=preds[2,], group=test.y, fill=as.factor(test.y)))+ geom_histogram(position="dodge",binwidth=0.25)+theme_bw() # convert pred-probability to binary classes threshold=0.3? bin_preds <- ifelse (preds[2,]<0.3, 0, 1) # get a factor variable comparing binary test-labels vs. predcted-labels label_match <- as.factor(ifelse (test.y==bin_preds, 0, 1)) p5 <- ggplot(data.frame(test.y, preds[2,]), aes(x = test.y, y = preds[2,], color=label_match))+geom_point()+ggtitle("Match between Test Data Labels and Predicted Labels") p5 #' #' #' This histogram plot suggests that the classification is not good. #' #' ## Country QoL Ranking Data #' #' Another case-study we have seen befre is the [country quality of life (QoL) dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_2008_World_CountriesRankings). Let's try to extime a netowkr model and use it to predict the overall QoL. #' #' wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_2008_World_CountriesRankings") html_nodes(wiki_url, "#content") CountryRankingData<- html_table(html_nodes(wiki_url, "table")[[2]]) # View (CountryRankingData); dim(CountryRankingData): Select an appropriate # outcome "OA": Overall country ranking (13) # Dichotomize outcome, Top-countries OA<20, bottom countries OA>=20 set.seed(1234) test.ind = sample(1:100, 30, replace = F) # select 15/100 of cases for testing, train on remaining 85/100 cases CountryRankingData[,c(8:12,14)] <- scale(CountryRankingData[,c(8:12,14)]) # scale/normalize all input variables train.x = data.matrix(CountryRankingData[-test.ind, c(8:12,14)]) # exclude outcome train.y = ifelse(CountryRankingData[-test.ind, 13] < 50, 1, 0) test.x = data.matrix(CountryRankingData[test.ind, c(8:12,14)]) test.y = ifelse(CountryRankingData[test.ind, 13] < 50, 1, 0) # developed (high OA rank) country # View(data.frame(train.x, train.y)); View(data.frame(test.x, test.y)) # View(data.frame(CountryRankingData, ifelse(CountryRankingData[,13] < 20, 1, 0))) act <- mx.symbol.Variable("data") fc <- mx.symbol.FullyConnected(act, num.hidden = 10) act <- mx.symbol.Activation(data = fc, act_type = "sigmoid") fc <- mx.symbol.FullyConnected(act, num.hidden = 2) mlp <- mx.symbol.SoftmaxOutput(fc) mx.set.seed(2235) model <- mx.model.FeedForward.create( symbol = mlp, array.batch.size=10, X = train.x, y=train.y, num.round = 15, array.layout = "rowmajor", learning.rate = exp(-1), eval.metric = mx.metric.accuracy) preds = predict(model, test.x); preds pred.label = max.col(t(preds))-1; table(pred.label, test.y) #' #' We only need 15 rounds to achieve 97% accuracy. #' #' ggplot(data.frame(test.y, preds[2,]), aes(x=preds[2,], group=test.y, fill=as.factor(test.y)))+ geom_histogram(position="dodge",binwidth=0.25)+theme_bw() #calculate sensitivity & specificity and more library("crossval") diagnosticErrors(crossval::confusionMatrix(pred.label,test.y,negative = 0)) # convert pred-probability to binary classes threshold=0.5? bin_preds <- ifelse (preds[2,]<0.5, 0, 1) # get a factor variable comparing binary test-labels vs. predicted-labels label_match <- as.factor(ifelse (test.y==bin_preds, 0, 1)) p6 <- ggplot(data.frame(test.y, preds[2,]), aes(x = test.y, y = preds[2,], color=label_match))+geom_point()+ggtitle("Match between Test Data Labels and Predicted Labels") p6 #' #' #' ## Handwritten Digits Classification #' #' In [Chapter 10 (ML, NN, SVM Classification)](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/10_ML_NN_SVM_Class.html) we discussed Optical Character Recognition (OCR). Specifically, we analyzed handwritten notes (unstructured text) and converted it to printed text. #' #' [MNIST](http://yann.lecun.com/exdb/mnist/) includes a large set of human annotated/labeled handwritten digits imaging data set. Every digit is represented by a $28\times 28$ thumbnail image. You can download the training and testing data from [Kaggle](https://www.kaggle.com/c/digit-recognizer/data). #' #' The `train.csv` and `test.csv` data files contain gray-scale images of hand-drawn digits, $0, 1, 2, \ldots, 9$. Each 2D image is $28\times 28$ in size and each of the $784$ pixels has a single pixel-intensity representing the lightness or darkness of that pixel (stored as a 1 byte integer $[0,255]$). Higher intensities correspond to darker pixels. #' #' The training data, `train.csv`, has 785 columns, where the first column, **label**, codes the actual the digit drawn by the user. The remaining $784$ columns contain the $28\times 28=784$ pixel-intensities of the associated 2D image. Columns in the training set have $pixel_K$ names, where $0\leq K\leq 783$. To reconstruct a 2D image out of each row in the training data we use this relation between pixel-index ($K$) and $X,Y$ image coordinates: #' $$K = Y \times 28 + X,$$ #' where $0\leq X, Y\leq 27$. Thus, $pixel_K$ is located on row $Y$ and column $X$ of the corresponding 2D Image of size $28\times 28$. For instance, $pixel_{60=(2\times 28 + 4)} \longleftrightarrow (X=4,Y=2)$ represents the pixel on the 3-rd row and 5-th column in the image. Diagrammatically, omitting the "pixel" prefix, the pixels may be ordered to reconstruct the 2D image as follows: #' #' Row | Col0 | COl1 | Col2 | **Col3** | Col5 | ... | Col26 | Co27 #' -----|------|------|------|------|------|------|------|----- #' Row0 | 000 | 001 | 002 | 003 | 004 | ... | 026 | 027 #' Row1 | 028 | 029 | 030 | 031 | 032 | ... | 054 | 055 #' **Row2** | 056 | 057 | 058 | 059 | **060** | ... | 082 | 083 #' RowK | ...|...|...|...|...|...|...|...| ... #' Row26| 728 | 729 | 730 | 731 | 732 | ... | 754 | 755 #' Row27| 756 | 757 | 758 | 759 | 760 | ... | 782 | 783 #' #' #' Note that the point-to-pixelID transformation ($K = Y \times 28 + X$) may easily be inverted as a pixelID-to-point mapping: $X= K\ mod\ 28$ (remainder of the integer division ($K/28$) and $Y=K %/%28$ (integer part of the division $K/28$)). For example: #' #' K <- 60 X <- K %% 28 # X= K mod 28, remainder of integer division 60/28 Y <- K%/%28 # integer part of the division # This validates that the application of both, the back and forth transformations, leads to an identity K; X; Y; Y * 28 + X #' #' #' The `test data` (test.csv) has the same organization as the training data, except that it does not contain the first **label** column. It includes 28,000 images and we can predict image labels that can be stored as $ImageId,Label$ pairs, which can be visually compared to the 2D images for validation/inspection. #' #' require(mxnet) # train.csv pathToZip <- tempfile() download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TrainingData.zip", pathToZip) train <- read.csv(unzip(pathToZip)) dim(train) unlink(pathToZip) # test.csv pathToZip <- tempfile() download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TestingData.zip", pathToZip) test <- read.csv(unzip(pathToZip)) dim(test) unlink(pathToZip) train <- data.matrix(train) test <- data.matrix(test) train.x <- train[,-1] train.y <- train[,1] # Scaling will be discussed below train.x <- t(train.x/255) test <- t(test/255) #' #' #' Let's look at some of these example images: #' #' library("imager") # first convert the CSV data (one row per image, 28,000 rows) array_3D <- array(test, c(28, 28, 28000)) mat_2D <- matrix(array_3D[,,1], nrow = 28, ncol = 28) plot(as.cimg(mat_2D)) # extract all N=28,000 images N <- 28000 img_3D <- as.cimg(array_3D[,,], 28, 28, N) # plot the k-th image (1<=k<=N) k <- 5 plot(img_3D, k) image_2D <- function(img,index) { img[,,index,,drop=FALSE] } plot(image_2D(img_3D, 1)) # Plot a collage of the first 4 images imappend(list(image_2D(img_3D, 1), image_2D(img_3D, 2), image_2D(img_3D, 3), image_2D(img_3D, 4)),"y") %>% plot # img <- image_2D(img_3D, 1) # for (i in 10:20) { imappend(list(img, image_2D(img_3D, i)),"x") } #' #' #' In these CSV data files, each $28\times 28$ image is represented as a single row. Greyscale images are $1$ byte, in the range $[0, 255]$, which we linearly transformed into $[0,1]$. Note that we only scale the $X$ input, not the output (labels). Also, we don't have manual gold-standard validation labels for the testing data, i.e., `test.y` is not available for the handwritten digits data. #' #' # We already scaled earlier # train.x <- t(train.x/255) # test <- t(test/255) #' #' #' Next, we can transpose the input matrix to $n\ (pixels) \times m\ (examples)$, as column major format required by `mxnet`. The image labels are evenly distributed: #' #' table(train.y); prop.table(table(train.y)) #' #' #' The majority class (`1`) in the training set includes 11.2% of the observations. #' #' ### Configuring the Neural Network #' #' data <- mx.symbol.Variable("data") fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu") fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64) act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu") fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10) softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm") #' #' #' **data <- mx.symbol.Variable("data")** represents the input layer. The first hidden layer, set by **fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)**, takes the data as an input, its name, and the number of hidden neurons to generate an output layer. #' #' **act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")** sets the activation function, which takes the output from the first hidden layer "fc1"" and generate an output that is fed into the second hidden layer "fc2" that uses fewer hidden neurons (64). #' #' The process repeats with the second activation "act2", resembling "act1" but using different input source and name. As there are only 10 digits ($0, 1, ..., 9$), in the last layer "fc3", we set the number of neurons to 10. At the end, we set the activation to `softmax` to obtain a probabilistic prediction. #' #' ### Training #' #' We are almost ready for the training process. Before we start the computation, let's decide what device should we use. #' #' devices <- mx.cpu() #' #' #' Here we assign CPU to `mxnet`. After all these preparation, you can run the following command to train the neural network! Note that in `mxnet`, the correct function to control the random process is `mx.set.seed`. #' #' mx.set.seed(1234) model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y, ctx=devices, num.round=10, array.batch.size=100, learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy, initializer=mx.init.uniform(0.07), epoch.end.callback=mx.callback.log.train.metric(100) ) #' #' #' For 10 rounds, the training accuracy is exceeds 99% it may not be worthwhile trying 100 rounds as this would increase substantially the computational complexity. #' #' ### Forecasting #' #' To generate forecasting using the model on a testing data, and evaluate the prediction performance. The `preds` matrix has $28,000$ rows and $10$ columns, containing the desired classification probabilities from the `output layer` of the neural net. To extract the maximum label for each row, we can use the `max.col`: #' #' # evaludate: "preds" is the matrix of the possibility of each of the 10 numbers preds <- predict(model, test) pred.label <- max.col(t(preds)) - 1 table(pred.label) # preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels # pred.label = t(preds1) # table(pred.label, test.y) # calculate sensitivity & specificity # sensitivity(factor(preds1), factor(as.numeric(test.y)),positive = 1) # specificity(factor(preds1), factor(as.numeric(test.y)),negative = 0) # preds <- predict(model, test.x) # dim(preds) # preds1 <- ifelse(preds[2,] <= 0.5, 0, 1) # dichotomize to labels # pred.label = t(preds1) # table(pred.label, test.y) #' #' #' For binary classification, `mxnet` outputs two prediction classes, whereas for multi-class predictions, it outputs $n\ (classes) \times m\ (examples)$ matrix, where the *rows* correspond to the probability of the class in the specific *column*, so all column sums add up to $1.0$. #' #' The predictions are stored in a $28,000(rows) \times 10 (colums)$ matrix, including the desired classification probabilities from the output layer. The R `max.col` function extracts the maximum label for each row. #' #' pred.label <- max.col(t(preds)) - 1 table(pred.label) #' #' #' We can save the predicted labels of the testing handwritten digits to CSV: #' #' predicted_lables <- data.frame(ImageId=1:ncol(test), Label=pred.label) write.csv(predicted_lables, file='predicted_lables.csv', row.names=FALSE, quote=FALSE) #' #' #' We can open the `predicted_lables.csv` file and inspect the ML-labels (saved in the 2-column `ImageID` and `Label` format CSV) assigned to the $28,000$ manually drawn digits. As the **testing handwritten digits data** do not have human-provided labels, we can't quantitatively assess the validity of the algorithm on the testing data. However, we can visually inspect random handwritten digit instances ($7$ in the example below, image indices $4:10$) against their predictions and gain intuition of the accuracy rate of the ML classifier. #' #' table(train.y) table(predicted_lables[,2]) # Plot the relative frequencies between the number of train.y labels (in range 0-9) against the testing data predicted labels. # These are not directly related (training.y vs. predicted.testing.data! # Remember - we don't have gold-standard test.y labels! # Generally speaking, numbers closer to the diagonal suggest expected good classifications. # Whereas, off diagonal points may suggest less effective labeling. label.names <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") plot(ftable(train.y)[c(1:10)], ftable(predicted_lables[,2])[c(1:10)]) text(ftable(train.y)[c(1:10)]+20, ftable(predicted_lables[,2])[c(1:10)], labels=label.names, cex= 1.2) # For example, the ML-classification labels assigned to the first 7 images (from the 28,000 testing data collection) are: head(predicted_lables, n = 7L) library(knitr) kable(head(predicted_lables, n = 7L), format = "markdown") #initialize a list of m=7 images from the N=28,000 available images m_start <- 4 m_end <- 10 if (m_end <= m_start) { m_end = m_start+1 } # check that m_end > m_start label_Ypositons <- vector() # initialize the array of label positions on the plot for (i in m_start:m_end) { if (i==m_start) { img1 <- image_2D(img_3D, m_start) } else img1 <- imappend(list(img1, image_2D(img_3D, i)),"y") label.names[i+1-m_start] <- predicted_lables[i,2] label_Ypositons[i+1-m_start] <- 15 + 28*(i-m_start) } plot(img1, axes=FALSE) text(40, label_Ypositons, labels=label.names[1:(m_end-m_start)], cex= 1.2, col="blue") mtext(paste((m_end+1-m_start), " Random Images \n Indices (m_start=", m_start, " : m_end=", m_end, ")"), side=2, line=-6, col="black") mtext("ML Classification Labels", side=4, line=-5, col="blue") #' #' #' table(ftable(train.y)[c(1:10)], ftable(predicted_lables[,2])[c(1:10)]) #' #' #' ### Examining the Network Structure using `LeNet` #' #' We can use the `mxnet` package `LeNet` [convolutional neural network (CNN) protocol](http://yann.lecun.com/exdb/lenet/) for learning the network. #' #' #### Construct the network: #' #' # input data <- mx.symbol.Variable('data') # first conv conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20) tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh") pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=c(2,2), stride=c(2,2)) # second conv conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50) tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh") pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", kernel=c(2,2), stride=c(2,2)) # first fullc flatten <- mx.symbol.Flatten(data=pool2) fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500) tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh") # second fullc fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10) # loss lenet <- mx.symbol.SoftmaxOutput(data=fc2) #' #' #' #### Reshape the matrices into arrays: #' #' train.array <- train.x dim(train.array) <- c(28, 28, 1, ncol(train.x)) test.array <- test dim(test.array) <- c(28, 28, 1, ncol(test)) #' #' #' #### Compare the training speed on different devices #' #' Start by defining the devices: #' #' n.gpu <- 1 device.cpu <- mx.cpu() device.gpu <- lapply(0:(n.gpu-1), function(i) { mx.gpu(i) }) #' #' #' Passing a list of devices is useful for high-end computational platforms (e.g., multi-GPU systems); `mxnet` can train on multiple GPUs or CPUs. #' #' To train using the CPU, try fewer iterations as protocol is computationally very intense. #' #' mx.set.seed(1234) tic <- proc.time() model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y, ctx=device.cpu, num.round=1, array.batch.size=100, learning.rate=0.05, momentum=0.9, wd=0.00001, eval.metric=mx.metric.accuracy, epoch.end.callback=mx.callback.log.train.metric(100)) print(proc.time() - tic) #' #' #' The corresponding training on GPU is similar, but it requires a separate GPU-compilation of `mxnet` (/mxnet/src/storage/storage.cc:78) with `USE_CUDA=1` to enable GPU usage. #' #' mx.set.seed(1234) tic <- proc.time() model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y, ctx=device.gpu, num.round=5, array.batch.size=100, learning.rate=0.05, momentum=0.9, wd=0.00001, eval.metric=mx.metric.accuracy, epoch.end.callback=mx.callback.log.train.metric(100)) print(proc.time() - tic) #' #' #' GPU training is faster than CPU. Everyone can [submit a new classification result to Kaggle](https://www.kaggle.com/c/digit-recognizer#evaluation) and see a ranking result for their classifier. Make sure you follow the specific result-file submission format. #' #' preds <- predict(model, test.array) pred.label <- max.col(t(preds)) - 1 submission <- data.frame(ImageId=1:ncol(test), Label=pred.label) write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE) #' #' #' # Classifying Real-World Images #' #' A real-world example of deep learning is classification of 2D images (pictures) or 3D volumes (e.g., neuroimages). #' #' This example shows the use a pre-trained **Inception-BatchNorm Network** to predict a class of real world image. The [network architecture is described this paper](https://arxiv.org/abs/1502.03167). The [pre-trained Inception-BatchNorm network is available here](http://webdocs.cs.ualberta.ca/~bx3/data/Inception.zip) and [here](https://github.com/dmlc/mxnet/blob/master/R-package/vignettes/classifyRealImageWithPretrainedModel.Rmd). This advanced model gives a state-of-the-art prediction accuracy on imaging data. We also need the `imager` package to load and preprocess the images in R. #' #' # install.packages("imager") require(mxnet) require(imager) #' #' #' ## Load the Pre-trained Model #' #' [Download and unzip the pre-trained model](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/Inception.zip) to a working folder and load the model and the mean image (used for preprocessing using `mx.nd.load`) into R. This download can either be done manually, or automated, as shown below. #' #' pathToZip <- tempfile() download.file("http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/Inception.zip", pathToZip) model_file <- unzip(pathToZip) # setwd(paste(getwd(),"results", sep='/')) model = mx.model.load(paste(getwd(),"Inception_BN", sep='/'), iteration=39) mean.img = as.array( mx.nd.load( paste(getwd(),"mean_224.nd", sep='/') ) [["mean_img"]] ) dim(mean.img) # plot(mean.img) #' #' #' ## Load and Preprocess a New Image #' #' To classify a new image, select an image and load it in. Below, we show the classification of several alternative images. #' #' library("imager") # One should be able to load the image directly from the web (but sometimes there may be problems, in which case, we need to first download the image and then load it in R: # im <- imager::load.image("http://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png") # download file to local working directory, use "wb" mode to avoid problems download.file("http://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') # report download image path paste(getwd(),"results/image.png", sep="/") img <- load.image(paste(getwd(),"results/image.png", sep="/")) dim(img) plot(img) #' #' #' Before feeding the image to the deep learning network for classification, we need to do some preprocessing to make it fit the `deepnet` input requirements. This image preprocessing (cropping and subtraction of the mean) can be done directly in R. #' #' preproc.image <-function(im, mean.image) { # crop the image shape <- dim(im) short.edge <- min(shape[1:2]) xx <- floor((shape[1] - short.edge) / 2) yy <- floor((shape[2] - short.edge) / 2) cropped <- crop.borders(im, xx, yy) # resize to 224 x 224, needed by input of the model. resized <- resize(cropped, 224, 224) plot(resized) # convert to array (x, y, channel) arr <- as.array(resized[,,,c(1:3)]) * 255 plot(as.cimg(arr)) dim(arr) <- c(224, 224, 3) # subtract the mean normed <- arr - mean.img # Reshape to format needed by mxnet (width, height, channel, num) dim(normed) <- c(224, 224, 3, 1) return(normed) } #' #' #' Call the preprocessing function with the normalized image. #' #' normed <- preproc.image(img, mean.img) # plot(normed) #' #' #' ## Image Classification #' #' Use the predict function to get the probability over all (learned) classes. #' #' prob <- predict(model, X=normed) dim(prob) #' #' #' The `prob` prediction generates a $1000 \times 1$ array representing the probability of the input image to resemble (be classified as) the top 1,000 known image categories. We can report the indices of the top-10 closest image classes to the input image: #' #' max.idx <- order(prob[,1], decreasing = TRUE)[1:10] max.idx #' #' #' Alternatively, we can map of these top-10 indices into named `image-classes`. #' #' synsets <- readLines("synset.txt") print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' Clearly, this US weather pattern image is not well classified. The optimal perdition suggests this may be a *theater curtain*, however, the confidence is very low, $Prob\sim 0.049$. None of the other top-10 classes capture the type of the actual image either. #' #' ## Additional Image Classification Examples #' #' The machine learning image classifications results won't always be this poor. Let's try classifying several alternative images. #' #' ### Lake Mapourika, New Zealand #' #' Let's try the automated image classification of this lakeside panorama. #' #' download.file("https://upload.wikimedia.org/wikipedia/commons/2/23/Lake_mapourika_NZ.jpeg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') im <- load.image(paste(getwd(),"results/image.png", sep="/")) plot(im) normed <- preproc.image(im, mean.img) prob <- predict(model, X=normed) max.idx <- order(prob[,1], decreasing = TRUE)[1:10] print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' This photo does represent a lakeside, which is reflected by the top three class labels: #' #' * breakwater, groin, groyne, mole, bulwark, seawall, jetty. #' * dock, dockage, docking facility. #' * lakeside, lakeshore. #' #' ### Beach #' #' Another costal boundary between water and land is represented in this [beach image](https://upload.wikimedia.org/wikipedia/commons/9/90/Holloways_beach_1920x1080.jpg). #' #' download.file("https://upload.wikimedia.org/wikipedia/commons/9/90/Holloways_beach_1920x1080.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') im <- load.image(paste(getwd(),"results/image.png", sep="/")) plot(im) normed <- preproc.image(im, mean.img) prob <- predict(model, X=normed) max.idx <- order(prob[,1], decreasing = TRUE)[1:10] print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' This photo was classified appropriately and with high-confidence as: #' #' * sandbar, sand bar. #' * lakeside, lakeshore. #' * seashore, coast, seacoast, sea-coast. #' #' ### Volcano #' #' Here is another natural image representing the [Mount St. Helens Vocano](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg/1200px-MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg). #' #' download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg/1200px-MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') im <- load.image(paste(getwd(),"results/image.png", sep="/")) plot(im) normed <- preproc.image(im, mean.img) prob <- predict(model, X=normed) max.idx <- order(prob[,1], decreasing = TRUE)[1:10] print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' The predicted top class labels for this image are perfect: #' #' * volcano. #' * geyser. #' * alp. #' #' ### Brain Surface #' #' The next image represents a 2D snapshot of 3D shape reconstruction of a brain cortical surface. This image is particularly difficult to automatically classify because (1) few people have ever seen a real brain, (2) the mathematical and computational models used to obtain the 2D manifold representing the brain surface do vary, and (3) the patterns of sulcal folds and gyral crests are quite inconsistent between people. #' #' download.file("http://wiki.socr.umich.edu/images/e/ea/BrainCortex2.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') im <- load.image(paste(getwd(),"results/image.png", sep="/")) plot(im) normed <- preproc.image(im, mean.img) prob <- predict(model, X=normed) max.idx <- order(prob[,1], decreasing = TRUE)[1:10] print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' The top class labels for the brain image are: #' #' * brain coral #' * mushroom #' * hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa #' * jigsaw puzzle #' #' Imagine if we can train a brain image classifier that labels individuals (volunteers or patients) solely based on their brain scans into different classes reflecting their development, clinical phenotypes, disease states, or aging profiles. This will require a substantial amount of expert-labeled brain scans, model training and extensive validation. However any progress in this direction will lead to effective computational clinical decision support systems that can assist physicians with diagnosis, tracking, and prognostication of brain growth and aging in health and disease. #' #' ### Face Mask #' #' The last example is a synthetic computer-generated image representing a cartoon face or a mask. #' #' download.file("http://wiki.socr.umich.edu/images/f/fb/FaceMask1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb') im <- load.image(paste(getwd(),"results/image.png", sep="/")) plot(im) normed <- preproc.image(im, mean.img) prob <- predict(model, X=normed) max.idx <- order(prob[,1], decreasing = TRUE)[1:10] print(paste0("Top Predicted Image-Label Classes: Name=", synsets[max.idx], "; Probability: ", prob[max.idx])) #' #' #' The top class labels for the face mask are: #' #' * mask #' * ski mask #' * analog clock #' #' You can easily test the same image classifier on your own images and identify classes of pictures that are either well or porly classified by the deep learning based machine learning model. #' #' # References #' #' * [MXNET R Tutorial](http://mxnet.io/tutorials/r/fiveMinutesNeuralNetwork.html) #' * [Deep Learning with MXNetR](https://www.r-bloggers.com/deep-learning-with-mxnetr/) #' * [Deep Neural Networks](https://github.com/ledell/useR-machine-learning-tutorial/blob/master/deep-neural-networks.Rmd) #' * [Google's TensorFlow API](http://playground.tensorflow.org)