#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

## Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines

" #' author: "

### SOCR/MIDAS (Ivo Dinov)

" #' date: "r format(Sys.time(), '%B %Y')" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' In this chapter, we are going to cover two very powerful machine-learning algorithms. These techniques have complex mathematical formulations, however, efficient algorithms and reliable software packages have been developed to utilize them for various practical applications. We will (1) describe Neural Networks as analogues of biological neurons, (2) develop hands-on a neural net that can be trained to compute the square-root function, (3) describe support vector machine (SVM) classification, and (4) complete several case-studies, including optical character recognition (OCR), the Iris flowers, Google Trends and the Stock Market, and Quality of Life in chronic disease. #' #' Later, in [Chapter 22](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/22_DeepLearning.html), we will provide more details and additional examples of deep neural network learning. For now, let's start by exploring the *magic* inside the machine learning black box. #' #' # Understanding Neural Networks #' #' ## From biological to artificial neurons #' #' An Artificial Neural Network (ANN) model mimics the biological brain response to multisource (sensory-motor) stimuli (inputs). ANN simulate the brain using a network of interconnected neuron cells to create a massive parallel processor. Of course, it uses a network of artificial nodes, not brain cells, to train data. #' #' When we have three signals (or inputs) $x_1$, $x_2$ and $x_3$, the first step is weighting the features ($w$'s) according to their importance. Then, the weighted signals are summed by the "neuron cell" and this sum is passed on according to an **activation function** denoted by **f**. The last step is generating an output **y** at the end of the process. A typical output will have the following mathematical relationship to the inputs. #' $$y(x)=f\left (\sum_{i=1}^n w_ix_i\right ).$$ #' #' There are three important components for building a neural network: #' #' * **Activation function**: transforms weighted and summed inputs to output. #' * **Network topology**: describes the number of "neuron cells", the number of layers and manner in which the cells are connected. #' * **Training algorithm**: how to determine weights $w_i$ #' #' Let's look at each of these components one by one. #' #' ## Activation functions #' #' One of the functions is known as threshold activation function that results in an output signal once a specified input threshold has been attained. #' #' x<-runif(1000, -10, 10) y<-ifelse(x>=0, 1, 0) plot(x, y, t="n", xlab = "Sum of input signals", ylab = "Output signal", main = "Threshold Activation (th=0.0)") segments(0, -0.1, 0, 1, col="black", lty=2) lines(x[order(x)], y[order(x)], pch=16, col="black", lwd=4) #' #' #' $$#' f(x)= \left\{ #' \begin{array}{ll} #' 0 & x<0 \\ #' 1 & x\geq 0 \\ #' \end{array} #' \right. . #'$$ #' #' This is the simplest form for activation functions. It may be rarely used in real world situations. Most commonly used alternative is the sigmoid activation function where $f(x)=\frac{1}{1+e^{-x}}$. The *Euler number* e is defined as the limit of $\displaystyle\lim_{n\longrightarrow\infty}{\left ( 1+\frac{1}{n}\right )^n}$. The output signal is no longer binary but can be any real number ranged from 0 to 1. #' #' x <- runif(100000, -10, 10) y <- 1/(1+exp(-x)) plot(x, y, xlab = "Sum of input signals", ylab = "Output signal", main = "Sigmoid Activation (th1=0; th2=2)") segments(0, -0.1, 0, 0.5, col="black", lty=2) segments(2, -0.1, 2, (1/(1+exp(-2))), col="black", lty=2) #' #' #' Other activation functions might also be useful: #' #' x<-runif(100000, -10, 10) y1<-x y2<-ifelse(x<=-5, -5, ifelse(x>=5, 5, x)) y3<-(exp(x)-exp(-x))/(exp(x)+exp(-x)) y4<-exp(-x^2/2) par(mfrow=c(2, 2)) plot(x, y1, main="Linear", xlab="", ylab="") plot(x, y2, main="Saturated Linear", xlab="", ylab="") plot(x, y3, main="Hyperbolic tangent", xlab="", ylab="") plot(x, y4, main = "Gaussian", xlab="", ylab="") #' #' #' Depending on the specific problem, we can chose a proper activation function based on the corresponding codomain of the function. For example, with hyperbolic tangent activation function, we can only have outputs ranging from -1 to 1 regardless of what input do we have. With linear functions we can go from $-\infty$ to $+\infty$. The Gaussian activation function yields a model called *Radial Basis Function* network. #' #' ## Network topology #' #' The number of layers: The $x$'s, or features, in the dataset are called **input nodes** while the predicted values are called the **output nodes**. Multilayer networks include multiple hidden layers. The following graph represents a two layer neural network: #' #' ![](http://wiki.socr.umich.edu/images/3/36/DSPA_Figs_TwoLayer_Chapter6.png) #' $$\text{Two Layer network.}$$ #' #' When we have multiple layers, the information flow could be complicated. #' #' ## The direction of information travel #' #' The arrows in the last graph (with multiple layers) suggest a feed forward network. In such networks, we can also have multiple outcomes modeled simultaneously. #' #' ![](http://wiki.socr.umich.edu/images/0/07/DSPA_Figs_multioutput_Chapter6.png) #' $$\text{Multi-output}$$ #' #' Alternatively, in a recurrent network (feedback network), information can also travel backwards in loops (or delays). This is illustrated in the following graph. #' #' ![](http://wiki.socr.umich.edu/images/f/f9/DSPA_Figs_delay_chapter6.png) #' $$\text{Delay(feedback network)}$$ #' #' This short-term memory increases the power of recurrent networks dramatically. However, in practice, recurrent networks are not commonly used. #' #' ## The number of nodes in each layer #' #' Number of input nodes and output nodes are predetermined by the dataset and predictive variables. The number we can edit is the hidden nodes in the model. Our goal is to add fewer hidden nodes as possible to simplify the model when the model performance is pleasant. #' #' ## Training neural networks with backpropagation #' #' This algorithm could determine the weights in the model using a strategy of back-propagating errors. First, we assign random numbers for weights (but all weights must be non-trivial, i.e., $\not= 0$). For example, we can use normal distribution, or any other random process, to assign initial weights (priors). Then, we can adjust the weights iteratively by repeating the process until certain convergence or stopping criterion is met. Each iteration contains two phases. #' #' * Forward phase: from input layer to output layer using current weights. Outputs are produced at the end of this phase, and #' * Backward phase: compare the outputs and true target values (training validation). If the difference is significant, we change the weights and go through the forward phase, again. #' #' In the end, we pick a set of weights minimizing the total aggregate error to be the weights of the specific neural network. #' #' # Case Study 1: Google Trends and the Stock Market - **Regression** #' #' ## Step 1 - collecting data #' #' In this case study, we are going to use the [Google trends and stock market dataset](http://wiki.socr.umich.edu/index.php/SOCR_Data_GoogleTrends_2005_2011). A [doc file with the meta-data](https://umich.instructure.com/files/416273/download?download_frd=1) and the [CSV data](https://umich.instructure.com/files/416274/download?download_frd=1) are available on the [Case-Studies Canvas Site](https://umich.instructure.com/courses/38100/files/folder/Case_Studies). These daily data (between 2008 and 2009) can be used to examine the associations between Google search trends and the daily stock market index - Dow Jones Industrial average. #' #' ### Variables #' #' * **Index**: Time Index of the Observation #' * **Date**: Date of the observation (Format: YYYY-MM-DD) #' * **Unemployment**: The Google Unemployment Index tracks queries related to "unemployment, social, social security, unemployment benefits" and so on. #' * **Rental**: The Google Rental Index tracks queries related to "rent, apartments, for rent, rentals", etc. #' * **RealEstate**: The Google Real Estate Index tracks queries related to "real estate, mortgage, rent, apartments" and so on. #' * **Mortgage**: The Google Mortgage Index tracks queries related to "mortgage, calculator, mortgage calculator, mortgage rates". #' * **Jobs**: The Google Jobs Index tracks queries related to "jobs, city, job, resume, career, monster" and so forth. #' * **Investing**: The Google Investing Index tracks queries related to "stock, finance, capital, yahoo finance, stocks", etc. #' * **DJI_Index**: The Dow Jones Industrial (DJI) index. These data are interpolated from 5 records per week (Dow Jones stocks are traded on week-days only) to 7 days per week to match the constant 7-day records of the Google-Trends data. #' * **StdDJI**: The standardized-DJI Index computed by: StdDJI = 3+(DJI-11091)/1501, where m=11091 and s=1501 are the approximate mean and standard-deviation of the DJI for the period (2005-2011). #' * **30-Day Moving Average Data Columns**: The 8 variables below are the 30-day moving averages of the 8 corresponding (raw) variables above: *Unemployment30MA, Rental30MA, RealEstate30MA, Mortgage30MA, Jobs30MA, Investing30MA, DJI_Index30MA*, and *StdDJI_30MA*. #' * **180-Day Moving Average Data Columns**: The 8 variables below are the 180-day moving averages of the 8 corresponding (raw) variables: *Unemployment180MA, Rental180MA, RealEstate180MA, Mortgage180MA, Jobs180MA, Investing180MA, DJI_Index180MA*, and *StdDJI_180MA*. #' #' Here we use the RealEstate as our dependent variable. Let's see if Google Real Estate Index could be predicted by other variables in the dataset. #' #' ## Step 2 - exploring and preparing the data #' #' First thing first, we need to load the dataset into R. #' #' google<-read.csv("https://umich.instructure.com/files/416274/download?download_frd=1", stringsAsFactors = F) #' #' #' Let's delete the first two columns, since the only goal is to predict Google Real Estate Index with other indexes and DJI. #' #' google<-google[, -c(1, 2)] str(google) #' #' #' As we can see from the structure of the data, these indexes and DJI have different ranges. We should rescale the data to make all features unitless, and therefore, comparable. In [Chapter 6](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html), we learned that normalizing these features using our own normalize() function could fix the problem. We can use lapply() to apply the normalize() function to each column. #' #' normalize <- function(x) { return((x - min(x)) / (max(x) - min(x))) } google_norm<-as.data.frame(lapply(google, normalize)) summary(google_norm$RealEstate) #' #' #' The last step clearly normalizes all feature vectors into the 0 to 1 range. #' #' The next step would be to split the google dataset into *training* and *testing* subsets. This time we will use the sample() and floor() function to separate training and test dataset (75:25). The sample() function creates a set of (random) indicators for row indices. We can subset the original dataset with random rows using these indicators. The floor() function takes a number *x* and returns the closest integer to *x*. #' #' sample(row, floor(size)) #' #' * row: rows in the dataset that you want to select from. If you want to select all the rows, you can use$nrow(data)$or$1:nrow(data)$(for a single number or a vector). #' * size: how many rows you want for your subset. #' #' sub<-sample(nrow(google_norm), floor(nrow(google_norm)*0.75)) google_train<-google_norm[sub, ] google_test<-google_norm[-sub, ] #' #' #' We are good to go and can move forward to the neural network training phase. #' #' ## Step 3 - training a model on the data #' #' Here, we use the function neuralnet::neuralnet(), which returns a NN object containing: #' #' * *call*; the matched call. #' * *response*; extracted from the data argument. #' * *covariate*; the variables extracted from the data argument. #' * *model.list*; a list containing the covariates and the response variables extracted from the formula argument. #' * *err.fct* and *act.fct*; the error and activation functions. #' * *net.result*; a list containing the overall result of the neural network for every repetition. #' * *weights*; a list containing the fitted weights of the neural network for every repetition. #' * *result.matrix*; a matrix containing the reached threshold, needed steps, error, AIC, BIC, and weights for every repetition. Each column represents one repetition. #' #' m <- neuralnet(target ~ predictors, data=mydata, hidden=1), where: #' #' * *target*: variable we want to predict. #' * *predictors*: predictors we want to use. Note that we cannot use "." to denote all the variables in this function. We have to add all predictors one by one to the model. #' * *data*: training dataset. #' * *hidden*: number of hidden nodes that we want to use in the model. By default, it is set to one. #' #' # install.packages("neuralnet") library(neuralnet) google_model<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train) plot(google_model) #' #' ![](http://wiki.socr.umich.edu/images/e/e5/DSPA_Figs_plot_google_model_Ch6.png) #' #' The above graph shows that we have only one hidden node. Error represents the aggregate sum of squared errors and Steps is how many iterations the model has go through. Note that these outputs could be different when you run exact same codes twice because the weights are stochastically estimated. #' #' Also, *bias nodes* (blue singletons in the graph) may be added to feedforward neural networks acting like intermediate input nodes that produce constant values, e.g., 1. They are not connected to nodes in previous layers, yet they generate biased activation. Bias nodes are not required but are helpful in some neural networks as they allow offsetting activation functions. #' #' ## Step 4 - evaluating model performance #' #' Similar to the predict() function that we have mentioned in previous chapters, compute() is an alternative method that could help us to generate the model predictions. #' #' p<-compute(m, test) #' #' * *m*: a trained neural networks model. #' * *test*: the test dataset. This dataset should only contain the same type of predictors in the neural network model. #' #' Our model used Unemployment, Rental, Mortgage, Jobs, Investing, DJI_Index, StdDJI as predictors. So, we need to find these corresponding column numbers in the test dataset (1, 2, 4, 5, 6, 7, 8 respectively). #' #' google_pred<-compute(google_model, google_test[, c(1:2, 4:8)]) pred_results<-google_pred$net.result cor(pred_results, google_test$RealEstate) #' #' #' As mentioned in [Chapter 9](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/09_RegressionForecasting.html), we can still use the correlation between predicted results and observed Real Estate Index to evaluate the algorithm. A correlation over 0.9 is very good for real world datasets. Could this be improved further? #' #' ## Step 5 - improving model performance #' #' This time we will include 4 hidden nodes in the NN model. Let's see what results we can get from this more complicated model. #' #' google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = 4) plot(google_model2) #' #' ![](http://wiki.socr.umich.edu/images/d/d1/DSPA_Figs_plot_google_model2_ch6.png) #' #' Although the graph looks more complicated than the previous neural network, we have smaller Error, i.e., sum of squared errors. The neural network models may be used both for *classification* and *regression*, which we will see in the next part. Let's first try regression. #' #' google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)]) pred_results2<-google_pred2$net.result cor(pred_results2, google_test$RealEstate) #' #' #' We get an even higher correlation. This is almost an ideal result! The predicted and observed RealEstate indices have a strong linear relationship. Nevertheless, too many hidden nodes might sometimes decrease the correlation between predicted and observed values, which will be examined in the practice problems later in this chapter. #' #' #' ## Step 6 - adding additional layers #' #' We observe an even lower Error by use three hidden layer with nodes$4,3,3$, respectively. However, this complicates the interpretation of the resulting neural network. #' #' google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = c(4,3,3)) google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)]) pred_results2<-google_pred2$net.result cor(pred_results2, google_test$RealEstate) # plot(google_model2) #' #' #' # Simple NN demo - learning to compute$\sqrt {\ \ \ }$#' #' This simple example demonstrates the foundation of the neural network prediction of a basic mathematical function:$\sqrt {\ \ \ }: \Re^+ \longrightarrow \Re^+$. #' #' # generate random training data: 1,000 |X_i|, where X_i ~ Uniform (0,100) or perhaps ~ N(0,1) rand_data <- abs(runif(1000, 0, 100)) # create a 2 column data-frame (and_data, sqrt_data) sqrt_df <- data.frame(rand_data, sqrt_data=sqrt(rand_data)) plot(rand_data, sqrt_df$sqrt_data) #' #' #' # Train the neural net set.seed(1234) net.sqrt <- neuralnet(sqrt_data ~ rand_data, sqrt_df, hidden=10, threshold=0.1) #' #' #' # report the NN # print(net.sqrt) # generate testing data seq(from=0.1, to=N, step=0.1) N <- 200 # out of range [100: 200] is also included in the testing! test_data <- seq(0, N, 0.1); test_data_sqrt <- sqrt(test_data) # try to predict the square-root values using 10 hidden nodes # Compute or predict for test data, test_data pred_sqrt <- compute(net.sqrt, test_data)$net.result # compute uses the trained neural net (net.sqrt), # to estimate the square-roots of the testing data # compare real (test_data_sqrt) and NN-predicted (pred_sqrt) square roots of test_data plot(pred_sqrt, test_data_sqrt, xlim=c(0, 12), ylim=c(0, 12)); abline(0,1, col="red", lty=2) legend("bottomright", c("Pred vs. Actual SQRT", "Pred=Actual Line"), cex=0.8, lty=c(1,2), lwd=c(2,2),col=c("black","red")) compare_df <-data.frame(pred_sqrt, test_data_sqrt); # compare_df plot(test_data, test_data_sqrt) lines(test_data, pred_sqrt, pch=22, col="red", lty=2) legend("bottomright", c("Actual SQRT","Predicted SQRT"), lty=c(1,2), lwd=c(2,2),col=c("black","red")) #' #' #' We observe that the NN, net.sqrt actually learns and predicts pretty close the complex square root function. Of course, everyone's results may vary as we randomly generate the training data (rand_data) and the NN construction (net.sqrt) is also stochastic. #' #' #Case Study 2: Google Trends and the Stock Market - **Classification** #' #' In practice, NN may be more useful as a classifier. Let's demonstrate this by using again the Stock Market data. We mark the samples according to their RealEstate categorization. For those higher than 75% percentile, we give them label 0; for those lower than 0.25 percentile, we label them as 2; otherwise, we label them 1. Note that even in the classification setting, the responses still must be numeric. #' #' google_class = google_norm id1 = which(google_class$RealEstate>quantile(google_class$RealEstate,0.75)) id2 = which(google_class$RealEstate