In this chapter, we are going to cover two very powerful machine-learning algorithms. These techniques have complex mathematical formulations, however, efficient algorithms and reliable software packages have been developed to utilize them for various practical applications. We will (1) describe Neural Networks as analogues of biological neurons, (2) develop hands-on a neural net that can be trained to compute the square-root function, (3) describe support vector machine (SVM) classification, and (4) complete several case-studies, including optical character recognition (OCR), the Iris flowers, Google Trends and the Stock Market, and Quality of Life in chronic disease.

Later, in Chapter 22, we will provide more details and additional examples of deep neural network learning. For now, let’s start by exploring the magic inside the machine learning black box.

1 Understanding Neural Networks

1.1 From biological to artificial neurons

An Artificial Neural Network (ANN) model mimics the biological brain response to multisource (sensory-motor) stimuli (inputs). ANN simulate the brain using a network of interconnected neuron cells to create a massive parallel processor. Of course, it uses a network of artificial nodes, not brain cells, to train data.

When we have three signals (or inputs) $x_1$, $x_2$ and $x_3$, the first step is weighting the features ($w$’s) according to their importance. Then, the weighted signals are summed by the “neuron cell” and this sum is passed on according to an activation function denoted by f. The last step is generating an output y at the end of the process. A typical output will have the following mathematical relationship to the inputs. The weights $\{w_i\}_{i\ge 1}$’s control the weight-averaging of the inputs, $\{x_i\}$’s, used to assess the activation function. The constant factor weight $w_o$ and the corresponding bias term $b$ allow us to shift or offset the entire activation function (left or right). \[y(x)=f\left (w_o \underbrace{b}_{bias}+\sum_{i=1}^n \overbrace{w_i}^{weights} \underbrace{x_i}_{inputs}\right ).\]

There are three important components for building a neural network:

Activation function: transforms weighted and summed inputs to output.
Network topology: describes the number of “neuron cells”, the number of layers and manner in which the cells are connected.
Training algorithm: how to determine weights $w_i$

Let’s look at each of these components one by one.

1.2 Activation functions

One of the functions is known as threshold activation function that results in an output signal once a specified input threshold has been attained.

\[ f(x)= \left\{ \begin{array}{ll} 0 & x<0 \\ 1 & x\geq 0 \\ \end{array} \right. . \]

This is the simplest form for activation functions. It may be rarely used in real world situations. Most commonly used alternative is the sigmoid activation function where $f(x)=\frac{1}{1+e^{-x}}$. The Euler number e is defined as the limit of $\displaystyle\lim_{n\longrightarrow\infty}{\left ( 1+\frac{1}{n}\right )^n}$. The output signal is no longer binary but can be any real number ranging from 0 to 1.

Other activation functions might also be useful:

Depending on the specific problem, we can chose a proper activation function based on the corresponding codomain of the function. For example, with hyperbolic tangent activation function, we can only have outputs ranging from -1 to 1 regardless of what input do we have. With linear functions we can go from $-\infty$ to $+\infty$. The Gaussian activation function yields a model called Radial Basis Function network.

1.3 Network topology

The number of layers: The $x$’s, or features, in the dataset are called input nodes while the predicted values are called the output nodes. Multilayer networks include multiple hidden layers. The following graph represents a two-layer neural network:

\[\text{Two Layer network.}\]

When we have multiple layers, the information flow could be complicated.

1.4 The direction of information travel

The arrows in the last graph (with multiple layers) suggest a feed forward network. In such networks, we can also have multiple outcomes modeled simultaneously.

\[\text{Multi-output}\]

Alternatively, in a recurrent network (feedback network), information can also travel backwards in loops (or delays). This is illustrated in the following graph.

\[\text{Delay(feedback network)}\]

This short-term memory increases the power of recurrent networks dramatically. However, in practice, recurrent networks are not commonly used.

1.5 The number of nodes in each layer

Number of input nodes and output nodes are predetermined by the dataset and predictive variables. The number we can edit is the hidden nodes in the model. Our goal is to add fewer hidden nodes as possible to simplify the model when the model performance is pleasant.

1.6 Training neural networks with backpropagation

This algorithm could determine the weights in the model using a strategy of back-propagating errors. First, we assign random numbers for weights (but all weights must be non-trivial, i.e., $\not= 0$). For example, we can use normal distribution, or any other random process, to assign initial weights (priors). Then, we can adjust the weights iteratively by repeating the process until certain convergence or stopping criterion is met. Each iteration contains two phases.

Forward phase: from input layer to output layer using current weights. Outputs are produced at the end of this phase, and
Backward phase: compare the outputs and true target values (training validation). If the difference is significant, we change the weights and go through the forward phase, again.

In the end, we pick a set of weights minimizing the total aggregate error to be the weights of the specific neural network.

2 Case Study 1: Google Trends and the Stock Market - Regression

2.1 Step 1 - collecting data

In this case study, we are going to use the Google trends and stock market dataset. A doc file with the meta-data and the CSV data are available on the Case-Studies Canvas Site. These daily data (between 2008 and 2009) can be used to examine the associations between Google search trends and the daily stock market index - Dow Jones Industrial average.

2.1.1 Variables

Index: Time Index of the Observation
Date: Date of the observation (Format: YYYY-MM-DD)
Unemployment: The Google Unemployment Index tracks queries related to “unemployment, social, social security, unemployment benefits” and so on.
Rental: The Google Rental Index tracks queries related to “rent, apartments, for rent, rentals”, etc.
RealEstate: The Google Real Estate Index tracks queries related to “real estate, mortgage, rent, apartments” and so on.
Mortgage: The Google Mortgage Index tracks queries related to “mortgage, calculator, mortgage calculator, mortgage rates”.
Jobs: The Google Jobs Index tracks queries related to “jobs, city, job, resume, career, monster” and so forth.
Investing: The Google Investing Index tracks queries related to “stock, finance, capital, yahoo finance, stocks”, etc.
DJI_Index: The Dow Jones Industrial (DJI) index. These data are interpolated from 5 records per week (Dow Jones stocks are traded on week-days only) to 7 days per week to match the constant 7-day records of the Google-Trends data.
StdDJI: The standardized-DJI Index computed by: StdDJI = 3+(DJI-11091)/1501, where m=11091 and s=1501 are the approximate mean and standard-deviation of the DJI for the period (2005-2011).
30-Day Moving Average Data Columns: The 8 variables below are the 30-day moving averages of the 8 corresponding (raw) variables above: Unemployment30MA, Rental30MA, RealEstate30MA, Mortgage30MA, Jobs30MA, Investing30MA, DJI_Index30MA, and StdDJI_30MA.
180-Day Moving Average Data Columns: The 8 variables below are the 180-day moving averages of the 8 corresponding (raw) variables: Unemployment180MA, Rental180MA, RealEstate180MA, Mortgage180MA, Jobs180MA, Investing180MA, DJI_Index180MA, and StdDJI_180MA.

Here we use the RealEstate as our dependent variable. Let’s see if Google Real Estate Index could be predicted by other variables in the dataset.

2.2 Step 2 - exploring and preparing the data

First things first, we need to load the dataset into R.

google<-read.csv("https://umich.instructure.com/files/416274/download?download_frd=1", stringsAsFactors = F)

Let’s delete the first two columns, since the only goal is to predict Google Real Estate Index with other indexes and DJI.

google<-google[, -c(1, 2)]
str(google)

## 'data.frame':    731 obs. of  24 variables:
##  $ Unemployment      : num  1.54 1.56 1.59 1.62 1.64 1.64 1.71 1.85 1.82 1.78 ...
##  $ Rental            : num  0.88 0.9 0.92 0.92 0.94 0.96 0.99 1.02 1.02 1.01 ...
##  $ RealEstate        : num  0.79 0.81 0.82 0.82 0.83 0.84 0.86 0.89 0.89 0.89 ...
##  $ Mortgage          : num  1 1.05 1.07 1.08 1.1 1.11 1.15 1.22 1.23 1.24 ...
##  $ Jobs              : num  0.99 1.05 1.1 1.14 1.17 1.2 1.3 1.41 1.43 1.44 ...
##  $ Investing         : num  0.92 0.94 0.96 0.98 0.99 0.99 1.02 1.09 1.1 1.1 ...
##  $ DJI_Index         : num  13044 13044 13057 12800 12827 ...
##  $ StdDJI            : num  4.3 4.3 4.31 4.14 4.16 4.16 4.16 4 4.1 4.17 ...
##  $ Unemployment_30MA : num  1.37 1.37 1.38 1.38 1.39 1.4 1.4 1.42 1.43 1.44 ...
##  $ Rental_30MA       : num  0.72 0.72 0.73 0.73 0.74 0.75 0.76 0.77 0.78 0.79 ...
##  $ RealEstate_30MA   : num  0.67 0.67 0.68 0.68 0.68 0.69 0.7 0.7 0.71 0.72 ...
##  $ Mortgage_30MA     : num  0.98 0.97 0.97 0.97 0.98 0.98 0.98 0.99 0.99 1 ...
##  $ Jobs_30MA         : num  1.06 1.06 1.05 1.05 1.05 1.05 1.05 1.06 1.07 1.08 ...
##  $ Investing_30MA    : num  0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.98 0.98 ...
##  $ DJI_Index_30MA    : num  13405 13396 13390 13368 13342 ...
##  $ StdDJI_30MA       : num  4.54 4.54 4.53 4.52 4.5 4.48 4.46 4.44 4.41 4.4 ...
##  $ Unemployment_180MA: num  1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 ...
##  $ Rental_180MA      : num  0.87 0.87 0.87 0.87 0.87 0.87 0.86 0.86 0.86 0.86 ...
##  $ RealEstate_180MA  : num  0.89 0.89 0.88 0.88 0.88 0.88 0.88 0.88 0.88 0.87 ...
##  $ Mortgage_180MA    : num  1.18 1.18 1.18 1.18 1.17 1.17 1.17 1.17 1.17 1.17 ...
##  $ Jobs_180MA        : num  1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 ...
##  $ Investing_180MA   : num  1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 ...
##  $ DJI_Index_180MA   : num  13493 13492 13489 13486 13482 ...
##  $ StdDJI_180MA      : num  4.6 4.6 4.6 4.6 4.59 4.59 4.59 4.58 4.58 4.58 ...

As we can see from the structure of the data, these indexes and DJI have different ranges. We should rescale the data to make all features unitless, and therefore, comparable. In Chapter 6, we learned that normalizing these features using our own normalize() function could fix the problem. We can use lapply() to apply the normalize() function to each column.

normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
google_norm<-as.data.frame(lapply(google, normalize))
summary(google_norm$RealEstate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4615  0.6731  0.6292  0.8077  1.0000

The last step clearly normalizes all feature vectors into the 0 to 1 range.

The next step would be to split the google dataset into training and testing subsets. This time we will use the sample() and floor() function to separate training and test dataset (75:25). The sample() function creates a set of (random) indicators for row indices. We can subset the original dataset with random rows using these indicators. The floor() function takes a number x and returns the closest integer to x.

sample(row, floor(size))

row: rows in the dataset that you want to select from. If you want to select all the rows, you can use $nrow(data)$ or $1:nrow(data)$ (for a single number or a vector).
size: how many rows you want for your subset.

sub<-sample(nrow(google_norm), floor(nrow(google_norm)*0.75))
google_train<-google_norm[sub, ]
google_test<-google_norm[-sub, ]

We are good to go and can move forward to the neural network training phase.

2.3 Step 3 - training a model on the data

Here, we use the function neuralnet::neuralnet(), which returns a NN object containing:

call; the matched call.
response; extracted from the data argument.
covariate; the variables extracted from the data argument.
model.list; a list containing the covariates and the response variables extracted from the formula argument.
err.fct and act.fct; the error and activation functions.
net.result; a list containing the overall result of the neural network for every repetition.
weights; a list containing the fitted weights of the neural network for every repetition.
result.matrix; a matrix containing the reached threshold, needed steps, error, AIC, BIC, and weights for every repetition. Each column represents one repetition.

m <- neuralnet(target ~ predictors, data=mydata, hidden=1), where:

target: variable we want to predict.
predictors: predictors we want to use. Note that we cannot use “.” to denote all the variables in this function. We have to add all predictors one by one to the model.
data: training dataset.
hidden: number of hidden nodes that we want to use in the model. By default, it is set to one.

# install.packages("neuralnet")
library(neuralnet)
google_model<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train)
plot(google_model)  # neuralnet::plot.nn

The above graph shows that we have only one hidden node. Error represents the aggregate sum of squared errors and Steps is how many iterations the model went through. Note that these outputs could be different when you run exact same codes twice because the weights are stochastically estimated.

Also, bias nodes (blue singletons in the graph) may be added to feedforward neural networks acting like intermediate input nodes that produce constant values, e.g., 1. They are not connected to nodes in previous layers, yet they generate biased activation. Bias nodes are not required but are helpful in some neural networks as they allow offsetting activation functions.

2.4 Step 4 - evaluating model performance

Similar to the predict() function that we have mentioned in previous chapters, compute() is an alternative method that could help us to generate the model predictions.

p<-compute(m, test)

m: a trained neural networks model.
test: the test dataset. This dataset should only contain the same type of predictors in the neural network model.

Our model used Unemployment, Rental, Mortgage, Jobs, Investing, DJI_Index, StdDJI as predictors. So, we need to find these corresponding column numbers in the test dataset (1, 2, 4, 5, 6, 7, 8 respectively).

google_pred<-compute(google_model, google_test[, c(1:2, 4:8)])
pred_results<-google_pred$net.result
cor(pred_results, google_test$RealEstate)

##           [,1]
## [1,] 0.9774946

As mentioned in Chapter 9, we can still use the correlation between predicted results and observed Real Estate Index to evaluate the algorithm. A correlation over 0.9 is very good for real world datasets. Could this be improved further?

2.5 Step 5 - improving model performance

This time we will include 4 hidden nodes in the NN model. Let’s see what results we can get from this more complicated model.

google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = 4)
plot(google_model2)

Although the graph looks more complicated than the previous neural network, we have smaller Error, i.e., sum of squared errors. The neural network models may be used both for classification and regression, which we will see in the next part. Let’s first try regression.

google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)])
pred_results2<-google_pred2$net.result
cor(pred_results2, google_test$RealEstate)

##           [,1]
## [1,] 0.9906595

plot_ly() %>%
  add_markers(x=pred_results2, y=google_test$RealEstate, 
              name="Data Scatter", type="scatter", mode="markers") %>%
  add_trace(x = c(0,1), y = c(0,1), type="scatter", mode="lines",
        line = list(width = 4), name="Ideal Agreement") %>%
  layout(title=paste0('Scatterplot (Normalized) Observed vs. Predicted Real Estate Values, Cor(Obs,Pred)=',
                      round(cor(pred_results2, google_test$RealEstate), 2)),
           xaxis = list(title="NN (hidden=4) Real Estate Predictions"),
           yaxis = list(title="(Normalized) Observed Real Estate"),
           legend = list(orientation = 'h'))

We get an even higher correlation. This is almost an ideal result! The predicted and observed RealEstate indices have a strong linear relationship. Nevertheless, too many hidden nodes might sometimes decrease the correlation between predicted and observed values, which will be examined in the practice problems later in this chapter.

2.6 Step 6 - adding additional layers

We observe an even lower Error by use three hidden layers, each with $4,3,3$ nodes, respectively. However, this enhanced neural network may complicate the interpretation of the results (or may overfit the network to intrinsic noise in the data).

google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = c(4,3,3))
google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)])
pred_results2<-google_pred2$net.result
cor(pred_results2, google_test$RealEstate)

##           [,1]
## [1,] 0.9908649

# plot(google_model2)

plot_ly() %>%
  add_markers(x=pred_results2, y=google_test$RealEstate, 
              name="Data Scatter", type="scatter", mode="markers") %>%
  add_trace(x = c(0,1), y = c(0,1), type="scatter", mode="lines",
        line = list(width = 4), name="Ideal Agreement") %>%
  layout(title=paste0('Scatterplot (Normalized) Observed vs. Predicted Real Estate Values, Cor(Obs,Pred)=',
                      round(cor(pred_results2, google_test$RealEstate), 2)),
           xaxis = list(title="NN (hidden=(4,3,3)) Real Estate Predictions"),
           yaxis = list(title="(Normalized) Observed Real Estate"),
           legend = list(orientation = 'h'))

3 Simple NN demo - learning to compute $\sqrt {\ \ \ }$

Neural networks can be used at the interface of experimental, theoretical, computational and data sciences. Here is one powerful example, Data science applications to string theory.

We will demonstrates the foundation of the neural network prediction to estimate a basic mathematical function: $\sqrt {\ \ \ }: \Re^+ \longrightarrow \Re^+$. First, plot the data.

# generate random training data: 1,000 |X_i|, where X_i ~ Uniform (0,100) or perhaps ~ N(0,1)
rand_data <- abs(runif(1000, 0, 100))
 
# create a 2 column data-frame (and_data, sqrt_data)
sqrt_df <- data.frame(rand_data, sqrt_data=sqrt(rand_data)) 
# plot(rand_data, sqrt_df$sqrt_data)

s <- seq(from=0, to=100, length.out=1000)
plot_ly(x = ~s, y = ~sqrt(s), type="scatter", mode = "lines") %>%
  layout(title='Square-root Funciton',
           xaxis = list(title="Input (x)", scaleanchor="y"),
           yaxis = list(title="Output (y=sqrt(x))", scaleanchor="x"),
           legend = list(orientation = 'h'))

Next, fit the NN model.

# Train the neural net
set.seed(1234)
net.sqrt <- neuralnet(sqrt_data ~ rand_data,  sqrt_df, hidden=10, threshold=0.1)

Examine the NN-prediction results on testing data.

# report the NN
# print(net.sqrt)
 
# generate testing data seq(from=0.1, to=N, step=0.1)
N <- 200 # out of range [100: 200] is also included in the testing!
test_data <- seq(0, N, 0.1); test_data_sqrt <- sqrt(test_data)
 
test_data.df <- data.frame(rand_data=test_data, sqrt_data=sqrt(test_data));
 # try to predict the square-root values using 10 hidden nodes
 # Compute or predict for test data input, test_data.df
pred_sqrt <- predict(net.sqrt, test_data.df)
 
 # compute uses the trained neural net (net.sqrt), 
 # to estimate the square-roots of the testing data 
 
# compare real (test_data_sqrt) and NN-predicted (pred_sqrt) square roots of test_data
# plot(pred_sqrt, test_data_sqrt, xlim=c(0, 12), ylim=c(0, 12)); 
# abline(0,1, col="red", lty=2)
# legend("bottomright",  c("Pred vs. Actual SQRT", "Pred=Actual Line"), cex=0.8, lty=c(1,2),
#         lwd=c(2,2),col=c("black","red"))

plot_ly(x = ~pred_sqrt[,1], y = ~test_data_sqrt, type = "scatter", mode="markers", name="scatter") %>%
  add_trace(x = c(0,14), y = c(0,14), mode="lines", line = list(width = 4), name="Ideal Agreement") %>%
  layout(title='Scatterplot Predicted vs. Actual SQRT',
           xaxis = list(title="NN Predicted", scaleanchor="y"),
           yaxis = list(title="Actual Value (y=sqrt(x))", scaleanchor="x"),
           legend = list(orientation = 'h'))

compare_df <-data.frame(pred_sqrt, test_data_sqrt); # compare_df
 
# plot(test_data, test_data_sqrt)
# lines(test_data, pred_sqrt, pch=22, col="red", lty=2)
# legend("bottomright",  c("Actual SQRT","Predicted SQRT"),
#        lty=c(1,2),lwd=c(2,2),col=c("black","red"))

plot_ly(x = ~test_data, y = ~test_data_sqrt,  type="scatter", mode="lines", name="SQRT") %>%
  add_trace(x = ~test_data, y = ~pred_sqrt, mode="markers", name="NN Model Prediction") %>%
  layout(title='Predicted vs. Actual SQRT',
           xaxis = list(title="Inputs"),
           yaxis = list(title="Outputs (y=sqrt(x))"),
           legend = list(orientation = 'h'))

We observe that the NN, net.sqrt actually learns and predicts pretty close the complex square root function. Of course, everyone’s results may vary as we randomly generate the training data (rand_data) and the NN construction (net.sqrt) is also stochastic.

4 Case Study 2: Google Trends and the Stock Market - Classification

In practice, NN may be more useful as a classifier. Let’s demonstrate this by using again the Stock Market data. We mark the samples according to their RealEstate categorization. For those higher than 75% percentile, we give them label 0; for those lower than 0.25 percentile, we label them as 2; otherwise, we label them 1. Note that even in the classification setting, the responses still must be numeric.

google_class = google_norm
id1 = which(google_class$RealEstate>quantile(google_class$RealEstate,0.75))
id2 = which(google_class$RealEstate<quantile(google_class$RealEstate,0.25))
id3 = setdiff(1:nrow(google_class),union(id1,id2))
google_class$RealEstate[id1]=0
google_class$RealEstate[id2]=1
google_class$RealEstate[id3]=2
summary(as.factor(google_class$RealEstate))

##   0   1   2 
## 179 178 374

Here, we divide the data to training and testing sets. We need 3 more column indicators, which correspond to the 3 outcomes labels.

set.seed(2017)
train = sample(1:nrow(google_class),0.7*nrow(google_class))
google_tr = google_class[train,]
google_ts = google_class[-train,]
train_x = google_tr[,c(1:2,4:8)]
train_y = google_tr[,3]
colnames(train_x)

## [1] "Unemployment" "Rental"       "Mortgage"     "Jobs"         "Investing"   
## [6] "DJI_Index"    "StdDJI"

test_x = google_ts[,c(1:2,4:8)]
test_y = google_ts[3]
train_y_ind = model.matrix(~factor(train_y)-1)
colnames(train_y_ind) = c("High","Median","Low")
train = cbind(train_x, train_y_ind)

We use non-linear output and display intermediate results every 5,000 iterations.

nn_single = neuralnet(High+Median+Low~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
    data = train,
    hidden=4,
    linear.output=FALSE,
    lifesign='full', lifesign.step=5000)

Below is the prediction function using this neural network model to forecast RealEstate classification results.

pred = function(nn, dat) {
    # compute uses the trained neural net (nn=nn_single), and 
    # new testing data (dat=google_ts) to generate predictions (y_hat)
    # compute returns a list containing: 
    #     (1) neurons: a list of the neurons' output for each layer of the neural network, and
    #     (2) net.result: a matrix containing the overall result of the neural network.
    yhat = compute(nn, dat)$net.result
    
    # find the maximum in each row (1) in the net.result matrix
    # to determine the first occurrence of a specific element in each row (1)
    # we can use the apply function with which.max
    yhat = apply(yhat, 1, which.max)-1
    return(yhat)
}

mean(pred(nn_single, google_ts[,c(1:2,4:8)]) != as.factor(google_ts[,3]))

## [1] 0.01818182

Finally, report the confusion matrix illustrating the agreement/disagreement between the 3 observed RealEstate class labels and their (NN) predicted counterparts.

table(pred(nn_single, google_ts[,c(1:2,4:8)]), as.factor(google_ts[,3]))

##    
##       0   1   2
##   0  51   0   0
##   1   0  54   2
##   2   2   0 111

Now let’s inspect the structure of the resulting Neural Network.

plot(nn_single)

Similarly, we can change hidden to utilize multiple hidden layers, however, a more complicated model won’t necessarily guarantee an improved performance.

nn_single = neuralnet(High+Median+Low~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
    data = train,
    hidden=c(4,5),
    linear.output=FALSE,
    lifesign='full', lifesign.step=5000)
mean(pred(nn_single, google_ts[,c(1:2,4:8)]) != as.factor(google_ts[,3]))

## [1] 0.02727273

5 Support Vector Machines (SVM)

5.1 Motivation

Recall that in Chapter 6 we presented Lazy machine learning methods, which assign class labels using geometrical distances of different features. In multidimensional feature spaces, we can utilize spheres, centered according to the training dataset, to assign testing data labels. What kinds of shapes may be embedded in $nD$ space to help with the classification process?

5.2 Classification with hyperplanes

The easiest shape would be an $(n-1)D$ plane embedded in $nD$, which splits the entire space into two parts. Support Vector Machines (SVM) can use hyperplanes to split the data into separate groups, or classes. Obviously, this may be useful for datasets that are linearly separable.

As an example, consider lines, $(n-1)D$ planes, embedded in $R^2$. Assume that we have only two features, will you choose line $A$ or line $B$ as the better hyperplane separating the data? Or even another plane $C$?

5.2.1 Finding the maximum margin

To answer the above question, we need to search for the Maximum Margin Hyperplane (MMH). That is the hyperplane that creates greatest separation between the two closest observations.

We define support vectors as the points from each class that are closest to MMH. Each class must have at least one observation as support vector.

Using support vectors alone is insufficient for finding the MMH. Although tricky mathematical calculations are involved, the fundamental of the SVM process is fairly simple. Let’s look at linearly separable data and non-linearly separable data individually.

5.2.2 Linearly separable data

If the dataset is linearly separable, we can find the outer boundaries of our two groups of data points. These boundaries are called convex hull (red lines in the following graph). The MMH (green solid line) is just the line that perpendicular to the shortest line (orange, dash) between the two convex hulls.

Mind the difference between convex hull and concave hull of a set of points.

# install.packages("alphahull")
library(alphahull)

# Define a convex spline polygon function
#   Input 'boundaryVertices' (n * 2 matrix) include the ordered X,Y coordinates of the boundary vertices
#   'vertexNumber' is the number of spline vertices to use; dim(boundaryVertices)[1] ... not all are necessary and some end vertices are clipped
#   'k' controls the smoothness of the  periodic spline, i.e., the number of points to wrap around the ends
#   Returns an array of points 
convexSplinePolygon <- function(boundaryVertices, vertexNumber, k=3) 
  {
    # Wrap k vertices around each end.
    n <- dim(boundaryVertices)[1]
    if (vertexNumber < n) {
      print("vertexNumber< n!!!")
      stop()
    }
    if (k >= 1) {
        data <- rbind(boundaryVertices[(n-k+1):n, ], boundaryVertices, boundaryVertices[1:k, ])
    } else {
        data <- boundaryVertices
    }

    # Spline-interpolate the x and y coordinates
    data.spline <- spline(1:(n+2*k), data[ , 1], n=vertexNumber)
    x <- data.spline$x
    x1 <- data.spline$y
    x2 <- spline(1:(n+2*k), data[,2], n=vertexNumber)$y

    # Keep only the middle part
    cbind(x1, x2)[k < x & x <= n+k, ]
}

# install.packages("alphahull")
# Concave hull (alpha-convex hull)
group1 <- list(x=A[6:9], y=B[6:9])
# if duplicate points are expected, remove them to prevent ahull() function errors
group2 <- lapply(group1, "[" ,which(!duplicated(as.matrix(as.data.frame(group1)))))
concaveHull1 <- ahull(group2, alpha=6)

# plot(concaveHull1, add=FALSE, col="blue", wpoints=FALSE, xlim=c(0,10),ylim=c(0,10))
# points(group2, pch=19)

library(alphahull)
# Convex hull
group3 <- list(x=A[1:5], y=B[1:5])
# points(group3, pch=19)
convHull2 <- lapply(group3, "[", chull(group3))

# polygon(convHull2, lty=2, border="gray", lwd=2)
# polygon(convexSplinePolygon(as.matrix(as.data.frame(convHull2)), 100),border="red",lwd=2)
# legend("topleft",  c("Convex Hull", "Convex Spline Hull", "Concave Hull"), lty=c(2,1,1), lwd=c(2,2,2),col=c("gray","red", "blue"), cex=0.8)
# text(5,2, "group 2", col="red"); text(8,6, "group 1", col="blue")

library(sp)
SpP = SpatialPolygons(list(Polygons(list(Polygon(group2)),ID="s1"))) 
# plot(SpP)
# points(XY)
x1 <- SpP@polygons[[1]]@Polygons[[1]]@coords[,1]
y1 <- SpP@polygons[[1]]@Polygons[[1]]@coords[,2]
df1 <- convexSplinePolygon(as.matrix(as.data.frame(convHull2)), 100)

plot_ly() %>%
  add_trace(x=df1[,1], y=df1[,2], type="scatter", mode="lines", name="Convex Hull", line=list(color="lightblue")) %>%
  add_lines(x = x1, y = y1, type="scatter", mode="lines", name="Concave Region", line=list(color="orange")) %>%
  add_segments(x = df1[1,1],  xend=df1[dim(df1)[1],1], y = df1[1,2], yend = df1[dim(df1)[1],2], type="scatter", 
            mode="lines", name="", line=list(color="gray"), showlegend=F) %>%
  add_segments(x = x1[3],  xend=x1[4], y = y1[3], yend = y1[4], type="scatter", 
            mode="lines", name="Concave Region", line=list(color="orange"), showlegend=F) %>%
  add_lines(x = c(6,4), y = c(8,5), name="Shortest Line Between the Convex Clusters (A)", line=list(dash='dash'))  %>% 
  add_lines(x = c(10,2), y = c(3,8.7), mode="lines", name="MMH Line (B)")  %>% 
  add_segments(x=1, xend=4, y=1, yend = 5, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_segments(x=1, xend=4, y=1, yend = 3, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_segments(x=4, xend=4, y=3, yend = 5, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_segments(x=6, xend=10, y=8, yend = 7, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_segments(x=6, xend=10, y=8, yend = 7, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_segments(x=10, xend=9, y=7, yend = 10, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_markers(x = A, y = B, type="scatter", mode="markers", name="Data", marker=list(color="blue")) %>%
  add_markers(x = 4, y = 5, name="P1", 
              marker = list(size = 20, color = 'blue', line = list(color = 'yellow', width = 2))) %>%
  add_segments(x=6, xend=9, y=8, yend = 10, line=list(color="gray", dash='dash'), showlegend=F) %>%
  add_markers(x = 6, y = 8, name="P2", 
              marker = list(size = 20, color = 'blue', line = list(color = 'yellow', width = 2))) %>%
  # add_lines(x = df1[,1], y = df1[,2], type="scatter", mode="lines", name="Convex Hull")  %>%
  layout(title="Illustration of Hyperplane (line) Separation of 2D Data",
         xaxis=list(title="X", scaleanchor="y"),  # control the y:x axes aspect ratio
         yaxis = list(title="Y", scaleanchor  = "x"), legend = list(orientation = 'h'),
         annotations = list(text=modelLabels,  x=modelLabels.x, y=modelLabels.y, textangle=c(-40,0),
                            font=list(size=15, color=modelLabels.col), showarrow=FALSE))

#   # extract the row numbers of the boundary points, in convex order.
# indx=concaveHull1$arcs[,"end1"]  
# points <- df[indx,2:3]               # extract the boundary points from df
# points <- rbind(points,points[1,])   # add the closing point
# # create the SpatialPolygonsDataFrame
# SpP = SpatialPolygons(list(Polygons(list(Polygon(points)),ID="s1"))) 
# plot(SpP)
# plot(SpP@polygons[[1]]@Polygons[[1]]@coords[,1], SpP@polygons[[1]]@Polygons[[1]]@coords[,2])

An alternative way to linearly separate the data into (two) clusters is to find two parallel planes that can separate the data into two groups, and then increase the distance between the two planes as much as possible.

We can use vector notation to mathematically define planes. In n-dimensional space, a plane could be expressed by the following equation: \[\vec{w}\cdot\vec{x}+b=0,\] where $\vec{w}$ (weights) is the plane normal vector, $\vec{x}$ is the vector of unknowns, both have n coordinates, and $b$ is a constant scalar that completely determines the plane (as it specifies a point the plane goes through).

To clarify this notation let’s look at the situation in a 3D space where we can express (embed) 2D Euclidean planes using a point $(x_o,y_o,z_o)$ and normal-vector $(a,b,c)$ form. This is just a linear equation, where $d=-(ax_o + by_o + cz_0)$: \[ax + by + cz + d = 0,\] or equivalently \[w_1x_1+w_2x_2+w_3x_3+b=0\] We can see that it is equivalent to the vector notation

Using the vector notation, we can specify two hyperplanes as follows: \[\vec{w}\cdot\vec{x}+b\geq+1\] and \[\vec{w}\cdot\vec{x}+b\leq-1\] We require that all of the observations in the first class fall above the first plane and all observations in the other class fall below the second plane.

The distance between two planes is calculated as: \[\frac{2}{\lVert \vec{w}\rVert}\] where $\lVert . \rVert$ is the Euclidean norm. To maximize the distance, we need to minimize the Euclidean norm.

To sum up we are going to find $min\frac{\lVert \vec{w}\rVert}{2}$ with the following constrain: \[y_i(\vec{w}\cdot\vec{x}-b)\geq1, \, \forall\vec{x}_i\] where $\forall$ means “for all”.

We will see more about constrained and unconstrained optimization later in Chapter 21. For each nonlinear programming problem, the primal problem, there is related nonlinear programming problem, also known as the Lagrangian dual problem. Under certain assumptions for convexity and suitable constraints, the primal and dual problems have equal optimal objective values. Primal optimization problems are typically described as: \[\begin{array}{rcl} \min_x{f(x)} \\ \text{subject to} \\ g_i(x) \leq 0\\ h_j(x) = 0 \\ \end{array}.\]

Then the Lagrangian dual problem is defined as a parallel nonlinear programming problem: \[\begin{array}{rcl} & \min_{u,v}{\theta(u,v)} & \\ & \text{subject to} & \\ & u \geq 0\\ \end{array},\] where \[ \theta(u,v)= \inf_{x}{ \left ( f(x)+\displaystyle\sum_i {u_i g_i(x)} +\displaystyle\sum_j {v_j h_j(x)} \right )}.\]

Chapter 21 provides additional technical details about optimization duality.

Suppose the Lagrange primal is \[L_p = \frac{1}{2}||w||^2-\sum_{i=1}^{n}\alpha_i[y_i(w_0+x_i^{T}w)-1],\ \text{where}\ \alpha_i\geq 0.\]

To optimize that objective function, we can set the partial derivatives equal to zero: \[\frac{\partial}{\partial w}\ :\ w = \sum_{i=1}^{n}\alpha_iy_ix_i\]

\[\frac{\partial}{\partial b}\ :\ 0 = \sum_{i=1}^{n}\alpha_iy_i.\]

Substituting into the Lagrange primal, we obtain the Lagrange dual:

\[L_D = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i=1}^{n}\alpha_i\alpha_i^{'}y_iy_i^{'}x_i^Tx_i^{'}.\]

Then, we maximize $L_D$ subject to $\alpha_i \geq 0$ and $\sum_{i=1}^{n}\alpha_iy_i =0$.

Since it follows the Karush-Kuhn-Tucker optimization conditions, we have $\hat\alpha[y_i(\hat{b}+x_i^T\hat{w})-1]=0.$

This implies that if $y_i \hat{f}(x_i)>1$, then $\hat{\alpha}_i=0$.

The support of a function ($f(x_i)=\hat{b}+x_i^T\hat{w}$) is the smallest subset of the domain containing only arguments ($x$) which are not mapped to zero ($f(x)\not=0$). In our case, the solution $\hat{w}$ is defined in terms of a linear combination of the support points:

\[\hat{f}(x)=w^Tx = w = \sum_{i=1}^{n}\alpha_iy_ix_i. \]

That’s where the name of Support Vector Machines (SVM) comes from.

5.2.3 Non-linearly separable data

For non-linearly separable data, we need to use a small trick. Still, we use a plane but allow some of the points to be misclassified into the wrong class. To penalize for that, we add a cost term after the Euclidean norm function that we need to minimize.

Therefore, the solution will optimize the following regularized objective (cost) function: \[\min \left (\frac{\lVert \vec{w}\rVert}{2} \right )+C\sum_{i=1}^{n} \xi_i\] \[\text{subject to}\] \[y_i(\vec{w}\cdot\vec{x}-b)\geq1, \, \forall\vec{x}_i, \, \xi_i\geq0,\] where hyperparameter $C$ controls the error term (regularization) penalty and $\xi_i$ is the distance between the misclassified observation i and the plane.

We have the following Lagrange primal problem: \[L_p = \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i-\sum_{i=1}^{n}\alpha_i[y_i(b+x_i^{T}w)-(1-\xi_i)] - \sum_{i=1}^{n}\gamma_i\xi_i,\] where \[\alpha_i,\gamma_i \geq 0.\]

Similar to what we did above for the linearly separable case, we can use the derivatives of the primal problem to solve the dual problem.

Notice the inner product in the final expression. We can replace this inner product with a kernel function that maps the feature space into a higher dimensional space (e.g., using a polynomial kernel) or an infinite dimensional space (e.g., using a Gaussian kernel).

5.2.4 Using kernels for non-linear spaces

An alternative way to solve for the non-linear separable is called the kernel trick. That is to add some dimensions (or features) to make these non-linear separable data to be separable in a higher dimensional space.

The solution of the quadratic optimization problem in this case involves regularized objective function: \[\min_{w, b} \left (\frac{\lVert \vec{w}\rVert}{2} \right )+C\sum_{i=1}^{n} \xi_i,\] \[\text{subject to}\] \[y_i(\vec{w}\cdot\phi (\vec{x})-b)\geq 1 -\xi_i, \, \forall\vec{x}_i, \, \xi_i\geq 0.\] Again, the hyperparameter $C$ controls the regularization penalty and $\xi_i$ are the slack variables introduced by lifting the initial low-dimensional (non-linear) problem to a new higher dimensional linear problem. The quadratic optimization of this (primal) higher-dimensional problem is similarly transformed into a Lagrangian dual problem:

\[L_p = \max_{\alpha} \min_{w, b} \left \{\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\alpha_i \left ( 1- w^T\phi(\vec{x}_i) +b \right )\right \},\] where \[0\leq \alpha_i \leq C, \forall i.\]

The solution to the Lagrange dual problem provides estimates of $\alpha_i$ and we can predict the class label of a new sample $x_{test}$ via:

\[y_{test}=sign(w^T \phi (x_{test})+b)=sign\left ( \sum_{i=1}^{n} \alpha_i y_i \ \underbrace{\phi(\vec{x}_{i,test})^T \phi(\vec{x}_{j,test})}_{kernel, K(\vec{x}_i,\vec{x}_j)=\phi(\vec{x}_{i,test}). \phi(\vec{x}_{j,test})} +b \right ).\]

Below is one example where the 2D data (mtcars, $n=32, k=10$ cars fuel consumption) doesn’t appear as linearly separable in its native 2D ($weight\times horsepower$), the binary colors correspond to V-shaped or Straight engine type.

library(plotly)

mtcars$vs[which(mtcars$vs == 0)] <- 'V-Shaped Engine'
mtcars$vs[which(mtcars$vs == 1)] <- 'Straight Engine'
mtcars$vs <- as.factor(mtcars$vs)

p_2D <- plot_ly(mtcars, x = ~wt, y = ~hp/10, color = ~vs, colors = c('blue', 'red'), name=~vs) %>%
  add_markers() %>%
  add_segments(x = 1, xend = 6, y = 8, yend = 18, colors="gray", opacity=0.2, 
               showlegend = FALSE) %>%
  layout(xaxis = list(title = 'Weight'), yaxis = list(title = 'Horsepower'), legend = list(orientation = 'h'),
         title="(mtcars) Automobile Weight vs. Horsepower Relation") %>% hide_colorbar()
p_2D

However, the data can be lifted in 3D where it’s is more clearly linearly separable (by engine type) via a 2D plane.

# library(plotly)
# p_3D <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~vs, colors = c('blue', 'red')) %>%
#   add_markers() %>%
#   layout(scene = list(xaxis = list(title = 'Weight'),
#                     yaxis = list(title = 'Horsepower'),
#                     zaxis = list(title = '1/4 mile time')))
#p_3D

# Compute the Normal to the 2D PC plane
normVec = c(1, 1.3, -3.0)
# Compute the 3D point of gravitational balance (Plane has to go through it)
dMean <- c(3.2, -280, 2)

d <- as.numeric((-1)*normVec %*% dMean)  # force the plane to go through the mean
x=mtcars$wt; y=mtcars$hp; z=mtcars$qsec; w=mtcars$vs   # define the x, y, z dimensions
w.col = ifelse(mtcars$vs=="Straight Engine", "blue", "red")

# Reparametrize the 2D (x,y) grid, and define the corresponding model values z on the grid. Recall z=-(d + ax+by)/c, where normVec=(a,b,c)
x.seq <- seq(min(x),max(x),length.out=100)
y.seq <- seq(min(y),max(y),length.out=100)
z.seq <- function(x,y) -(d + normVec[1]*x + normVec[2]*y)/normVec[3]
# define the values of z = z(x.seq, y.seq), as a Matrix of dimension c(dim(x.seq), dim(y.seq))
z1 <- t(outer(x.seq, y.seq, z.seq))/10; range(z1)  # we need to check this 10 correction, to ensure the range of z is appropriate!!!

## [1] 14.53043 26.92413

# Draw the 2D plane embedded in 3D, and then add points with "add_trace"
myPlotly <- plot_ly(x=~x.seq, y=~y.seq, z=~z1,
                    colors = "gray", type="surface", opacity=0.5, showlegend = FALSE) %>%
    add_trace(data=mtcars, x=x, y=y, z=mtcars$qsec, mode="markers", type="scatter3d", 
              marker = list(color=w.col, opacity=0.9, symbol=105)) %>%
    layout(showlegend = FALSE,  scene = list(
        aspectmode = "manual", aspectratio = list(x=1, y=1, z=1),
        xaxis = list(title = "Weight", range = c(min(x),max(x))),
        yaxis = list(title = "Horsepower", range = c(min(y),max(y))),
        zaxis = list(title = "1/4 mile time", range = c(14, 23)))
    ) %>% hide_colorbar()
myPlotly

How can we do that in practice? We transform our data using kernel functions. A general form for kernel functions would be: \[K(\vec{x_i}, \vec{x_j})=\phi(\vec{x_i})\cdot\phi(\vec{x_j}),\] where $\phi$ is a mapping of the data into another space.

The linear kernel would be the simplest one that is just the dot product of the features. \[K(\vec{x_i}, \vec{x_j})=\vec{x_i}\cdot\vec{x_j}.\] The polynomial kernel of degree d transform the data by adding a simple non-linear transformation of the data. \[K(\vec{x_i}, \vec{x_j})=(\vec{x_i}\cdot\vec{x_j}+1)^d.\] The sigmoid kernel is very similar to the neural networks approach. It uses a sigmoid activation function. \[K(\vec{x_i}, \vec{x_j})=tanh(k\vec{x_i}\cdot\vec{x_j}-\delta).\] The Gaussian radial basis function (RBF) kernel is similar to RBF neural network and may be a good place to start, in general. \[K(\vec{x_i}, \vec{x_j})=exp \left (\frac{-\lVert \vec{x_i}-\vec{x_j}\rVert^2}{2\sigma^2}\right ) .\]

6 Case Study 3: Optical Character Recognition (OCR)

In Chapter 5 we saw machine learning strategies for recognition of hand-written digits. We now want to expend that to character recognition. The following example illustrates management and transferring of handwritten notes (text) and converting it to typeset or printed text representing the characters in the original notes (unstructured image data).

Protocol:
Divides the image (typically optical image of handwritten notes on paper) into a fine grid where each cell contains 1 glyph (symbol, letter, number).
Match the glyph in each cell to 1 of the possible characters in a dictionary.
Combine individual characters together into words to reconstitute the digital representation of the optical image of the handwritten notes.

In this example, we use an optical document image (data) that has already been pre-partitioned into rectangular grid cells containing 1 character of the 26 English letters, A through Z.

The resulting gridded dataset is distributed by the UCI Machine Learning Data Repository. The dataset contains 20, 000 examples of 26 English capital letters printed using 20 different randomly reshaped and morphed fonts.

This figure show an example of the preprocessed gridded handwritten letters.

6.1 Step 1: Prepare and explore the data

Load the data and split it into training and testing sets.

# read in data and examine its structure
hand_letters <- read.csv("https://umich.instructure.com/files/2837863/download?download_frd=1", header = T)
str(hand_letters)

## 'data.frame':    20000 obs. of  17 variables:
##  $ letter: chr  "T" "I" "D" "N" ...
##  $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
##  $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width : int  3 3 6 6 3 5 5 3 4 13 ...
##  $ height: int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
##  $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
##  $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...

# divide into training (3/4) and testing (1/4) data
hand_letters_train <- hand_letters[1:15000, ]
hand_letters_test  <- hand_letters[15001:20000, ]

6.2 Step 2: Training an SVM model

We can specify vanilladot as a linear kernel, or alternatively:

rbfdot Radial Basis kernel i.e, “Gaussian”
polydot Polynomial kernel
tanhdot Hyperbolic tangent kernel
laplacedot Laplacian kernel
besseldot Bessel kernel
anovadot ANOVA RBF kernel
splinedot Spline kernel
stringdot String kernel

# begin by training a simple linear SVM
library(kernlab)
set.seed(123)
hand_letter_classifier <- ksvm(as.factor(letter) ~ ., data = hand_letters_train, kernel = "vanilladot")

##  Setting default kernel parameters

# look at basic information about the model
hand_letter_classifier

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 6618 
## 
## Objective Function Value : -13.2947 -19.6051 -20.8982 -5.6651 -7.2092 -31.5151 -48.3253 -17.6236 -57.0476 -30.532 -15.7162 -31.49 -28.2706 -45.741 -11.7891 -33.3161 -28.2251 -16.5347 -13.2693 -30.88 -29.4259 -7.7099 -11.1685 -29.4289 -13.0857 -9.2631 -144.1105 -52.7747 -71.052 -109.7783 -158.3152 -51.2839 -39.6499 -67.0061 -23.8637 -27.6083 -26.3461 -35.2626 -38.6346 -116.8967 -173.8336 -214.2196 -20.7925 -10.3812 -53.1156 -12.228 -46.6132 -8.6867 -18.9108 -11.0535 -94.5751 -26.5689 -224.0215 -70.5714 -8.3232 -4.5265 -132.5431 -74.6876 -19.5742 -12.7352 -81.7894 -11.6983 -25.4835 -17.582 -23.934 -27.022 -50.7092 -10.9228 -4.3852 -13.7216 -3.8547 -3.5723 -8.419 -36.9773 -47.1418 -172.6874 -42.457 -44.0342 -42.7695 -13.0527 -16.7534 -78.7849 -101.8146 -32.1141 -30.3349 -104.0695 -32.1258 -24.6301 -32.6087 -17.0808 -5.1347 -40.5505 -6.684 -16.2962 -56.364 -147.3669 -49.0907 -37.8334 -32.8068 -73.248 -127.7819 -10.5342 -5.2495 -11.9568 -30.1631 -135.5915 -51.521 -176.2669 -99.0973 -10.295 -14.5906 -3.7822 -64.1452 -7.4813 -84.9109 -40.9146 -87.2437 -66.8629 -69.9932 -20.5294 -12.7577 -7.0328 -22.9219 -12.3975 -223.9411 -29.9969 -24.0552 -132.6252 -133.7033 -9.2959 -33.1873 -5.8016 -57.3392 -60.9046 -27.1766 -200.8554 -29.9334 -15.9359 -130.0183 -154.4587 -43.5779 -24.4852 -135.7896 -74.1531 -303.5043 -131.4741 -149.5403 -30.4917 -29.8086 -47.3454 -24.6204 -44.2792 -6.2064 -8.6708 -36.4412 -68.712 -179.7303 -44.7489 -84.8608 -136.6786 -569.3398 -113.0779 -138.435 -303.8556 -32.8011 -60.4546 -139.3525 -108.9841 -34.277 -64.9071 -38.6148 -7.5086 -204.222 -12.9572 -29.0252 -2.0352 -5.9916 -14.3706 -21.5773 -57.0064 -19.6546 -178.0543 -19.812 -4.145 -4.5318 -0.8101 -116.8649 -7.8269 -53.3445 -21.4812 -13.5066 -5.3881 -15.1061 -27.6061 -18.9239 -68.8104 -26.1223 -93.0231 -15.1693 -9.7999 -7.6137 -1.5301 -84.9531 -5.4551 -93.187 -93.4153 -43.8334 -23.6706 -59.1468 -22.0933 -47.8381 -219.9936 -39.5596 -47.2643 -34.0752 -20.2532 -11.239 -118.4152 -6.4126 -5.1846 -8.7272 -9.4584 -20.8522 -22.0878 -113.0806 -29.0912 -80.397 -29.6206 -13.7422 -8.9416 -3.0785 -79.842 -6.1869 -13.9663 -63.3665 -93.2067 -11.5593 -13.0449 -48.2558 -2.9343 -8.25 -76.4361 -33.5374 -109.112 -4.1731 -6.1978 -1.2664 -84.1287 -18.3054 -7.2209 -45.5509 -3.3567 -16.8612 -60.5094 -43.9956 -53.0592 -6.1407 -17.4499 -2.3741 -65.023 -102.1593 -103.4312 -23.1318 -17.3394 -50.6654 -31.4407 -57.6065 -19.6857 -5.2667 -4.1767 -55.8445 -30.92 -57.2396 -30.1101 -7.611 -47.7711 -12.1616 -19.1572 -53.5364 -3.8024 -53.124 -225.6075 -12.6791 -11.5852 -16.6614 -9.7186 -65.824 -16.3897 -42.3931 -50.513 -24.752 -14.513 -40.495 -16.5124 -57.1813 -4.7974 -5.2949 -81.7477 -3.272 -6.3448 -1.1259 -114.3256 -22.3232 -339.8619 -31.0491 -31.3872 -4.9625 -82.4936 -123.6225 -72.8463 -23.4836 -33.1608 -11.7133 -19.7607 -1.8599 -50.1148 -8.2868 -143.3592 -1.8508 -1.9699 -9.4175 -0.5202 -25.0654 -30.0489 -5.6248 
## Training error : 0.129733

6.3 Step 3: Evaluating model performance

# predictions on testing dataset
hand_letter_predictions <- predict(hand_letter_classifier, hand_letters_test)

head(hand_letter_predictions)

## [1] C U K U E I
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

# table(hand_letter_predictions, hand_letters_test$letter)

# look only at agreements vs. disagreements
# construct a vector of TRUE/FALSE indicating correct/incorrect predictions
agreement <- hand_letter_predictions == hand_letters_test$letter # check if characters agree
table(agreement)

## agreement
## FALSE  TRUE 
##   780  4220

prop.table(table(agreement))

## agreement
## FALSE  TRUE 
## 0.156 0.844

tab <- table(hand_letter_predictions, hand_letters_test$letter)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)

sum(diag(table(hand_letter_predictions, hand_letters_test$letter)))

## [1] 4220

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap")

6.4 Step 4: Improving model performance

Replacing the vanilladot linear kernel with rbfdot Radial Basis Function kernel, i.e., “Gaussian” kernel may improve the OCR prediction.

hand_letter_classifier_rbf <- ksvm(as.factor(letter) ~ ., data = hand_letters_train, kernel = "rbfdot")
hand_letter_predictions_rbf <- predict(hand_letter_classifier_rbf, hand_letters_test)

agreement_rbf <- hand_letter_predictions_rbf == hand_letters_test$letter
table(agreement_rbf)

## agreement_rbf
## FALSE  TRUE 
##   361  4639

prop.table(table(agreement_rbf))

## agreement_rbf
##  FALSE   TRUE 
## 0.0722 0.9278

Note the improvement of automated (SVM) classification accuracy ($0.928$) for rbfdot compared to the previous (vanilladot) result ($0.844$).

7 Case Study 4: Iris Flowers

Let’s have another look at the iris data that we saw in Chapter 2.

7.1 Step 1 - collecting data

SVM require all features to be numeric and each feature has to be scaled into a relative small interval. We are using the Edgar Anderson’s Iris Data in R for this case study. This dataset measures the length and width of sepals and petals from three Iris flower species.

7.2 Step 2 - exploring and preparing the data

Let’s load the data first. In this case study we want to explore the variable Species.

data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

The data looks good. However, recall that we need a fairly normalized data. We could normalize the data by hand. Luckily, the R package we are going to use will normalized the dataset automatically.

Now we can separate the training and test dataset using 75%-25% rule.

sub<-sample(nrow(iris), floor(nrow(iris)*0.75))
iris_train<-iris[sub, ]
iris_test<-iris[-sub, ]

Let’s first try a toy (iris data) example.

library(e1071)
iris.svm_1 <- svm(Species~Petal.Length+Petal.Width, data=iris_train,
                kernel="linear", cost=1)
iris.svm_2 <- svm(Species~Petal.Length+Petal.Width, data=iris_train,
                kernel="radial", cost=1)
par(mfrow=c(2,1))
plot(iris.svm_1, iris[,c(5,3,4)], symbolPalette = rainbow(4), color.palette = terrain.colors)
legend("center", "Linear")

plot(iris.svm_2, iris[,c(5,3,4)], symbolPalette = rainbow(4), color.palette = terrain.colors)
legend("center", "Radial", )

7.3 Step 3 - training a model on the data

We are going to use kernlab for this case study. However other packages like e1071 and klaR are available if you are quite familiar with C++.

Let’s break down the function ksvm()

m<-ksvm(target~predictors, data=mydata, kernel="rbfdot", c=1)

target: the outcome variable that we want to predict.
predictors: features that the prediction based on. In this function we can use the “.” to represent all the variables in the dataset again.
data: the training dataset that the target and predictors can be find.
kernel: is the kernel mapping we want to use. By default, it is the radio basis function (rbfdot).
C is a number that specifies the cost of misclassification.

Let’s in stall the package and play with the data now.

# install.packages("kernlab")
library(kernlab)
iris_clas<-ksvm(Species~., data=iris_train, kernel="vanilladot")

##  Setting default kernel parameters

iris_clas

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 24 
## 
## Objective Function Value : -0.8784 -0.3342 -12.9635 
## Training error : 0.026786

Here, we used all the variables other than the Species in the dataset as predictors. We also used kernel vanilladot that is the linear kernel in this model. We get a training error less than 0.02.

7.4 Step 4 - evaluating model performance

Our old friend predict() function is used again to make predictions. Here we have a factor outcome, so we need the command table() to show us how well do the predictions and actual data match.

iris.pred<-predict(iris_clas, iris_test)
table(iris.pred, iris_test$Species)

##             
## iris.pred    setosa versicolor virginica
##   setosa          9          0         0
##   versicolor      0         14         0
##   virginica       0          3        12

We can see that only a few cases of Iris versicolor flowers may be classified as virginica. The species of the majority of the flowers are all correctly identified.

To see the results more clearly, we can use the proportional table to show the agreements of the categories.

agreement<-iris.pred==iris_test$Species
prop.table(table(agreement))

## agreement
##      FALSE       TRUE 
## 0.07894737 0.92105263

Here == means “equal to”. Over 90% of predictions are correct. Nevertheless, is there any chance that we can improve the outcome? What if we try a Gaussian kernel?

7.5 Step 5 - RBF kernel function

Linear kernel is the simplest one but usually not the best one. Let’s try the RBF (Radial Basis “Gaussian” Function) kernel instead.

iris_clas1<-ksvm(Species~., data=iris_train, kernel="rbfdot")
iris_clas1

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.812699787875342 
## 
## Number of Support Vectors : 51 
## 
## Objective Function Value : -4.2531 -4.7802 -16.0651 
## Training error : 0.017857

iris.pred1<-predict(iris_clas1, iris_test)
table(iris.pred1, iris_test$Species)

##             
## iris.pred1   setosa versicolor virginica
##   setosa          9          0         0
##   versicolor      0         15         0
##   virginica       0          2        12

agreement<-iris.pred1==iris_test$Species
prop.table(table(agreement))

## agreement
##      FALSE       TRUE 
## 0.05263158 0.94736842

The model performance did not drastically improve, compared to the previous linear kernel case (you might get slightly different results). This is because this Iris dataset has a mostly linear feature space separation. In practice, we could try alternative kernel functions and see which one fits the dataset the best.

7.6 Parameter Tuning

We can tune the SVM using the tune.svm function in the package e1071.

costs = exp(-5:8)
tune.svm(Species~., kernel = "radial", data = iris_train, cost = costs)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.3678794
## 
## - best performance: 0.03560606

Further, we can draw a cv plot to gauge the model performance:

# install.packages("sparsediscrim") 
set.seed(2017)

# Install the Package sparsediscrim: https://cran.r-project.org/src/contrib/Archive/sparsediscrim/
# install.packages("corpcor", "bdsmatrix")
# install.packages("C:/Users/Dinov/Desktop/sparsediscrim_0.2.4.tar.gz", repos = NULL, type="source")
library(sparsediscrim)
library (reshape); library(ggplot2)
folds = cv_partition(iris$Species, num_folds = 5)
train_cv_error_svm = function(costC) {
  #Train
  ir.svm = svm(Species~., data=iris,
                 kernel="radial", cost=costC)
  train_error = sum(ir.svm$fitted != iris$Species) / nrow(iris)
  #Test
  test_error = sum(predict(ir.svm, iris_test) != iris_test$Species) / nrow(iris_test)
  #CV error
  ire.cverr = sapply(folds, function(fold) {
    svmcv = svm(Species~.,data = iris, kernel="radial", cost=costC, subset = fold$training)
    svmpred = predict(svmcv, iris[fold$test,])
    return(sum(svmpred != iris$Species[fold$test]) / length(fold$test))
  })
  cv_error = mean(ire.cverr)
  return(c(train_error, cv_error, test_error))
}

costs = exp(-5:8)
ir_cost_errors = sapply(costs, function(cost) train_cv_error_svm(cost))
df_errs = data.frame(t(ir_cost_errors), costs)
colnames(df_errs) = c('Train', 'CV', 'Test', 'Logcost')
dataL <- melt(df_errs, id="Logcost")

# ggplot(dataL, aes_string(x="Logcost", y="value", colour="variable",
#                          group="variable", linetype="variable", shape="variable")) +
#   geom_line(size=1) + labs(x = "Cost",
#                            y = "Classification error",
#                            colour="",group="",
#                            linetype="",shape="") + scale_x_log10()

plot_ly(dataL, x = ~log(Logcost), y = ~value, color = ~variable, 
        colors = c('blue', 'red', "green"), type="scatter", mode="lines") %>%
  layout(xaxis = list(title = 'log(Cost)'), yaxis = list(title = 'Classifier Error'), legend = list(orientation = 'h'),
         title="SVM CV-Plot of Model Performance (Iris Data)") %>% hide_colorbar()

7.7 Improving the performance of Gaussian kernels

Now, let’s attempt to improve the performance of a Gaussian kernel by tuning:

set.seed(2020)
gammas = exp(-5:5)
tune_g = tune.svm(Species~., kernel = "radial", data = iris_train, cost = costs, gamma = gammas)
tune_g

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       gamma     cost
##  0.01831564 7.389056
## 
## - best performance: 0.01818182

We observe that the model achieves a better prediction now.

iris.svm_g <- svm(Species~., data=iris_train,
                kernel="radial", gamma=0.0183, cost=20)
table(iris_test$Species, predict(iris.svm_g, iris_test))

##             
##              setosa versicolor virginica
##   setosa          9          0         0
##   versicolor      0         15         2
##   virginica       0          0        12

agreement<-predict(iris.svm_g, iris_test)==iris_test$Species
prop.table(table(agreement))

## agreement
##      FALSE       TRUE 
## 0.05263158 0.94736842

Chapter 22 provides more details about neural networks and deep learning.

8 Practice

8.1 Problem 1: Google Trends and the Stock Market

Use the Google trend data. Fit a neural network model with the same training data as case study 1. This time use Investing as target and Unemployment, Rental, RealEstate, Mortgage, Jobs, DJI_Index, StdDJI as predictors. Use 3 hidden nodes.

Note: remember to change the columns you want to include in the test dataset when predicting.

The following number is the correlation between predicted and observed values.

google_model3<-neuralnet(Investing~Unemployment+Rental+RealEstate+Mortgage+Jobs+DJI_Index+StdDJI, data=google_train, hidden = 3)
plot(google_model3)
google_pred3<-compute(google_model3, google_test[, c(1:5, 7:8)])
pred_results3<-google_pred3$net.result
cor(pred_results3, google_test$Investing)

##           [,1]
## [1,] 0.9010635

You might get slightly different results since the weights are generated randomly.

8.2 Problem 2: Quality of Life and Chronic Disease

Use the same data we showed in Chapter 8, Quality of life and chronic disease and the corresponding meta-data doc.

Let’s load the data first. In this case study, we want to use the variable CHARLSONSCORE as our target variable.

qol<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
featureLength <- dim(qol)[2]
str(qol[, c((featureLength-3):featureLength)])

## 'data.frame':    2356 obs. of  4 variables:
##  $ TOS_Q_03           : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ TOS_Q_04           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ CHARLSONSCORE      : int  2 2 3 1 0 0 2 8 0 1 ...
##  $ CHRONICDISEASESCORE: num  1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...

Delete the first two columns (we don’t need ID variables) and rows that have missing values in CHARLSONSCORE(where CHARLSONSCOREequals “-9”) !qol$CHARLSONSCORE==-9 means we want all the rows that have CHARLSONSCORE not equal to -9. The exclamation sign (!) indicates “exclude”. Also, we need to convert our categorical variable CHARLSONSCORE into a factor.

qol<-qol[!qol$CHARLSONSCORE==-9 , -c(1, 2)]
qol$CHARLSONSCORE<-as.factor(qol$CHARLSONSCORE)
featureLength <- dim(qol)[2]
str(qol[, c((featureLength-3):featureLength)])

## 'data.frame':    2328 obs. of  4 variables:
##  $ TOS_Q_03           : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ TOS_Q_04           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ CHARLSONSCORE      : Factor w/ 11 levels "0","1","2","3",..: 3 3 4 2 1 1 3 9 1 2 ...
##  $ CHRONICDISEASESCORE: num  1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...

Now the dataset is ready. First, separate the dataset into training and test datasets using 75%-25% rule. Then, build a SVM model using all other variables in the dataset to be predictor variables. Try to add different cost of misclassification to the model. Rather than the default C=1 we use C=2 and C=3. See how the model behaves. Here we utilize the radio basis kernel.

Output for C=2.

sub<-sample(nrow(qol), floor(nrow(qol)*0.75))
qol_train<-qol[sub, ]
qol_test<-qol[-sub, ]
qol_clas2<-ksvm(CHARLSONSCORE~., data=qol_train, kernel="rbfdot", C=2)
qol_clas2

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0179049133022324 
## 
## Number of Support Vectors : 1690 
## 
## Objective Function Value : -1791.529 -661.387 -281.6799 -54.6665 -16.8087 -9.365 -7.1712 -48.6125 -15.763 -3.0908 -694.6972 -294.81 -56.8338 -17.1704 -9.7008 -6.9953 -48.8671 -15.9895 -3.0122 -261.4046 -54.2496 -16.7967 -9.2232 -6.8109 -45.3683 -15.6672 -2.8534 -50.2201 -16.2147 -8.5662 -6.5326 -42.1534 -14.1415 -2.9874 -14.5834 -7.3738 -5.7942 -25.5704 -10.4334 -2.5111 -5.43 -4.7896 -12.3232 -7.1284 -2.0973 -3.79 -7.877 -6.1698 -1.8863 -6.4411 -3.8929 -2.1508 -9.8989 -2.6025 -2.1345 
## Training error : 0.313288

qol.pred2<-predict(qol_clas2, qol_test)

# table(qol.pred2, qol_test$CHARLSONSCORE)
agreement<-qol.pred2==qol_test$CHARLSONSCORE
prop.table(table(agreement))

## agreement
##     FALSE      TRUE 
## 0.4828179 0.5171821

tab <- table(qol.pred2, qol_test$CHARLSONSCORE)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)

sum(diag(table(hand_letter_predictions, hand_letters_test$letter)))

## [1] 4220

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap")

Output for C=3.

qol_clas3<-ksvm(CHARLSONSCORE~., data=qol_train, kernel="rbfdot", C=3)
qol_clas3

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 3 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.01774204087375 
## 
## Number of Support Vectors : 1688 
## 
## Objective Function Value : -2414.278 -898.0229 -384.9233 -72.9281 -22.8802 -12.1196 -10.1545 -67.5478 -20.5772 -3.9696 -958.4005 -413.5135 -77.3647 -23.6938 -12.8706 -9.7629 -68.118 -21.0636 -3.793 -345.1383 -71.9275 -22.8552 -11.7983 -9.3501 -60.4221 -20.3684 -3.4363 -64.2546 -21.5556 -10.3295 -8.7248 -53.8136 -17.093 -3.7375 -18.3774 -8.2186 -7.1266 -29.4282 -10.8706 -2.7001 -5.5612 -5.4635 -14.3168 -7.2334 -2.1105 -3.8256 -8.9925 -6.2707 -1.893 -8.5234 -3.9642 -2.1748 -10.3427 -2.8782 -2.1534 
## Training error : 0.252005

qol.pred3<-predict(qol_clas3, qol_test)
# table(qol.pred3, qol_test$CHARLSONSCORE)
agreement<-qol.pred3==qol_test$CHARLSONSCORE
prop.table(table(agreement))

## agreement
##     FALSE      TRUE 
## 0.4965636 0.5034364

tab <- table(qol.pred3, qol_test$CHARLSONSCORE)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)

sum(diag(table(hand_letter_predictions, hand_letters_test$letter)))

## [1] 4220

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap")

Try to improve these performance metrics and results by preprocessing the data, parameter tuning, and ensembling methods.

Try to practice these techniques using other data from the list of our Case-Studies.

Data Science and Predictive Analytics (UMich HS650)

Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines

SOCR/MIDAS (Ivo Dinov)

October 2021