#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

Evaluating Model Performance

" #' author: "

SOCR/MIDAS (Ivo Dinov)

" #' date: "`r format(Sys.time(), '%B %Y')`" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' after_body: SOCR_footer_tracker.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' In [previous chapters](http://dspa.predictive.space/), we used prediction accuracy to evaluate classification models. However, accurate predictions in one dataset does not necessarily imply that our model is perfect or that it will reproduce when tested on external data. We need additional metrics to evaluate the model performance to make sure it is robust, reproducible, reliable and unbiased. #' #' In this chapter, we will discuss (1) various evaluation strategies for prediction, clustering, classification, regression, and decision trees, (2) visualization of ROC curves and performance tradeoffs, and (3) estimation of future performance, internal statistical cross-validation and bootstrap sampling. #' #' # Measuring the performance of classification methods #' #' As mentioned previously, classification model performances could not be evaluated by prediction accuracy alone. We make different classification models for different purposes. For example, in newborns screening for genetic defects we want the model to have as few true negatives as possible. We don't want to classify anyone as "no defect" when they actually have a defect gene, since early treatment might alter the destiny of this newborn. #' #' We can use the following three types of data to evaluate the performance of a classifier model. #' #' * Actual class values (for supervised classification). #' * Predicted class values. #' * Estimated probability of the prediction. #' #' We are familiar with the first two cases. The last type of validation relies on the `predict(model, test_data)` function that we have talked about in previous classification and prediction chapters ([Chapter 6-8](http://dspa.predictive.space)). Let's revisit the model and test data we discussed in [Chapter 7](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/07_NaiveBayesianClass.html) - [Inpatient Head and Neck Cancer Medication data](https://umich.instructure.com/files/1614351/download?download_frd=1). We will demonstrate prediction probability estimation using this case-study [CaseStudy14_HeadNeck_Cancer_Medication.csv](https://umich.instructure.com/files/1614350/download?download_frd=1). #' #' hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE) hn_med$seer_stage<-factor(hn_med$seer_stage) require(tm) hn_med_corpus<-Corpus(VectorSource(hn_med$MEDICATION_SUMMARY)) corpus_clean<-tm_map(hn_med_corpus, tolower) corpus_clean<-tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <-tm_map(corpus_clean, removeNumbers) # corpus_clean <- tm_map(corpus_clean, PlainTextDocument) hn_med_dtm<-DocumentTermMatrix(corpus_clean) hn_med_train<-hn_med[1:562, ] hn_med_test<-hn_med[563:662, ] hn_med_dtm_train<-hn_med_dtm[1:562, ] hn_med_dtm_test<-hn_med_dtm[563:662, ] corpus_train<-corpus_clean[1:562] corpus_test<-corpus_clean[563:662] hn_med_train$stage<-hn_med_train$seer_stage %in% c(4, 5, 7) hn_med_train$stage<-factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) hn_med_test$stage<-hn_med_test$seer_stage %in% c(4, 5, 7) hn_med_test$stage<-factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) convert_counts <- function(x) { x <- ifelse(x > 0, 1, 0) x <- factor(x, levels = c(0, 1), labels = c("No", "Yes")) return(x) } hn_med_dict<-as.character(findFreqTerms(hn_med_dtm_train, 5)) hn_train<-DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict)) hn_test<-DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict)) hn_train <- apply(hn_train, MARGIN = 2, convert_counts) hn_test <- apply(hn_test, MARGIN = 2, convert_counts) library(e1071) hn_classifier <- naiveBayes(hn_train, hn_med_train$stage) #' #' #' pred_raw<-predict(hn_classifier, hn_test, type="raw") head(pred_raw) #' #' #' The above output includes the prediction probabilities for the first 6 rows of the data. This examples is based on Naive Bayes classifier, however the same approach works for any other machine learning classification or prediction technique. #' #' In addition, we can report the predicted probability with the outputs of the Naive Bayesian decision-support system (`hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)`): #' #' pred_nb<-predict(hn_classifier, hn_test) head(pred_nb) #' #' #' The general `predict()` method automatically subclasses to the specific `predict.naiveBayes(object, newdata, type = c("class", "raw"), threshold = 0.001, ...)` call where `type="raw"` and `type = "class"` specify the output as the conditional a-posterior probabilities for each class or the class with maximal probability, respectively. Back in [Chapter 8](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/08_DecisionTreeClass.html) where we discussed the `C5.0` and the `randomForest` classifiers to predict the chronic disease score in a (different) [Quality of Life Study](https://umich.instructure.com/files/481332/download?download_frd=1). #' #' qol<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1") qol<-qol[!qol$CHRONICDISEASESCORE==-9, ] qol$cd<-qol$CHRONICDISEASESCORE>1.497 qol$cd<-factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease")) qol<-qol[order(qol$ID), ] # Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later qol <- qol[ , -1] # 80-20% training-testing data split set.seed(1234) train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol)) qol_train<-qol[train_index, ] qol_test<-qol[-train_index, ] library(C50) set.seed(1234) qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd) #' #' #' Below are the (probability) results of the `C5.0` classification prediction: #' #' pred_prob<-predict(qol_model, qol_test, type="prob") head(pred_prob) #' #' #' These can be contrasted against the `C5.0` classification label results: #' #' pred_tree<-predict(qol_model, qol_test) head(pred_tree) #' #' #' The same complementary types of outputs can be reported for most machine learning classification and prediction approaches. #' #' # Evaluation strategies #' #' In [Chapter 6](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/06_LazyLearning_kNN.html), we saw an attempt to categorize the supervised classification and unsupervised clustering methods. Similarly, the *table* below summarizes the basic types of evaluation and validation strategies for different forecasting, prediction, ensembling, and clustering techniques. [(Internal) Statistical Cross Validation](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/20_PredictionCrossValidation.html) or external validation should always be applied to ensure reliability and reproducibility of the results. The SciKit [clustering performance evaluation](http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) and [Classification metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) sections provide details about many alternative techniques and metrics for performance evaluation of clustering and classification methods. #' #' Inference | Outcome | Evaluation Metrics | R functions #' ----------|-------------|---------------------|------------------------ #' Classification & Prediction | Binary | Accuracy, Sensitivity, Specificity, PPV/Precision, NPV/Recall, LOR | `caret::confusionMatrix`, `gmodels::CrossTable`, `cluster::silhouette` #' Classification & Prediction | Categorical | Accuracy, Sensitivity/Specificity, PPV, NPV, LOR, Silhouette Coefficient | `caret::confusionMatrix`, `gmodels::CrossTable`, `cluster::silhouette` #' Regression Modeling | Real Quantitative | correlation coefficient, $R^2$, RMSE, Mutual Information, Homogeneity and Completeness Scores | `cor`, `metrics::mse` #' #' ## Binary outcomes #' More details about binary test assessment is available on the [Scientific Methods for Health Sciences (SMHS) EBook site](http://wiki.socr.umich.edu/index.php/SMHS_IntroEpi#Screening). The table below summarizes the key measures commonly used to evaluate the performance of binary tests, classifiers, or predictions. #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #' #'

		Actual Condition		Test Interpretation
		Absent ($H_0$ is true)	Present ($H_1$ is true)	Test Interpretation
Test Result	Negative (fail to reject $H_0$)	TN Condition absent + Negative result = True (accurate) Negative	FN Condition present + Negative result = False (invalid) Negative Type II error* (proportional to $\beta$)*	$NPV=Recall$ #' $=\frac{TN}{TN+FN}$	#'
Test Result	Positive (reject $H_0$)	FP Condition absent + Positive result = False Positive Type I error ($\alpha$)	TP Condition Present + Positive result = True Positive	$PPV=Precision$ #' $=\frac{TP}{TP+FP}$	#'
Test Interpretation	$Power =1-\beta$ #' $= 1-\frac{FN}{FN+TP}$	$Specificity=\frac{TN}{TN+FP}$	$Power=Sensitivity$ #' $=\frac{TP}{TP+FN}$	$LOR=\ln\left (\frac{S1/F1}{S2/F2}\right )$ #' $=\ln\left (\frac{S1\times F2}{S2\times F1}\right )$, S=success, F=failure for 2 binary variables, $1$ and $2$

#' #' See also [SMHS EBook, Power, Sensitivity and Specificity section](http://wiki.socr.umich.edu/index.php/SMHS_PowerSensitivitySpecificity). #' #' ## Confusion matrices #' #' We talked about this type of matrix in [Chapter 8](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/08_DecisionTreeClass.html). For binary classes, there will be a $2\times 2$ matrix. Each of the cells have a specific meaning. #' #' Graph $2\times 2$ table: #' #' require(knitr) item_table = data.frame(predict_T = c("TP","FP"),predict_F = c("TN","FN")) rownames(item_table) = c("TRUE","FALSE") kable(item_table,caption = "cross table") #' #' #' * **True Positive**(TP): Number of observations that correctly classified as "yes" or "success". #' #' * **True Negative**(TN): Number of observations that correctly classified as "no" or "failure". #' #' * **False Positive**(FP): Number of observations that incorrectly classified as "yes" or "success". #' #' * **False Negative**(FN): Number of observations that incorrectly classified as "no" or "failure". #' #' **Using confusion matrices to measure performance** #' #' The way we calculate accuracy using these four cells is summarized by the following formula: #' $$accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}}$$ #' On the other hand, the error rate, or proportion of incorrectly classified observations is calculated using: #' $$error rate=\frac{FP+FN}{TP+TN+FP+FN}==\frac{FP+FN}{\text{Total number of observations}}=1-accuracy$$ #' If we look at the numerator and denominator carefully, we can see that the error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate. #' #' In R, we have multiple ways to obtain confusion table. The simplest way would be `table()`. For example, in [Chapter 7](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/07_NaiveBayesianClass.html), to get a plain $2\times 2$ table reporting the agreement between the real clinical cancer labels and their machine learning predicted counterparts, we used: #' #' hn_test_pred<-predict(hn_classifier, hn_test) table(hn_test_pred, hn_med_test$stage) #' #' #' Then why did we use `CrossTable()` function back in [Chapter 7](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/07_NaiveBayesianClass.html)? Because it reports more useful information about the model performances. #' #' library(gmodels) CrossTable(hn_test_pred, hn_med_test$stage) #' #' #' With both tables, we can calculate accuracy and error rate by hand. #' #' accuracy<-(69+0)/100 accuracy error_rate<-(23+8)/100 error_rate 1-accuracy #' #' #' For matrices that are larger than $2\times 2$, all diagonal elements count the observations that are correctly classified and the off-diagonal elements represent incorrectly labeled cases. #' #' ## Other measures of performance beyond accuracy #' #' So far we discussed two performance methods - table and cross-table. A third function is `confusionMatrix()` which provides the easiest way to report model performance. Notice that the first argument is an *actual vector of the labels*, i.e., $Test\_Y$ and the second argument, of the same length, represents the *vector of predicted labels*. #' #' This example was presented as the first case-study in [Chapter 8](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/08_DecisionTreeClass.html). #' #' library(caret) qol_pred<-predict(qol_model, qol_test) confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease") #' #' #' ### The kappa ( $\kappa$ ) statistic #' #' The Kappa statistic was [originally developed to measure the reliability between two human raters](http://wiki.socr.umich.edu/index.php/SMHS_ReliabilityValidity). It can be harnessed in machine learning applications to compare the accuracy of a classifier, where `one rater` represents the ground truth (for labeled data, these are the actual values of each instance) and the `second rater` represents the results of the automated machine learning classifier. The order of listing the **raters** is irrelevant. #' #' Kappa statistic measures the **possibility of a correct prediction by chance alone** and answers the question of `How much better is the agreement (between the ground truth and the machine learning prediction) than would be expected by chance alone?` Its value is between $0$ and $1$. When $\kappa=1$, we have a perfect agreement between a **computed** prediction (typically the result of a model-based or model-free technique forecasting an outcome of interest) and an **expected** prediction (typically random, by-chance, prediction). A common interpretation of the Kappa statistics includes: #' #' * Poor agreement: less than 0.20 #' * Fair agreement: 0.20-0.40 #' * Moderate agreement: 0.40-0.60 #' * Good agreement: 0.60-0.80 #' * Very good agreement: 0.80-1 #' #' In the above `confusionMatrix` output, we have a fair agreement. For different problems, we may have different interpretations of Kappa statistics. #' #' To understand Kappa statistic better, let's look at its formula: #' #' $$kappa=\frac{P(a)-P(e)}{1-P(e)}$$ #' #' `P(a)` and `P(e)` simply denote probability of **actual** and **expected** agreement between the classifier and true values. Let's use the Quality of Life `cd` prediciton (qol_pred) for computing the kappa statistics by hand. #' #' table(qol_pred, qol_test$cd) #' #' #' According to above table, actual agreement is the accuracy: #' p_a<-(149+131)/(149+89+74+131) p_a #' #' #' The manually and automatically computed accuracies coincide (0.6321). It may be trickier to obtain the expected agreement. Probability rules tell us that the probability of two independent events occurring at the same time equals to the product of the individual (marginal) probabilities for these two events. Thus, we have: #' #' *P(expect agreement for minor_disease)=P(actual type is minor_disease) x P(predicted type is minor_disease)* #' #' Similarly: #' #' *P(expect agreement for severe_disease)=P(actual type is severe_disease) x P(predicted type is severe_disease)* #' #' In our case: #' #' p_e_minor<- (149+131)/(149+89+74+131)*(149+89)/(149+89+74+131) p_e_severe<-(89+74)/(149+89+74+131) * (74+131)/(149+89+74+131) p_e<-p_e_minor+p_e_severe p_e #' #' #' Plug `p_a` and `p_e` into the formula we get: #' #' kappa<-(p_a-p_e)/(1-p_e) kappa #' #' #' We get a similar value in the `confusionTable()` output. A more straight forward way of getting the Kappa statistics is by using `Kappa()` function in the `vcd` package. #' #' #install.packages(vcd) library(vcd) Kappa(table(qol_pred, qol_test$cd)) #' #' #' The combination of `Kappa()` and `table` function yields a $2\times 4$ matrix. The *Kappa statistic* is under the unweighted value. #' #' Generally speaking, predicting a severe disease outcome is a more critical problem than predicting a mild disease state. Thus, weighted Kappa is also useful. We give the severe disease a higher weight. The Kappa test result is not acceptable since the classifier may make too many mistakes for the severe disease cases. The Kappa value is only $-0.0714$. Notice that the range of Kappa is not [0,1] for weighted Kappa. #' #' Kappa(table(qol_pred, qol_test$cd),weights = matrix(c(1,10,1,10),nrow=2)) #' #' #' When the predicted value is the first argument, the row and column names represent the **true labels** and the **predicted labels**, respectively. #' #' table(qol_pred, qol_test$cd) #' #' #' #### Summary of the Kappa score for calculating prediction accuracy #' #' Kappa compares an **Observed classification accuracy** (output of our ML classifier) with an **Expected classification accuracy** (corresponding to random chance classification). It may be used to evaluate single classifiers and/or to compare among a set of different classifiers. It takes into account random chance (agreement with a random classifier). That makes **Kappa** more meaningful than simply using the **accuracy** as a single quality metric. For instance, the interpretation of an `Observed Accuracy of 80%` is **relative** to the `Expected Accuracy`. `Observed Accuracy of 80%` is more impactful for an `Expected Accuracy of 50%` compared to `Expected Accuracy of 75%`. #' #' #### Computation of Observed Accuracy and Expected Accuracy #' #' Consider the following example of a `classifier` generating the following `confusion matrix`. Columns represent the **true labels** and rows represent the **classifier-derived labels** for this binary prediction example. #' #' Class | True | False | Total #' ------|------|-------|------ #' T | 50 | 35 | 85 #' F | 25 | 40 | 65 #' Total | 75 | 75 | 150 #' #' In this example, there are a total of 150 observations instances total ($50+35+25+40$). In reality, 75 are labeled as **True** ($50 + 25$) and another 75 are labeled as **False** ($35 + 40$). The classifier labeled 85 as **True** ($50 + 35$) and the other 65 as **False** ($25 + 40$). #' #' * Observed Accuracy (OA) is the `proportion of instances` that were classified correctly throughout the entire confusion matrix: #' $$OA = \frac{50 + 40}{150} = 0.6$$ #' * Expected Accuracy (EA) is the accuracy that any random classifier would be expected to achieve based on the given confusion matrix. EA is the `proportion of instances` of each class (**True** and **False**), along with the number of instances that the automated classifier agreed with the ground truth label. The EA is calculated by multiplying the marginal frequencies of **True** for the true-state and the machine classified instances, and dividing by the total number of instances. The marginal frequency of **True** for the **true-state** is 75 ($50 + 25$) and for the corresponding ML classifier is 85 ($50 + 35$). Then, the expected accuracy for the **True** outcome is: #' $$EA(True) = \frac{75 \times 85}{150} = 42.5$$ #' #' We similarly compute the $EA(False)$ for the second, **False**, outcome, by using the marginal frequencies for the true-state ($(False|true-state) = 75 = 50 + 25$) and the ML classifier $(False|classifier) = 65 (40+25)$. Then, the expected accuracy for the **True** outcome is: #' #' $$EA(False) = \frac{75 \times 65}{150} = 32.5$$ #' #' Finally, the $EA = \frac{EA(True) + EA(False)}{150}$ #' #' $$Expected Accuracy (EA) = \frac{42.5 + 32.5}{150} = 0.5$$ #' #' Note that $EA = 0.5$ whenever the `true-state` binary classification is balanced (in reality, the frequencies of **True** and **False** are equal, in our case 75). #' #' The calculation of the **kappa statistic** relies on $OA=0.6$ and $EA=0.5$: #' $$ (Kappa)\text{ } \kappa = \frac{OA - EA}{1 - EA} = \frac{0.6-0.5}{1-0.5}=0.2.$$ #' #' ### Sensitivity and specificity #' #' Take a closer look at the `confusionMatrix()` output where we can find two important statistics - "sensitivity" and "specificity". #' #' Sensitivity or true positive rate measures the proportion of "success" observations that are correctly classified. #' $$sensitivity=\frac{TP}{TP+FN}.$$ #' Notice $TP+FN$ are the total number of true "success" observations. #' #' On the other hand, specificity or true negative rate measures the proportion of "failure" observations that are correctly classified. #' $$sensitivity=\frac{TN}{TN+FP}.$$ #' Accordingly, $TN+FP$ are the total number of true "failure" observations. #' #' In the QoL data, considering "severe_disease" as "success" and using the `table()` output we can manually compute the *sensitivity* and *specificity*, as well as *precision* and *recall* (below): #' #' sens<-131/(131+89) sens spec<-149/(149+74) spec #' #' #' Another R package `caret` also provides functions to directly calculate the sensitivity and specificity. #' #' library(caret) sensitivity(qol_pred, qol_test$cd, positive="severe_disease") # specificity(qol_pred, qol_test$cd) #' #' #' Sensitivity and specificity both range from 0 to 1. For either measure, a values of 1 imply that the positive and negative predictions are very accurate. However, simultaneously high sensitivity and specificity may not be attainable in real world situations. There is a tradeoff between sensitivity and specificity. To compromise, some studies loosen the demands on one and focus on achieving high values on the other. #' #' ###Precision and recall #' #' Very similar to sensitivity, *precision* measures the proportion of true "success" observations among predicted "success" observations. #' $$precision=\frac{TP}{TP+FP}.$$ #' *Recall* is the proportion of true "success" among all "success". It has the same formula with sensitivity but different meaning. Recall measures how complete the results are. A model with high recall captures most "interesting" cases. #' $$recall=\frac{TP}{TP+FN}.$$ #' Again, let's calculate these by hand for the QoL data: #' #' prec<-131/(131+74) prec recall<-131/(131+89) recall #' #' #' Another way to obtain *precision* would be `posPredValue()` under `caret` package. Remember to specify which one is the "success" class. #' #' posPredValue(qol_pred, qol_test$cd, positive="severe_disease") #' #' #' From the definitions of **precision** and **recall**, we can derive the type 1 error and type 2 errors as follow: #' #' $$error_1 = 1- Precision = \frac{FP}{TP+FP}.$$ #' #' $$error_2 = 1- Recall = \frac{FN}{TP+FN}.$$ #' #' Thus, we can compute the type 1 error ($0.36$) and type 2 error ($0.40$). #' #' error1<-74/(131+74) error2<-89/(131+89) error1; error2 #' #' #' ### The F-measure #' #' The F-measure, or F1-score, combines precision and recall using the harmonic mean assuming equal weights. High F1-score means high precision and high recall. This is a convenient way of measuring model performances and comparing models. #' $$F1=\frac{2\times precision\times recall}{recall+precision}=\frac{2\times TP}{2\times TP+FP+FN}$$ #' If calculating the F1-score by hand, using the Quality of Life prediction: #' #' f1<-(2*prec*recall)/(prec+recall) f1 #' #' #' The direct calculations of the F1-statistics can be obtained using `caret`: #' #' precision <- posPredValue(qol_pred, qol_test$cd, positive="severe_disease") recall <- sensitivity(qol_pred, qol_test$cd, positive="severe_disease") F1 <- (2 * precision * recall) / (precision + recall); F1 #' #' #' # Visualizing performance tradeoffs (ROC Curve) #' #' Another choice for evaluating classifiers' performance is by graphs rather than statistics. Graphs are usually more comprehensive than single statistics. #' #' In R there is a package providing user-friendly functions for visualizing model performance. Details could be find on the [ROCR website]( http://rocr.bioinf.mpi-sb.mpg.de). #' #' Here we evaluate the model performance for the Quality of Life case study in [Chapter 8](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/08_DecisionTreeClass.html). #' #' #install.packages("ROCR") library(ROCR) pred<-ROCR::prediction(predictions=pred_prob[, 2], labels=qol_test$cd) # avoid naming collision (ROCR::prediction), as # there is another prediction function in the neuralnet package. #' #' #' `pred_prob[, 2]` is the probability of classifying each observation as "severe_disease". The above code saved all the model prediction information into the object `pred`. #' #' The ROC (Receiver Operating Characteristic) curves are often used for examine the trade-off between detecting true positives and avoiding the false positives. #' #' curve(log(x), from=0, to=100, xlab="False Positive Rate", ylab="True Positive Rate", main="ROC curve", col="green", lwd=3, axes=F) Axis(side=1, at=c(0, 20, 40, 60, 80, 100), labels = c("0%", "20%", "40%", "60%", "80%", "100%")) Axis(side=2, at=0:5, labels = c("0%", "20%", "40%", "60%", "80%", "100%")) segments(0, 0, 110, 5, lty=2, lwd=3) segments(0, 0, 0, 4.7, lty=2, lwd=3, col="blue") segments(0, 4.7, 107, 4.7, lty=2, lwd=3, col="blue") text(20, 4, col="blue", labels = "Perfect Classifier") text(40, 3, col="green", labels = "Test Classifier") text(70, 2, col="black", labels= "Classifier with no predictive value") #' #' #' The blue line in the above graph means the perfect classifier where we have 0% false positive and 100% true positive. The middle green line is the test classifier. Most of our classifiers trained by real data will look like this. The black diagonal line illustrates a classifier with no predictive value predicts. We can see that it has same true positive rate and false positive rate. Thus, it cannot distinguish between the two. #' #' In terms of identifying true positive values, the ROC curve should be as near to the (blue) *perfect classifier* line. Thus, we measure the area under the ROC curve (abbreviated as AUC) as a proxy of how perfect the classifier is. To do this we have to change the scale of the graph above. Mapping 100% to 1, we have a $1\times1$ square. The area under perfect classifier would be 1 and area under classifier with no predictive value being 0.5. Then, 1 and 0.5 will be the upper and lower limits for our model ROC curve. We have the following scoring system (numbers indicate area under curve) for model ROC curves: #' #' * Outstanding: 0.9-1.0 #' * Excellent/good: 0.8-0.9 #' * Acceptable/fair: 0.7-0.8 #' * Poor: 0.6-0.7 #' * No discrimination: 0.5-0.6 #' #' Note that this rating system is somewhat subjective. We can use the `ROCR` package to draw ROC curves. #' #' roc<-performance(pred, measure="tpr", x.measure="fpr") #' #' #' By specify a "performance" object by providing `"tpr"` (True positive rate) and `"fpr"` (False positive rate): #' #' plot(roc, main="ROC curve for Quality of Life Model", col="blue", lwd=3) segments(0, 0, 1, 1, lty=2) #' #' #' The segments command give us the dash line representing classifier with no predictive value. #' #' To measure this quantitatively we need to create a new performance object with `measure="auc"` or area under the curve. #' #' roc_auc<-performance(pred, measure="auc") #' #' #' Now the `roc_auc` is stored as a [S4 object](http://adv-r.had.co.nz/OO-essentials.html). This is quite different than data frame and matrices. First, we can use `str()` function to see its structure. #' #' str(roc_auc) #' #' #' It has 6 members, or "slots". The AUC value is stored in `y.values`. To extract that we use `@` symbol according to `str()` output. #' #' roc_auc@y.values #' #' #' Thus, the obtained $AUC=0.65$, which suggests a fair classifier, according to the above scoring schema. #' #' # Estimating future performance (internal statistical validation) #' #' The evaluation method we have talked about are all measuring re-substitution error. That is building the model on training data and measuring the model error on test data. This is one way of dealing with unseen data. First, let's introduce this method in detail. #' #' ## The holdout method #' #' The idea is to partition one data into two separate datasets. Using one of them to create the model and the other to test the model performances. In practice, we usually use a fraction (e.g., $50\%$, or $\frac{2}{3}$) of our data for training the model, and reserve the rest (e.g., $50\%$, or $\frac{1}{3}$) for testing. Note that the testing data may also be further split into proportions for internal repeated (e.g., cross-validation) testing and final external (independent) testing. #' #' The partition has to be randomized. In R, the best way of doing this is to create a parameter that randomly draws numbers and use this parameter to extract random rows from the original dataset. In [Chapter 10](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/10_ML_NN_SVM_Class.html), we used this method to partition the *Google Trends* data. #' #' sub<-sample(nrow(google_norm), floor(nrow(google_norm)*0.75)) google_train<-google_norm[sub, ] google_test<-google_norm[-sub, ] #' #' #' Another way of partition is using `createDatePartition()` under `caret` package. Instead of using original dataset alone, `google_norm$RealEstate` or the independent variable column of the original dataset is #' #' sub<-createDataPartition(google_norm$RealEstate, p=0.75, list = F) google_train<-google_norm[sub, ] google_test<-google_norm[-sub, ] #' #' #' To make sure that the model can be applied to future datasets, we can partition the original dataset into three separate subsets. In this way, we have two subsets for testing. The additional validation dataset can alleviate the probability that we have a good model due to chance (non-representative subsets). A common split among training, test, and validation subsets would be 50%, 25%, and 25% respectively. #' #' google<-read.csv("https://umich.instructure.com/files/416274/download?download_frd=1", stringsAsFactors = F) google<-google[, -c(1, 2)] normalize <- function(x) { return((x - min(x)) / (max(x) - min(x))) } google_norm<-as.data.frame(lapply(google, normalize)) #' #' #' sub<-sample(nrow(google_norm), floor(nrow(google_norm)*0.50)) google_train<-google_norm[sub, ] google_test<-google_norm[-sub, ] sub1<-sample(nrow(google_test), floor(nrow(google_test)*0.5)) google_test1<-google_test[sub1, ] google_test2<-google_test[-sub1, ] nrow(google_norm) nrow(google_train) nrow(google_test1) nrow(google_test2) #' #' #' However, when we only have a very small dataset, it's difficult to split off too much data as this reduces the sample further. There are the following two options for evaluation of model performance with unseen data. These are implemented in the `caret` package. #' #' ## Cross-validation #' #' For complete details see the [DSPA Cross-Validation (Chapter 20)](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/20_PredictionCrossValidation.html). Below, we describe the fundamentals of cross-validation as an internal statistical validation technique. #' #' This technique is known as *k-fold cross-validation* or *k-fold CV*, which is a standard for estimating model performance. K-fold CV randomly dived the original data into *k* separate random subsets called folds. #' #' A common practice is to use `k=10` or 10-fold CV. That is to split the data into 10 different subsets. Each time using one of the subsets to be the test set and the rest to build the model. #' `createFolds()` under `caret` package will help us to do so. `seet.seed()` insures the folds created are the same if you run the code line twice. `1234` is just a random number. You can use any number for `set.seed()`. We use the normalized Google Trend dataset in this section. #' #' library("caret") set.seed(1234) folds<-createFolds(google_norm$RealEstate, k=10) str(folds) #' #' #' Another way to cross-validate is to use `cv_partition()` in package `sparsediscrim`. #' #' # install.packages("sparsediscrim") require(sparsediscrim) folds2 = cv_partition(1:nrow(google_norm), num_folds=10) #' #' #' And the structure of folds may be reported by: #' #' str(folds2) #' #' #' Now, we have 10 different subsets in `folds` object. We can use `lapply()` to fit the model. 90% of data will be used for training so we use `[-x, ]` to represent all observations not in a specific fold. In [Chapter 10](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/10_ML_NN_SVM_Class.html) we showed building a neutral network model for the *Google Trends* data. We can do the same for each fold manually, train, test, aggregate the results, and report the agreement (correlations between the predicted and observed RealEstate values). #' #' library(neuralnet) fold_cv<-lapply(folds, function(x){ google_train<-google_norm[-x, ] google_test<-google_norm[x, ] google_model<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train) google_pred<-compute(google_model, google_test[, c(1:2, 4:8)]) pred_results<-google_pred$net.result pred_cor<-cor(google_test$RealEstate, pred_results) return(pred_cor) }) str(fold_cv) #' #' #' From the output, we know that in most of the folds the model predicts very well. In a typical run, one fold may yield bad results. We can use the *mean* of these 10 correlations to represent the *overall* model performance. But first, we need to use `unlist()` function to transform `fold_cv` into a vector. #' #' mean(unlist(fold_cv)) #' #' #' This correlation is high suggesting strong association between predicted and true values. Thus, the model is very good in terms of its prediction. #' #' ## Bootstrap sampling #' #' The second method is called *bootstrap sampling*. In k-fold CV, each observation can only be used once. However, bootstrap sampling is a sampling process *with replacement*. Before selecting a new sample, it recycles every observation so that each observation could be appear in multiple folders. #' #' A very special setting of bootstrap uses at each iteration 63.2% of the original data as our training dataset and the remaining 36.8% as the test dataset. Thus, compared to k-fold CV, bootstrap sampling is less representative of the full dataset. A special case of bootstrapping *0.632 bootstrap* address this issue with changing the final performance measurement to the following formula: #' $$error=0.632\times error_{test}+0.368\times error_{train}.$$ #' This synthesizes the optimistic model performance on training data with the pessimistic model performance on test data by weighting hte corresponding errors. This method is extremely good for small samples. #' #' To see the rationale behind *0.632 bootstrap*, consider a standard training set $T$ of cardinality $n$ where our bootstrapping sampling generates $m$ new training sets $T_i$, each of size $n'$. Sampling from $T$ is uniform *with replacement* suggesting that some observations may be repeated in each sample $T_i$. Suppose the size of the sub-samples are of the same order as $T$, i.e., $n'=n$, then for large $n$ the sample $D_{i}$ is *expected* to have $\left (1 - \frac{1}{e}\right ) \sim 0.632$ unique cases from the complete original collection $T$, the remaining proportion $0.368$ are expected to be repeated duplicates. Hence the name *0.632 bootstrap* sampling. In general, for large $n>>n'$, the sample $D_{i}$ is *expected* to have $n\left ( 1-e^{-n'/n})\right )$ unique cases, see [On Estimating the Size and Confidence of a Statistical Audit](http://people.csail.mit.edu/rivest/pubs/APR07.pdf)). #' #' Having the bootstrap samples, the $m$ models can be fitted (estimated) and aggregated, e.g., by averaging the outputs (for regression) or using voting methods (for classification). We will discuss this more in later chapters.