SOCR ≫ | DSPA ≫ | Topics ≫ |
In previous chapters, we used prediction accuracy to evaluate classification models. However, accurate predictions in one dataset does not necessarily imply that our model is perfect or that it will reproduce when tested on external data. We need additional metrics to evaluate the model performance to make sure it is robust, reproducible, reliable and unbiased.
In this chapter, we will discuss (1) various evaluation strategies for prediction, clustering, classification, regression, and decision trees, (2) visualization of ROC curves and performance tradeoffs, and (3) estimation of future performance, internal statistical cross-validation and bootstrap sampling.
Prediction accuracy represents just one view at evaluation of classification model performance or the reliability of clustering methods. Different classification models and alternative clustering techniques may be appropriate for different situations. For example, when screening newborns for rare genetic defects, we may want the model to have as few true-negatives as possible. We don’t want to classify anyone as “non-carriers” when they actually may have a defective gene, since early treatment might impact near- and long-term life experiences.
We can use the following three types of data to evaluate the performance of a classifier model.
We are familiar with the first two cases. The last type of validation relies on the predict(model, test_data)
function that we have talked about in previous classification and prediction chapters (Chapter 6-8). Let’s revisit the model and test data we discussed in Chapter 7 - Inpatient Head and Neck Cancer Medication data. We will demonstrate prediction probability estimation using this case-study CaseStudy14_HeadNeck_Cancer_Medication.csv.
read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
hn_med<-$seer_stage<-factor(hn_med$seer_stage)
hn_medrequire(tm)
Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
hn_med_corpus<-tm_map(hn_med_corpus, tolower)
corpus_clean<-tm_map(corpus_clean, removePunctuation)
corpus_clean<- tm_map(corpus_clean, stripWhitespace)
corpus_clean <-tm_map(corpus_clean, removeNumbers)
corpus_clean <-DocumentTermMatrix(corpus_clean)
hn_med_dtm<-1:562, ]
hn_med_train<-hn_med[563:662, ]
hn_med_test<-hn_med[1:562, ]
hn_med_dtm_train<-hn_med_dtm[563:662, ]
hn_med_dtm_test<-hn_med_dtm[1:562]
corpus_train<-corpus_clean[563:662]
corpus_test<-corpus_clean[$stage<-hn_med_train$seer_stage %in% c(4, 5, 7)
hn_med_train$stage<-factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_train$stage<-hn_med_test$seer_stage %in% c(4, 5, 7)
hn_med_test$stage<-factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test
function(x) {
convert_counts <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
x <-return(x)
}
as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_med_dict<-DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_train<-DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))
hn_test<- apply(hn_train, MARGIN = 2, convert_counts)
hn_train <- apply(hn_test, MARGIN = 2, convert_counts)
hn_test <-library(e1071)
naiveBayes(hn_train, hn_med_train$stage) hn_classifier <-
predict(hn_classifier, hn_test, type="raw")
pred_raw<-head(pred_raw)
## early_stage later_stage
## [1,] 0.8812634 0.11873664
## [2,] 0.8812634 0.11873664
## [3,] 0.8425828 0.15741724
## [4,] 0.9636007 0.03639926
## [5,] 0.8654298 0.13457022
## [6,] 0.8654298 0.13457022
The above output includes the prediction probabilities for the first 6 rows of the data. This examples is based on Naive Bayes classifier (naiveBayes), however the same approach works for any other machine learning classification or prediction technique. The type="raw"
indicates that the prediction call will return the conditional a-posterior probabilities for each class. If type="class"
predict returns the class with maximal probability.
In addition, we can report the predicted probability with the outputs of the Naive Bayesian decision-support system (hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)
):
predict(hn_classifier, hn_test)
pred_nb<-head(stats::ftable(pred_nb))
##
## "pred_nb" "early_stage" "later_stage"
##
## 96 4
The general predict()
method automatically subclasses to the specific predict.naiveBayes(object, newdata, type = c("class", "raw"), threshold = 0.001, ...)
call where type="raw"
and type = "class"
specify the output as the conditional a-posterior probabilities for each class or the class with maximal probability, respectively. Back in Chapter 8 where we discussed the C5.0
and the randomForest
classifiers to predict the chronic disease score in a another case-study Quality of Life (QoL).
Below are the (probability) results of the C5.0
classification tree model prediction:
predict(qol_model, qol_test, type="prob")
pred_prob<-head(pred_prob)
## minor_disease severe_disease
## 1 0.3762353 0.6237647
## 3 0.3762353 0.6237647
## 4 0.2942230 0.7057770
## 9 0.7376664 0.2623336
## 10 0.3762353 0.6237647
## 12 0.3762353 0.6237647
These can be contrasted against the C5.0
tree classification label results:
predict(qol_model, qol_test)
pred_tree<-head(pred_tree); head(stats::ftable(pred_tree))
## [1] severe_disease severe_disease severe_disease minor_disease severe_disease
## [6] severe_disease
## Levels: minor_disease severe_disease
##
## "pred_tree" "minor_disease" "severe_disease"
##
## 214 229
The same complementary types of outputs can be reported for most machine learning classification and prediction approaches.
In Chapter 6, we saw an attempt to categorize the supervised classification and unsupervised clustering methods. Similarly, the table below summarizes the basic types of evaluation and validation strategies for different forecasting, prediction, ensembling, and clustering techniques. (Internal) Statistical Cross Validation or external validation should always be applied to ensure reliability and reproducibility of the results. The SciKit clustering performance evaluation and Classification metrics sections provide details about many alternative techniques and metrics for performance evaluation of clustering and classification methods.
Inference | Outcome | Evaluation Metrics | R functions |
---|---|---|---|
Classification & Prediction | Binary | Accuracy, Sensitivity, Specificity, PPV/Precision, NPV/Recall, LOR | caret::confusionMatrix , gmodels::CrossTable , cluster::silhouette |
Classification & Prediction | Categorical | Accuracy, Sensitivity/Specificity, PPV, NPV, LOR, Silhouette Coefficient | caret::confusionMatrix , gmodels::CrossTable , cluster::silhouette |
Regression Modeling | Real Quantitative | correlation coefficient, \(R^2\), RMSE, Mutual Information, Homogeneity and Completeness Scores | cor , metrics::mse |
More details about binary test assessment is available on the Scientific Methods for Health Sciences (SMHS) EBook site. The table below summarizes the key measures commonly used to evaluate the performance of binary tests, classifiers, or predictions.
Actual Condition | Test Interpretation | ||||
Absent (\(H_0\) is true) | Present (\(H_1\) is true) | ||||
Test Result |
Negative (fail to reject \(H_0\)) |
TN Condition absent + Negative result = True (accurate) Negative |
FN Condition present + Negative result = False (invalid) Negative Type II error (proportional to \(\beta\)) |
\(NPV\) \(=\frac{TN}{TN+FN}\) | |
Positive (reject \(H_0\)) |
FP Condition absent + Positive result = False Positive Type I error (\(\alpha\)) |
TP Condition Present + Positive result = True Positive |
\(PPV=Precision\) \(=\frac{TP}{TP+FP}\) | ||
Test Interpretation |
\(Power =1-\beta\) \(= 1-\frac{FN}{FN+TP}\) |
\(Specificity=\frac{TN}{TN+FP}\) |
\(Power=Recall=Sensitivity\) \(=\frac{TP}{TP+FN}\) |
\(LOR=\ln\left (\frac{TN\times TP}{FP\times FN}\right )\) |
See also SMHS EBook, Power, Sensitivity and Specificity section.
We talked about this type of matrix in Chapter 8. For binary classes, there will be a \(2\times 2\) matrix. Each of the cells have a specific meaning.
Graph \(2\times 2\) table:
predict_T | predict_F | |
---|---|---|
TRUE | TP | TN |
FALSE | FP | FN |
True Positive(TP): Number of observations that correctly classified as “yes” or “success”.
True Negative(TN): Number of observations that correctly classified as “no” or “failure”.
False Positive(FP): Number of observations that incorrectly classified as “yes” or “success”.
False Negative(FN): Number of observations that incorrectly classified as “no” or “failure”.
Using confusion matrices to measure performance
The way we calculate accuracy using these four cells is summarized by the following formula: \[accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}}\] On the other hand, the error rate, or proportion of incorrectly classified observations is calculated using: \[error rate=\frac{FP+FN}{TP+TN+FP+FN}==\frac{FP+FN}{\text{Total number of observations}}=1-accuracy\] If we look at the numerator and denominator carefully, we can see that the error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate.
In R, we have multiple ways to obtain confusion table. The simplest way would be table()
. For example, in Chapter 7, to get a plain \(2\times 2\) table reporting the agreement between the real clinical cancer labels and their machine learning predicted counterparts, we used:
predict(hn_classifier, hn_test)
hn_test_pred<-table(hn_test_pred, hn_med_test$stage)
##
## hn_test_pred early_stage later_stage
## early_stage 73 23
## later_stage 4 0
The reason we sometimes use the gmodels::CrossTable()
function, e.g., see Chapter 7, is because it reports additional information about the model performance.
library(gmodels)
CrossTable(hn_test_pred, hn_med_test$stage)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | hn_med_test$stage
## hn_test_pred | early_stage | later_stage | Row Total |
## -------------|-------------|-------------|-------------|
## early_stage | 73 | 23 | 96 |
## | 0.011 | 0.038 | |
## | 0.760 | 0.240 | 0.960 |
## | 0.948 | 1.000 | |
## | 0.730 | 0.230 | |
## -------------|-------------|-------------|-------------|
## later_stage | 4 | 0 | 4 |
## | 0.275 | 0.920 | |
## | 1.000 | 0.000 | 0.040 |
## | 0.052 | 0.000 | |
## | 0.040 | 0.000 | |
## -------------|-------------|-------------|-------------|
## Column Total | 77 | 23 | 100 |
## | 0.770 | 0.230 | |
## -------------|-------------|-------------|-------------|
##
##
The second entry in each cell of the crossTable table reports the Chi-square contribution. This uses the standard Chi-Square formula for computing relative discrepancy between observed and expected counts. For instance, the Chi-square contribution of cell(1,1), (hn_med_test$stage=early_stage
and hn_test_pred=early_stage
), can be computed as follows from the \(\frac{(Observed-Expected)^2}{Expected}\) formula. Assuming independence between the rows and columns (i.e., random classification), the expected cell(1,1) value is computed as the product of the corresponding row (96) and column (77) marginal counts, \(\frac{96\times 77}{100}\). Thus the Chi-square value for cell(1,1) is:
\[\text{Chi-square cell(1,1)} = \frac{(Observed-Expected)^2}{Expected}=\] \[=\frac{\left (73-\frac{96\times 77}{100}\right ) ^2}{\frac{96\times 77}{100}}=0.01145022.\]
Note that each cell Chi-square value represents one of the four (in this case) components of the Chi-square test-statistics, which tries to answer the question if there is no association between observed and predicted class labels. That is, under the null-hypothesis there is no association between actual and observed counts for each level of the factor variable, which allows us to quantify whether the derived classification jibes with the real class annotations (labels). The aggregate sum of all Chi-square values represents the \(\chi_o^2 = \displaystyle\sum_{all-categories}{(O-E)^2 \over E} \sim \chi_{(df)}^2\) statistics, where \(df = (\# rows - 1)\times (\# columns - 1)\).
Using either table (CrossTable, confusionMatrix), we can calculate accuracy and error rate by hand.
73+0)/100
accuracy<-( accuracy
## [1] 0.73
23+4)/100
error_rate<-( error_rate
## [1] 0.27
1-accuracy
## [1] 0.27
For matrices that are larger than \(2\times 2\), all diagonal elements count the observations that are correctly classified and the off-diagonal elements represent incorrectly labeled cases.
So far we discussed two performance methods - table()
and CrossTable()
. A third function is caret::confusionMatrix()
which provides the easiest way to report model performance. Notice that the first argument is an actual vector of the labels, i.e., \(Test\_Y\) and the second argument, of the same length, represents the vector of predicted labels.
This example was presented as the first case-study in Chapter 8.
library(caret)
predict(qol_model, qol_test)
qol_pred<-confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
##
## Accuracy : 0.6772
## 95% CI : (0.6315, 0.7206)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 3.001e-12
##
## Kappa : 0.3538
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6886
## Specificity : 0.6651
## Pos Pred Value : 0.6856
## Neg Pred Value : 0.6682
## Prevalence : 0.5147
## Detection Rate : 0.3544
## Detection Prevalence : 0.5169
## Balanced Accuracy : 0.6769
##
## 'Positive' Class : severe_disease
##
In Chapter 12 we already saw the Silhouette coefficient, which captures the shape of the clustering boundaries. It is a function of the intracluster distance of a sample in the dataset (\(i\)). Suppose:
Then, the Silhouette coefficient for a sample point \(i\) is:
\[-1\leq Silhouette(i)=\frac{l_i-d_i}{\max(l_i,d_i)} \leq 1.\] For interpreting the Silhouette of point \(i\), we use:
\[Silhouette(i) \approx \begin{cases} -1 & \text{sample } i \text{ is closer to a neighboring cluster} \\ 0 & \text {the sample } i \text{ is near the border of its cluster, i.e., } i \text{ represents the closest point in its cluster to the rest of the dataset clusters} \\ 1 & \text {the sample } i \text{ is near the center of its cluster} \end{cases}.\]
The mean Silhouette value represents the arithmetic average of all Silhouette coefficients (either within a cluster, or overall) and represents the quality of the cluster (clustering). High mean Silhouette corresponds to compact clustering (dense and separated clusters), whereas low values represent more diffused clusters. The Silhouette value is useful when the number of predicted clusters is smaller than the number of samples.
The Kappa statistic was originally developed to measure the reliability between two human raters. It can be harnessed in machine learning applications to compare the accuracy of a classifier, where one rater
represents the ground truth (for labeled data, these are the actual values of each instance) and the second rater
represents the results of the automated machine learning classifier. The order of listing the raters is irrelevant.
Kappa statistic measures the possibility of a correct prediction by chance alone and answers the question of How much better is the agreement (between the ground truth and the machine learning prediction) than would be expected by chance alone? Its value is between \(0\) and \(1\). When \(\kappa=1\), we have a perfect agreement between a computed prediction (typically the result of a model-based or model-free technique forecasting an outcome of interest) and an expected prediction (typically random, by-chance, prediction). A common interpretation of the Kappa statistics includes:
In the above confusionMatrix
output, we have a fair agreement. For different problems, we may have different interpretations of Kappa statistics.
To understand Kappa statistic better, let’s look at its definition.
Predicted \ Observed | Minor | Severe | Row Sum |
---|---|---|---|
Minor | \(A=143\) | \(B=71\) | \(A+B=214\) |
Severe | \(C=72\) | \(D=157\) | \(C+D=229\) |
Column Sum | \(A+C=215\) | \(B+D=228\) | \(A+B+C+D=443\) |
In this table, \(A=143, B=71, C=72, D=157\) denote the frequencies (counts) of cases within each of the cells in the \(2\times 2\) design. Then
\[ObservedAgreement = (A+D)=300.\]
\[ExpectedAgreement = \frac{(A+B)\times (A+C)+(C+D)\times (B+D)}{A+B+C+D}=221.72.\]
\[(Kappa)\ \kappa = \frac{(ObservedAgreement) – (ExpectedAgreement)} {(A+B+C+D) – (ExpectedAgreement)}=0.35.\]
In this manual calculation of kappa
statistics (\(\kappa\)) we used the corresponding values we saw earlier in the Quality of Life (QoL) case-study, where chronic-disease binary outcome qol$cd<-qol$CHRONICDISEASESCORE>1.497
, and we used the cd
prediction (qol_pred).
table(qol_pred, qol_test$cd)
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
According to above table, actual agreement is the accuracy:
143; B=71; C=72; D=157
A=# A+B+ C+D # 443
# ((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # 221.7201
+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # Expected accuracy
EA=((A+D; OA # Observed accuracy OA=A
## [1] 300
-EA)/(A+B+C+D - EA); k # 0.3537597 k=(OA
## [1] 0.3537597
# Compare against the official kappa score
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[1] # report official Accuracy
## Accuracy
## 0.6772009
The manually and automatically computed accuracies coincide (\(\sim 0.35\)).
Let’s now look at computing Kappa
. It may be trickier to obtain the expected agreement. Probability rules tell us that the probability of the union of two disjoint events equals to the sum of the individual (marginal) probabilities for these two events. Thus, we have:
round(confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2], 2) # report official Kappa
## Kappa
## 0.35
We get a similar value in the confusionTable()
output. A more straight forward way of getting the Kappa statistics is by using Kappa()
function in the vcd
package.
# install.packages(vcd)
library(vcd)
## Loading required package: grid
Kappa(table(qol_pred, qol_test$cd))
## value ASE z Pr(>|z|)
## Unweighted 0.3538 0.04446 7.957 1.76e-15
## Weighted 0.3538 0.04446 7.957 1.76e-15
The combination of Kappa()
and table
function yields a \(2\times 4\) matrix. The Kappa statistic is under the unweighted value.
Generally speaking, predicting a severe disease outcome is a more critical problem than predicting a mild disease state. Thus, weighted Kappa is also useful. We give the severe disease a higher weight. The Kappa test result is not acceptable since the classifier may make too many mistakes for the severe disease cases. The Kappa value is \(0.26374\). Notice that the range of weighted Kappa may exceed [0,1].
Kappa(table(qol_pred, qol_test$cd),weights = matrix(c(1,10,1,10),nrow=2))
## value ASE z Pr(>|z|)
## Unweighted 0.353760 0.04446 7.95721 1.760e-15
## Weighted -0.004386 0.04667 -0.09397 9.251e-01
When the predicted value is the first argument, the row and column names represent the true labels and the predicted labels, respectively.
table(qol_pred, qol_test$cd)
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
Kappa compares an Observed classification accuracy (output of our ML classifier) with an Expected classification accuracy (corresponding to random chance classification). It may be used to evaluate single classifiers and/or to compare among a set of different classifiers. It takes into account random chance (agreement with a random classifier). That makes Kappa more meaningful than simply using the accuracy as a single quality metric. For instance, the interpretation of an Observed Accuracy of 80%
is relative to the Expected Accuracy
. Observed Accuracy of 80%
is more impactful for an Expected Accuracy of 50%
compared to Expected Accuracy of 75%
.
Take a closer look at the confusionMatrix()
output where we can find two important statistics - “sensitivity” and “specificity”.
Sensitivity or true positive rate measures the proportion of “success” observations that are correctly classified. \[sensitivity=\frac{TP}{TP+FN}.\] Notice \(TP+FN\) are the total number of true “success” observations.
On the other hand, specificity or true negative rate measures the proportion of “failure” observations that are correctly classified. \[sensitivity=\frac{TN}{TN+FP}.\] Accordingly, \(TN+FP\) are the total number of true “failure” observations.
In the QoL data, considering “severe_disease” as “success” and using the table()
output we can manually compute the sensitivity and specificity, as well as precision and recall (below):
131/(131+89)
sens<- sens
## [1] 0.5954545
149/(149+74)
spec<- spec
## [1] 0.6681614
Another R package caret
also provides functions to directly calculate the sensitivity and specificity.
library(caret)
sensitivity(qol_pred, qol_test$cd, positive="severe_disease")
## [1] 0.6885965
# specificity(qol_pred, qol_test$cd)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[1] # another way to report the sensitivity
## Sensitivity
## 0.6885965
# confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[2] # another way to report the specificity
Sensitivity and specificity both range from 0 to 1. For either measure, a values of 1 imply that the positive and negative predictions are very accurate. However, simultaneously high sensitivity and specificity may not be attainable in real world situations. There is a tradeoff between sensitivity and specificity. To compromise, some studies loosen the demands on one and focus on achieving high values on the other.
Very similar to sensitivity, precision measures the proportion of true “success” observations among predicted “success” observations. \[precision=\frac{TP}{TP+FP}.\] Recall is the proportion of true “failures” among all “failures”. A model with high recall captures most “interesting” cases. \[recall=\frac{TP}{TP+FN}.\] Again, let’s calculate these by hand for the QoL data:
157/(157+72); prec prec<-
## [1] 0.6855895
157/(157+71); recall recall<-
## [1] 0.6885965
Report the Area under the ROC Curve (AUC).
library (ROCR)
predict(qol_model, qol_test)
qol_pred<- predict(qol_model, qol_test, type = 'prob')
qol_pred <-
prediction( qol_pred[,2], qol_test$cd)
pred <- performance(pred, "prec", "rec")
PrecRec <-plot(PrecRec)