#' ---
#' title: "Data Science and Predictive Analytics (UMich HS650)"
#' subtitle: "
#'
#' |
#' Actual Condition |
#' Test Interpretation |
#'
#'
#' Absent ($H_0$ is true) |
#' Present ($H_1$ is true) |
#'
#'
#' Test Result |
#' Negative (fail to reject $H_0$) |
#' TN Condition absent + Negative result = True (accurate) Negative |
#' FN Condition present + Negative result = False (invalid) Negative Type II error (proportional to $\beta$) |
#' $NPV$
#' $=\frac{TN}{TN+FN}$ |
#' |
#'
#' Positive (reject $H_0$) |
#' FP Condition absent + Positive result = False Positive Type I error ($\alpha$) |
#' TP Condition Present + Positive result = True Positive |
#' $PPV=Precision$
#' $=\frac{TP}{TP+FP}$ |
#' |
#'
#' Test Interpretation |
#' $Power =1-\beta$
#' $= 1-\frac{FN}{FN+TP}$ |
#' $Specificity=\frac{TN}{TN+FP}$ |
#' $Power=Recall=Sensitivity$
#' $=\frac{TP}{TP+FN}$ |
#' $LOR=\ln\left (\frac{TN\times TP}{FP\times FN}\right )$
#' |
#'
#'
#'
#' See also [SMHS EBook, Power, Sensitivity and Specificity section](https://wiki.socr.umich.edu/index.php/SMHS_PowerSensitivitySpecificity).
#'
#' ## Confusion matrices
#'
#' We talked about this type of matrix in [Chapter 8](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/08_DecisionTreeClass.html). For binary classes, there will be a $2\times 2$ matrix. Each of the cells have a specific meaning.
#'
#' Graph $2\times 2$ table:
#'
#'
require(knitr)
item_table = data.frame(predict_T = c("TP","FP"),predict_F = c("TN","FN"))
rownames(item_table) = c("TRUE","FALSE")
kable(item_table,caption = "cross table")
#'
#'
#' * **True Positive**(TP): Number of observations that correctly classified as "yes" or "success".
#'
#' * **True Negative**(TN): Number of observations that correctly classified as "no" or "failure".
#'
#' * **False Positive**(FP): Number of observations that incorrectly classified as "yes" or "success".
#'
#' * **False Negative**(FN): Number of observations that incorrectly classified as "no" or "failure".
#'
#' **Using confusion matrices to measure performance**
#'
#' The way we calculate accuracy using these four cells is summarized by the following formula:
#' $$accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}}$$
#' On the other hand, the error rate, or proportion of incorrectly classified observations is calculated using:
#' $$error rate=\frac{FP+FN}{TP+TN+FP+FN}==\frac{FP+FN}{\text{Total number of observations}}=1-accuracy$$
#' If we look at the numerator and denominator carefully, we can see that the error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate.
#'
#' In R, we have multiple ways to obtain confusion table. The simplest way would be `table()`. For example, in [Chapter 7](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/07_NaiveBayesianClass.html), to get a plain $2\times 2$ table reporting the agreement between the real clinical cancer labels and their machine learning predicted counterparts, we used:
#'
#'
hn_test_pred <- predict(hn_classifier, hn_test)
table(hn_test_pred, hn_med_test$stage)
#'
#'
#' The reason we sometimes use the `gmodels::CrossTable()` function, e.g., see [Chapter 7](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/07_NaiveBayesianClass.html), is because it reports additional information about the model performance.
#'
#'
library(gmodels)
CrossTable(hn_test_pred, hn_med_test$stage)
#'
#'
#' The second entry in each cell of the *crossTable* table reports the Chi-square contribution. This uses the standard [Chi-Square formula](https://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Contingency_Indep) for computing relative discrepancy between *observed* and *expected* counts. For instance, the Chi-square contribution of *cell(1,1)*, (`hn_med_test$stage=early_stage` and `hn_test_pred=early_stage`), can be computed as follows from the $\frac{(Observed-Expected)^2}{Expected}$ formula. Assuming independence between the rows and columns (i.e., random classification), the *expected cell(1,1) value* is computed as the product of the corresponding row (96) and column (77) marginal counts, $\frac{96\times 77}{100}$. Thus the Chi-square value for cell(1,1) is:
#'
#' $$\text{Chi-square cell(1,1)} = \frac{(Observed-Expected)^2}{Expected}=$$
#' $$=\frac{\left (73-\frac{96\times 77}{100}\right ) ^2}{\frac{96\times 77}{100}}=0.01145022.$$
#'
#' Note that each cell Chi-square value represents one of the four (in this case) components of the Chi-square test-statistics, which tries to answer the question if there is no association between observed and predicted class labels. That is, under the null-hypothesis there is no association between actual and observed counts for each level of the factor variable, which allows us to quantify whether the derived classification jibes with the real class annotations (labels). The aggregate sum of all Chi-square values represents the $\chi_o^2 = \displaystyle\sum_{all-categories}{(O-E)^2 \over E} \sim \chi_{(df)}^2$ statistics, where $df = (\# rows - 1)\times (\# columns - 1)$.
#'
#' Using either table (CrossTable, confusionMatrix), we can calculate accuracy and error rate by hand.
#'
#'
accuracy<-(73+0)/100
accuracy
error_rate<-(23+4)/100
error_rate
1-accuracy
#'
#'
#' For matrices that are larger than $2\times 2$, all diagonal elements count the observations that are correctly classified and the off-diagonal elements represent incorrectly labeled cases.
#'
#' ## Other measures of performance beyond accuracy
#'
#' So far we discussed two performance methods - `table()` and `CrossTable()`. A third function is `caret::confusionMatrix()` which provides the easiest way to report model performance. Notice that the first argument is an *actual vector of the labels*, i.e., $Test\_Y$ and the second argument, of the same length, represents the *vector of predicted labels*.
#'
#' This example was presented as the first case-study in [Chapter 8](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/08_DecisionTreeClass.html).
#'
#'
library(caret)
qol_pred<-predict(qol_model, qol_test)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")
#'
#'
#' ### Silhouette coefficient
#'
#' In [Chapter 12](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#2_silhouette_plots) we already saw the *Silhouette coefficient*, which captures the shape of the clustering boundaries. It is a function of the *intracluster distance* of a sample in the dataset ($i$). Suppose:
#'
#' * $d_i$ is the average dissimilarity of point $i$ with all other data points within its cluster. Then, $d_i$ captures the quality of the assignment of $i$ to its current class label. Smaller or larger $d_i$ values suggest better or worse overall assignment for $i$ to its cluster, respectively. The average dissimilarity of $i$ to a cluster $C$ is the average distance between $i$ and all points in the cluster of points labeled $C$.
#' * $l_i$ is the lowest average dissimilarity of point $i$ to any other cluster, that $i$ is not a member of. The cluster corresponding to $l_i$, the lowest average dissimilarity, is called the $i$ neighboring cluster, as it is the next best fit cluster for $i$.
#'
#' Then, the *Silhouette coefficient* for a sample point $i$ is:
#'
#' $$-1\leq Silhouette(i)=\frac{l_i-d_i}{\max(l_i,d_i)} \leq 1.$$
#' For interpreting the Silhouette of point $i$, we use:
#'
#' $$Silhouette(i) \approx
#' \begin{cases}
#' -1 & \text{sample } i \text{ is closer to a neighboring cluster} \\
#' 0 & \text {the sample } i \text{ is near the border of its cluster, i.e., } i \text{ represents the closest point in its cluster to the rest of the dataset clusters} \\
#' 1 & \text {the sample } i \text{ is near the center of its cluster}
#' \end{cases}.$$
#'
#' The *mean Silhouette value* represents the arithmetic average of all Silhouette coefficients (either within a cluster, or overall) and represents the quality of the cluster (clustering). High mean Silhouette corresponds to compact clustering (dense and separated clusters), whereas low values represent more diffused clusters. The Silhouette value is useful when the number of predicted clusters is smaller than the number of samples.
#'
#' ### The kappa ( $\kappa$ ) statistic
#'
#' The Kappa statistic was [originally developed to measure the reliability between two human raters](https://wiki.socr.umich.edu/index.php/SMHS_ReliabilityValidity). It can be harnessed in machine learning applications to compare the accuracy of a classifier, where `one rater` represents the ground truth (for labeled data, these are the actual values of each instance) and the `second rater` represents the results of the automated machine learning classifier. The order of listing the **raters** is irrelevant.
#'
#' Kappa statistic measures the **possibility of a correct prediction by chance alone** and answers the question of *How much better is the agreement (between the ground truth and the machine learning prediction) than would be expected by chance alone?* Its value is between $0$ and $1$. When $\kappa=1$, we have a perfect agreement between a **computed** prediction (typically the result of a model-based or model-free technique forecasting an outcome of interest) and an **expected** prediction (typically random, by-chance, prediction). A common interpretation of the Kappa statistics includes:
#'
#' * *Poor* agreement: less than 0.20
#' * *Fair* agreement: 0.20-0.40
#' * *Moderate* agreement: 0.40-0.60
#' * *Good* agreement: 0.60-0.80
#' * *Very good* agreement: 0.80-1
#'
#' In the above `confusionMatrix` output, we have a fair agreement. For different problems, we may have different interpretations of Kappa statistics.
#'
#' To understand Kappa statistic better, let's look at its definition.
#'
#' | Predicted \\ Observed | Minor | Severe | Row Sum |
#' |---|---|---|---|
#' | Minor | $A=143$ | $B=71$ | $A+B=214$ |
#' | Severe | $C=72$ | $D=157$ | $C+D=229$ |
#' | Column Sum | $A+C=215$ | $B+D=228$ | $A+B+C+D=443$ |
#'
#'
#' In this table, $A=143, B=71, C=72, D=157$ denote the frequencies (counts) of cases within each of the cells in the $2\times 2$ design. Then
#'
#' $$ObservedAgreement = (A+D)=300.$$
#'
#' $$ExpectedAgreement = \frac{(A+B)\times (A+C)+(C+D)\times (B+D)}{A+B+C+D}=221.72.$$
#'
#' $$(Kappa)\ \kappa = \frac{(ObservedAgreement) – (ExpectedAgreement)}
#' {(A+B+C+D) – (ExpectedAgreement)}=0.35.$$
#'
#' In this manual calculation of `kappa` statistics ($\kappa$) we used the corresponding values we saw earlier in the Quality of Life (QoL) case-study, where chronic-disease binary outcome `qol$cd<-qol$CHRONICDISEASESCORE>1.497`, and we used the `cd` prediction (qol_pred).
#'
#'
table(qol_pred, qol_test$cd)
#'
#'
#' According to above table, actual agreement is the accuracy:
#'
A=143; B=71; C=72; D=157
# A+B+ C+D # 443
# ((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # 221.7201
EA=((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # Expected accuracy
OA=A+D; OA # Observed accuracy
k=(OA-EA)/(A+B+C+D - EA); k # 0.3537597
# Compare against the official kappa score
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[1] # report official Accuracy
#'
#'
#' The manually and automatically computed accuracies coincide ($\sim 0.35$).
#'
#' Let's now look at computing `Kappa`. It may be trickier to obtain the expected agreement. [Probability rules](https://wiki.socr.umich.edu/index.php/EBook#Chapter_III:_Probability) tell us that the probability of the union of two *disjoint events* equals to the sum of the individual (marginal) probabilities for these two events. Thus, we have:
#'
#'
round(confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2], 2) # report official Kappa
#'
#'
#' We get a similar value in the `confusionTable()` output. A more straight forward way of getting the Kappa statistics is by using `Kappa()` function in the `vcd` package.
#'
#'
# install.packages(vcd)
library(vcd)
Kappa(table(qol_pred, qol_test$cd))
#'
#'
#' The combination of `Kappa()` and `table` function yields a $2\times 4$ matrix. The *Kappa statistic* is under the unweighted value.
#'
#' Generally speaking, predicting a severe disease outcome is a more critical problem than predicting a mild disease state. Thus, weighted Kappa is also useful. We give the severe disease a higher weight. The Kappa test result is not acceptable since the classifier may make too many mistakes for the severe disease cases. The Kappa value is $0.26374$. Notice that the range of weighted Kappa may exceed [0,1].
#'
#'
Kappa(table(qol_pred, qol_test$cd),weights = matrix(c(1,10,1,10),nrow=2))
#'
#'
#' When the predicted value is the first argument, the row and column names represent the **true labels** and the **predicted labels**, respectively.
#'
#'
table(qol_pred, qol_test$cd)
#'
#'
#' #### Summary of the Kappa score for calculating prediction accuracy
#'
#' Kappa compares an **Observed classification accuracy** (output of our ML classifier) with an **Expected classification accuracy** (corresponding to random chance classification). It may be used to evaluate single classifiers and/or to compare among a set of different classifiers. It takes into account random chance (agreement with a random classifier). That makes **Kappa** more meaningful than simply using the **accuracy** as a single quality metric. For instance, the interpretation of an `Observed Accuracy of 80%` is **relative** to the `Expected Accuracy`. `Observed Accuracy of 80%` is more impactful for an `Expected Accuracy of 50%` compared to `Expected Accuracy of 75%`.
#'
#' ### Sensitivity and specificity
#'
#' Take a closer look at the `confusionMatrix()` output where we can find two important statistics - "sensitivity" and "specificity".
#'
#' Sensitivity or true positive rate measures the proportion of "success" observations that are correctly classified.
#' $$sensitivity=\frac{TP}{TP+FN}.$$
#' Notice $TP+FN$ are the total number of true "success" observations.
#'
#' On the other hand, specificity or true negative rate measures the proportion of "failure" observations that are correctly classified.
#' $$sensitivity=\frac{TN}{TN+FP}.$$
#' Accordingly, $TN+FP$ are the total number of true "failure" observations.
#'
#' In the QoL data, considering "severe_disease" as "success" and using the `table()` output we can manually compute the *sensitivity* and *specificity*, as well as *precision* and *recall* (below):
#'
#'
sens<-131/(131+89)
sens
spec<-149/(149+74)
spec
#'
#'
#' Another R package `caret` also provides functions to directly calculate the sensitivity and specificity.
#'
#'
library(caret)
sensitivity(qol_pred, qol_test$cd, positive="severe_disease")
# specificity(qol_pred, qol_test$cd)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[1] # another way to report the sensitivity
# confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[2] # another way to report the specificity
#'
#'
#' Sensitivity and specificity both range from 0 to 1. For either measure, a values of 1 imply that the positive and negative predictions are very accurate. However, simultaneously high sensitivity and specificity may not be attainable in real world situations. There is a tradeoff between sensitivity and specificity. To compromise, some studies loosen the demands on one and focus on achieving high values on the other.
#'
#' ### Precision and recall
#'
#' Very similar to sensitivity, *precision* measures the proportion of true "success" observations among predicted "success" observations.
#' $$precision=\frac{TP}{TP+FP}.$$
#' *Recall* is the proportion of true "failures" among all "failures". A model with high recall captures most "interesting" cases.
#' $$recall=\frac{TP}{TP+FN}.$$
#' Again, let's calculate these by hand for the QoL data:
#'
#'
prec<-157/(157+72); prec
recall<-157/(157+71); recall
#'
#'
#' Report the Area under the ROC Curve (AUC).
#'
#'
library (ROCR)
library(plotly)
qol_pred<-predict(qol_model, qol_test)
qol_pred <- predict(qol_model, qol_test, type = 'prob')
pred <- prediction( qol_pred[,2], qol_test$cd)
PrecRec <- performance(pred, "prec", "rec")
PrecRecAUC <- performance(pred, "auc")
paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))
# plot(PrecRec)
plot_ly(x = ~PrecRec@x.values[[1]][2:length(PrecRec@x.values[[1]])],
y = ~PrecRec@y.values[[1]][2:length(PrecRec@y.values[[1]])],
name = 'Recall-Precision relation', type='scatter', mode='markers+lines') %>%
layout(title=paste0("Precision-Recall Plot, AUC=",
round(as.numeric(PrecRecAUC@y.values[[1]]), 2)),
xaxis=list(title="Recall"), yaxis=list(title="Precision"))
# PrecRecAUC <- performance(pred, "auc")
# paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))
#'
#'
#' Another way to obtain *precision* would be `posPredValue()` under `caret` package. Remember to specify which one is the "success" class.
#'
#'
qol_pred<-predict(qol_model, qol_test)
posPredValue(qol_pred, qol_test$cd, positive="severe_disease")
#'
#'
#' From the definitions of **precision** and **recall**, we can derive the type 1 error and type 2 errors as follow:
#'
#' $$error_1 = 1- Precision = \frac{FP}{TP+FP}.$$
#'
#' $$error_2 = 1- Recall = \frac{FN}{TN+FN}.$$
#'
#' Thus, we can compute the type 1 error ($0.31$) and type 2 error ($0.31$).
#'
#'
error1<-1-prec; error1
error2<-1-recall; error2
#'
#'
#' ### The F-measure
#'
#' The F-measure, or F1-score, combines precision and recall using the [harmonic mean](https://wiki.socr.umich.edu/index.php/SMHS_CenterSpreadShape#Harmonic_Mean) assuming equal weights. High F1-score means high precision and high recall. This is a convenient way of measuring model performances and comparing models.
#' $$F1=\frac{2\times precision\times recall}{recall+precision}=\frac{2\times TP}{2\times TP+FP+FN}$$
#' If calculating the F1-score by hand, using the Quality of Life prediction:
#'
#'
f1<-(2*prec*recall)/(prec+recall)
f1
#'
#'
#' The direct calculations of the F1-statistics can be obtained using `caret`:
#'
#'
precision <- posPredValue(qol_pred, qol_test$cd, positive="severe_disease")
recall <- sensitivity(qol_pred, qol_test$cd, positive="severe_disease")
F1 <- (2 * precision * recall) / (precision + recall); F1
#'
#'
#' # Visualizing performance tradeoffs (ROC Curve)
#'
#' Another choice for evaluating classifiers' performance is by graphs rather than statistics. Graphs are usually more comprehensive than single statistics.
#'
#' The R package `ROCR` provides user-friendly functions for visualizing model performance. Details could be find on the [ROCR website](http://rocr.bioinf.mpi-sb.mpg.de).
#'
#' Here we evaluate the model performance for the Quality of Life case study in [Chapter 8](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/08_DecisionTreeClass.html).
#'
#'
# install.packages("ROCR")
library(ROCR)
pred <- ROCR::prediction(predictions=pred_prob[, 2], labels=qol_test$cd)
# avoid naming collision (ROCR::prediction), as
# there is another prediction function in the neuralnet package.
#'
#'
#' `pred_prob[, 2]` is the probability of classifying each observation as "severe_disease". The above code saved all the model prediction information into the object `pred`.
#'
#' [Receiver Operating Characteristic (ROC)](https://wiki.socr.umich.edu/index.php/SMHS_ROC) curves are often used for examine the trade-off between detecting true positives and avoiding the false positives.
#'
#'
# curve(log(x), from=0, to=100, xlab="False Positive Rate", ylab="True Positive Rate", main="ROC curve", col="green", lwd=3, axes=F)
# Axis(side=1, at=c(0, 20, 40, 60, 80, 100), labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# Axis(side=2, at=0:5, labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# segments(0, 0, 110, 5, lty=2, lwd=3)
# segments(0, 0, 0, 4.7, lty=2, lwd=3, col="blue")
# segments(0, 4.7, 107, 4.7, lty=2, lwd=3, col="blue")
# text(20, 4, col="blue", labels = "Perfect Classifier")
# text(40, 3, col="green", labels = "Test Classifier")
# text(70, 2, col="black", labels= "Classifier with no predictive value")
x <- seq(from=0, to=1.0, by=0.01) + 0.001
plot_ly(x = ~x, y = (log(100*x)+2.3)/(log(100*x[101])+2.3), line=list(color="lightgreen"),
name = 'Test Classifier', type='scatter', mode='lines', showlegend=T) %>%
add_lines(x=c(0,1), y=c(0,1), line=list(color="black", dash='dash'),
name="Classifier with no predictive value") %>%
add_segments(x=0, xend=0, y=0, yend = 1, line=list(color="blue"),
name="Perfect Classifier") %>%
add_segments(x=0, xend=1, y=1, yend = 1, line=list(color="blue"),
name="Perfect Classifier 2", showlegend=F) %>%
layout(title="ROC curve", legend = list(orientation = 'h'),
xaxis=list(title="False Positive Rate", scaleanchor="y", range=c(0,1)),
yaxis=list(title="True Positive Rate", scaleanchor="x"))
#'
#'
#' The