#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

Probabilistic Learning - Classification Using Naive Bayes

" #' author: "

SOCR/MIDAS (Ivo Dinov)

" #' date: "`r format(Sys.time(), '%B %Y')`" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' after_body: SOCR_footer_tracker.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' Please review the introduction to [Chapter 6](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/06_LazyLearning_kNN.html), where we described the types of machine learning methods and presented lazy classification for numerical data. What about nominal features or textual data? In this chapter, we will begin to explore some classificaiton techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) complete a Head and Neck Cancer Medication case-study. #' #' Later, in [Chapter 19](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/19_NLP_TextMining.html), we will also discuss text mining and natural language processing. #' #' # Overview of the Naive Bayes Algorithm #' #' Start by reviewing the [basics of probability theory and Bayesian inference](http://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Prob_Rules). #' #' Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the "map" to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated. #' #' The best scenario of accurately estimating the probability of an outcome-class map is when all features in Bayes classifiers attribute to the class simultaneously. The naive Bayes algorithm is frequently used for text classifications. The maximum *a posteriori* assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable. #' #' # Assumptions #' #' Naive Bayes is named for its "naive" assumptions. Its most important assumption is that all of the features are *equally important* and *independent.* This rarely happens in real world data. However, sometimes even when the assumptions are violated, naive Bayes still performs fairly accurate, particularly when the number of features $p$ is large. This is why the Naive Bayes algorithm may be used as a powerful text classifier. #' #' There are interesting [relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification](http://www.wikicoursenote.com/wiki/Stat841f10#Linear_and_Quadratic_Discriminant_Analysis). Additional information about [LDA and QDA is available here ](http://wiki.socr.umich.edu/index.php/SMHS_BigDataBigSci_CrossVal_LDA_QDA). #' #' # Bayes Formula #' #' We assume that $B_i's$ are mutually exclusive events, for all $i= 1, 2, ..., n$, where *n* represents the number of features. If *A* is another event, the Bayes conditional probability formula is as follows: #' $$Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}$$ #' #' Symbolically, #' #' $$P(A|B_1\cap B_2 \cap B_3\cap ...\cap B_n)=\frac{P(B_1\cap B_2 \cap B_3\cap ...\cap B_n|A)P(A)}{P(B_1\cap B_2 \cap B_3\cap ...\cap B_n)}$$ #' When $B_i's$ are independent we have $P(B_1\cap B_2)=P(B_1)*P(B_2)$ and similar property for each pair of different $B_i's$. So we have: #' #' $$P(A|B_1\cap B_2 \cap B_3\cap ...\cap B_n)=\frac{P(B_1|A)\times P(B_2|A)\times ...\times P(B_n|A)\times P(A)}{P(B_1)\times P(B_2)\times P(B_3)\times ...\times P(B_n)}$$ #' #' Essentially, probability of level *L* for class *C* given independent features $F_1$ to $F_n$ is equal to: #' #' $$P(C_L|F_1, ..., F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}$$ #' Where the denominator, $\prod_{i=1}^nP(F_i)$, is a scaling factor that represents the the marginal probability of observing all features jointly. #' #' # The Laplace Estimator #' #' If at least one $P(F_i|C_L)=0$ then $P(C_L|F_1, ..., F_n)=0$, which means the probability of being in this class is 0. However, $P(F_i|C_L)=0$ could be result from a random chance in picking the training data. #' #' One of the solutions to this scenario is **Laplace estimation**, also known as Laplace smoothing, which can be accomplished in two ways. One is to add small number to each counts in the frequency table, which allows each class-feature combination to be at least one in the training data. Then $P(F_i|C_L)>0$ for all $i$. Another strategy is to add some small value, $\epsilon$, to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial ($0$) posterior for another class. #' #' # Case Study: Head and Neck Cancer Medication #' #' ## Step 1: Collecting Data #' #' We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data file. #' #' Variables: #' #' * **PID:** coded patient ID #' * **ENC_ID:** coded encounter ID #' * **Seer_stage:** SEER cancer stage (0 =In situ, 1=Localized, 2=Regional by direct extension, 3=Regional to lymph nodes, 4=Regional (both codes 2 and 3), 5=Regional, NOS, 7= Distant metastases/systemic disease, 8=Not applicable, 9=Unstaged, unknown, or unspecified). See: http://seer.cancer.gov/tools/ssm. #' * **Medication_desc:** description of the chemical composition of the medication #' * **Medication_summary:** brief description about medication brand and usage #' * **Dose:** the dosage in the medication summary #' * **Unit:** the unit for dosage in the Medication_summary #' * **Frequency:** the frequency of use in the Medication_summary #' * **Total_dose_count:** total dosage count according to the Medication_summary #' #' ##Step 2: Exploring and preparing the data #' #' Let's load our data first. #' hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE) str(hn_med) #' #' #' Change the `seer_stage` (cancer stage indicator) variable into a factor. #' #' hn_med$seer_stage <- factor(hn_med$seer_stage) str(hn_med$seer_stage) table(hn_med$seer_stage) #' #' #' ###Data preparation - processing text data for analysis #' #' As you can see, the `medication_summary` contains a great amount of text. We should do some text mining to prepare the data for analysis. In R, the `tm` package is a good choice for text mining. #' #' # install.packages("tm", repos = "http://cran.us.r-project.org") # requires R V.3.3.1 + require(tm) #' #' #' First step for text mining is to convert text features (text elements) into a `corpus` object, which is a collection of text documents. #' #' hn_med_corpus<-Corpus(VectorSource(hn_med$MEDICATION_SUMMARY)) print(hn_med_corpus) #' #' #' After we construct the `corpus` object, we could see that we have 662 documents. Each document represents an encounter (e.g., notes on medical treatment) for a patient. #' #' inspect(hn_med_corpus[1:3]) hn_med_corpus[[1]]$content hn_med_corpus[[2]]$content hn_med_corpus[[3]]$content #' #' #' There are unwanted punctuations and other symbols in the corpus document that we want to remove. We use `tm_map()` function for the cleaning. #' #' corpus_clean<-tm_map(hn_med_corpus, tolower) corpus_clean<-tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <-tm_map(corpus_clean, removeNumbers) # corpus_clean <- tm_map(corpus_clean, PlainTextDocument) #' #' #' The above lines of code changed all the characters to lower case, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the coprus to plain text). #' #' inspect(corpus_clean[1:3]) corpus_clean[[1]]$content corpus_clean[[2]]$content corpus_clean[[3]]$content #' #' #' `DocumentTermMatrix()` function can successfully tokenize the medication summary into words. It can count frequent terms in each document in the corpus object. #' hn_med_dtm<-DocumentTermMatrix(corpus_clean) #' #' #' ### Data preparation - creating training and test datasets #' #' Just like in [Chapter 6](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/06_LazyLearning_kNN.html), we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix. #' #' set.seed(12) subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8)) # 80% training + 20% testing hn_med_train<-hn_med[subset_int, ] hn_med_test<-hn_med[-subset_int, ] hn_med_dtm_train<-hn_med_dtm[subset_int, ] hn_med_dtm_test<-hn_med_dtm[-subset_int, ] corpus_train<-corpus_clean[subset_int] corpus_test<-corpus_clean[-subset_int] # hn_med_train<-hn_med[1:562, ] #hn_med_test<-hn_med[563:662, ] # hn_med_dtm_train<-hn_med_dtm[1:562, ] # hn_med_dtm_test<-hn_med_dtm[563:662, ] #corpus_train<-corpus_clean[1:562] #corpus_test<-corpus_clean[563:662] #' #' #' Let's examine the distribution of **seer stages** in the training and test datasets. #' #' prop.table(table(hn_med_train$seer_stage)) prop.table(table(hn_med_test$seer_stage)) #' #' #' We can separate (dichotomize) the [seer_stage](https://seer.cancer.gov/tools/ssm/intro.pdf) into two categories: #' #' * *no stage* or *early stage* cancer, and #' * *later stage* cancer. #' #' hn_med_train$stage<-hn_med_train$seer_stage %in% c(4, 5, 7) hn_med_train$stage<-factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) hn_med_test$stage<-hn_med_test$seer_stage %in% c(4, 5, 7) hn_med_test$stage<-factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) prop.table(table(hn_med_train$stage)) prop.table(table(hn_med_test$stage)) #' #' #' ### Visualizing text data - word clouds #' #' A word cloud can help us visualize text data. More frequent words would have larger fonts in the figure, while less common words are appearing in smaller fonts. There is a `wordcloud` package in R that is commonly used for creating these figures. #' #' # install.packages("wordcloud", repos = "http://cran.us.r-project.org") library(wordcloud) wordcloud(corpus_train, min.freq = 40, random.order = FALSE) #' #' #' The `random.order=FALSE` option made more frequent words appear in the middle. `min.freq=40` option sets the cutoff word frequency to be at least 40 times in the corpus object. Therefore, the words must be appear in at least 40 medication summaries to be shown on the graph. #' #' We can also visualize the difference between early stages and later stages using this type of graph. #' #' early<-subset(hn_med_train, stage=="early_stage") later<-subset(hn_med_train, stage=="later_stage") wordcloud(early$MEDICATION_SUMMARY, max.words = 20) wordcloud(later$MEDICATION_SUMMARY, max.words = 20) #' #' #' We can see that the frequent words are somewhat different in the medication summary between early stage and later stage patients. #' #' ### Data preparation - creating indicator features for frequent words #' #' For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that in [Chapter 6](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/06_LazyLearning_kNN.html) we used features for classifications. **In this study, we are going to make frequencies of words into features**. #' #' summary(findFreqTerms(hn_med_dtm_train, 5)) hn_med_dict<-as.character(findFreqTerms(hn_med_dtm_train, 5)) hn_train<-DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict)) hn_test<-DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict)) #' #' #' The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 103 features for us to use. #' #' The Naive Bayes classifier trains on data with categorical features. Thus, we need to transform our word count features into categorical data. A way to do this is to change the count into an indicator of whether this word appears. We can create a function of our own to deal with this. #' #' convert_counts <- function(x) { x <- ifelse(x > 0, 1, 0) x <- factor(x, levels = c(0, 1), labels = c("No", "Yes")) return(x) } #' #' #' We employ hard-thresholding here `x<-ifelse(x>0, 1, 0)`. This is saying that if we have an `x` that is greater than 0, we assign value 1 to it, otherwise the value is set to 0. #' #' Now let's apply our own function `convert_counts()` on each column (`MARGIN=2`) of the training and testing datasets. #' #' hn_train <- apply(hn_train, MARGIN = 2, convert_counts) hn_test <- apply(hn_test, MARGIN = 2, convert_counts) #' #' #' So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data. #' #' ## Step 3 - training a model on the data #' #' The package we will use for Naive Bayes classifier is called `e1071`. #' # install.packages("e1071", repos = "http://cran.us.r-project.org") library(e1071) #' #' #' The function `naiveBayes()` has following components: #' #' `m<-naiveBayes(train, class, laplace=0)` #' #' * train: data frame containing numeric training data (features) #' * class: factor vector with the class for each row in the training data. #' * laplace: positive double controlling Laplace smoothing; default is $0$ and disables Laplace smoothing, #' #' Let's build our classifier first. #' hn_classifier <- naiveBayes(hn_train, hn_med_train$stage) #' #' #' Then we can use the classifier to make predictions using `predict()`. Recall that when we presented the AdaBoost example in [Chapter 2](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/02_ManagingData.html), we saw the basic mechanism of machine-learning training, prediction and assessment. #' #' The function `predict()` has the following components: #' #' `p<-predict(m, test, type="class")` #' #' * m: classifier trained by `naiveBayes()` #' * test: test data frame or matrix #' * type: either `"class"` or `"raw"` specifies whether the predictions should be the most likely class value or the raw predicted probabilities. #' #' hn_test_pred<-predict(hn_classifier, hn_test) #' #' #' ## Step 4 - evaluating model performance #' #' Similarly to the approach in [Chapter 6](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/06_LazyLearning_kNN.html), we use *cross table* to compare predicted class and the true class of our test dataset. #' #' library(gmodels) CrossTable(hn_test_pred, hn_med_test$stage) #' #' #' It may be worth skipping forward to [Chapter 13](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/13_ModelEvaluation.html#31_binary_outcomes) where we present a summary table for the key measures used to evaluate the performance of binary tests, classifiers, or predictions. #' #' The prediciton accuracy: #' $$ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{93}{133}=0.7.$$ #' #' From the cross table we can see that our prediction accuracy is $\frac{93}{133}=0.70$. However, there is the *later stage* classificaiton only has $3$ counts. This might due to the $P(F_i|C_L)\sim 0$ problem that we talked about earlier. #' #' ## Step 5 - improving model performance #' #' After setting `laplace=15`, the accuracy goes up to 76% ($acc=\frac{101}{133}$). Although this is a small improvement in terms of accuracy, we have a better chance of detecting *later stage* patients. #' #' hn_classifier <- naiveBayes(hn_train, hn_med_train$stage, laplace = 15) hn_test_pred<-predict(hn_classifier, hn_test) CrossTable(hn_test_pred, hn_med_test$stage) #' #' #' #' ## Step 6 - compare Naive Bayesian vs. LDA #' As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of *Naive Bayes* and *LDA* classificaiton. #' #' library(MASS) df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric), stage = hn_med_train$stage) df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric), stage = hn_med_test$stage) hn_lda <- lda(data=df_hn_train, stage~.) # hn_pred = predict(hn_lda, df_hn_test[,-104]) hn_pred = predict(hn_lda, df_hn_test) CrossTable(hn_pred$class, df_hn_test$stage) #' #' #' Here, Naive Bayesian outperforms LDA classifier in terms of the overall accuracy, however LDA has a lower type II error ($\frac{22}{133}), which is clinically important to avoid missing later-stage cancer patients. #' #' In later chapters, we will step deeper into the space of classification problems and see more sophisticated approaches. #' #' # Practice Problem #' #' In the previous case study. We classified the patients with `seer_stage` of "not applicable"(`seer_stage`=8) and "unstaged, unknown or unspecified"(`seer_stage`=9) as no cancer or early cancer stages. Let's remove these two categories and replicate the Naive Bayes classifier case study again. #' #' hn_med1<-hn_med[!hn_med$seer_stage %in% c(8, 9), ] str(hn_med1); dim(hn_med1) #' #' #' Now we have only 580 observations. We can either use the first 480 of them as the training dataset and the last 100 as the test dataset, or select 80-20 (training-testing) split, and evaluate the prediction accuracy when `laplace=1`? #' #' hn_med1_corpus<-Corpus(VectorSource(hn_med1$MEDICATION_SUMMARY)) corpus_clean1<-tm_map(hn_med1_corpus, tolower) corpus_clean1<-tm_map(corpus_clean1, removePunctuation) corpus_clean1 <- tm_map(corpus_clean1, stripWhitespace) corpus_clean1 <-tm_map(corpus_clean1, removeNumbers) # corpus_clean1 <- tm_map(corpus_clean1, PlainTextDocument) hn_med_dtm1<-DocumentTermMatrix(corpus_clean1) set.seed(11) subset_int1 <- sample(nrow(hn_med1),floor(nrow(hn_med1)*0.8)) # 80% training + 20% testing hn_med_train1<-hn_med1[subset_int1, ] hn_med_test1<-hn_med1[-subset_int1, ] hn_med_dtm_train1<-hn_med_dtm1[subset_int1, ] hn_med_dtm_test1<-hn_med_dtm1[-subset_int1, ] corpus_train1<-corpus_clean1[subset_int1] corpus_test1<-corpus_clean1[-subset_int1] #hn_med_train1<-hn_med1[1:480, ] #hn_med_test1<-hn_med1[481:580, ] #hn_med_dtm_train1<-hn_med_dtm1[1:480, ] #hn_med_dtm_test1<-hn_med_dtm1[481:580, ] #corpus_train1<-corpus_clean1[1:480] #corpus_test1<-corpus_clean1[481:580] #' #' #' We can use the same code for creating the classes in training and test dataset. Since the `seer_stage=8 or 9` is not in the data, we classify `seer_stage=0, 1, 2 or 3` as "early_stage" and `seer_stage=4, 5 or 7` as "later_stage". #' #' hn_med_train1$stage<-hn_med_train1$seer_stage %in% c(4, 5, 7) hn_med_train1$stage<-factor(hn_med_train1$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) hn_med_test1$stage<-hn_med_test1$seer_stage %in% c(4, 5, 7) hn_med_test1$stage<-factor(hn_med_test1$stage, levels=c(F, T), labels = c("early_stage", "later_stage")) prop.table(table(hn_med_train1$stage)) prop.table(table(hn_med_test1$stage)) #' #' #' Use terms that have appeared at least 5 documents in the training dataset to build the document term matrix. #' #' summary(findFreqTerms(hn_med_dtm_train1, 5)) hn_med_dict1<-as.character(findFreqTerms(hn_med_dtm_train1, 5)) hn_train1<-DocumentTermMatrix(corpus_train1, list(dictionary=hn_med_dict1)) hn_test1<-DocumentTermMatrix(corpus_test1, list(dictionary=hn_med_dict1)) hn_train1 <- apply(hn_train1, MARGIN = 2, convert_counts) hn_test1 <- apply(hn_test1, MARGIN = 2, convert_counts) hn_classifier1 <- naiveBayes(hn_train1, hn_med_train1$stage, laplace = 15) hn_test_pred1<-predict(hn_classifier1, hn_test1) CrossTable(hn_test_pred1, hn_med_test1$stage) #' #' #' $$ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{86}{116}=0.74.$$ #' #' Try to reproduce these results with some [new data from the list of our Case-Studies](https://umich.instructure.com/courses/38100/files/).