Please review the introduction to Chapter 6, where we described the types of machine learning methods and presented lazy classification for numerical data. What about nominal features or textual data? In this chapter, we will begin to explore some classification techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) illustrate the naive Bayesian classifier on a Head and Neck Cancer Medication case-study.

Later, in Chapter 19, we will also discuss text mining and natural language processing of unstructured text data.

1 Overview of the Naive Bayes Algorithm

Start by reviewing the basics of probability theory and Bayesian inference.

Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the “map” to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated.

The best scenario of accurately estimating the probability of an outcome-class map is when all features in Bayes classifiers attribute to the class simultaneously. The naive Bayes algorithm is frequently used for text classifications. The maximum a posteriori assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable.

2 Assumptions

Naive Bayes is named for its “naive” assumptions. Its most important assumption is that all of the features are equally important and independent. This rarely happens in real world data. However, sometimes even when the assumptions are violated, naive Bayes still performs fairly accurate, particularly when the number of features \(p\) is large. This is why the Naive Bayes algorithm may be used as a powerful text classifier.

There are interesting relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification. Additional information about LDA and QDA is available here.

3 Bayes Formula

Let’s first define the set-theoretic Bayes formula. We assume that \(B_i\)’s are mutually exclusive events, for all \(i= 1, 2, ..., n\), where n represents the number of features. If \(A\) and \(B\) are two events, the Bayes conditional probability formula is as follows: \[Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}\]

Symbolically,

\[P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\] When \(B_i's\) represent a partition of the event space, \(S=\cup {B_i}\) and \(B_i\cap B_j = \emptyset,\ \forall i\not= j\). So we have:

\[P(A|B)=\frac{P(B|A)\times P(A)}{P(B|B_1)\times P(B_1) + P(B|B_2)\times P(B_2) + ... + P(B|B_n)\times P(B_n)}.\]

Now, let’s represent the Bayes formula in terms of classification using observed features. Having observed \(n\) features, \(F_i\), for each of \(K\) possible class outcomes, \(C_k\). The Bayesian model may be reformulate to make it more tractable using the Bayes’ theorem, by decomposing the conditional probability.

\[ P(C_k \mid F_1, \dots, F_n)= \frac{P(F_1, \dots, F_n|C_k)P(C_k)}{P(F_1, \dots, F_n)}.\]

In the above expression, only the numerator depends on the class label, \(C_k\), as the values of the features \(F_i\) are observed (or imputed) making the denominator constant. Let’s focus on the numerator.

The numerator essentially represents the joint probability model:

\[P(F_1, \dots, F_n|C_k)P(C_k) = \underbrace{P(F_1, ..., F_n, C_k)}_{\text{joint model}}.\]

Repeatedly using the chain rule and the definition of conditional probability simplifies this to:

\[P(F_1, \dots ,F_n, C_{k})=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times P(F_4, \dots ,F_n, C_{k})=\] \[=...=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Note that the “naive” qualifier in the Naive Bayes classifier name is attributed to the oversimplification of the conditional probability. Assuming each feature \(F_i\) is conditionally statistical independent of every other feature \(F_j, \ \forall j\neq i\), given the category \(C_k\), we get:

\[P(F_i | F_{i+1}, \dots , F_{n}, C_k ) = P(F_i | C_k).\]

This reduces the joint probability model to:

\[P(F_1, \dots ,F_n, C_{k})= P(F_1| C_{k})\times P(F_2| C_{k})\times P(F_3| C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Therefore, the joint model is:

\[P(F_1, \dots ,F_n, C_{k})= P(C_k) \prod_{i=1}^n {P(F_i| C_{k})}\]

Essentially, we express the probability of class level \(L\) given an observation, represented as a set of independent features \(F_1, F_2, ..., F_n\). Then the posterior probability that the observation is in class \(L\) is equal to:

\[P(C_L|F_1, ..., F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)},\] where the denominator, \(\prod_{i=1}^nP(F_i)\), is a scaling factor that represents the marginal probability of observing all features jointly.

For a given case \(X=(F_1, F_2, ..., F_n)\), i.e., given vector of features, the naive Bayes classifier assigns the most likely class \(\hat{C}\) by calculating \(\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}\) for all class labels \(L\), and then assigning the class \(\hat{C}\) corresponding to the maximum posterior probability. Analytically, \(\hat{C}\) is defined by: \[ \hat{C} = \arg\max_L{\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}}.\]

As the denominator is static for \(L\), the posterior probability above is maximized when the numerator is maximized, i.e., \(\hat{C} = \arg\max_L {P(C_L)\prod_{i=1}^nP(F_i|C_L)}.\)

The contingency table below illustrates schematically how the Bayesian, marginal, conditional, and joint probabilities may be calculated for a finite number of features (columns) and classes (rows).

\(Features\\Classes\)	\(F_1\)	\(F_2\)	\(...\)	\(F_n\)	Total
\(C_1\)	\(...\)	\(...\)	\(...\)	\(...\)	Marginal \(P(C_1)\)
\(C_2\)	\(...\)	\(...\)	\(...\)	Joint \(P(C_2, F_n)\)	\(...\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(C_L\)	Conditional \(P(F_1\mid C_L)=\frac{P(F_1,C_L)}{P(C_L)}\)	\(...\)	\(...\)	\(...\)	\(...\)
Total		Marginal \(P(F_2)\)	\(...\)	\(...\)	\(N\)

In the DSPA Appendix, we provide additional technical details, code, and applications of Bayesian simulation, modeling and inference.

4 The Laplace Estimator

If at least one \(P(F_i|C_L)=0\) then \(P(C_L|F_1, ..., F_n)=0\), which means the probability of being in this class is 0. However, \(P(F_i|C_L)=0\) could happen by random chance, e.g., when selecting the training data.

One of the solutions to this scenario is Laplace estimation, also known as Laplace smoothing, which can be accomplished in two ways. One is to add small number to each counts in the frequency table, which allows each class-feature combination to be at least one in the training data. Then \(P(F_i|C_L)>0\) for all \(i\) and we avoid degenerate cases. Another strategy is to add some small value, \(\epsilon\), to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial (\(0\)) posterior for another class.

5 Case Study: Head and Neck Cancer Medication

5.1 Step 1: Collecting Data

We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data archive.

Variables:

PID: coded patient ID
ENC_ID: coded encounter ID
Seer_stage: SEER cancer stage (0 =In situ, 1=Localized, 2=Regional by direct extension, 3=Regional to lymph nodes, 4=Regional (both codes 2 and 3), 5=Regional, NOS, 7= Distant metastases/systemic disease, 8=Not applicable, 9=Unstaged, unknown, or unspecified). See: http://seer.cancer.gov/tools/ssm.
Medication_desc: description of the chemical composition of the medication
Medication_summary: brief description about medication brand and usage
Dose: the dosage in the medication summary
Unit: the unit for dosage in the Medication_summary
Frequency: the frequency of use in the Medication_summary
Total_dose_count: total dosage count according to the Medication_summary

5.2 Step 2: Exploring and preparing the data

Let’s first load the cancer dataset.

hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
str(hn_med)

## 'data.frame':    662 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
##  $ seer_stage        : int  1 1 4 1 9 1 1 1 9 1 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 1 2 2 6 15 1 ...

Change the seer_stage (cancer stage indicator) variable into a factor.

hn_med$seer_stage <- factor(hn_med$seer_stage)
str(hn_med$seer_stage)

##  Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 9 2 2 2 9 2 ...

table(hn_med$seer_stage)

## 
##   0   1   2   3   4   5   7   8   9 
##  21 265  53  90  46  18  87  14  68

5.2.1 Data preparation - processing text data for analysis

As you can see, the medication_summary contains a great amount of text. We should do some text mining to prepare the data for analysis. In R, the tm package is a good choice for text mining.

# install.packages("tm", repos = "http://cran.us.r-project.org")
# requires R V.3.3.1 +
library(tm)

First step for text mining is to convert text features (text elements) into a corpus object, which is a collection of text documents.

hn_med_corpus<-Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
print(hn_med_corpus)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 662

After we construct the corpus object, we could see that we have 662 documents. Each document represents an encounter (e.g., notes on medical treatment) for a patient.

inspect(hn_med_corpus[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] (Zantac) 150 mg tablet oral two times a day
## [2] 5,000 unit subcutaneous three times a day  
## [3] (Unasyn) 15 g IV every 6 hours

hn_med_corpus[[1]]$content

## [1] "(Zantac) 150 mg tablet oral two times a day"

hn_med_corpus[[2]]$content

## [1] "5,000 unit subcutaneous three times a day"

hn_med_corpus[[3]]$content

## [1] "(Unasyn) 15 g IV every 6 hours"

There are unwanted punctuations and other symbols in the corpus document that we want to remove. We use tm_map() function for the cleaning.

corpus_clean<-tm_map(hn_med_corpus, tolower)

## Warning in tm_map.SimpleCorpus(hn_med_corpus, tolower): transformation drops
## documents

corpus_clean<-tm_map(corpus_clean, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus_clean, removePunctuation): transformation
## drops documents

# corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <-tm_map(corpus_clean, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus_clean, removeNumbers): transformation
## drops documents

corpus_clean <- tm_map(corpus_clean, stripWhitespace)

## Warning in tm_map.SimpleCorpus(corpus_clean, stripWhitespace): transformation
## drops documents

# corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

The above lines of code changed all the characters to lower case, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the corpus to plain text).

inspect(corpus_clean[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] zantac mg tablet oral two times a day  unit subcutaneous three times a day 
## [3] unasyn g iv every hours

corpus_clean[[1]]$content

## [1] "zantac mg tablet oral two times a day"

corpus_clean[[2]]$content

## [1] " unit subcutaneous three times a day"

corpus_clean[[3]]$content

## [1] "unasyn g iv every hours"

DocumentTermMatrix() function can successfully tokenize the medication summary into words. It can count frequent terms in each document in the corpus object.

hn_med_dtm<-DocumentTermMatrix(corpus_clean)

5.2.2 Data preparation - creating training and test datasets

Just like in Chapter 6, we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix.

set.seed(12345)
subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8))  # 80% training + 20% testing
hn_med_train<-hn_med[subset_int, ]
hn_med_test<-hn_med[-subset_int, ]
hn_med_dtm_train<-hn_med_dtm[subset_int, ]
hn_med_dtm_test<-hn_med_dtm[-subset_int, ]
corpus_train<-corpus_clean[subset_int]
corpus_test<-corpus_clean[-subset_int]

# hn_med_train<-hn_med[1:562, ]
#hn_med_test<-hn_med[563:662, ]
# hn_med_dtm_train<-hn_med_dtm[1:562, ]
# hn_med_dtm_test<-hn_med_dtm[563:662, ]
#corpus_train<-corpus_clean[1:562]
#corpus_test<-corpus_clean[563:662]

Let’s examine the distribution of seer stages in the training and test datasets.

prop.table(table(hn_med_train$seer_stage))

## 
##          0          1          2          3          4          5          7 
## 0.01890359 0.41965974 0.07561437 0.13988658 0.07561437 0.03213611 0.12098299 
##          8          9 
## 0.01890359 0.09829868

prop.table(table(hn_med_test$seer_stage))

## 
##           0           1           2           3           4           5 
## 0.082706767 0.323308271 0.097744361 0.120300752 0.045112782 0.007518797 
##           7           8           9 
## 0.172932331 0.030075188 0.120300752

We can separate (dichotomize) the seer_stage into two categories:

no stage or early stage cancer (seer in [0;4]), and
later stage cancer (seer in [5;9]).

Of course, other binarizations are possible as well. Note that a %in% b is an intuitive interface to match and acts as a binary operator returning a logical vector (T or F) indicating if there is a match between the left and right operands.

hn_med_train$stage<-hn_med_train$seer_stage %in% c(5:9)
hn_med_train$stage<-factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test$stage<-hn_med_test$seer_stage %in% c(5:9)
hn_med_test$stage<-factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train$stage))

## 
## early_stage later_stage 
##   0.7296786   0.2703214

prop.table(table(hn_med_test$stage))

## 
## early_stage later_stage 
##   0.6691729   0.3308271

5.2.3 Visualizing text data - word clouds

A word cloud can help us visualize text data. More frequent words would have larger fonts in the figure, while less common words are appearing in smaller fonts. There is a wordcloud package in R that is commonly used for creating these figures.

# install.packages("wordcloud", repos = "http://cran.us.r-project.org")
library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(corpus_train, min.freq = 40, random.order = FALSE, colors=brewer.pal(5, "Dark2"))

The random.order=FALSE option makes more frequent words appear in the center of the word cloud. The min.freq=40 option sets the cutoff word frequency to be at least 40 times in the corpus object. Therefore, the words must be appear in at least 40 medication summaries to be shown on the graph.

We can also visualize the difference between early stages and later stages using this type of graph.

early<-subset(hn_med_train, stage=="early_stage")
later<-subset(hn_med_train, stage=="later_stage")
wordcloud(early$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

wordcloud(later$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

# tdm = as.matrix(hn_med_dtm)
# tdm1 <- tdm[nchar(rownames(tdm)) < 15, ]
# colnames(tdm1) <- hn_med_dtm$dimnames$Terms[1:14]
# comparison.cloud(tdm1, random.order=FALSE,
#                  colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
#                  title.size=1, min.words=5, max.words=50, scale=c(1.5, 0.4),rot.per=0.4)

We can see that the frequent words are somewhat different in the medication summaries of the early and later stage patients.

5.2.4 Data preparation - creating indicator features for frequent words

For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that in Chapter 6 we used features for classifications. In this study, we are going to make frequencies of words into features.

summary(findFreqTerms(hn_med_dtm_train, 5))

##    Length     Class      Mode 
##       108 character character

hn_med_dict<-as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train<-DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test<-DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 118 features for us to use.

The Naive Bayes classifier trains on data with categorical features, as it uses frequency tables for learning the data affinities. To create the combinations of class and feature values comprising the frequency-table (matrix), all feature must be categorical. For instance, numeric features have to be converted (binned) into categories. Thus, we need to transform our word count features into categorical data. One way to achieve this is to change the count into an indicator of whether this word appears in the document describing the patient encounter. We can create a simple function to convert presence of a specific word (column) in an encounter document (row) to a binary “Yes” (present) or “No” (not present).

convert_counts <- function(wordFreq) {
  wordFreq <- ifelse(wordFreq > 0, 1, 0)
  wordFreq <- factor(wordFreq, levels = c(0, 1), labels = c("No", "Yes"))
  return(wordFreq)
}

We employ hard-thresholding here x<-ifelse(x>0, 1, 0). This is saying that if we have an x that is greater than 0, we assign value 1 to it, otherwise the value is set to 0.

Now let’s apply our own function convert_counts() on each column (MARGIN=2) of the training and testing datasets.

hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)

# Check the structure of hn_train and hn_train:
# head(hn_train); dim(hn_train)

So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data.

5.3 Step 3 - training a model on the data

The package we will use for Naive Bayes classifier is called e1071.

# install.packages("e1071", repos = "http://cran.us.r-project.org")
library(e1071)

The function naiveBayes() has following components:

m<-naiveBayes(train, class, laplace=0)

train: data frame containing numeric training data (features)
class: factor vector with the class for each row in the training data.
laplace: positive double controlling Laplace smoothing; default is \(0\) and disables Laplace smoothing,

Let’s build our classifier first.

hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)

Then we can use the classifier to make predictions using predict(). Recall that when we presented the AdaBoost example in Chapter 2, we saw the basic mechanism of machine-learning training, prediction and assessment.

The function predict() has the following components:

p<-predict(m, test, type="class")

m: classifier trained by naiveBayes()
test: test data frame or matrix
type: either "class" or "raw" specifies whether the predictions should be the most likely class value or the raw predicted probabilities.

hn_test_pred<-predict(hn_classifier, hn_test)

5.4 Step 4 - evaluating model performance

Similarly to the approach in Chapter 6, we use cross table to compare predicted class and the true class of our test dataset.

library(gmodels)
CT <- CrossTable(hn_test_pred, hn_med_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          64 |          30 |          94 | 
##              |       0.019 |       0.039 |             | 
##              |       0.681 |       0.319 |       0.707 | 
##              |       0.719 |       0.682 |             | 
##              |       0.481 |       0.226 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |          25 |          14 |          39 | 
##              |       0.046 |       0.093 |             | 
##              |       0.641 |       0.359 |       0.293 | 
##              |       0.281 |       0.318 |             | 
##              |       0.188 |       0.105 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |          89 |          44 |         133 | 
##              |       0.669 |       0.331 |             | 
## -------------|-------------|-------------|-------------|
## 
##

CT

## $t
##              y
## x             early_stage later_stage
##   early_stage          64          30
##   later_stage          25          14
## 
## $prop.row
##              y
## x             early_stage later_stage
##   early_stage   0.6808511   0.3191489
##   later_stage   0.6410256   0.3589744
## 
## $prop.col
##              y
## x             early_stage later_stage
##   early_stage   0.7191011   0.6818182
##   later_stage   0.2808989   0.3181818
## 
## $prop.tbl
##              y
## x             early_stage later_stage
##   early_stage   0.4812030   0.2255639
##   later_stage   0.1879699   0.1052632

mod_TN <- CT$prop.row[1, 1]  
mod_FP <- CT$prop.row[1, 2]
mod_FN <- CT$prop.row[2, 1]
mod_TP <- CT$prop.row[2, 2]
# caret::confusionMatrix(hn_test_pred, hn_med_test$stage)
# CT$prop.row
library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod_TN, mod_FN, mod_FP, mod_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Consusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

It may be worth quickly looking forward in Chapter 13 where we present a summary table for the key measures used to evaluate the performance of binary tests, classifiers, or predictions.

In this case, the prediction accuracy of the Naive Bayes classifier, assessed on the testing dataset, is: \[ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{78}{133}=0.59.\]

From the cross table we can see that our testing-data prediction accuracy is \(\frac{78}{133}=0.59\).

5.5 Step 5 - improving model performance

After setting laplace=1, the accuracy goes to \(acc=\frac{75}{133}=0.56\). Although this is a small improvement in terms of accuracy, yet, we still fail to identify true later stage patients in the testing dataset.

set.seed(1234)
hn_classifier1 <- naiveBayes(hn_train, hn_med_train$stage, laplace = 1, type = "class")
hn_test_pred1 <- predict(hn_classifier1, hn_test)
CrossTable(hn_test_pred1, hn_med_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##               | hn_med_test$stage 
## hn_test_pred1 | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   early_stage |          58 |          27 |          85 | 
##               |       0.022 |       0.045 |             | 
##               |       0.682 |       0.318 |       0.639 | 
##               |       0.652 |       0.614 |             | 
##               |       0.436 |       0.203 |             | 
## --------------|-------------|-------------|-------------|
##   later_stage |          31 |          17 |          48 | 
##               |       0.039 |       0.079 |             | 
##               |       0.646 |       0.354 |       0.361 | 
##               |       0.348 |       0.386 |             | 
##               |       0.233 |       0.128 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |          89 |          44 |         133 | 
##               |       0.669 |       0.331 |             | 
## --------------|-------------|-------------|-------------|
## 
##

Accuracy is \(\frac{75}{133} = 0.564\). This is a degraded performance, compared to the prior (\(laplace=0\)) model.

5.6 Step 6 - compare Naive Bayesian vs. LDA

As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of Naive Bayes and LDA classification.

Note: LDA assumes the predictors are jointly approximately normally distributed.

library(MASS)
# df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric), 
#                          stage = hn_med_train$stage)
# df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric), 
#                         stage = hn_med_test$stage)

library("dplyr")
binarizeFunction <- function(x) { ifelse(x=="Yes", 1,0) }

#  A function to Convert Categorical variables to numeric
cat2Numeric <- function (dfInput) {
  df = as.data.frame(lapply( as.data.frame(dfInput), factor)) %>%
                       mutate_all(binarizeFunction)
   return(df)
}

# define the numeric DF of predictors (X) and outcome (Y=stage)
df_hn_train = data.frame(cat2Numeric(hn_train), stage = as.numeric(hn_med_train$stage))
df_hn_test = data.frame(cat2Numeric(hn_test), stage = as.numeric(hn_med_test$stage))

# Remove the multicollinearity - this should be done via VIF assessment, 
#      but for now, just take the first few predictors
df_hn_train <- df_hn_train[ , c(1:34, 40:50, 60:70, 109)]

# Fit LDA
set.seed(1234)
hn_lda <- lda(data=df_hn_train, stage~.)
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(hn_lda, df_hn_test)
CrossTable(hn_pred$class, df_hn_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##               | df_hn_test$stage 
## hn_pred$class |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##             1 |        83 |        33 |       116 | 
##               |     0.372 |     0.753 |           | 
##               |     0.716 |     0.284 |     0.872 | 
##               |     0.933 |     0.750 |           | 
##               |     0.624 |     0.248 |           | 
## --------------|-----------|-----------|-----------|
##             2 |         6 |        11 |        17 | 
##               |     2.541 |     5.139 |           | 
##               |     0.353 |     0.647 |     0.128 | 
##               |     0.067 |     0.250 |           | 
##               |     0.045 |     0.083 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        89 |        44 |       133 | 
##               |     0.669 |     0.331 |           | 
## --------------|-----------|-----------|-----------|
## 
##

There are differences in the performance of the Naive Bayesian (\(acc=\frac{87}{133}=0.65\)) and the LDA classifiers (\(acc=\frac{94}{133}=0.7\)) in terms of the overall accuracy. LDA has a lower type II error (\(\frac{33}{133}=0.25\)), which is clinically important to avoid missing later-stage cancer patients.

In later chapters, we will step deeper into the space of classification problems and see more sophisticated approaches, especially for difficult problems like cancer forecasting.

6 Practice Problems

6.1 Iris Species

The classification of the iris flowers represents an easy example of Naive Bayesian classifier.

data(iris)
nbc_model <- naiveBayes(Species ~ ., data = iris)
## alternatively:
nbc_model <- naiveBayes(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")], iris[,"Species"])
predicted.nbcvalues <- predict(nbc_model, iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
table(predicted.nbcvalues, iris[, "Species"])

##                    
## predicted.nbcvalues setosa versicolor virginica
##          setosa         50          0         0
##          versicolor      0         47         3
##          virginica       0          3        47

6.2 Cancer Study

In the previous cancer case study, we classified the patients with seer_stage of “not applicable”(seer_stage=8) and “unstaged, unknown or unspecified” (seer_stage=9) as no cancer or early cancer stages. Let’s remove these two categories and replicate the Naive Bayes classifier case study again.

hn_med1<-hn_med[!hn_med$seer_stage %in% c(8, 9), ]
str(hn_med1); dim(hn_med1)

## 'data.frame':    580 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10103 1012 10135 10143 10152 10184 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47511 3138 47739 47769 47800 47938 ...
##  $ seer_stage        : Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 2 2 2 2 7 2 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 2 2 6 1 24 2 ...

## [1] 580   9

Now we have only 580 observations. We can either use the first 480 of them as the training dataset and the last 100 as the test dataset, or select 80-20 (training-testing) split, and evaluate the prediction accuracy when laplace=1?

We can use the same code for creating the classes in training and test dataset. Since the seer_stage=8 or 9 is not in the data, we classify seer_stage=0, 1, 2 or 3 as “early_stage” and seer_stage=4, 5 or 7 as “later_stage”.

hn_med_train1$stage<-hn_med_train1$seer_stage %in% c(4, 5, 7)
hn_med_train1$stage<-factor(hn_med_train1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test1$stage<-hn_med_test1$seer_stage %in% c(4, 5, 7)
hn_med_test1$stage<-factor(hn_med_test1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train1$stage))

## 
## early_stage later_stage 
##   0.7392241   0.2607759

prop.table(table(hn_med_test1$stage))

## 
## early_stage later_stage 
##   0.7413793   0.2586207

Use terms that have appeared at least 5 documents in the training dataset to build the document term matrix.

##    Length     Class      Mode 
##       116 character character

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  116 
## 
##  
##               | hn_med_test1$stage 
## hn_test_pred1 | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   later_stage |          86 |          30 |         116 | 
##               |       0.741 |       0.259 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |          86 |          30 |         116 | 
## --------------|-------------|-------------|-------------|
## 
##

\[ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{86}{116}=0.74.\] Note that this is a degenerate classifier as using \(laplace=15\) yields a single class prediction, later_stage.

6.3 Baseball Data

Use the MLB Data (01a_data.txt) to predict the Player's Position (or perhaps the player’s Team) using naiveBayes classifier. Compute and report the agreement between predicted and actual labels (for the player’s position). Below is some example code.

mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1',as.is=T, header=T)  # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
# TESTING DATA
test <- mydata[-train_ind, ]
library("e1071")
nbc_model <- naiveBayes(train[ , c("Weight", "Height", "Age")], as.factor(train$Position), laplace = 15)
nbc_model

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train[, c("Weight", "Height", "Age")], 
##     y = as.factor(train$Position), laplace = 15)
## 
## A-priori probabilities:
## as.factor(train$Position)
##           Catcher Designated_Hitter     First_Baseman        Outfielder 
##        0.07225806        0.01806452        0.05161290        0.19225806 
##    Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
##        0.29806452        0.05677419        0.04645161        0.22064516 
##     Third_Baseman 
##        0.04387097 
## 
## Conditional probabilities:
##                          Weight
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           206.6786 15.21973
##         Designated_Hitter 222.5000 22.84311
##         First_Baseman     212.5750 20.87556
##         Outfielder        200.3289 18.47951
##         Relief_Pitcher    203.9913 21.84650
##         Second_Baseman    184.8864 10.67537
##         Shortstop         182.4444 13.55717
##         Starting_Pitcher  204.9825 22.58356
##         Third_Baseman     200.5588 19.18047
## 
##                          Height
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           72.89286 1.865267
##         Designated_Hitter 74.57143 1.603567
##         First_Baseman     73.97500 1.941286
##         Outfielder        73.16779 2.021515
##         Relief_Pitcher    74.53247 2.160421
##         Second_Baseman    71.27273 1.703125
##         Shortstop         71.88889 1.720096
##         Starting_Pitcher  74.70760 2.187370
##         Third_Baseman     73.08824 2.287882
## 
##                          Age
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           29.75571 4.445924
##         Designated_Hitter 31.47857 4.471898
##         First_Baseman     29.78425 5.219565
##         Outfielder        28.99342 4.255701
##         Relief_Pitcher    28.61420 4.160627
##         Second_Baseman    29.32386 4.616064
##         Shortstop         28.48972 3.919462
##         Starting_Pitcher  28.00661 4.401236
##         Third_Baseman     28.73353 3.798898

predicted.nbcvalues <- predict(nbc_model, as.data.frame(test))

# report results
tab <- table(predicted.nbcvalues, test$Position)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)

sum(diag(table(predicted.nbcvalues, test$Position)))

## [1] 74

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap")

# write.csv(file="./test.csv" , table(predicted.nbcvalues, test$Position))

6.4 Medical Specialty Text-Notes Classification

Let’s demonstrate text-classification using a clinical transcription text dataset, which consists of an index and 5 data elements - description, medical_specialty (prediction outcome target), sample_name, transcription, and keywords. Out task is to derive computed phenotypes automatically classifying the 40 different medical specialties using the clinical transcription text.

dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 
str(dataCT)

## 'data.frame':    4999 obs. of  6 variables:
##  $ Index            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ description      : chr  " A 23-year-old white female presents with complaint of allergies." " Consult for laparoscopic gastric bypass." " Consult for laparoscopic gastric bypass." " 2-D M-Mode. Doppler.  " ...
##  $ medical_specialty: chr  " Allergy / Immunology" " Bariatrics" " Bariatrics" " Cardiovascular / Pulmonary" ...
##  $ sample_name      : chr  " Allergic Rhinitis " " Laparoscopic Gastric Bypass Consult - 2 " " Laparoscopic Gastric Bypass Consult - 1 " " 2-D Echocardiogram - 1 " ...
##  $ transcription    : chr  "SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies w"| __truncated__ "PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to p"| __truncated__ "HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 years old, 344 "| __truncated__ "2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left "| __truncated__ ...
##  $ keywords         : chr  "allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegr"| __truncated__ "bariatrics, laparoscopic gastric bypass, weight loss programs, gastric bypass, atkin's diet, weight watcher's, "| __truncated__ "bariatrics, laparoscopic gastric bypass, heart attacks, body weight, pulmonary embolism, potential complication"| __truncated__ "cardiovascular / pulmonary, 2-d m-mode, doppler, aortic valve, atrial enlargement, diastolic function, ejection"| __truncated__ ...

# 1. EDA
library(dplyr)
mySummary <- dataCT %>%
  count(medical_specialty, sort = TRUE) 
mySummary

##                 medical_specialty    n
## 1                         Surgery 1103
## 2      Consult - History and Phy.  516
## 3      Cardiovascular / Pulmonary  372
## 4                      Orthopedic  355
## 5                       Radiology  273
## 6                General Medicine  259
## 7                Gastroenterology  230
## 8                       Neurology  223
## 9   SOAP / Chart / Progress Notes  166
## 10        Obstetrics / Gynecology  160
## 11                        Urology  158
## 12              Discharge Summary  108
## 13           ENT - Otolaryngology   98
## 14                   Neurosurgery   94
## 15          Hematology - Oncology   90
## 16                  Ophthalmology   83
## 17                     Nephrology   81
## 18         Emergency Room Reports   75
## 19          Pediatrics - Neonatal   70
## 20                Pain Management   62
## 21        Psychiatry / Psychology   53
## 22                   Office Notes   51
## 23                       Podiatry   47
## 24                    Dermatology   29
## 25     Cosmetic / Plastic Surgery   27
## 26                      Dentistry   27
## 27                        Letters   23
## 28      Physical Medicine - Rehab   21
## 29                 Sleep Medicine   20
## 30                  Endocrinology   19
## 31                     Bariatrics   18
## 32         IME-QME-Work Comp etc.   16
## 33                   Chiropractic   14
## 34           Diets and Nutritions   10
## 35                   Rheumatology   10
## 36              Speech - Language    9
## 37                        Autopsy    8
## 38       Lab Medicine - Pathology    8
## 39           Allergy / Immunology    7
## 40      Hospice - Palliative Care    6

plot_ly(dataCT, x = ~medical_specialty) %>%
  add_histogram()

# 2. Preprocess the medical clinical notes (transcription)
# library(tm)
dataCT_corpus <- Corpus(VectorSource(dataCT$transcription))
dataCT_corpus_clean <- tm_map(dataCT_corpus, tolower)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removePunctuation)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removeNumbers)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, stripWhitespace)
dataCT_corpus_dtm <- DocumentTermMatrix(dataCT_corpus_clean)

set.seed(1234)
subset_train <- sample(nrow(dataCT),floor(nrow(dataCT)*0.8))  # 80% training + 20% testing
dataCT_train <- dataCT[subset_train, ]
dataCT_test <- dataCT[-subset_train, ]
dataCT_corpus_dtm_train <- dataCT_corpus_dtm[subset_train, ]
hn_med_dtm_test <- dataCT_corpus_dtm[-subset_train, ]
dataCT_corpus_train <- dataCT_corpus_clean[subset_train]
dataCT_corpus_test <- dataCT_corpus_clean[-subset_train]

dataCT_train$MedSpecFac <- factor(dataCT_train$medical_specialty)
dataCT_test$MedSpecFac <- factor(dataCT_test$medical_specialty)
# prop.table(table(dataCT_test$medical_specialty))
prop.table(table(dataCT_test$MedSpecFac))

## 
##           Allergy / Immunology                        Autopsy 
##                          0.002                          0.002 
##                     Bariatrics     Cardiovascular / Pulmonary 
##                          0.004                          0.070 
##                   Chiropractic     Consult - History and Phy. 
##                          0.003                          0.100 
##     Cosmetic / Plastic Surgery                      Dentistry 
##                          0.003                          0.004 
##                    Dermatology              Discharge Summary 
##                          0.003                          0.020 
##         Emergency Room Reports                  Endocrinology 
##                          0.024                          0.002 
##           ENT - Otolaryngology               Gastroenterology 
##                          0.016                          0.052 
##               General Medicine          Hematology - Oncology 
##                          0.051                          0.022 
##         IME-QME-Work Comp etc.       Lab Medicine - Pathology 
##                          0.003                          0.004 
##                        Letters                     Nephrology 
##                          0.001                          0.018 
##                      Neurology                   Neurosurgery 
##                          0.037                          0.021 
##        Obstetrics / Gynecology                   Office Notes 
##                          0.028                          0.009 
##                  Ophthalmology                     Orthopedic 
##                          0.020                          0.082 
##                Pain Management          Pediatrics - Neonatal 
##                          0.017                          0.012 
##      Physical Medicine - Rehab                       Podiatry 
##                          0.004                          0.012 
##        Psychiatry / Psychology                      Radiology 
##                          0.010                          0.058 
##                   Rheumatology                 Sleep Medicine 
##                          0.001                          0.004 
##  SOAP / Chart / Progress Notes              Speech - Language 
##                          0.030                          0.002 
##                        Surgery                        Urology 
##                          0.221                          0.028

summary(findFreqTerms(dataCT_corpus_dtm_train, 5))

##    Length     Class      Mode 
##     12630 character character

dataCT_corpus_dict <- as.character(findFreqTerms(dataCT_corpus_dtm_train, 5))

dataCT_train1 <- DocumentTermMatrix(dataCT_corpus_train, list(dictionary=dataCT_corpus_dict))
dataCT_test1 <- DocumentTermMatrix(dataCT_corpus_test, list(dictionary=dataCT_corpus_dict))

dataCT_train1 <- apply(dataCT_train1, MARGIN = 2, convert_counts)
dataCT_test1 <- apply(dataCT_test1, MARGIN = 2, convert_counts)

dataCT_classifier <- naiveBayes(dataCT_train1, dataCT_train$MedSpecFac, laplace = 0)
dataCT_pred <- predict(dataCT_classifier, dataCT_test1)
# table(dataCT_pred)

# report results
tab <- table(dataCT_pred, dataCT_test$MedSpecFac)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)
# gmodels::CrossTable(dataCT_pred, dataCT_test$MedSpecFac)

sum(diag(table(dataCT_pred, dataCT_test$MedSpecFac)))

## [1] 51

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap") %>%
  layout(title="Consusion Matrix of Naive-Bayesian Classification of 40 Medical Specialties using Text-Notes (Test Data)",
          xaxis=list(title='True Specialties'), yaxis=list(title='Derived NB Class Labels'))

Try to apply the Naive Bayesian classification techniques to some new data from the list of our Case-Studies.

Data Science and Predictive Analytics (UMich HS650)

Probabilistic Learning - Classification Using Naive Bayes

SOCR/MIDAS (Ivo Dinov)

August 2021