SOCR ≫ DSPA ≫ Topics ≫

Please review the introduction to Chapter 6, where we described the types of machine learning methods and presented lazy classification for numerical data. What about nominal features or textual data? In this chapter, we will begin to explore some classificaiton techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) illustrate the naïve Bayesian classifier on a Head and Neck Cancer Medication case-study.

Later, in Chapter 19, we will also discuss text mining and natural language processing of unstructured text data.

1 Overview of the Naive Bayes Algorithm

Start by reviewing the basics of probability theory and Bayesian inference.

Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the “map” to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated.

The best scenario of accurately estimating the probability of an outcome-class map is when all features in Bayes classifiers attribute to the class simultaneously. The naive Bayes algorithm is frequently used for text classifications. The maximum a posteriori assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable.

2 Assumptions

Naive Bayes is named for its “naive” assumptions. Its most important assumption is that all of the features are equally important and independent. This rarely happens in real world data. However, sometimes even when the assumptions are violated, naive Bayes still performs fairly accurate, particularly when the number of features \(p\) is large. This is why the Naive Bayes algorithm may be used as a powerful text classifier.

There are interesting relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification. Additional information about LDA and QDA is available here.

3 Bayes Formula

We assume that \(B_i's\) are mutually exclusive events, for all \(i= 1, 2, ..., n\), where n represents the number of features. If A is another event, the Bayes conditional probability formula is as follows: \[Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}\]

Symbolically,

\[P(A|B_1\cap B_2 \cap B_3\cap ...\cap B_n)=\frac{P(B_1\cap B_2 \cap B_3\cap ...\cap B_n|A)P(A)}{P(B_1\cap B_2 \cap B_3\cap ...\cap B_n)}\] When \(B_i's\) are independent we have \(P(B_1\cap B_2)=P(B_1)*P(B_2)\) and similar property for each pair of different \(B_i's\). So we have:

\[P(A|B_1\cap B_2 \cap B_3\cap ...\cap B_n)=\frac{P(B_1|A)\times P(B_2|A)\times ...\times P(B_n|A)\times P(A)}{P(B_1)\times P(B_2)\times P(B_3)\times ...\times P(B_n)}\]

Essentially, probability of level L for class C given independent features \(F_1\) to \(F_n\) is equal to:

\[P(C_L|F_1, ..., F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)},\] where the denominator, \(\prod_{i=1}^nP(F_i)\), is a scaling factor that represents the the marginal probability of observing all features jointly. In the DSPA Appendix, we provide additional technical details, code, and applications of Bayesian simulation, modeling and inference.

4 The Laplace Estimator

If at least one \(P(F_i|C_L)=0\) then \(P(C_L|F_1, ..., F_n)=0\), which means the probability of being in this class is 0. However, \(P(F_i|C_L)=0\) could be result from a random chance in picking the training data.

One of the solutions to this scenario is Laplace estimation, also known as Laplace smoothing, which can be accomplished in two ways. One is to add small number to each counts in the frequency table, which allows each class-feature combination to be at least one in the training data. Then \(P(F_i|C_L)>0\) for all \(i\). Another strategy is to add some small value, \(\epsilon\), to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial (\(0\)) posterior for another class.

5 Case Study: Head and Neck Cancer Medication

5.1 Step 1: Collecting Data

We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data archive.

Variables:

  • PID: coded patient ID
  • ENC_ID: coded encounter ID
  • Seer_stage: SEER cancer stage (0 =In situ, 1=Localized, 2=Regional by direct extension, 3=Regional to lymph nodes, 4=Regional (both codes 2 and 3), 5=Regional, NOS, 7= Distant metastases/systemic disease, 8=Not applicable, 9=Unstaged, unknown, or unspecified). See: http://seer.cancer.gov/tools/ssm.
  • Medication_desc: description of the chemical composition of the medication
  • Medication_summary: brief description about medication brand and usage
  • Dose: the dosage in the medication summary
  • Unit: the unit for dosage in the Medication_summary
  • Frequency: the frequency of use in the Medication_summary
  • Total_dose_count: total dosage count according to the Medication_summary

5.2 Step 2: Exploring and preparing the data

Let’s load our data first.

hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
str(hn_med)
## 'data.frame':    662 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
##  $ seer_stage        : int  1 1 4 1 9 1 1 1 9 1 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 1 2 2 6 15 1 ...

Change the seer_stage (cancer stage indicator) variable into a factor.

hn_med$seer_stage <- factor(hn_med$seer_stage)
str(hn_med$seer_stage)
##  Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 9 2 2 2 9 2 ...
table(hn_med$seer_stage)
## 
##   0   1   2   3   4   5   7   8   9 
##  21 265  53  90  46  18  87  14  68

5.2.1 Data preparation - processing text data for analysis

As you can see, the medication_summary contains a great amount of text. We should do some text mining to prepare the data for analysis. In R, the tm package is a good choice for text mining.

# install.packages("tm", repos = "http://cran.us.r-project.org")
# requires R V.3.3.1 +
require(tm)
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP

First step for text mining is to convert text features (text elements) into a corpus object, which is a collection of text documents.

hn_med_corpus<-Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
print(hn_med_corpus)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 662

After we construct the corpus object, we could see that we have 662 documents. Each document represents an encounter (e.g., notes on medical treatment) for a patient.

inspect(hn_med_corpus[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] (Zantac) 150 mg tablet oral two times a day
## [2] 5,000 unit subcutaneous three times a day  
## [3] (Unasyn) 15 g IV every 6 hours
hn_med_corpus[[1]]$content
## [1] "(Zantac) 150 mg tablet oral two times a day"
hn_med_corpus[[2]]$content
## [1] "5,000 unit subcutaneous three times a day"
hn_med_corpus[[3]]$content
## [1] "(Unasyn) 15 g IV every 6 hours"

There are unwanted punctuations and other symbols in the corpus document that we want to remove. We use tm_map() function for the cleaning.

corpus_clean<-tm_map(hn_med_corpus, tolower)
corpus_clean<-tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <-tm_map(corpus_clean, removeNumbers)
# corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

The above lines of code changed all the characters to lower case, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the coprus to plain text).

inspect(corpus_clean[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] zantac  mg tablet oral two times a day
## [2]  unit subcutaneous three times a day  
## [3] unasyn  g iv every  hours
corpus_clean[[1]]$content
## [1] "zantac  mg tablet oral two times a day"
corpus_clean[[2]]$content
## [1] " unit subcutaneous three times a day"
corpus_clean[[3]]$content
## [1] "unasyn  g iv every  hours"

DocumentTermMatrix() function can successfully tokenize the medication summary into words. It can count frequent terms in each document in the corpus object.

hn_med_dtm<-DocumentTermMatrix(corpus_clean)

5.2.2 Data preparation - creating training and test datasets

Just like in Chapter 6, we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix.

set.seed(12)
subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8))  # 80% training + 20% testing
hn_med_train<-hn_med[subset_int, ]
hn_med_test<-hn_med[-subset_int, ]
hn_med_dtm_train<-hn_med_dtm[subset_int, ]
hn_med_dtm_test<-hn_med_dtm[-subset_int, ]
corpus_train<-corpus_clean[subset_int]
corpus_test<-corpus_clean[-subset_int]

# hn_med_train<-hn_med[1:562, ]
#hn_med_test<-hn_med[563:662, ]
# hn_med_dtm_train<-hn_med_dtm[1:562, ]
# hn_med_dtm_test<-hn_med_dtm[563:662, ]
#corpus_train<-corpus_clean[1:562]
#corpus_test<-corpus_clean[563:662]

Let’s examine the distribution of seer stages in the training and test datasets.

prop.table(table(hn_med_train$seer_stage))
## 
##          0          1          2          3          4          5 
## 0.03024575 0.38374291 0.08317580 0.14555766 0.06616257 0.03402647 
##          7          8          9 
## 0.13421550 0.02268431 0.10018904
prop.table(table(hn_med_test$seer_stage))
## 
##          0          1          2          3          4          5 
## 0.03759398 0.46616541 0.06766917 0.09774436 0.08270677 0.00000000 
##          7          8          9 
## 0.12030075 0.01503759 0.11278195

We can separate (dichotomize) the seer_stage into two categories:

  • no stage or early stage cancer, and
  • later stage cancer.
hn_med_train$stage<-hn_med_train$seer_stage %in% c(4, 5, 7)
hn_med_train$stage<-factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test$stage<-hn_med_test$seer_stage %in% c(4, 5, 7)
hn_med_test$stage<-factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train$stage))
## 
## early_stage later_stage 
##   0.7655955   0.2344045
prop.table(table(hn_med_test$stage))
## 
## early_stage later_stage 
##   0.7969925   0.2030075

5.2.3 Visualizing text data - word clouds

A word cloud can help us visualize text data. More frequent words would have larger fonts in the figure, while less common words are appearing in smaller fonts. There is a wordcloud package in R that is commonly used for creating these figures.

# install.packages("wordcloud", repos = "http://cran.us.r-project.org")
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(corpus_train, min.freq = 40, random.order = FALSE)

The random.order=FALSE option made more frequent words appear in the middle. min.freq=40 option sets the cutoff word frequency to be at least 40 times in the corpus object. Therefore, the words must be appear in at least 40 medication summaries to be shown on the graph.

We can also visualize the difference between early stages and later stages using this type of graph.

early<-subset(hn_med_train, stage=="early_stage")
later<-subset(hn_med_train, stage=="later_stage")
wordcloud(early$MEDICATION_SUMMARY, max.words = 20)

wordcloud(later$MEDICATION_SUMMARY, max.words = 20)

We can see that the frequent words are somewhat different in the medication summary between early stage and later stage patients.

5.2.4 Data preparation - creating indicator features for frequent words

For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that in Chapter 6 we used features for classifications. In this study, we are going to make frequencies of words into features.

summary(findFreqTerms(hn_med_dtm_train, 5))
##    Length     Class      Mode 
##       103 character character
hn_med_dict<-as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train<-DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test<-DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 103 features for us to use.

The Naive Bayes classifier trains on data with categorical features. Thus, we need to transform our word count features into categorical data. A way to do this is to change the count into an indicator of whether this word appears. We can create a function of our own to deal with this.

convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}

We employ hard-thresholding here x<-ifelse(x>0, 1, 0). This is saying that if we have an x that is greater than 0, we assign value 1 to it, otherwise the value is set to 0.

Now let’s apply our own function convert_counts() on each column (MARGIN=2) of the training and testing datasets.

hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)

So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data.

5.3 Step 3 - training a model on the data

The package we will use for Naive Bayes classifier is called e1071.

# install.packages("e1071", repos = "http://cran.us.r-project.org")
library(e1071)

The function naiveBayes() has following components:

m<-naiveBayes(train, class, laplace=0)

  • train: data frame containing numeric training data (features)
  • class: factor vector with the class for each row in the training data.
  • laplace: positive double controlling Laplace smoothing; default is \(0\) and disables Laplace smoothing,

Let’s build our classifier first.

hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)

Then we can use the classifier to make predictions using predict(). Recall that when we presented the AdaBoost example in Chapter 2, we saw the basic mechanism of machine-learning training, prediction and assessment.

The function predict() has the following components:

p<-predict(m, test, type="class")

  • m: classifier trained by naiveBayes()
  • test: test data frame or matrix
  • type: either "class" or "raw" specifies whether the predictions should be the most likely class value or the raw predicted probabilities.
hn_test_pred<-predict(hn_classifier, hn_test)

5.4 Step 4 - evaluating model performance

Similarly to the approach in Chapter 6, we use cross table to compare predicted class and the true class of our test dataset.

library(gmodels)
CrossTable(hn_test_pred, hn_med_test$stage)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          90 |          24 |         114 | 
##              |       0.008 |       0.032 |             | 
##              |       0.789 |       0.211 |       0.857 | 
##              |       0.849 |       0.889 |             | 
##              |       0.677 |       0.180 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |          16 |           3 |          19 | 
##              |       0.049 |       0.190 |             | 
##              |       0.842 |       0.158 |       0.143 | 
##              |       0.151 |       0.111 |             | 
##              |       0.120 |       0.023 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         106 |          27 |         133 | 
##              |       0.797 |       0.203 |             | 
## -------------|-------------|-------------|-------------|
## 
## 

It may be worth skipping forward to Chapter 13 where we present a summary table for the key measures used to evaluate the performance of binary tests, classifiers, or predictions.

The prediciton accuracy: \[ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{93}{133}=0.7.\]

From the cross table we can see that our prediction accuracy is \(\frac{93}{133}=0.70\). However, there is the later stage classificaiton only has \(3\) counts. This might due to the \(P(F_i|C_L)\sim 0\) problem that we talked about earlier.

5.5 Step 5 - improving model performance

After setting laplace=15, the accuracy goes up to 76% (\(acc=\frac{101}{133}\)). Although this is a small improvement in terms of accuracy, we have a better chance of detecting later stage patients.

hn_classifier <- naiveBayes(hn_train, hn_med_train$stage, laplace = 15)
hn_test_pred<-predict(hn_classifier, hn_test)
CrossTable(hn_test_pred, hn_med_test$stage)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          99 |          25 |         124 | 
##              |       0.000 |       0.001 |             | 
##              |       0.798 |       0.202 |       0.932 | 
##              |       0.934 |       0.926 |             | 
##              |       0.744 |       0.188 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |           7 |           2 |           9 | 
##              |       0.004 |       0.016 |             | 
##              |       0.778 |       0.222 |       0.068 | 
##              |       0.066 |       0.074 |             | 
##              |       0.053 |       0.015 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         106 |          27 |         133 | 
##              |       0.797 |       0.203 |             | 
## -------------|-------------|-------------|-------------|
## 
## 

5.6 Step 6 - compare Naive Bayesian vs. LDA

As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of Naive Bayes and LDA classificaiton.

library(MASS)
df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric), stage = hn_med_train$stage)
df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric), stage = hn_med_test$stage)

hn_lda <- lda(data=df_hn_train, stage~.)
## Warning in lda.default(x, grouping, ...): variables are collinear
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(hn_lda, df_hn_test)
CrossTable(hn_pred$class, df_hn_test$stage)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##               | df_hn_test$stage 
## hn_pred$class | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   early_stage |          66 |          22 |          88 | 
##               |       0.244 |       0.957 |             | 
##               |       0.750 |       0.250 |       0.662 | 
##               |       0.623 |       0.815 |             | 
##               |       0.496 |       0.165 |             | 
## --------------|-------------|-------------|-------------|
##   later_stage |          40 |           5 |          45 | 
##               |       0.477 |       1.872 |             | 
##               |       0.889 |       0.111 |       0.338 | 
##               |       0.377 |       0.185 |             | 
##               |       0.301 |       0.038 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |         106 |          27 |         133 | 
##               |       0.797 |       0.203 |             | 
## --------------|-------------|-------------|-------------|
## 
## 

Here, Naive Bayesian outperforms LDA classifier in terms of the overall accuracy, however LDA has a lower type II error ($), which is clinically important to avoid missing later-stage cancer patients.

In later chapters, we will step deeper into the space of classification problems and see more sophisticated approaches.

6 Practice Problems

6.1 Iris Species

The classification of hte iris flowers represents an easy example of Naive Bayesian classifier.

data(iris)
nbc_model <- naiveBayes(Species ~ ., data = iris)
## alternatively:
nbc_model <- naiveBayes(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")], iris[,"Species"])
predicted.nbcvalues <- predict(nbc_model, iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
table(predicted.nbcvalues, iris[, "Species"])
##                    
## predicted.nbcvalues setosa versicolor virginica
##          setosa         50          0         0
##          versicolor      0         47         3
##          virginica       0          3        47

6.2 Cancer Study

In the previous case study. We classified the patients with seer_stage of “not applicable”(seer_stage=8) and “unstaged, unknown or unspecified”(seer_stage=9) as no cancer or early cancer stages. Let’s remove these two categories and replicate the Naive Bayes classifier case study again.

hn_med1<-hn_med[!hn_med$seer_stage %in% c(8, 9), ]
str(hn_med1); dim(hn_med1)
## 'data.frame':    580 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10103 1012 10135 10143 10152 10184 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47511 3138 47739 47769 47800 47938 ...
##  $ seer_stage        : Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 2 2 2 2 7 2 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 2 2 6 1 24 2 ...
## [1] 580   9

Now we have only 580 observations. We can either use the first 480 of them as the training dataset and the last 100 as the test dataset, or select 80-20 (training-testing) split, and evaluate the prediction accuracy when laplace=1?

We can use the same code for creating the classes in training and test dataset. Since the seer_stage=8 or 9 is not in the data, we classify seer_stage=0, 1, 2 or 3 as “early_stage” and seer_stage=4, 5 or 7 as “later_stage”.

hn_med_train1$stage<-hn_med_train1$seer_stage %in% c(4, 5, 7)
hn_med_train1$stage<-factor(hn_med_train1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test1$stage<-hn_med_test1$seer_stage %in% c(4, 5, 7)
hn_med_test1$stage<-factor(hn_med_test1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train1$stage))
## 
## early_stage later_stage 
##   0.7392241   0.2607759
prop.table(table(hn_med_test1$stage))
## 
## early_stage later_stage 
##   0.7413793   0.2586207

Use terms that have appeared at least 5 documents in the training dataset to build the document term matrix.

##    Length     Class      Mode 
##       112 character character
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  116 
## 
##  
##               | hn_med_test1$stage 
## hn_test_pred1 | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   early_stage |          84 |          28 |         112 | 
##               |       0.011 |       0.032 |             | 
##               |       0.750 |       0.250 |       0.966 | 
##               |       0.977 |       0.933 |             | 
##               |       0.724 |       0.241 |             | 
## --------------|-------------|-------------|-------------|
##   later_stage |           2 |           2 |           4 | 
##               |       0.314 |       0.901 |             | 
##               |       0.500 |       0.500 |       0.034 | 
##               |       0.023 |       0.067 |             | 
##               |       0.017 |       0.017 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |          86 |          30 |         116 | 
##               |       0.741 |       0.259 |             | 
## --------------|-------------|-------------|-------------|
## 
## 

\[ ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{86}{116}=0.74.\]

6.3 Baseball Data

Use the MLB Data (01a_data.txt) to predict the Team each player belongs to using naiveBayes classifier and report the agreement between predicted and actual team labels. Below is some example code.

mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1',as.is=T, header=T)  # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(1234)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
# TESTING DATA
test <- mydata[-train_ind, ]
library("e1071")
nbc_model <- naiveBayes(train[ , c("Weight", "Height", "Age")], as.factor(train$Team))
nbc_model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train[, c("Weight", "Height", "Age")], 
##     y = as.factor(train$Team))
## 
## A-priori probabilities:
## as.factor(train$Team)
##        ANA        ARZ        ATL        BAL        BOS        CHC 
## 0.03096774 0.02967742 0.04129032 0.03225806 0.03741935 0.03096774 
##        CIN        CLE        COL        CWS        DET        FLA 
## 0.03354839 0.03483871 0.02967742 0.02322581 0.03483871 0.03096774 
##        HOU         KC         LA        MIN        MLW        NYM 
## 0.03612903 0.03612903 0.02838710 0.03354839 0.03483871 0.03741935 
##        NYY        OAK        PHI        PIT         SD        SEA 
## 0.03483871 0.03612903 0.04000000 0.02967742 0.02838710 0.04000000 
##         SF        STL         TB        TEX        TOR        WAS 
## 0.03354839 0.02838710 0.03483871 0.03354839 0.03096774 0.03354839 
## 
## Conditional probabilities:
##                      Weight
## as.factor(train$Team)     [,1]     [,2]
##                   ANA 197.9583 23.65601
##                   ARZ 205.2609 24.65068
##                   ATL 200.7188 20.75857
##                   BAL 195.5600 18.32139
##                   BOS 203.6897 18.61087
##                   CHC 206.6250 19.22139
##                   CIN 203.6154 15.16595
##                   CLE 200.2963 21.29791
##                   COL 198.5217 14.54107
##                   CWS 209.3889 21.25514
##                   DET 203.4074 16.32426
##                   FLA 204.0417 21.55172
##                   HOU 196.0357 19.90809
##                   KC  197.2143 21.63392
##                   LA  205.3182 16.82511
##                   MIN 204.1154 22.18707
##                   MLW 203.9259 13.49053
##                   NYM 198.5517 18.14345
##                   NYY 207.7037 25.41322
##                   OAK 194.9643 18.58610
##                   PHI 196.8387 26.33514
##                   PIT 205.6522 18.95115
##                   SD  204.7273 18.64501
##                   SEA 199.2258 19.34030
##                   SF  203.3846 18.00684
##                   STL 201.7273 19.68365
##                   TB  196.3704 16.18439
##                   TEX 202.1538 21.23712
##                   TOR 204.8750 28.13216
##                   WAS 198.3846 26.40163
## 
##                      Height
## as.factor(train$Team)     [,1]     [,2]
##                   ANA 72.70833 2.312145
##                   ARZ 73.43478 2.063230
##                   ATL 73.96875 2.177367
##                   BAL 73.88000 2.521904
##                   BOS 73.96552 2.179167
##                   CHC 73.79167 1.864524
##                   CIN 73.34615 1.958414
##                   CLE 74.00000 1.961161
##                   COL 73.69565 2.457545
##                   CWS 74.11111 2.446820
##                   DET 73.85185 2.106833
##                   FLA 74.08333 2.205067
##                   HOU 73.00000 1.805342
##                   KC  73.53571 2.168497
##                   LA  73.40909 2.500649
##                   MIN 73.42308 2.023326
##                   MLW 73.81481 1.732873
##                   NYM 73.03448 2.499754
##                   NYY 74.29630 2.267069
##                   OAK 73.25000 1.818119
##                   PHI 73.45161 2.754859
##                   PIT 73.65217 1.849046
##                   SD  73.40909 2.039120
##                   SEA 73.54839 2.656297
##                   SF  73.80769 2.332711
##                   STL 73.72727 2.510584
##                   TB  73.55556 1.967688
##                   TEX 74.15385 2.033565
##                   TOR 74.12500 2.383138
##                   WAS 74.15385 2.663716
## 
##                      Age
## as.factor(train$Team)     [,1]     [,2]
##                   ANA 28.87875 4.612506
##                   ARZ 26.93348 3.024420
##                   ATL 28.48500 4.444819
##                   BAL 28.08120 3.913035
##                   BOS 29.98552 5.288319
##                   CHC 28.75958 3.370762
##                   CIN 29.88000 5.375042
##                   CLE 28.92148 4.746345
##                   COL 27.99739 4.078276
##                   CWS 28.67611 2.864865
##                   DET 29.28667 5.648195
##                   FLA 25.99542 2.890383
##                   HOU 30.06036 4.306209
##                   KC  29.19357 4.141790
##                   LA  31.03636 5.488794
##                   MIN 27.65385 3.740758
##                   MLW 29.30185 4.126257
##                   NYM 31.40793 6.173785
##                   NYY 29.73889 4.331266
##                   OAK 29.05536 3.729079
##                   PHI 29.66710 4.571378
##                   PIT 26.83304 2.471173
##                   SD  30.87045 4.944569
##                   SEA 27.78355 4.200894
##                   SF  30.42269 5.340946
##                   STL 29.95500 3.688636
##                   TB  26.90259 3.649058
##                   TEX 28.66654 4.709882
##                   TOR 29.13458 4.072320
##                   WAS 27.24077 2.209049
predicted.nbcvalues <- predict(nbc_model, as.data.frame(test))

# report results
table(predicted.nbcvalues, test$Team)
##                    
## predicted.nbcvalues ANA ARZ ATL BAL BOS CHC CIN CLE COL CWS DET FLA HOU KC
##                 ANA   0   0   0   0   0   0   0   0   0   0   0   1   0  0
##                 ARZ   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 ATL   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 BAL   0   0   0   0   0   0   1   0   0   0   0   0   0  0
##                 BOS   0   0   0   1   0   0   0   1   0   0   0   0   0  0
##                 CHC   0   0   1   0   1   0   0   0   0   0   0   0   0  0
##                 CIN   0   0   0   0   0   0   1   0   0   0   0   0   0  0
##                 CLE   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 COL   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 CWS   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 DET   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 FLA   1   0   0   0   2   0   2   0   0   4   1   1   1  0
##                 HOU   1   1   1   1   0   0   0   0   0   0   1   0   0  0
##                 KC    0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 LA    0   1   0   0   0   0   0   0   0   0   0   0   0  0
##                 MIN   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 MLW   2   1   0   1   2   0   0   0   3   1   1   0   1  2
##                 NYM   0   0   0   3   0   1   1   0   0   0   1   0   1  0
##                 NYY   2   0   0   1   1   2   0   0   0   4   2   0   1  0
##                 OAK   1   0   1   2   0   1   2   0   1   1   1   0   1  2
##                 PHI   0   0   0   0   0   0   1   0   1   0   0   0   0  1
##                 PIT   0   0   0   0   0   1   0   1   0   2   1   2   0  2
##                 SD    0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 SEA   0   1   0   0   0   0   0   0   0   0   0   0   0  0
##                 SF    0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 STL   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 TB    2   0   2   1   1   3   0   5   1   1   2   3   1  0
##                 TEX   0   0   0   0   0   0   0   0   0   0   0   0   0  0
##                 TOR   0   0   0   0   0   0   0   1   0   0   0   0   0  0
##                 WAS   2   1   0   0   0   4   2   0   6   2   0   1   0  0
##                    
## predicted.nbcvalues LA MIN MLW NYM NYY OAK PHI PIT SD SEA SF STL TB TEX
##                 ANA  0   0   0   1   0   0   0   0  0   0  1   0  0   0
##                 ARZ  0   0   1   0   0   0   0   0  0   0  0   0  0   0
##                 ATL  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 BAL  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 BOS  0   0   0   1   0   1   0   0  0   0  0   1  0   0
##                 CHC  0   0   0   0   1   0   0   0  0   0  0   0  0   1
##                 CIN  0   0   0   0   0   0   0   0  0   0  1   1  0   0
##                 CLE  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 COL  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 CWS  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 DET  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 FLA  1   0   0   1   1   0   0   2  0   0  0   0  2   0
##                 HOU  0   1   1   1   0   0   0   0  0   0  1   1  0   0
##                 KC   0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 LA   0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 MIN  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 MLW  1   0   0   0   1   0   0   1  1   0  1   1  1   0
##                 NYM  0   1   0   0   1   0   0   0  0   0  0   1  0   0
##                 NYY  2   0   1   0   1   0   0   1  0   0  0   1  0   0
##                 OAK  1   2   1   1   0   1   1   2  2   0  0   0  0   1
##                 PHI  1   1   0   0   0   0   0   0  1   0  0   1  0   0
##                 PIT  0   0   1   1   0   2   0   2  3   1  1   1  0   4
##                 SD   0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 SEA  1   0   0   1   0   1   0   1  0   0  0   0  0   0
##                 SF   0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 STL  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 TB   3   0   1   2   0   1   2   2  1   2  2   0  3   3
##                 TEX  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 TOR  0   0   0   0   0   0   0   0  0   0  0   0  0   0
##                 WAS  1   2   2   0   0   3   2   1  3   0  1   2  0   0
##                    
## predicted.nbcvalues TOR WAS
##                 ANA   0   0
##                 ARZ   0   0
##                 ATL   0   0
##                 BAL   0   0
##                 BOS   0   0
##                 CHC   0   0
##                 CIN   0   0
##                 CLE   0   0
##                 COL   0   0
##                 CWS   0   0
##                 DET   0   0
##                 FLA   1   2
##                 HOU   1   0
##                 KC    0   0
##                 LA    0   0
##                 MIN   0   0
##                 MLW   0   1
##                 NYM   0   0
##                 NYY   2   1
##                 OAK   0   1
##                 PHI   1   0
##                 PIT   1   1
##                 SD    0   0
##                 SEA   0   0
##                 SF    0   0
##                 STL   0   0
##                 TB    3   3
##                 TEX   0   0
##                 TOR   0   0
##                 WAS   1   1

Try to reproduce these results with some new data from the list of our Case-Studies.

SOCR Resource Visitor number Dinov Email