As we have seen in the previous chapters, traditional statistical analyses and classical data modeling are applied to relational data where the observed information is represented by tables, vectors, arrays, tensors or data-frames containing binary, categorical, original, or numerical values. Such representations provide incredible advantages (e.g., quick reference and de-reference of elements, search, discovery and navigation), but also limit the scope of applications. Relational data objects are quite effective for managing information that is based only on existing attributes. However, when data science inference needs to utilize attributes that are not included in the relational model, alternative non-relational representations are necessary. For instance, imagine that our data object includes a free text feature (e.g., physician/nurse clinical notes, biospecimen samples) that contains information about medical condition, treatment or outcome. It’s very difficult, or sometimes even impossible, to include the raw text into the automated data analytics, using classical procedures and statistical models available for relational datasets.

Natural Language Processing (NLP) and Text Mining (TM) refer to automated machine-driven algorithms for semantically mapping, extracting information, and understanding of (natural) human language. Sometimes, this involves extracting salient information from large amounts of unstructured text. To do so, we need to build a semantic and syntactic mapping algorithm for effective processing of heavy text. Related to NLP/TM, the work we did in Chapter 7 showed an interesting text classifier using the naive Bayes algorithm.

In this Chapter, we will present more details about various text processing strategies in R. Specifically, we will present simulated and real examples of text processing and computing document term frequency (TF), inverse document frequency (IDF), and cosine similarity transformation.

1 A simple NLP/TM example

Text mining or text analytics (TM/TA) examines large volumes of unstructured text (corpus) aiming to extract new information, discover context, identify linguistic motifs, or transform the text and derive quantitative data that can be further analyzed. Natural language processing (NLP) is one example of a TM analytical technique. Whereas TM’s goal is to discover relevant contextual information, which may be unknown, hidden, or obfuscated, NLP is focused on linguistic analysis that trains a machine to interpret voluminous textual content. To decipher the semantics and ambiguities in human-interpretable language, NLP employs automatic summarization, tagging, disambiguation, extraction of entities and relations, pattern recognition and frequency analyses. As of 2017, the total amount of information generated by the human race exceeds 5 zettabytes (\(1ZB=10^{21}=2^{70}\) bytes), which is projected to top \(50ZB\) by 2020. The amount of data we obtain and record doubles every 12-14 months (Kryder’s law). A small fraction of this massive information (\(<0.0001\%\) or \(<1PB=10^{15}\) bytes) represents newly written or transcribed text, including code. However, it is impossible (cf. efficiency, time, resources) for humans to read, synthesize, curate, interpret, and react to all this information without direct assistance of TM/NLP. The information content in text could be substantially higher than that of other information media. Remember that “a picture may be worth a thousand words”, yet, “a word may also be worth a thousand pictures”. As an example, the simple sentence “The data science and predictive analytics textbook includes 22 Chapters.” takes 63 bytes to store as text, however, a color image showing this as printed text could reach 10 megabytes (MB), and an HD video of a speaker reading the same sentence could easily surpass 50MB. Text mining and natural language processing may be used to automatically analyze and interpret written, coded or transcribed content to assess news, moods, emotions, and biosocial trends related to specific topics.

In general, text analysis protocol involves:

Construction of a document-term matrix (DTM) from the input documents, vectorizing the text, e.g., creating a map of single words or n-grams into a vector space. In other words, we generate a vectorizer function mapping terms to indices.
Apply a model-based statistical analysis or a model-free machine learning techniques for prediction, clustering, classification, similarity search, network/sentiment analysis, or forecasting using the DTM. This step also includes tuning and internally validating the performance of the method.
Apply and evaluate the technique to new data.

1.1 Define and load the unstructured-text documents

Let’s create some documents we can use to demonstrate the use of the tm package to do text mining. The 5 documents below represent portions of the syllabi of 5 recent courses taught by Ivo Dinov:

We import the syllabi into several separate segments represented as documents.

As an exercise, try to use the rvest::read_html method to load in the 5 course syllabi directly from the course websites listed above.

doc1 <- "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

doc2 <- "Bootcamp: A week-long intensive Bootcamp focused on methods, techniques, tools, services and resources for big healthcare and biomedical data analytics using the open-source statistical computing software R. Morning sessions (3 hrs) will be dedicated to methods and technologies and applications. Afternoon sessions (3 hrs) will be for group-based hands-on practice and team work. Commitment to attend the full week of instruction (morning sessions) and self-guided work (afternoon sessions) is required. Certificates of completion will be issued only to trainees with perfect attendance that complete all work. This hands-on intensive graduate course (Bootcamp) will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

doc3 <- "HS 853: This course covers a number of modern analytical methods for advanced healthcare research. Specific focus will be on reviewing and using innovative modeling, computational, analytic and visualization techniques to address concrete driving biomedical and healthcare applications. The course will cover the 5 dimensions of Big-Data (volume, complexity, multiple scales, multiple sources, and incompleteness). HS853 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to conduct research, employ and report on recent advanced health sciences analytical methods; read, comprehend and present recent reports of innovative scientific methods; apply a broad range of health problems; experiment with real Big-Data. Topics Covered include: Foundations of R, Scientific Visualization, Review of Multivariate and Mixed Linear Models, Causality/Causal Inference and Structural Equation Models, Generalized Estimating Equations, PCOR/CER methods Heterogeneity of Treatment Effects, Big-Data, Big-Science, Internal statistical cross-validation, Missing data, Genotype-Environment-Phenotype, associations, Variable selection (regularized regression and controlled/knockoff filtering), medical imaging, Databases/registries, Meta-analyses, classification methods, Longitudinal data and time-series analysis, Geographic Information Systems (GIS), Psychometrics and Rasch measurement model analysis, MCMC sampling for Bayesian inference, and Network Analysis"

doc4 <- "HS 851: This course introduces students to applied inference methods in studies involving multiple variables. Specific methods that will be discussed include linear regression, analysis of variance, and different regression models. This course will emphasize the scientific formulation, analytical modeling, computational tools and applied statistical inference in diverse health-sciences problems. Data interrogation, modeling approaches, rigorous interpretation and inference will be emphasized throughout. HS851 is a 4 credit hour course (3 lectures + 1 lab/discussion).  Students will learn how to: ,  Understand the commonly used statistical methods of published scientific papers , Conduct statistical calculations/analyses on available data , Use software tools to analyze specific case-studies data , Communicate advanced statistical concepts/techniques , Determine, explain and interpret assumptions and limitations. Topics Covered  include   Epidemiology , Correlation/SLR , and slope inference, 1-2 samples , ROC Curve , ANOVA , Non-parametric inference , Cronbach's $\alpha$, Measurement Reliability/Validity , Survival Analysis , Decision theory , CLT/LLNs - limiting results and misconceptions , Association Tests , Bayesian Inference , PCA/ICA/Factor Analysis , Point/Interval Estimation (CI) - MoM, MLE , Instrument performance Evaluation , Study/Research Critiques , Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them."

doc5 <- "HS550: This course provides students with an introduction to probability reasoning and statistical inference. Students will learn theoretical concepts and apply analytic skills for collecting, managing, modeling, processing, interpreting and visualizing (mostly univariate) data. Students will learn the basic probability modeling and statistical analysis methods and acquire knowledge to read recently published health research publications. HS550 is a 4 credit hour course (3 lectures + 1 lab/discussion).  Students will learn how to:  Apply data management strategies to sample data files , Carry out statistical tests to answer common healthcare research questions using appropriate methods and software tools , Understand the core analytical data modeling techniques and their appropriate use  Examples of Topics Covered ,  EDA/Charts , Ubiquitous variation , Parametric inference , Probability Theory , Odds Ratio/Relative Risk , Distributions , Exploratory data analysis , Resampling/Simulation , Design of Experiments , Intro to Epidemiology , Estimation , Hypothesis testing , Experiments vs. Observational studies , Data management (tables, streams, cloud, warehouses, DBs, arrays, binary, ASCII, handling, mechanics) , Power, sample-size, effect-size, sensitivity, specificity , Bias/Precision , Association vs. Causality , Rate-of-change , Clinical vs. Stat significance , Statistical Independence Bayesian Rule."

1.2 Create a new VCorpus object

The VCorpus object includes all the text and some meta-data (e.g., indexing) about the text.

docs<-c(doc1, doc2, doc3, doc4, doc5)

class(docs)

## [1] "character"

Then let’s make a VCorpus object using the tm package. To complete this task, we need to know the source type. Here docs has a vector with “character” class so we should use VectorSource(). If it is a dataframe, we should use DataframeSource() instead. VCorpus() creates a volatile corpus, which is the data type used by the tm package for text mining.

library(tm)

## Loading required package: NLP

doc_corpus<-VCorpus(VectorSource(docs))
doc_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5

doc_corpus[[1]]$content

## [1] "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

This is a list that contains the information for the 5 documents we have created. Now we can apply the tm_map() function on this object to edit the text. Similarly to human semantic language understanding, the goal here is to algorithmically process the text and output (structured) quantitative information as a signature tensor representing the original (unstructured) text.

1.3 To-lower case transformation

The text itself contains upper case letters as well as lower case letters. The first thing to do is to convert everything to lower case.

doc_corpus<-tm_map(doc_corpus, tolower)
doc_corpus[[1]]

## [1] "hs650: the data science and predictive analytics (dspa) course (offered as a massive open online course, mooc, as well as a traditional university of michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. it explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. this open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (big data). the focus will be to expose students to common challenges related to handling big data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. the collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

1.4 Text pre-processing

1.4.1 Remove Stopwords

These documents contain a lot of “stopwords” or common words that have important semantic meaning but low analytic value. We can remove these by the following command.

stopwords("english")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

doc_corpus<-tm_map(doc_corpus, removeWords, stopwords("english"))
doc_corpus[[1]]

## [1] "hs650:  data science  predictive analytics (dspa) course (offered   massive open online course, mooc,  well   traditional university  michigan class) aims  build computational abilities, inferential thinking,  practical skills  tackling core data scientific challenges.  explores foundational concepts  data management, processing, statistical computing,  dynamic visualization using modern programming tools  agile web-services. concepts, ideas,  protocols  illustrated  examples  real observational, simulated  research-derived datasets.  prior quantitative experience  programming, calculus, statistics, mathematical models,  linear algebra will  necessary.  open graduate course will provide  general overview   principles, concepts, techniques, tools  services  managing, harmonizing, aggregating, preprocessing, modeling, analyzing  interpreting large, multi-source, incomplete, incongruent,  heterogeneous data (big data).  focus will   expose students  common challenges related  handling big data  present  enormous opportunities  power associated   ability  interrogate  complex datasets, extract useful information, derive knowledge,  provide actionable forecasting. biomedical, healthcare,  social datasets will provide context  addressing specific driving challenges. students will learn  modern data analytic techniques  develop skills  importing  exporting, cleaning  fusing, modeling  visualizing, analyzing  synthesizing complex datasets.  collaborative design, implementation, sharing  community validation  high-throughput analytic workflows will  emphasized throughout  course."

We removed all the stopwards included in the stopwords("english") list. You can always make your own stop-word list and just use doc_corpus<-tm_map(doc_corpus, removeWords, your_own_words_list) to apply this list.

From the output of doc1, we notice that the removal of stopwords may create extra blank spaces. Thus, the next step would be to remove them.

doc_corpus<-tm_map(doc_corpus, stripWhitespace)
doc_corpus[[1]]

## [1] "hs650: data science predictive analytics (dspa) course (offered massive open online course, mooc, well traditional university michigan class) aims build computational abilities, inferential thinking, practical skills tackling core data scientific challenges. explores foundational concepts data management, processing, statistical computing, dynamic visualization using modern programming tools agile web-services. concepts, ideas, protocols illustrated examples real observational, simulated research-derived datasets. prior quantitative experience programming, calculus, statistics, mathematical models, linear algebra will necessary. open graduate course will provide general overview principles, concepts, techniques, tools services managing, harmonizing, aggregating, preprocessing, modeling, analyzing interpreting large, multi-source, incomplete, incongruent, heterogeneous data (big data). focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets, extract useful information, derive knowledge, provide actionable forecasting. biomedical, healthcare, social datasets will provide context addressing specific driving challenges. students will learn modern data analytic techniques develop skills importing exporting, cleaning fusing, modeling visualizing, analyzing synthesizing complex datasets. collaborative design, implementation, sharing community validation high-throughput analytic workflows will emphasized throughout course."

1.4.2 Remove punctuation

Now we notice the irrelevant punctuation in the text, which can be removed by using a combination of tm_map() and removePunctuation() functions.

doc_corpus<-tm_map(doc_corpus, removePunctuation)
doc_corpus[[2]]

## [1] "bootcamp weeklong intensive bootcamp focused methods techniques tools services resources big healthcare biomedical data analytics using opensource statistical computing software r morning sessions 3 hrs will dedicated methods technologies applications afternoon sessions 3 hrs will groupbased hands practice team work commitment attend full week instruction morning sessions selfguided work afternoon sessions required certificates completion will issued trainees perfect attendance complete work hands intensive graduate course bootcamp will provide general overview principles concepts techniques tools services managing harmonizing aggregating preprocessing modeling analyzing interpreting large multisource incomplete incongruent heterogeneous data big data focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets extract useful information derive knowledge provide actionable forecasting biomedical healthcare social datasets will provide context addressing specific driving challenges students will learn modern data analytic techniques develop skills importing exporting cleaning fusing modeling visualizing analyzing synthesizing complex datasets collaborative design implementation sharing community validation highthroughput analytic workflows will emphasized throughout course"

The above tm_map commands changed the structure of our doc_corpus object. We can apply the PlainTextDocument function to convert it back to the original format.

doc_corpus<-tm_map(doc_corpus, PlainTextDocument)

1.4.3 Stemming: removal of plurals and action suffixes

Let’s inspect the first three documents. We notice that there are some words ending with “ing”, “es”, “s”.

doc_corpus[[1]]$content

## [1] "hs650 data science predictive analytics dspa course offered massive open online course mooc well traditional university michigan class aims build computational abilities inferential thinking practical skills tackling core data scientific challenges explores foundational concepts data management processing statistical computing dynamic visualization using modern programming tools agile webservices concepts ideas protocols illustrated examples real observational simulated researchderived datasets prior quantitative experience programming calculus statistics mathematical models linear algebra will necessary open graduate course will provide general overview principles concepts techniques tools services managing harmonizing aggregating preprocessing modeling analyzing interpreting large multisource incomplete incongruent heterogeneous data big data focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets extract useful information derive knowledge provide actionable forecasting biomedical healthcare social datasets will provide context addressing specific driving challenges students will learn modern data analytic techniques develop skills importing exporting cleaning fusing modeling visualizing analyzing synthesizing complex datasets collaborative design implementation sharing community validation highthroughput analytic workflows will emphasized throughout course"

doc_corpus[[2]]$content

## [1] "bootcamp weeklong intensive bootcamp focused methods techniques tools services resources big healthcare biomedical data analytics using opensource statistical computing software r morning sessions 3 hrs will dedicated methods technologies applications afternoon sessions 3 hrs will groupbased hands practice team work commitment attend full week instruction morning sessions selfguided work afternoon sessions required certificates completion will issued trainees perfect attendance complete work hands intensive graduate course bootcamp will provide general overview principles concepts techniques tools services managing harmonizing aggregating preprocessing modeling analyzing interpreting large multisource incomplete incongruent heterogeneous data big data focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets extract useful information derive knowledge provide actionable forecasting biomedical healthcare social datasets will provide context addressing specific driving challenges students will learn modern data analytic techniques develop skills importing exporting cleaning fusing modeling visualizing analyzing synthesizing complex datasets collaborative design implementation sharing community validation highthroughput analytic workflows will emphasized throughout course"

doc_corpus[[3]]$content

## [1] "hs 853 course covers number modern analytical methods advanced healthcare research specific focus will reviewing using innovative modeling computational analytic visualization techniques address concrete driving biomedical healthcare applications course will cover 5 dimensions bigdata volume complexity multiple scales multiple sources incompleteness hs853 4 credit hour course 3 lectures  1 labdiscussion students will learn conduct research employ report recent advanced health sciences analytical methods read comprehend present recent reports innovative scientific methods apply broad range health problems experiment real bigdata topics covered include foundations r scientific visualization review multivariate mixed linear models causalitycausal inference structural equation models generalized estimating equations pcorcer methods heterogeneity treatment effects bigdata bigscience internal statistical crossvalidation missing data genotypeenvironmentphenotype associations variable selection regularized regression controlledknockoff filtering medical imaging databasesregistries metaanalyses classification methods longitudinal data timeseries analysis geographic information systems gis psychometrics rasch measurement model analysis mcmc sampling bayesian inference network analysis"

If we have multiple terms that only differ in their endings (e.g., past, present, present-perfect-continuous tense), the algorithm will treat them differently because it does not understand language semantics the way a human would. To make things easier for the computer, we can delete these endings by “stemming” documents. Remember to load the package SnowballC before using the function stemDocument(). The earliest stemmer was written by Julie Beth Lovins in 1968, which had great influence on all subsequent work. Currently, one of the most popular stemming approaches was proposed by Martin Porter and is used in stemDocument(), more on Porter’s algorithm.

# install.packages("SnowballC")
library(SnowballC)
doc_corpus<-tm_map(doc_corpus, stemDocument)
doc_corpus[[1]]$content

## [1] "hs650 data scienc predict analyt dspa cours offer massiv open onlin cours mooc well tradit univers michigan class aim build comput abil inferenti think practic skill tackl core data scientif challeng explor foundat concept data manag process statist comput dynam visual use modern program tool agil webservic concept idea protocol illustr exampl real observ simul researchderiv dataset prior quantit experi program calculus statist mathemat model linear algebra will necessari open graduat cours will provid general overview principl concept techniqu tool servic manag harmon aggreg preprocess model analyz interpret larg multisourc incomplet incongru heterogen data big data focus will expos student common challeng relat handl big data present enorm opportun power associ abil interrog complex dataset extract use inform deriv knowledg provid action forecast biomed healthcar social dataset will provid context address specif drive challeng student will learn modern data analyt techniqu develop skill import export clean fuse model visual analyz synthes complex dataset collabor design implement share communiti valid highthroughput analyt workflow will emphas throughout cours"

This stemming process has to be done after the PlainTextDocument function because stemDocument only can be applied to plain text.

1.5 Bags of words

It’s very useful to be able to tokenize text documents into n-grams, sequences of words, e.g., a 2-gram represents two-word phrases that appear together in order. This allows us to form bags of words and extract information about word ordering. The bag of words model is a common way to represent documents in matrix form based on their term frequencies (TFs). We can construct an \(n\times t\) document-term matrix (DTM), where \(n\) is the number of documents, and \(t\) is the number of unique terms. Each column in the DTM represents a unique term, the \((i,j)^{th}\) cell represents how many of term \(j\) are present in document \(i\).

The basic bag of words model is invariant to ordering of the words within a document. Once we compute the DTM, we can use machine learning techniques to interpret the derived signature information contained in the resulting matrices.

1.6 Document-term matrix

Now the doc_corpus object is quite clean. Next, we can make a document-term matrix to explore all the terms in 5 documents. The document-term matrix is a bunch of dummy variables that tell us if a given term appears in a specific document.

doc_dtm<-TermDocumentMatrix(doc_corpus)
doc_dtm

## <<TermDocumentMatrix (terms: 329, documents: 5)>>
## Non-/sparse entries: 540/1105
## Sparsity           : 67%
## Maximal term length: 27
## Weighting          : term frequency (tf)

The summary of the document-term matrix is informative. We have 329 different terms in the 5 documents. There are 540 non-zero and 1105 sparse entries. Thus, the sparsity is \(\frac{1105}{(540+1105)}\approx 67\%\), which measures the term sparsity across documents. A high sparsity means terms are not repeated often among different documents.

Recall that we applied the PlainTextDocument function to your doc_corpus object. This removes all document meta data. To relabel the documents in the document-term matrix we can use the following commands:

doc_dtm$dimnames$Docs<-as.character(1:5)
inspect(doc_dtm)

## <<TermDocumentMatrix (terms: 329, documents: 5)>>
## Non-/sparse entries: 540/1105
## Sparsity           : 67%
## Maximal term length: 27
## Weighting          : term frequency (tf)
## Sample             :
##          Docs
## Terms     1 2 3 4 5
##   analyt  3 3 3 1 2
##   cours   4 2 3 3 2
##   data    7 5 2 3 6
##   infer   0 0 2 6 2
##   method  0 2 5 3 2
##   model   3 2 4 3 3
##   statist 2 1 1 5 4
##   student 2 2 1 2 4
##   use     2 2 1 3 2
##   will    6 8 3 4 3

We might want to find and report the frequent terms using this document-term matrix.

findFreqTerms(doc_dtm, lowfreq = 2)

##   [1] "abil"           "action"         "address"        "advanc"        
##   [5] "afternoon"      "aggreg"         "analysi"        "analyt"        
##   [9] "analyz"         "appli"          "applic"         "appropri"      
##  [13] "associ"         "assumpt"        "attend"         "bayesian"      
##  [17] "big"            "bigdata"        "biomed"         "bootcamp"      
##  [21] "challeng"       "clean"          "collabor"       "common"        
##  [25] "communiti"      "complet"        "complex"        "comput"        
##  [29] "concept"        "conduct"        "context"        "core"          
##  [33] "cours"          "cover"          "credit"         "data"          
##  [37] "dataset"        "deriv"          "design"         "develop"       
##  [41] "drive"          "emphas"         "enorm"          "epidemiolog"   
##  [45] "equat"          "estim"          "exampl"         "experi"        
##  [49] "export"         "expos"          "extract"        "focus"         
##  [53] "forecast"       "foundat"        "fuse"           "general"       
##  [57] "graduat"        "hand"           "handl"          "harmon"        
##  [61] "health"         "healthcar"      "heterogen"      "highthroughput"
##  [65] "hour"           "hrs"            "hs550"          "implement"     
##  [69] "import"         "includ"         "incomplet"      "incongru"      
##  [73] "infer"          "inform"         "innov"          "intens"        
##  [77] "interpret"      "interrog"       "knowledg"       "labdiscuss"    
##  [81] "larg"           "learn"          "lectur"         "limit"         
##  [85] "linear"         "manag"          "measur"         "method"        
##  [89] "misconcept"     "model"          "modern"         "morn"          
##  [93] "multipl"        "multisourc"     "observ"         "open"          
##  [97] "opportun"       "overview"       "power"          "practic"       
## [101] "preprocess"     "present"        "principl"       "probabl"       
## [105] "problem"        "process"        "program"        "provid"        
## [109] "publish"        "read"           "real"           "recent"        
## [113] "regress"        "relat"          "report"         "research"      
## [117] "review"         "sampl"          "scienc"         "scientif"      
## [121] "servic"         "session"        "share"          "skill"         
## [125] "social"         "softwar"        "specif"         "statist"       
## [129] "student"        "studi"          "synthes"        "techniqu"      
## [133] "test"           "theori"         "throughout"     "tool"          
## [137] "topic"          "understand"     "use"            "valid"         
## [141] "variabl"        "visual"         "will"           "work"          
## [145] "workflow"

This gives us the terms that appear in at least 2 documents. High-frequency terms like comput, statist, model, healthcar, and learn make perfect sense to be included in this shortlist, as these courses cover modeling, statistical and computational methods with applications to health sciences.

The tm package provides the functionality to compute the correlations between terms. Here is a mechanism to determine the words that are highly correlated with statist, (\(\rho(statist, ?)\ge 0.8\)).

findAssocs(doc_dtm, "statist", corlimit = 0.8)

## $statist
## epidemiolog     publish       studi      theori  understand       appli 
##        0.95        0.95        0.95        0.95        0.95        0.83 
##        test 
##        0.80

2 Case-Study: Job ranking

Let’s explore some real datasets. First, we will import the 2011 USA Jobs Ranking Dataset from SOCR data archive.

library(rvest)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking")
html_nodes(wiki_url, "#content")

## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...

job <- html_table(html_nodes(wiki_url, "table")[[1]])
head(job)

## # A tibble: 6 x 10
##   Index Job_Title   Overall_Score `Average_Income~ Work_Environment Stress_Level
##   <int> <chr>               <int>            <int>            <dbl>        <dbl>
## 1     1 Software_E~            60            87140            150           10.4
## 2     2 Mathematic~            73            94178             89.7         12.8
## 3     3 Actuary               123            87204            179.          16.0
## 4     4 Statistici~           129            73208             89.5         14.1
## 5     5 Computer_S~           147            77153             90.8         16.5
## 6     6 Meteorolog~           175            85210            180.          15.1
## # ... with 4 more variables: Stress_Category <int>, Physical_Demand <dbl>,
## #   Hiring_Potential <dbl>, Description <chr>

Note that low or high job indices represent more or less desirable jobs, respectively. Thus, in 2011, the most desirable job among top 200 common jobs would be Software Engineer. The aim of our study now is to explore the difference between the top 30 desirable jobs and the bottom 100 jobs based on their textural job descriptions (JDs).

We will go through the same procedure as we did for the course syllabi example above. The documents we are using are the Description column (a text vector) in the dataset.

2.1 Step 1: make a VCorpus object

jobs <- as.list(job$Description)
jobCorpus <- VCorpus(VectorSource(jobs))

2.2 Step 2: clean the VCorpus object

jobCorpus <- tm_map(jobCorpus, tolower)
for(j in seq(jobCorpus)){
  jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])
}

Here we used a loop to substitute "_" (underscore) with blank space. This is necessary as the underscore character connecting words will cause problems with using removePunctuation to separate terms. In this situation, global pattern matching, gsub, will find and replace the underscores with spaces.

jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus<-tm_map(jobCorpus, removePunctuation)
jobCorpus<-tm_map(jobCorpus, stripWhitespace)
jobCorpus<-tm_map(jobCorpus, PlainTextDocument)
jobCorpus<-tm_map(jobCorpus, stemDocument)

2.3 Step 3: build document-term matrix

Term Document Matrix (TDM) objects (tm::DocumentTermMatrix) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix.

First make sure that we got a clean VCorpus object

jobCorpus[[1]]$content

## [1] "research design develop maintain softwar system along hardwar develop medic scientif industri purpos"

Then we can start to build the DTM and reassign labels to the Docs.

dtm<-DocumentTermMatrix(jobCorpus)
dtm

## <<DocumentTermMatrix (documents: 200, terms: 846)>>
## Non-/sparse entries: 1818/167382
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

dtm$dimnames$Docs<-as.character(1:200)
inspect(dtm[1:10, 1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 2/98
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access accid accord account accur achiev act activ
##   1        0      0      0     0      0       0     0      0   0     0
##   10       0      0      0     0      0       0     0      0   0     0
##   2        0      0      0     0      0       0     0      0   0     0
##   3        0      0      0     1      0       0     0      0   0     0
##   4        0      0      0     0      0       0     0      0   0     0
##   5        0      0      0     0      0       0     0      0   0     0
##   6        0      0      0     0      0       0     0      0   0     0
##   7        0      0      0     0      0       0     0      0   0     0
##   8        0      0      0     0      1       0     0      0   0     0
##   9        0      0      0     0      0       0     0      0   0     0

# tdm <- TermDocumentMatrix(jobCorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))
# dtm <- DocumentTermMatrix(jobCorpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE))

Let’s subset the dtm into top 30 jobs and bottom 100 jobs.

dtm_top30<-dtm[1:30, ]
dtm_bot100<-dtm[101:200, ]
dtm_top30

## <<DocumentTermMatrix (documents: 30, terms: 846)>>
## Non-/sparse entries: 293/25087
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

dtm_bot100

## <<DocumentTermMatrix (documents: 100, terms: 846)>>
## Non-/sparse entries: 870/83730
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

In this case, since the sparsity is very high, we can try to remove some of the words that rarely appear in the job descriptions.

dtms_top30<-removeSparseTerms(dtm_top30, 0.90)
dtms_top30

## <<DocumentTermMatrix (documents: 30, terms: 19)>>
## Non-/sparse entries: 70/500
## Sparsity           : 88%
## Maximal term length: 10
## Weighting          : term frequency (tf)

dtms_bot100<-removeSparseTerms(dtm_bot100, 0.94)
dtms_bot100

## <<DocumentTermMatrix (documents: 100, terms: 14)>>
## Non-/sparse entries: 122/1278
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency (tf)

On the top, instead of the initial 846 terms, we only have 19 terms appearing in at least 10% of the jobs.

Similarly, in the bottom, instead of the initial 846 terms, we only have 14 terms appearing in at least 6% of the bottom 100 jobs.

Similar to what we did in Chapter 7, visualization of the terms-world clouds is available in R when combine tm with wordcloud package. First, we can count the term frequencies in two document-term matrices.

library(plotly)
# Let's calculate the cumulative frequencies of words across documents and sort:
freq1<-sort(colSums(as.matrix(dtms_top30)), decreasing=T)
freq1

##    develop     assist      natur      studi     analyz    concern   individu 
##          6          5          5          5          4          4          4 
##   industri     physic       plan       busi     inform   institut    problem 
##          4          4          4          3          3          3          3 
##   research   scientif     theori  treatment understand 
##          3          3          3          3          3

freq2<-sort(colSums(as.matrix(dtms_bot100)), decreasing=T)
freq2

##       oper     repair    perform     instal      build     prepar       busi 
##         17         15         11          9          8          8          7 
##   commerci  construct   industri     machin manufactur    product  transport 
##          7          7          7          7          7          7          7

# Plot frequent words (for bottom 100 jobs)
wf2=data.frame(term=names(freq2), occurrences=freq2)
# library(ggplot2)
# p <- ggplot(subset(wf2, freq2>2), aes(term, occurrences))
# p <- p + geom_bar(stat="identity")
# p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
# p

df.freq2 <- subset(wf2, freq2>2)
plot_ly(data=df.freq2, x=~term, y=~occurrences, type="bar") %>%
  layout(title="Bottom 100 Job Descriptions (Frequent Terms)")

# Plot frequent words (for top 30 jobs)
wf1=data.frame(term=names(freq1), occurrences=freq1)
# p2 <- ggplot(subset(wf1, freq1>2), aes(term, occurrences, fill = freq1))
# p2 <- p2 + geom_bar(stat="identity")
# p2 <- p2 + theme(axis.text.x=element_text(angle=45, hjust=1))
# p2

df.freq1 <- subset(wf1, freq1>2)
plot_ly(data=df.freq1, x=~term, y=~occurrences, type="bar") %>%
  layout(title="Top 30 Job Descriptions (Frequent Terms)")

# what is common (frequently occurring words)
# in the description of the top 30 and the bottom 100 jobs?
intersect(df.freq1$term, df.freq2$term)

## [1] "industri" "busi"

Then we apply the wordcloud function to the freq dataset.

library(wordcloud)
set.seed(123)
wordcloud(names(freq1), freq1)

# Color code the frequencies using an appropriate color map:

# Sequential palettes names include: 
# Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd

# Diverging palettes include  
# BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral
wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(6, "Spectral"))

#wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(length(names(subset(wf2, freq2>5))), "Dark2"))

It becomes apparent that top 30 jobs tend to focus more on research or discovery and include frequent keywords like “study”, “nature”, and “analyze”. The bottom 100 jobs are more focused on mechanistic operations of objects or equipment, with frequent keywords like “operation”, “repair”, and “perform”.

2.4 Area Under ROC Curve

In Chapter 13 we talked about the ROC curve. We can use document-term matrix to build classifiers and use the area under ROC curve to evaluate those classifiers. Assume we want to predict whether a job ranks top 30 in the job list.

The first task would be to create an indicator of high rank (job is in the top 30 list). We can use the ifelse() function that we are already familiar with.

job$highrank <- ifelse(job$Index<30, 1, 0)

Next we load the glmnet package to help us build the prediction model and draw graphs.

# install.packages("glmnet")
library(glmnet)

The function we will be using is the cv.glmnet, where cv stands for cross-validation. Since the derived job ranking variable highrank is binary, we specify the option family='binomial'. Also, we want to use a 10-fold CV method for internal statistical (resampling-based) prediction validation.

set.seed(25)
fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-3, 
                 # again lower number of iterations for faster training
                 maxit = 1e3)
# plot(fit)
print(paste("max AUC =", round(max(fit$cvm), 4)))

## [1] "max AUC = 0.6348"

plotCV.glmnet <- function(cv.glmnet.object, name="") {
  df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, 
                           errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)

  featureNum <- cv.glmnet.object$nzero
  xFeature <- log(cv.glmnet.object$lambda)
  yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
  dataFeature <- data.frame(featureNum, xFeature, yFeature)

  plot_ly(data = df) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
        name = 'CV MSE', error_y = ~list(array = errorBar)) %>% 
    # add the lambda-min and lambda 1SD vertical dash lines
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
              showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), 
              showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
    # Add Number of Features Annotations on Top
    add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
        mode = 'text', text = ~featureNum, textposition = 'middle right',
        textfont = list(color = '#000000', size = 9)) %>%
    # Add top x-axis (non-zero features)
    # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
    #           y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, 
    #           name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
    layout(title = paste0("Cross-Validation MSE (", name, ")"),
                            xaxis = list(title=paste0("log(",TeX("\\lambda"),")"),  side="bottom", showgrid=TRUE), # type="log"
                            hovermode = "x unified", legend = list(orientation='h'),  # xaxis2 = ax,  
                            yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE))
}

plotCV.glmnet(fit, "LASSO")

Here x is a matrix and y is the response variable. The graph is showing all the AUC we got from models we created. The last line of code helps us select the best AUC among all models. The resulting \(AUC\sim 0.73\) represents a relatively good prediction model for this small sample size.

3 TF-IDF

To enhance the performance of DTM matrix, we introduce the TF-IDF (term frequency - inverse document frequency) concept. Unlike pure frequency, TF-IDF measures the relative importance of a term. If a term appears in almost every document, the term will be considered common with a small weight. Alternatively, rare terms would be considered as less informational.

3.1 Term Frequency (TF)

TF is a ratio \(\frac{\text{a term's occurrences in a document}}{\text{the number of occurrences of the most frequent word within the same document}}\).

\[TF(t,d)=\frac{f_d(t)}{\max_{w\in d}{f_d(w)}}.\]

3.2 Inverse Document Frequency (IDF)

The TF definition may allow high scores for irrelevant words that naturally show up often in a long text, even if these may have been triaged in a prior preprocessing step. The IDF attempts to rectify that. IDF represents the inverse of the share of the documents in which the regarded term can be found. The lower the number of documents containing the term, relative to the size of the corpus, the higher the term factor.

IDF involves a logarithm function because otherwise the effective scoring penalty of showing up in two documents would be too extreme. Typically, the IDF for a term found in just one document is twice the IDF for another term found in two docs. The \(\ln()\) function rectifies this bias of ranking in favor of rare terms, even if the TF-factor may be high. It is rather unlikely that a term’s relevance is only high in one doc and not all others.

\[ IDF(t,D) = \ln\left ( \frac{|D|}{|\{ d\in D: t\in d\}|} \right ).\]

3.3 TF-IDF

Both TF and IDF yield high scores for highly relevant terms. TF relies on local information (search over \(d\)), whereas IDF incorporates a more global perspective (search over \(D\)). The product \(TF\times IDF\), gives the classical TF-IDF formula. However, alternative expressions may be formulated to get other univariate expressions using alternative weights for TF and IDF.

\[TF\_IDF(t,d,D)= TF(t,d)\times IDF(t,D).\] An example of an alternative TF-IDF metric can be defined by: \[ TF\_IDF'(t,d,D) =\frac{IDF(t,D)}{|D|} + TF\_IDF(t,d,D).\]

Let’s make another DTM with TF-IDF weights and compare the differences between the unweighted and weighted DTM.

dtm.tfidf <- DocumentTermMatrix(jobCorpus, control = list(weighting=weightTfIdf))
dtm.tfidf

## <<DocumentTermMatrix (documents: 200, terms: 846)>>
## Non-/sparse entries: 1818/167382
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

dtm.tfidf$dimnames$Docs <- as.character(1:200)
inspect(dtm.tfidf[1:9, 1:10])

## <<DocumentTermMatrix (documents: 9, terms: 10)>>
## Non-/sparse entries: 2/88
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access     accid    accord account accur achiev act activ
##    1       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    2       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    3       0      0      0 0.5536547 0.0000000       0     0      0   0     0
##    4       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    5       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    6       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    7       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    8       0      0      0 0.0000000 0.4321928       0     0      0   0     0
##    9       0      0      0 0.0000000 0.0000000       0     0      0   0     0

inspect(dtm[1:9, 1:10])

## <<DocumentTermMatrix (documents: 9, terms: 10)>>
## Non-/sparse entries: 2/88
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access accid accord account accur achiev act activ
##    1       0      0      0     0      0       0     0      0   0     0
##    2       0      0      0     0      0       0     0      0   0     0
##    3       0      0      0     1      0       0     0      0   0     0
##    4       0      0      0     0      0       0     0      0   0     0
##    5       0      0      0     0      0       0     0      0   0     0
##    6       0      0      0     0      0       0     0      0   0     0
##    7       0      0      0     0      0       0     0      0   0     0
##    8       0      0      0     0      1       0     0      0   0     0
##    9       0      0      0     0      0       0     0      0   0     0

Inspecting the two different DTMs we can see that TF-IDF is not only counting the frequency but also assigning different weights to each term according to the importance of the term. Next, we are going to fit another model with this new DTM (dtm.tfidf).

set.seed(2)
fit1 <- cv.glmnet(x = as.matrix(dtm.tfidf), y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-3, 
                 # again lower number of iterations for faster training
                 maxit = 1e3)
# plot(fit1)
print(paste("max AUC =", round(max(fit1$cvm), 4)))

## [1] "max AUC = 0.5738"

plotCV.glmnet(fit1, "LASSO")

This output is about the same as the previous jobs ranking prediction classifier (based on the unweighted DTM). Due to random sampling, each run of the protocols may generate a slightly different result. The idea behind using TF-IDF is that one would expect to get more unbiased estimates of word importance. If the document includes stopwords, like “the” or “one”, the DTM may distort the results, but TF-IDF can help fix this problem.

Next we can report a more intuitive representation of the job ranking prediction reflecting the agreement of the binary (top-30 or not) classification between the real labels and the predicted labels. Notice that this applies only to the training data itself.

# Binarize the LASSO probability prediction 
preffit1 <- predict(fit1, newx=as.matrix(dtm.tfidf), s="lambda.min", type = "class")
binPredfit1 <- ifelse(preffit1<0.5, 0, 1)
table(binPredfit1, job[['highrank']])

##            
## binPredfit1   0   1
##           0 171   0
##           1   0  29

Let’s try to predict the job ranking of a new (testing or validation) job description (JD). There are many job descriptions provided online that we can extract text from to predict the job ranking of the corresponding positions. Trying several alternative job categories, e.g., some high-tech or fin-tech and some manufacturing and construction jobs, may provide some intuition to the power of the jobs-classifier we built. Below, we will compare the JDs for the positions of accountant, attorney, and machinist.

# install.packages("text2vec"); install.packages("data.table")
library(text2vec)
library(data.table)

# Choose the JD for a PUBLIC ACCOUNTANTS 1430 (https://www.bls.gov/ocs/ocsjobde.htm)
xTestAccountant <- "Performs professional auditing work in a public accounting firm. Work requires at least a bachelor's degree in accounting. Participates in or conducts audits to ascertain the fairness of financial representations made by client companies. May also assist the client in improving accounting procedures and operations. Examines financial reports, accounting records, and related documents and practices of clients. Determines whether all important matters have been disclosed and whether procedures are consistent and conform to acceptable practices. Samples and tests transactions, internal controls, and other elements of the accounting system(s) as needed to render the accounting firm's final written opinion. As an entry level public accountant, serves as a junior member of an audit team. Receives classroom and on-the-job training to provide practical experience in applying the principles, theories, and concepts of accounting and auditing to specific situations. (Positions held by trainee public accountants with advanced degrees, such as MBA's are excluded at this level.) Complete instructions are furnished and work is reviewed to verify its accuracy, conformance with required procedures and instructions, and usefulness in facilitating the accountant's professional growth. Any technical problems not covered by instructions are brought to the attention of a superior. Carries out basic audit tests and procedures, such as: verifying reports against source accounts and records; reconciling bank and other accounts; and examining cash receipts and disbursements, payroll records, requisitions, receiving reports, and other accounting documents in detail to ascertain that transactions are properly supported and recorded. Prepares selected portions of audit working papers"

xTestAttorney <- "Performs consultation, advisory and/or trail work and carries out the legal processes necessary to effect the rights, privileges, and obligations of the organization. The work performed requires completion of law school with an L.L.B. degree or J.D. degree and admission to the bar. Responsibilities or functions include one or more of the following or comparable duties:
1. Preparing and reviewing various legal instruments and documents, such as contracts, leases, licenses, purchases, sales, real estate, etc.;
2. Acting as agent of the organization in its transactions;
3. Examining material (e.g., advertisements, publications, etc.) for legal implications; advising officials of proposed legislation which might affect the organization;
4. Applying for patents, copyrights, or registration of the organization's products, processes, devices, and trademarks; advising whether to initiate or defend law suits;
5. Conducting pretrial preparations; defending the organization in lawsuits;
6. Prosecuting criminal cases for a local or state government or defending the general public (for example, public defenders and attorneys rendering legal services to students); or
7. Advising officials on tax matters, government regulations, and/or legal rights.
Attorney jobs are matched at one of six levels according to two factors:
1. Difficulty level of legal work; and
2. Responsibility level of job.
Attorney jobs which meet the above definitions are to be classified and coded in accordance with a chart available upon request.
Legal questions are characterized by: facts that are well-established; clearly applicable legal precedents; and matters not of substantial importance to the organization. (Usually relatively limited sums of money, e.g., a few thousand dollars, are involved.)
a. legal investigation, negotiation, and research preparatory to defending the organization in potential or actual lawsuits involving alleged negligence where the facts can be firmly established and there are precedent cases directly applicable to the situation;
b. searching case reports, legal documents, periodicals, textbooks, and other legal references, and preparing draft opinions on employee compensation or benefit questions where there is a substantial amount of clearly applicable statutory, regulatory, and case material;
c. drawing up contracts and other legal documents in connection with real property transactions requiring the development of detailed information but not involving serious questions regarding titles to property or other major factual or legal issues.
d. preparing routine criminal cases for trial when the legal or factual issues are relatively straight forward and the impact of the case is limited; and
e. advising public defendants in regard to routine criminal charges or complaints and representing such defendants in court when legal alternatives and facts are relatively clear and the impact of the outcome is limited primarily to the defendant.

Legal work is regularly difficult by reason of one or more of the following: the absence of clear and directly applicable legal precedents; the different possible interpretations that can be placed on the facts, the laws, or the precedents involved; the substantial importance of the legal matters to the organization (e.g., sums as large as $100,000 are generally directly or indirectly involved); or the matter is being strongly pressed or contested in formal proceedings or in negotiations by the individuals, corporations, or government agencies involved.
a. advising on the legal implications of advertising representations when the facts supporting the representations and the applicable precedent cases are subject to different interpretations;
b. reviewing and advising on the implications of new or revised laws affecting the organization;
c. presenting the organization's defense in court in a negligence lawsuit which is strongly pressed by counsel for an organized group;
d. providing legal counsel on tax questions complicated by the absence of precedent decisions that are directly applicable to the organization's situation;
e. preparing and prosecuting criminal cases when the facts of the cases are complex or difficult to determine or the outcome will have a significant impact within the jurisdiction; and
f. advising and representing public defendants in all phases of criminal proceedings when the facts of the case are complex or difficult to determine, complex or unsettled legal issues are involved, or the prosecutorial jurisdiction devotes substantial resources to obtaining a conviction."

xTestMachinist <- "Produces replacement parts and new parts in making repairs of metal parts of mechanical equipment. Work involves most of the following: interpreting written instructions and specifications; planning and laying out of work; using a variety of machinist's handtools and precision measuring instruments; setting up and operating standard machine tools; shaping of metal parts to close tolerances; making standard shop computations relating to dimensions of work, tooling, feeds, and speeds of machining; knowledge of the working properties of the common metals; selecting standard materials, parts, and equipment required for this work; and fitting and assembling parts into mechanical equipment. In general, the machinist's work normally requires a rounded training in machine-shop practice usually acquired through a formal apprenticeship or equivalent training and experience. Industrial machinery repairer. Repairs machinery or mechanical equipment. Work involves most of the following: examining machines and mechanical equipment to diagnose source of trouble; dismantling or partly dismantling machines and performing repairs that mainly involve the use of handtools in scraping and fitting parts; replacing broken or defective parts with items obtained from stock; ordering the production of a replacement part by a machine shop or sending the machine to a machine shop for major repairs; preparing written specifications for major repairs or for the production of parts ordered from machine shops; reassembling machines; and making all necessary adjustments for operation. In general, the work of a machinery maintenance mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience. Excluded from this classification are workers whose primary duties involve setting up or adjusting machines. Vehicle and mobile equipment mechanics and repairers. Repairs, rebuilds, or overhauls major assemblies of internal combustion automobiles, buses, trucks, or tractors. Work involves most of the following: Diagnosing the source of trouble and determining the extent of repairs required; replacing worn or broken parts such as piston rings, bearings, or other engine parts; grinding and adjusting valves; rebuilding carburetors; overhauling transmissions; and repairing fuel injection, lighting, and ignition systems. In general, the work of the motor vehicle mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience"

# Define the testing cases (1) list, (2) Y-category (top-30 or not), and (3) ID names
testJDs <- as.list(c(xTestAccountant, xTestAttorney, xTestMachinist))
testTop30 <- c(1,1,0)
testNames <- c("xTestAccountant", "xTestAttorney", "xTestMachinist")
  
### 1. Training Phase Labeled data
# loop to substitute "_" with blank space
for(j in seq(jobs)){
  jobs[[j]] <- gsub("_", " ", jobs[[j]])
}

prep_fun = tolower
tok_fun = word_tokenizer

trainIterator = itoken(unlist(jobs), 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = rownames(jobs), 
             progressbar = FALSE)
vocab = create_vocabulary(trainIterator)
vocab

## Number of docs: 200 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##                term term_count doc_count
##    1:            16          1         1
##    2:      abnormal          1         1
##    3: abnormalities          1         1
##    4:        access          1         1
##    5:   accordingly          1         1
##   ---                                   
## 1109:            in         51        45
## 1110:            to         81        69
## 1111:           the         82        70
## 1112:            of        118        98
## 1113:           and        329       184

# `text2vec` package has alternative tokenizer functions (?tokenizers), we can use simple wrappers of the `base::gsub()` function or
# write new tokenize. Let's construct a document-term matrix using the vocabulary of the trainIterator
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(trainIterator, vectorizer)
dim(dtm_train)

## [1]  200 1113

### 2. Fit LASSO model
set.seed(1234)
fit1 <- cv.glmnet(x = dtm_train, y = job[['highrank']], 
                   family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-5, 
                 # again lower number of iterations for faster training
                 maxit = 1e4)
print(paste("max AUC =", round(max(fit1$cvm), 4)))

## [1] "max AUC = 0.6191"

#### 3. Testing Phase (new JDs)
testIterator = tok_fun(prep_fun(unlist(testJDs)))
# turn off progressbar because it won't look nice in rmd
testIterator = itoken(testIterator, ids = testNames, progressbar = FALSE)

dtm_test = create_dtm(testIterator, vectorizer)

predictedJDs = predict(fit1, dtm_test, type = 'response')[,1] # Type can be: "link", "response", "coefficients", "class", "nonzero"
predictedJDs

## xTestAccountant   xTestAttorney  xTestMachinist 
##       0.2312489       0.1504954       0.1355345

# glmnet:::auc(testTop30, predictedJDs)

Note that the results may change somewhat (e.g., AUC~0.62). Above we assessed the JD predictive (LASSO) model using the three out of bag job descriptions. Below is a plot of the model’s MSE.

# plot(fit1)
# plot(fit1, xvar="lambda", label="TRUE")
# mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plotCV.glmnet(fit1, "LASSO")

The output of the predictions shows that:

On the training data, the predicted probabilities rapidly decrease with the indexing of the jobs, corresponding to the overall job ranking (highly ranked/desired jobs are listed on the top).
On the three testing job description data (accountant, attorney, and machinist), there is a clear ranking difference between the machinist and the other two professions.

Also see the discussion in Chapter 17 about the different types of predictions that can be generated as outputs of cv.glmnet regularized forecasting methods.

4 Cosine similarity

As we mentioned above, text data are often transformed to be in terms of Term Frequency-Inverse Document Frequency (TF-IDF), which may be better input compared to the raw frequencies for many text-mining methods. An alternative transformation can be represented as a different distance measure such as cosine distance, which is defined in terms of the cosine similarity:

\[ similarity = \cos(\theta) = \frac{A\cdot B}{||A||_2||B||_2},\] \[ Cosine\_Distance = 1-Similarity = 1 - \frac{A\cdot B}{||A||_2||B||_2},\] where \(\theta\) represents the angle between two vectors \(A\) and \(B\) in Euclidean space spanned by the DTM matrix. Note that the cosine similarity of two text documents (more specifically two DTM’s with or without TF-IDF weights) will always be in the range \([0,1]\), since the term frequencies are always non-negative. In other words, the angle between two term frequency vectors cannot exceed \(90^o\), and therefore, \(0\leq Cosine\_Distance\leq1\), albeit, it’s not a proper distance metric as it does not satisfy the triangle inequality, in general. Mind the dimensions of the corresponding matrices: \(\dim(dtm)=200\times 846\) and \(\dim(dist\_cos)=200\times 200\).

cos_dist = function(mat){
  numer = tcrossprod(mat)
  denom1 = sqrt(apply(mat, 1, crossprod))
  denom2 = sqrt(apply(mat, 1, crossprod))
  1 - numer / outer(denom1,denom2)
}

# Recall
# dtm  <- DocumentTermMatrix(jobCorpus)
# jobs <- as.list(job$Description)
# trainIterator = itoken(unlist(jobs), ...)
# dtm_train = create_dtm(trainIterator, vectorizer)
# fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], ...)
dist_cos = cos_dist(as.matrix(dtm))   # also try with dtm_train
                 
set.seed(1234)
fit_cos <- cv.glmnet(x = dist_cos, y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-5, 
                 # again lower number of iterations for faster training
                 maxit = 1e5)
# plot(fit_cos)
plotCV.glmnet(fit_cos, "Cosine-transformed LASSO")

print(paste("max AUC =", round(max(fit_cos$cvm), 4)))

## [1] "max AUC = 0.8377"

The AUC now is significantly higher, \(0.84\), which is a pretty good result even better than what we obtained from DTM or TF-IDF. This suggests that our machine “understanding” of the content, i.e., the natural language processing, leads to a more acceptable content classifier.

5 Sentiment analysis

Let’s use the text2vec::movie_review dataset, which consists of 5,000 movie reviews dichotomized as positive or negative. In the subsequent predictive analytics, this sentiment will represent our output feature: \[Y= Sentiment=\left\{ \begin{array}{ll} 0, & \quad negative \\ 1, & \quad positive \end{array} \right. .\]

5.1 Data Preprocessing

The data.table package will also be used for some data manipulation. Let’s start with splitting the data into training and testing sets.

# install.packages("text2vec"); install.packages("data.table")
library(text2vec)
library(data.table)

# Load the movie reviews data
data("movie_review")

# coerce the movie reviews data to a data.table (DT) object
setDT(movie_review)

# create a key for the movie-reviews data table
setkey(movie_review, id)

# View the data
# View(movie_review)
head(movie_review); dim(movie_review); colnames(movie_review)

##         id sentiment
## 1: 10000_8         1
## 2: 10001_4         0
## 3: 10004_3         0
## 4: 10004_8         1
## 5: 10006_4         0
## 6: 10008_7         1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                review
## 1: Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home, and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. He's not used to reaching mutual agreements like he once did when being rich where it's fight or flight, kill or be killed.<br /><br />While the love connection between Molly and Bolt wasn't necessary to plot, I found \\"Life Stinks\\" to be one of Mel Brooks' observant films where prior to being a comedy, it shows a tender side compared to his slapstick work such as Blazing Saddles, Young Frankenstein, or Spaceballs for the matter, to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others.
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.<br /><br />But could it have been the screenplay? Just exactly who was the chef in love with? He seemed more enamored of his culinary skills and restaurant, and ultimately of himself and his youthful exploits, than of anybody or anything else. He never convinced me he was in love with the princess.<br /><br />I was disappointed in this movie. But, don't forget it was nominated for an Oscar, so judge for yourself.
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \\"It appears that many critics find the idea of a Woody Allen drama unpalatable.\\" And for good reason: they are unbearably wooden and pretentious imitations of Bergman. And let's not kid ourselves: critics were mostly supportive of Allen's Bergman pretensions, Allen's whining accusations to the contrary notwithstanding. What I don't get is this: why was Allen generally applauded for his originality in imitating Bergman, but the contemporaneous Brian DePalma was excoriated for \\"ripping off\\" Hitchcock in his suspense/horror films? In Robin Wood's view, it's a strange form of cultural snobbery. I would have to agree with that.
## 4:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          This isn't the comedic Robin Williams, nor is it the quirky/insane Robin Williams of recent thriller fame. This is a hybrid of the classic drama without over-dramatization, mixed with Robin's new love of the thriller. But this isn't a thriller, per se. This is more a mystery/suspense vehicle through which Williams attempts to locate a sick boy and his keeper.<br /><br />Also starring Sandra Oh and Rory Culkin, this Suspense Drama plays pretty much like a news report, until William's character gets close to achieving his goal.<br /><br />I must say that I was highly entertained, though this movie fails to teach, guide, inspect, or amuse. It felt more like I was watching a guy (Williams), as he was actually performing the actions, from a third person perspective. In other words, it felt real, and I was able to subscribe to the premise of the story.<br /><br />All in all, it's worth a watch, though it's definitely not Friday/Saturday night fare.<br /><br />It rates a 7.7/10 from...<br /><br />the Fiend :.
## 5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            I don't know who to blame, the timid writers or the clueless director. It seemed to be one of those movies where so much was paid to the stars (Angie, Charlie, Denise, Rosanna and Jon) that there wasn't enough left to really make a movie. This could have been very entertaining, but there was a veil of timidity, even cowardice, that hung over each scene. Since it got an R rating anyway why was the ubiquitous bubble bath scene shot with a 70-year-old woman and not Angie Harmon? Why does Sheen sleepwalk through potentially hot relationships WITH TWO OF THE MOST BEAUTIFUL AND SEXY ACTRESSES in the world? If they were only looking for laughs why not cast Whoopi Goldberg and Judy Tenuta instead? This was so predictable I was surprised to find that the director wasn't a five year old. What a waste, not just for the viewers but for the actors as well.
## 6:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       You know, Robin Williams, God bless him, is constantly shooting himself in the foot lately with all these dumb comedies he has done this decade (with perhaps the exception of \\"Death To Smoochy\\", which bombed when it came out but is now a cult classic). The dramas he has made lately have been fantastic, especially \\"Insomnia\\" and \\"One Hour Photo\\". \\"The Night Listener\\", despite mediocre reviews and a quick DVD release, is among his best work, period.<br /><br />This is a very chilling story, even though it doesn't include a serial killer or anyone that physically dangerous for that matter. The concept of the film is based on an actual case of fraud that still has yet to be officially confirmed. In high school, I read an autobiography by a child named Anthony Godby Johnson, who suffered horrific abuse and eventually contracted AIDS as a result. I was moved by the story until I read reports online that Johnson may not actually exist. When I saw this movie, the confused feelings that Robin Williams so brilliantly portrayed resurfaced in my mind.<br /><br />Toni Collette probably gives her best dramatic performance too as the ultimately sociopathic \\"caretaker\\". Her role was a far cry from those she had in movies like \\"Little Miss Sunshine\\". There were even times she looked into the camera where I thought she was staring right at me. It takes a good actress to play that sort of role, and it's this understated (yet well reviewed) role that makes Toni Collette probably one of the best actresses of this generation not to have even been nominated for an Academy Award (as of 2008). It's incredible that there is at least one woman in this world who is like this, and it's scary too.<br /><br />This is a good, dark film that I highly recommend. Be prepared to be unsettled, though, because this movie leaves you with a strange feeling at the end.

## [1] 5000    3

## [1] "id"        "sentiment" "review"

# Generate 80-20% training-testing split of the reviews
all_ids = movie_review$id
set.seed(1234)
train_ids = sample(all_ids, 5000*0.8)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[train_ids, ]
test = movie_review[test_ids, ]

Next, we will vectorize the reviews by creating terms to termID mappings. Note that terms may include arbitrary n-grams, not just single words. The set of reviews will be represented as a sparse matrix, with rows and columns corresponding to reviews/reviewers and terms, respectively. This vectorization may be accomplished in several alternative ways, e.g., by using the corpus vocabulary, feature hashing, etc.

The vocabulary-based DTM, created by the create_vocabulary() function, relies on all unique terms from all reviews, where each term has a unique ID. In this example, we will create the review vocabulary using an iterator construct abstracting the input details and enabling in memory processing of the (training) data by chunks.

# define the test preprocessing 

# either a simple (tolower case) function
preproc_fun = tolower

# or a more elaborate "cleaning" function
preproc_fun = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # remove non-alpha-numeric values
  x  =  tolower(x)                          # convert to lower case characters
  # x  =  removeNumbers(x)                  # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

# define the tokenization function
token_fun = word_tokenizer

# iterator for both training and testing sets
iter_train = itoken(train$review, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             ids = train$id, 
             progressbar = TRUE)

iter_test = itoken(test$review, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             ids = test$id, 
             progressbar = TRUE)
reviewVocab = create_vocabulary(iter_train)

# report the head and tail of the reviewVocab
reviewVocab

## Number of docs: 4000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##         term term_count doc_count
##     1: 00015          1         1
##     2:    03          1         1
##     3:   041          1         1
##     4:    05          1         1
##     5:    09          1         1
##    ---                           
## 35560:    to      22103      3788
## 35561:    of      23585      3791
## 35562:     a      26468      3871
## 35563:   and      26832      3864
## 35564:   the      54245      3966

The next step computes the document term matrix (DTM).

reviewVectorizer = vocab_vectorizer(reviewVocab)
t0 = Sys.time()
dtm_train = create_dtm(iter_train, reviewVectorizer)
dtm_test = create_dtm(iter_test, reviewVectorizer)
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 2.577568 secs

# check the DTM dimensions
dim(dtm_train); dim(dtm_test)

## [1]  4000 35564

## [1]  1000 35564

# confirm that the training data review DTM dimensions are consistent 
# with training review IDs, i.e., #rows = number of documents, and
# #columns = number of unique terms (n-grams), dim(dtm_train)[[2]]
identical(rownames(dtm_train), train$id)

## [1] TRUE

5.2 NLP/TM Analytics

We can now fit statistical models or derive machine learning model-free predictions. Let’s start by using glmnet() to fit a logit model with LASSO (\(L_1\)) regularization and 10-fold cross-validation, see Chapter 17.

library(glmnet)
nFolds = 10
t0 = Sys.time()
glmnet_classifier = cv.glmnet(x = dtm_train, y = train[['sentiment']], 
        family = "binomial", 
        # LASSO L1 penalty
        alpha = 1,
        # interested in the area under ROC curve or MSE
        type.measure = "auc",
        # n-fold internal (training data) stats cross-validation
        nfolds = nFolds,
        # threshold: high value is less accurate / faster training
        thresh = 1e-2,
        # again lower number of iterations for faster training
        maxit = 1e3
      )

lambda.best <- glmnet_classifier$lambda.min
lambda.best

## [1] 0.009043106

# report execution time
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 8.796872 secs

# some prediction plots
# plot(glmnet_classifier)
# # plot(glmnet_classifier, xvar="lambda", label="TRUE")
# mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)
plotCV.glmnet(glmnet_classifier)

Now let’s look at external validation, i.e., testing the model on the independent 20% of the reviews we kept aside. The performance of the binary prediction (binary sentiment analysis of these movie reviews) on the test data is roughly the same as we had from the internal statistical 10-fold cross-validation.

# library('glmnet')
# report the mean internal cross-validated error
print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

## [1] "max AUC = 0.9221"

# report TESTING data prediction accuracy
xTest = dtm_test
yTest = test[['sentiment']]
predLASSO <- predict(glmnet_classifier, 
              s = glmnet_classifier$lambda.1se, newx = xTest)
testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 2.482202

# Binarize the LASSO probability prediction 
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 449 181
##            1  38 332

# and testing data AUC
glmnet:::auc(yTest, predLASSO)

## [1] 0.906773

# report the top 20 negative and positive predictive terms: predict == predict.cv.glmnet
summary(predLASSO)

##        s1         
##  Min.   :-8.2990  
##  1st Qu.:-0.9875  
##  Median : 0.1347  
##  Mean   :-0.1050  
##  3rd Qu.: 0.9050  
##  Max.   : 5.8133

sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20]

##  [1] -5.3773471 -2.4413170 -2.0576560 -1.8147407 -1.7930992 -1.6398103
##  [7] -1.5799485 -1.4489793 -1.3448756 -1.2109861 -1.2032443 -1.1045763
## [13] -1.0823501 -1.0753680 -1.0609291 -1.0525903 -1.0398367 -1.0111934
## [19] -0.9996941 -0.9048566

rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20]

##  [1] 2.6990847 2.0802809 1.4634998 1.0710930 1.0179944 0.9351993 0.8824431
##  [8] 0.8106505 0.7944252 0.7789477 0.7634964 0.7611448 0.7361530 0.7230270
## [15] 0.7225707 0.7190906 0.7075955 0.7073987 0.7038402 0.6936936

The (external) prediction performance, measured by AUC, on the testing data is about the same as the internal 10-fold stats cross-validation we reported above.

5.3 Prediction Optimization

Earlier we saw that we can also prune the vocabulary and perhaps improve prediction performance, e.g., by removing non-salient terms like stopwords and by using n-grams instead of single words.

reviewVocab = create_vocabulary(iter_train, stopwords=tm::stopwords("english"), ngram = c(1L, 2L))

prunedReviewVocab = prune_vocabulary(reviewVocab, 
                term_count_min = 10, 
                doc_proportion_max = 0.5,
                doc_proportion_min = 0.001)
prunedVectorizer = vocab_vectorizer(prunedReviewVocab)

t0 = Sys.time()
dtm_train = create_dtm(iter_train, prunedVectorizer)
dtm_test = create_dtm(iter_test, prunedVectorizer)
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 2.891846 secs

Next, refit the model and report the performance. Did this yield an improvement in the prediction accuracy?

glmnet_prunedClassifier=cv.glmnet(x=dtm_train, 
        y=train[['sentiment']], 
        family = "binomial", 
        # LASSO L1 penalty
        alpha = 1,
        # interested in the area under ROC curve or MSE
        type.measure = "auc",
        # n-fold internal (training data) stats cross-validation
        nfolds = nFolds,
        # threshold: high value is less accurate / faster training
        thresh = 1e-4,
        # again lower number of iterations for faster training
        maxit = 1e5
      )

lambda.best <- glmnet_prunedClassifier$lambda.min
lambda.best

## [1] 0.007865232

# report execution time
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 9.582903 secs

# some prediction plots
# plot(glmnet_prunedClassifier)
# mtext("Pruned-Model CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)
plotCV.glmnet(glmnet_prunedClassifier)

# report the mean internal cross-validated error
print(paste("max AUC =", round(max(glmnet_prunedClassifier$cvm), 4)))

## [1] "max AUC = 0.9301"

# report TESTING data prediction accuracy
xTest = dtm_test
yTest = test[['sentiment']]
predLASSO = predict(glmnet_prunedClassifier,
          dtm_test, type = 'response')[,1]

testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 0.1232584

# Binarize the LASSO probability prediction 
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 397  80
##            1  90 433

# and testing data AUC
glmnet:::auc(yTest, predLASSO)

## [1] 0.9181407

# report the top 20 negative and positive predictive terms
summary(predLASSO)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000352 0.2256975 0.5240056 0.4943751 0.7449730 0.9985883

sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20]

## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient

##  [1] -5.996667 -2.917531 -2.570840 -2.129353 -2.011366 -1.924831 -1.905654
##  [8] -1.672907 -1.623986 -1.536025 -1.525004 -1.430932 -1.363108 -1.330195
## [15] -1.274768 -1.258665 -1.176855 -1.170610 -1.146671 -1.136090

rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20]

## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient

##  [1] 2.9487927 2.5906712 1.6395806 1.3466682 1.2665379 1.1657157 1.1609455
##  [8] 1.0732492 1.0469346 1.0196068 1.0174890 1.0051070 0.9562316 0.9241320
## [15] 0.9162853 0.8817781 0.8803928 0.8668196 0.8604375 0.8561196

# Binarize the LASSO probability prediction 
# and construct an approximate confusion matrix
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 397  80
##            1  90 433

Using n-grams improved the sentiment prediction model a bit.

Try these NLP techniques to:

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: https://www.nature.com/articles/sdata201635
Other data from the list of our Case-Studies.
Your own free text.

Data Science and Predictive Analytics (UMich HS650)

Natural Language Processing/Text Mining

SOCR/MIDAS (Ivo Dinov)

November 2021