#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

Natural Language Processing/Text Mining

" #' author: "

SOCR/MIDAS (Ivo Dinov)

" #' date: "`r format(Sys.time(), '%B %Y')`" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' after_body: SOCR_footer_tracker.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' As we have seen in the previous chapters, traditional statistical analyses and classical data modeling are applied to *relational data* where the observed information is represented by tables, vectors, arrays, tensors or data-frames containing binary, categorical, original, or numerical values. Such representations provide incredible advantages (e.g., quick reference and de-reference of elements, search, discovery and navigation), but also limit the scope of applications. Relational data objects are quite effective for managing information that is based only on existing attributes. However, when data science inference needs to utilize attributes that are not included in the relational model, alternative non-relational representations are necessary. For instance, imagine that our data object includes a free text feature (e.g., physician/nurse clinical notes, biospecimen samples) that contains information about medical condition, treatment or outcome. It's very difficult, or sometimes even impossible, to include the raw text into the automated data analytics, using classical procedures and statistical models available for relational datasets. #' #' Natural Language Processing (NLP) and Text Mining (TM) refer to automated machine-driven algorithms for semantically mapping, extracting information, and understanding of (natural) human language. Sometimes, this involves extracting salient information from large amounts of unstructured text. To do so, we need to build a semantic and syntactic mapping algorithm for effective processing of heavy text. Related to NLP/TM, the work we did in [Chapter 7](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/07_NaiveBayesianClass.html) showed a powerful text classifier using the naive Bayes algorithm. #' #' In this Chapter, we will present more details about various text processing strategies in R. Specifically, we will present simulated and real examples of text processing and computing document term frequency (TF), inverse document frequency (IDF), and cosine similarity transformation. #' #' # A simple NLP/TM example #' #' Text mining or text analytics (TM/TA) examines large volumes of unstructured text (corpus) aiming to extract new information, discover context, identify linguistic motifs, or transform the text and derive quantitative data that can be further analyzed. Natural language processing (NLP) is one example of a TM analytical technique. Whereas TM's goal is to discover relevant contextual information, which may be unknown, hidden, or obfuscated, NLP is focused on linguistic analysis that trains a machine interpret voluminous textual content. To decipher the semantics and ambiguities in human-interpretable language, NLP employs automatic summarization, tagging, disambiguation, extraction of entities and relations, pattern recognition and frequency analyses. [As of 2017](https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#5400d3a617b1), the total amount of information generated by the human race exceeds 5 zettabytes ($1ZB=10^{21}=2^{70}$ bytes), which is projected to top $50ZB$ by 2020. The amount of data we obtain and record doubles every 12-14 months (Kryder's law). A small fraction of this massive information ($<0.0001\%$ or $<1PB=10^{15}$ bytes) represents newly written or transcribed text, including code. However, it is impossible (cf. efficiency, time, resources) for humans to read, synthesize, interpret and react to all this information without direct assistance of TM/NLP. The information content in text could be substantially higher than that of other information media. Remember that "a picture may be worth a thousand words", yet, "a word may also be worth a thousand pictures". As an example, the simple sentence "The data science and predictive analytics textbook includes 22 Chapters." takes 63 bytes to store as text, however, a color image showing this as printed text could reach 10 megabytes (MB), and an HD video of a speaker reading the same sentence could easily surpass 50MB. Text mining and natural language processing may be used to automatically analyze and interpret written, coded or transcribed content to assess news, moods, emotions, and biosocial trends related to specific topics. #' #' In general, text analysis protocol involves: #' #' * Construction of a document-term matrix (DTM) from the input documents, vectorizing the text, e.g., creating a map of single words or `n-grams` into a vector space. That is, the *vectorizer is a function mapping terms to indices*. #' #' * Apply a model-based statistical analysis or a model-free machine learning techniques for prediction, clustering, classification, similarity search, network/sentiment analysis, or forecasting using the DTM. This step also includes tuning and internally validating the performance of the method. #' * Apply and evaluate the technique to new data. #' #' ## Define and load the unstructured-text documents #' Let's create some documents we can use to demonstrate the use of the `tm` package to do text mining. The 5 documents below represent portions of the syllabi of 5 recent courses taught by [Ivo Dinov](http://www.umich.edu/~dinov): #' #' * [HS650: Data Science and Predictive Analytics (DSPA)](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650) #' * [Bootcamp: Predictive Big Data Analytics using R](http://www.socr.umich.edu/people/dinov/2017/Spring/PBDA_R_Bootcamp) #' * [HS 853: Scientific Methods for Health Sciences: Special Topics](http://www.socr.umich.edu/people/dinov/2016/Fall/HS853) #' * [HS851: Scientific Methods for Health Sciences: Applied Inference](http://www.socr.umich.edu/people/dinov/2014/Fall/HS851), and #' * [HS550: Scientific Methods for Health Sciences: Fundamentals](http://www.socr.umich.edu/people/dinov/2014/Fall/HS550) #' #' We import the syllabi into several separate segments represented as `documents`. #' #' * As an *exercise*, try to use the `rvest::read_html` method to load in the 5 course syllabi directly from the course websites listed above. #' #' doc1 <- "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course." #' #' doc2 <- "Bootcamp: A week-long intensive Bootcamp focused on methods, techniques, tools, services and resources for big healthcare and biomedical data analytics using the open-source statistical computing software R. Morning sessions (3 hrs) will be dedicated to methods and technologies and applications. Afternoon sessions (3 hrs) will be for group-based hands-on practice and team work. Commitment to attend the full week of instruction (morning sessions) and self-guided work (afternoon sessions) is required. Certificates of completion will be issued only to trainees with perfect attendance that complete all work. This hands-on intensive graduate course (Bootcamp) will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course." #' #' doc3 <- "HS 853: This course covers a number of modern analytical methods for advanced healthcare research. Specific focus will be on reviewing and using innovative modeling, computational, analytic and visualization techniques to address concrete driving biomedical and healthcare applications. The course will cover the 5 dimensions of Big-Data (volume, complexity, multiple scales, multiple sources, and incompleteness). HS853 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to conduct research, employ and report on recent advanced health sciences analytical methods; read, comprehend and present recent reports of innovative scientific methods; apply a broad range of health problems; experiment with real Big-Data. Topics Covered include: Foundations of R, Scientific Visualization, Review of Multivariate and Mixed Linear Models, Causality/Causal Inference and Structural Equation Models, Generalized Estimating Equations, PCOR/CER methods Heterogeneity of Treatment Effects, Big-Data, Big-Science, Internal statistical cross-validation, Missing data, Genotype-Environment-Phenotype, associations, Variable selection (regularized regression and controlled/knockoff filtering), medical imaging, Databases/registries, Meta-analyses, classification methods, Longitudinal data and time-series analysis, Geographic Information Systems (GIS), Psychometrics and Rasch measurement model analysis, MCMC sampling for Bayesian inference, and Network Analysis" #' #' doc4 <- "HS 851: This course introduces students to applied inference methods in studies involving multiple variables. Specific methods that will be discussed include linear regression, analysis of variance, and different regression models. This course will emphasize the scientific formulation, analytical modeling, computational tools and applied statistical inference in diverse health-sciences problems. Data interrogation, modeling approaches, rigorous interpretation and inference will be emphasized throughout. HS851 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to: , Understand the commonly used statistical methods of published scientific papers , Conduct statistical calculations/analyses on available data , Use software tools to analyze specific case-studies data , Communicate advanced statistical concepts/techniques , Determine, explain and interpret assumptions and limitations. Topics Covered include Epidemiology , Correlation/SLR , and slope inference, 1-2 samples , ROC Curve , ANOVA , Non-parametric inference , Cronbach's $\alpha$, Measurement Reliability/Validity , Survival Analysis , Decision theory , CLT/LLNs - limiting results and misconceptions , Association Tests , Bayesian Inference , PCA/ICA/Factor Analysis , Point/Interval Estimation (CI) - MoM, MLE , Instrument performance Evaluation , Study/Research Critiques , Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them." #' #' doc5 <- "HS550: This course provides students with an introduction to probability reasoning and statistical inference. Students will learn theoretical concepts and apply analytic skills for collecting, managing, modeling, processing, interpreting and visualizing (mostly univariate) data. Students will learn the basic probability modeling and statistical analysis methods and acquire knowledge to read recently published health research publications. HS550 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to: Apply data management strategies to sample data files , Carry out statistical tests to answer common healthcare research questions using appropriate methods and software tools , Understand the core analytical data modeling techniques and their appropriate use Examples of Topics Covered , EDA/Charts , Ubiquitous variation , Parametric inference , Probability Theory , Odds Ratio/Relative Risk , Distributions , Exploratory data analysis , Resampling/Simulation , Design of Experiments , Intro to Epidemiology , Estimation , Hypothesis testing , Experiments vs. Observational studies , Data management (tables, streams, cloud, warehouses, DBs, arrays, binary, ASCII, handling, mechanics) , Power, sample-size, effect-size, sensitivity, specificity , Bias/Precision , Association vs. Causality , Rate-of-change , Clinical vs. Stat significance , Statistical Independence Bayesian Rule." #' #' #' ## Create a new **VCorpus** object #' #' The `VCorpus` object includes all the text and some meta-data (e.g., indexing) about the text. #' #' docs<-c(doc1, doc2, doc3, doc4, doc5) class(docs) #' #' #' Then let's make a `VCorpus` object using `tm` package. To complete this task, we need to know the source type. Here `docs` has a vector with "character" class so we should use `VectorSource()`. If it is a dataframe, we should use `DataframeSource()` instead. `VCorpus()` creates a *volatile corpus*, which is the data type used by the `tm` package for text mining. #' #' library(tm) doc_corpus<-VCorpus(VectorSource(docs)) doc_corpus doc_corpus[[1]]$content #' #' #' This is a list that contains the information for the 5 documents we have created. Now we can apply `tm_map()` function on this object to edit the text. The goal is to let the computer to understand the text and output information we desired. #' #' ## To-lower case transformation #' #' The text itself contains upper case letters as well as lower case letters. The first thing to do is to convert everything to lower case. #' #' doc_corpus<-tm_map(doc_corpus, tolower) doc_corpus[[1]] #' #' #' ## Text pre-processing #' ### Remove Stopwords #' These documents contains a lot of "stopwords" or common words that have important semantic meaning but low analytic value. We can remove these by the following command. #' #' stopwords("english") doc_corpus<-tm_map(doc_corpus, removeWords, stopwords("english")) doc_corpus[[1]] #' #' #' We removed all the stopwards in the `stopwords("english")` list. You can always make your own stop-word list and just use `doc_corpus<-tm_map(doc_corpus, removeWords, your_own_words_list)` to apply this list. #' #' From the output of `doc1` we notice the removal of stopwords creates extra blank space. Thus, the next step would be to remove them. #' #' doc_corpus<-tm_map(doc_corpus, stripWhitespace) doc_corpus[[1]] #' #' #' ### Remove punctuation #' Now we notice the irrelevant punctuation in the text, which can be removed by using a combination of `tm_map()` and `removePunctuation()` functions. #' #' doc_corpus<-tm_map(doc_corpus, removePunctuation) doc_corpus[[2]] #' #' #' The above `tm_map` commands changed the structure of our `doc_corpus` object. We can apply `PlainTextDocument` function to convert it back to the original format. #' #' doc_corpus<-tm_map(doc_corpus, PlainTextDocument) #' #' #' ### Stemming: removal of plurals and action suffixes #' Let's inspect the first three documents. We notice that there are some words ending with "ing", "es", "s". #' #' doc_corpus[[1]]$content doc_corpus[[2]]$content doc_corpus[[3]]$content #' #' #' If we have multiple terms that only differ in their endings (e.g., past, present, present-perfect-continuous tense), the algorithm will treat them differently because it does not understand language semantics the way a human would. To make things easier for the computer, we can delete these endings by "stemming" documents. Remember to load the package `SnowballC` before using the function `stemDocument()`. The earliest stemmer was written by Julie Beth Lovins in 1968, which had great influence on all subsequent work. Currently, one of the most popular stemming approaches was proposed by Martin Porter and is used in `stemDocument()`, [more on Porter's algorithm](https://tartarus.org/martin/PorterStemmer/). #' #' # install.packages("SnowballC") library(SnowballC) doc_corpus<-tm_map(doc_corpus, stemDocument) doc_corpus[[1]]$content #' #' #' This stemming process has to be done after the `PlainTextDocument` function because `stemDocument` only can be applied to plain text. #' #' ## Bags of words #' #' It's very useful to be able to tokenize text documents into `n-grams`, sequences of words, e.g., a `2-gram` represents two-word phrases that appear together in order. This allows us to form bags of words and extract information about word ordering. The *bag of words model* is a common way to represent documents in matrix form based on their term frequencies (TFs). We can construct an $n\times t$ document-term matrix (DTM), where $n$ is the number of documents, and $t$ is the number of unique terms. Each column in the DTM represents a unique term, the $(i,j)^{th}$ cell represents how many of term $j$ are present in document $i$. #' #' The basic bag of words model is invariant to ordering of the words within a document. Once we compute the DTM, we can use machine learning techniques to interpret the derived signature information contained in the resulting matrices. #' #' ## Document-term matrix #' #' Now the `doc_corpus` object is quite clean. Next, we can make a document-term matrix to explore all the terms in 5 documents. The document-term matrix is a bunch of dummy variables that tell us if a specific term appear in a specific document. #' #' doc_dtm<-TermDocumentMatrix(doc_corpus) doc_dtm #' #' #' The summary of document-term matrix is informative. We have 329 different terms in the 5 documents. There are 540 non-zero and 1105 sparse entries. Thus, the sparsity is $\frac{1105}{(540+1105)}\approx 67\%$, which measures the term sparsity across documents. A high sparsity means terms are not repeated often among different documents. #' #' Recall that we applied `PlainTextDocument` function to your `doc_corpus` object. This removes all document meta data. To relabel the documents in the document-term matrix we can use the following commands: #' #' doc_dtm$dimnames$Docs<-as.character(1:5) inspect(doc_dtm) #' #' #' We might want to find and report the frequent terms using this document-term matrix. #' #' findFreqTerms(doc_dtm, lowfreq = 2) #' #' #' This gives us the terms that appear in at least 2 documents. High-frequency terms like `comput`, `statist`, `model`, `healthcar`, `learn` make perfect sense to be included as these courses cover modeling, statistical and computational methods with applications to health sciences. #' #' `tm` package also provides the correlation between terms. Here is a mechanism to determine the words that are highly correlated with `statist`, ($\rho(statist, ?)\ge 0.8$). #' #' findAssocs(doc_dtm, "statist", corlimit = 0.8) #' #' #' # Case-Study: Job ranking #' #' Let's explore some real datasets. First, we will import the [2011 USA Jobs Ranking Dataset](http://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking) from [SOCR data archive](http://wiki.socr.umich.edu/index.php/SOCR_Data). #' #' library(rvest) wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking") html_nodes(wiki_url, "#content") job <- html_table(html_nodes(wiki_url, "table")[[1]]) head(job) #' #' #' Note that low index represent the job is desirable for 2011. Thus, the most desirable job among top 200 common jobs would be Software Engineer in 2011. The aim of our case study is to explore the difference between top 30 desirable jobs and the last 100 jobs in the list. #' #' We will go through the same procedure as we did for the simple example. The documents we are using is the `Description` column (a vector) in the dataset. #' #' ## Step 1: make a VCorpus object #' #' jobCorpus<-VCorpus(VectorSource(job[, 10])) #' #' #' ##Step 2: clean the VCorpus object #' #' jobCorpus<-tm_map(jobCorpus, tolower) for(j in seq(jobCorpus)){ jobCorpus[[j]]<-gsub("_", " ", jobCorpus[[j]]) } #' #' #' Here we used a loop to substitute "_" with blank space. This is because when we use `removePunctuation` all the underline will disappear and there will be no separation between terms. In this situation, `gsub` will be the best choice to use. #' #' jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english")) jobCorpus<-tm_map(jobCorpus, removePunctuation) jobCorpus<-tm_map(jobCorpus, stripWhitespace) jobCorpus<-tm_map(jobCorpus, PlainTextDocument) jobCorpus<-tm_map(jobCorpus, stemDocument) #' #' #' ## Step 3: build document-term matrix #' #' Term Document Matrix (TDM) objects (`tm::DocumentTermMatrix`) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix. #' #' First make sure that we got a clean VCorpus object #' jobCorpus[[1]]$content #' #' #' Then we can start to build the DTM and reassign labels to the `Docs`. #' #' dtm<-DocumentTermMatrix(jobCorpus) dtm dtm$dimnames$Docs<-as.character(1:200) inspect(dtm[1:10, 1:10]) #' #' #' Let's subset the `dtm` into top 30 jobs and bottom 100 jobs. #' #' dtm_top30<-dtm[1:30, ] dtm_bot100<-dtm[101:200, ] dtm_top30 dtm_bot100 #' #' #' In this case, since the sparsity is very high, we can try to remove some of the words that appear seldom in the job descriptions. #' #' dtms_top30<-removeSparseTerms(dtm_top30, 0.90) dtms_top30 dtms_bot100<-removeSparseTerms(dtm_bot100, 0.94) dtms_bot100 #' #' #' On the top, instead of the initial 846 terms, we only have 19 terms appearing in at least 10% of the jobs. #' #' Similarly, in the bottom, instead of the initial 846 terms, we only have 14 terms appearing in at least 6% of the bottom 100 jobs. #' #' Similar to what we did in [Chapter 7](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/07_NaiveBayesianClass.html), visualization of the terms-world clouds is available in R when combine `tm` with `wordcloud` package. First, we can count the term frequencies in two document-term matrices. #' #' # Let's calculate the cumulative frequencies of words across documents and sort: freq1<-sort(colSums(as.matrix(dtms_top30)), decreasing=T) freq1 freq2<-sort(colSums(as.matrix(dtms_bot100)), decreasing=T) freq2 # Plot wf=data.frame(term=names(freq2), occurrences=freq2) library(ggplot2) p <- ggplot(subset(wf, freq2>2), aes(term, occurrences)) p <- p + geom_bar(stat="identity") p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) p #' #' Then we apply `wordcloud` function to the `freq` dataset. #' #' library(wordcloud) set.seed(123) wordcloud(names(freq1), freq1) # Color code the frequencies using an appropriate color map: # Sequential palettes names include: # Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd # Diverging palettes include # BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(6, "Spectral")) #' #' #' It is apparent that top 30 jobs focus more on researching or discover new things with frequent keywords like "study", "nature", "analyze". The bottom 100 jobs are focused on operating on existing objects with frequent keywords like "operation", "repair", "perform". #' #' ## Area Under ROC Curve #' #' In [Chapter 13](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/13_ModelEvaluation.html) we talked about the ROC curve. We can use document-term matrix to build classifiers and use the *area under ROC curve* to evaluate those classifiers. Assume we want to predict whether a job ranks top 30 in the job list. #' #' The first task would be to create an indicator of high rank (job is in the top 30 list). We can use the `ifelse()` function that we are already familiar with. #' #' job$highrank<-ifelse(job$Index<30, 1, 0) #' #' #' Next we load the `glmnet` package to help us build the prediction model and draw graphs. #' #' #install.packages("glmnet") library(glmnet) #' #' #' The function we will be using is the `cv.glmnet`. `cv` stands for cross-validation. Since the `highrank` variable is binary, we use option `family='binomial'`. Also, we want to use 10-fold CV method for internal statistical (resampling-based) prediction validation. #' #' set.seed(25) fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) plot(fit) print(paste("max AUC =", round(max(fit$cvm), 4))) #' #' #' Here `x` is a matrix and `y` is the response variable. The graph is showing all the AUC we got from models we created. The last line of code help us select the best AUC among all models. The resulting $AUC\sim 0.73$ represents a relatively good prediction model for this small sample size. #' #' # TF-IDF #' #' To enhance the performance of DTM matrix, we introduce the **TF-IDF (term frequency - inverse document frequency)** concept. Unlike pure frequency, TF-IDF measures the relative importance of a term. If a term appears in almost every document, the term will be considered common with a small weight. Alternatively, the rare terms would be considered more informational. #' #' ## Term Frequency (TF) #' **TF** is a ratio $\frac{\text{a term's occurrences in a document}}{\text{the number of occurrences of the most frequent word within the same document}}$. #' #' $$TF(t,d)=\frac{f_d(t)}{\max_{w\in d}{f_d(w)}}.$$ #' #' ## Inverse Document Frequency (IDF) #' The **TF** definition may allow high scores for irrelevant words that naturally show up often in a long text, even if these may have been triaged in a prior preprocessing step. The **IDF** attempts to rectify that. **IDF** represents the inverse of the share of the documents in which the regarded term can be found. The lower the number of documents containing the term, relative to the size of the corpus, the higher the term factor. #' #' IDF involves a logarithm function is because otherwise the effective scoring penalty of showing up in two documents would be too extreme. Typically, the IDF for a term found in just one document is twice the IDF for another term found in two docs. The $\ln()$ function rectifies this bias of ranking in favor of rare terms, even if the TF-factor may be high. It is rather unlikely that a term's relevance is only high in one doc and not all others. #' #' $$ IDF(t,D) = \ln\left ( \frac{|D|}{|\{ d\in D: t\in d\}|} \right ).$$ #' #' ## TF-IDF #' Both TF and IDF yield high scores for highly relevant terms. TF relies on local information (search over $d$), whereas IDF incorporates a more global perspective (search over $D$). The product $TF\times IDF$, gives the classical **TF-IDF** formula. However, alternative expressions may be formulated to get other univariate expressions using alternative weights for TF and IDF. #' #' $$TF\_IDF(t,d,D)= TF(t,d)\times IDF(t,D).$$ #' An example of an alternative TF-IDF metric can be defined by: #' $$ TF\_IDF'(t,d,D) =\frac{IDF(t,D)}{|D|} + TF\_IDF(t,d,D).$$ #' #' Let's make another DTM with TF-IDF weights and compare the differences between the *unweighted* and *weighted* DTM. #' #' dtm.tfidf<-DocumentTermMatrix(jobCorpus, control = list(weighting=weightTfIdf)) dtm.tfidf dtm.tfidf$dimnames$Docs <- as.character(1:200) inspect(dtm.tfidf[1:9, 1:10]) inspect(dtm[1:9, 1:10]) #' #' #' From the inspections for two different DTMs we can see that TF-IDF is not only counting the frequency but also assigning different weights to each term according to the importance of the term. Next, we are going to fit another model with this new DTM (`dtm.tfidf`). #' #' set.seed(2) fit1 <- cv.glmnet(x = as.matrix(dtm.tfidf), y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) plot(fit1) print(paste("max AUC =", round(max(fit1$cvm), 4))) #' #' #' This output is about the same as the previous jobs ranking prediction classifier (based on the unweighted DTM). Due to random sampling, each run of the protocols may generate a slightly different result. The idea behind using TF-IDF is that one would expect to get more unbiased estimates of word importance. If the document includes stopwords, like "the" or "one", the DTM may distort the results, but TF-IDF can help fix this problem. #' #' Next we can report a more intuitive representation of the job ranking prediction reflecting the agreement of the binary (top-30 or not) classification between the real labels and the predicted labels. Notice that this applies only to the training data, itself. #' #' # Binarize the LASSO probability prediction preffit1 <- predict(fit1, newx=as.matrix(dtm.tfidf), s="lambda.min", type = "class") binPredfit1 <- ifelse(preffit1<0.5, 0, 1) table(binPredfit1, job[['highrank']]) #' #' #' Let's try to predict the job ranking of a new (testing or validation) job description (JD). There are [many job descriptions provided online](https://www.bls.gov/ocs/ocsjobde.htm) that we can extract text from to predict the job ranking of the corresponding positions. Trying several alternative job categories, e.g., some high-tech or fin-tech and some manufacturing and construction jobs, may provide some intuition to the power of the jobs-classifier we built. Below, we will compare the JDs for the positions of accountant, attorney, and machinist. #' #' # install.packages("text2vec"); install.packages("data.table") library(text2vec) library(data.table) # Choose the JD for a PUBLIC ACCOUNTANTS 1430 (https://www.bls.gov/ocs/ocsjobde.htm) xTestAccountant <- "Performs professional auditing work in a public accounting firm. Work requires at least a bachelor's degree in accounting. Participates in or conducts audits to ascertain the fairness of financial representations made by client companies. May also assist the client in improving accounting procedures and operations. Examines financial reports, accounting records, and related documents and practices of clients. Determines whether all important matters have been disclosed and whether procedures are consistent and conform to acceptable practices. Samples and tests transactions, internal controls, and other elements of the accounting system(s) as needed to render the accounting firm's final written opinion. As an entry level public accountant, serves as a junior member of an audit team. Receives classroom and on-the-job training to provide practical experience in applying the principles, theories, and concepts of accounting and auditing to specific situations. (Positions held by trainee public accountants with advanced degrees, such as MBA's are excluded at this level.) Complete instructions are furnished and work is reviewed to verify its accuracy, conformance with required procedures and instructions, and usefulness in facilitating the accountant's professional growth. Any technical problems not covered by instructions are brought to the attention of a superior. Carries out basic audit tests and procedures, such as: verifying reports against source accounts and records; reconciling bank and other accounts; and examining cash receipts and disbursements, payroll records, requisitions, receiving reports, and other accounting documents in detail to ascertain that transactions are properly supported and recorded. Prepares selected portions of audit working papers" xTestAttorney <- "Performs consultation, advisory and/or trail work and carries out the legal processes necessary to effect the rights, privileges, and obligations of the organization. The work performed requires completion of law school with an L.L.B. degree or J.D. degree and admission to the bar. Responsibilities or functions include one or more of the following or comparable duties: 1. Preparing and reviewing various legal instruments and documents, such as contracts, leases, licenses, purchases, sales, real estate, etc.; 2. Acting as agent of the organization in its transactions; 3. Examining material (e.g., advertisements, publications, etc.) for legal implications; advising officials of proposed legislation which might affect the organization; 4. Applying for patents, copyrights, or registration of the organization's products, processes, devices, and trademarks; advising whether to initiate or defend law suits; 5. Conducting pretrial preparations; defending the organization in lawsuits; 6. Prosecuting criminal cases for a local or state government or defending the general public (for example, public defenders and attorneys rendering legal services to students); or 7. Advising officials on tax matters, government regulations, and/or legal rights. Attorney jobs are matched at one of six levels according to two factors: 1. Difficulty level of legal work; and 2. Responsibility level of job. Attorney jobs which meet the above definitions are to be classified and coded in accordance with a chart available upon request. Legal questions are characterized by: facts that are well-established; clearly applicable legal precedents; and matters not of substantial importance to the organization. (Usually relatively limited sums of money, e.g., a few thousand dollars, are involved.) a. legal investigation, negotiation, and research preparatory to defending the organization in potential or actual lawsuits involving alleged negligence where the facts can be firmly established and there are precedent cases directly applicable to the situation; b. searching case reports, legal documents, periodicals, textbooks, and other legal references, and preparing draft opinions on employee compensation or benefit questions where there is a substantial amount of clearly applicable statutory, regulatory, and case material; c. drawing up contracts and other legal documents in connection with real property transactions requiring the development of detailed information but not involving serious questions regarding titles to property or other major factual or legal issues. d. preparing routine criminal cases for trial when the legal or factual issues are relatively straight forward and the impact of the case is limited; and e. advising public defendants in regard to routine criminal charges or complaints and representing such defendants in court when legal alternatives and facts are relatively clear and the impact of the outcome is limited primarily to the defendant. Legal work is regularly difficult by reason of one or more of the following: the absence of clear and directly applicable legal precedents; the different possible interpretations that can be placed on the facts, the laws, or the precedents involved; the substantial importance of the legal matters to the organization (e.g., sums as large as $100,000 are generally directly or indirectly involved); or the matter is being strongly pressed or contested in formal proceedings or in negotiations by the individuals, corporations, or government agencies involved. a. advising on the legal implications of advertising representations when the facts supporting the representations and the applicable precedent cases are subject to different interpretations; b. reviewing and advising on the implications of new or revised laws affecting the organization; c. presenting the organization's defense in court in a negligence lawsuit which is strongly pressed by counsel for an organized group; d. providing legal counsel on tax questions complicated by the absence of precedent decisions that are directly applicable to the organization's situation; e. preparing and prosecuting criminal cases when the facts of the cases are complex or difficult to determine or the outcome will have a significant impact within the jurisdiction; and f. advising and representing public defendants in all phases of criminal proceedings when the facts of the case are complex or difficult to determine, complex or unsettled legal issues are involved, or the prosecutorial jurisdiction devotes substantial resources to obtaining a conviction." xTestMachinist <- "Produces replacement parts and new parts in making repairs of metal parts of mechanical equipment. Work involves most of the following: interpreting written instructions and specifications; planning and laying out of work; using a variety of machinist's handtools and precision measuring instruments; setting up and operating standard machine tools; shaping of metal parts to close tolerances; making standard shop computations relating to dimensions of work, tooling, feeds, and speeds of machining; knowledge of the working properties of the common metals; selecting standard materials, parts, and equipment required for this work; and fitting and assembling parts into mechanical equipment. In general, the machinist's work normally requires a rounded training in machine-shop practice usually acquired through a formal apprenticeship or equivalent training and experience. Industrial machinery repairer. Repairs machinery or mechanical equipment. Work involves most of the following: examining machines and mechanical equipment to diagnose source of trouble; dismantling or partly dismantling machines and performing repairs that mainly involve the use of handtools in scraping and fitting parts; replacing broken or defective parts with items obtained from stock; ordering the production of a replacement part by a machine shop or sending the machine to a machine shop for major repairs; preparing written specifications for major repairs or for the production of parts ordered from machine shops; reassembling machines; and making all necessary adjustments for operation. In general, the work of a machinery maintenance mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience. Excluded from this classification are workers whose primary duties involve setting up or adjusting machines. Vehicle and mobile equipment mechanics and repairers. Repairs, rebuilds, or overhauls major assemblies of internal combustion automobiles, buses, trucks, or tractors. Work involves most of the following: Diagnosing the source of trouble and determining the extent of repairs required; replacing worn or broken parts such as piston rings, bearings, or other engine parts; grinding and adjusting valves; rebuilding carburetors; overhauling transmissions; and repairing fuel injection, lighting, and ignition systems. In general, the work of the motor vehicle mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience" testJDs <- c(xTestAccountant, xTestAttorney, xTestMachinist) # define the preprocessing (tolower case) function preproc_fun = tolower # define the tokenization function token_fun = text2vec::word_tokenizer # loop to substitute "_" with blank space for(j in seq(job[, 10])){ job[j, 10] <- gsub("_", " ", job[j, 10]) } # iterator for Job training and testing JDs iter_Jobs = itoken(job[, 10], preprocessor = preproc_fun, tokenizer = token_fun, progressbar = TRUE) iter_testJDs = itoken(testJDs, preprocessor = preproc_fun, tokenizer = token_fun, progressbar = TRUE) jobs_Vocab = create_vocabulary(iter_Jobs, stopwords=tm::stopwords("english"), ngram = c(1L, 2L)) jobsVectorizer = vocab_vectorizer(jobs_Vocab) dtm_jobsTrain = create_dtm(iter_Jobs, jobsVectorizer) dtm_testJDs = create_dtm(iter_testJDs, jobsVectorizer) dim(dtm_jobsTrain); dim(dtm_testJDs) set.seed(2) fit1 <- cv.glmnet(x = as.matrix(dtm_jobsTrain), y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) print(paste("max AUC =", round(max(fit1$cvm), 4))) #' #' #' Note that we improved somewhat the $AUC \sim 0.79$. Below, we will assess the JD predictive model using the three out of bag job descriptions. #' #' plot(fit1) # plot(fit1, xvar="lambda", label="TRUE") mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) predTestJDs <- predict(fit1, s = fit1$lambda.1se, newx = dtm_testJDs, type="response"); predTestJDs predTrainJDs <- predict(fit1, s = fit1$lambda.1se, newx = dtm_jobsTrain, type="response"); predTrainJDs # Type can be: "link", "response", "coefficients", "class", "nonzero" #' #' #' The output of the predictions shows that: #' #' * On the *training data*, the predicted probabilities rapidly decrease with the indexing of the jobs, corresponding to the *overall job ranking* (highly ranked/desired jobs are listed on the top). #' * On the three *testing job description data* (accountant, attorney, and machinist), there is a clear ranking difference between the machinist and the other two professions. #' #' Also see the discussion in [Chapter 17](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html#88_lasso_10-fold_cross_validation) about the different *types of predictions* that can be generated as outputs of `cv.glmnet` regularized forecasting methods. #' #' # Cosine similarity #' #' As we mentioned above, text data are often *transformed* to be in terms of Term Frequency-Inverse Document Frequency (TF-IDF), which of better input than raw frequencies for many text-mining methods. An alternative transformation can be represented as a different distance measure such as *cosine distance*, which is defined by: #' #' $$ similarity = \cos(\theta) = \frac{A\cdot B}{||A||_2||B||_2},$$ #' #' where $\theta$ represents the angle between two vectors $A$ and $B$ in Euclidean space spanned by the DTM matrix. #' #' cos_dist = function(mat){ numer = tcrossprod(mat) denom1 = sqrt(apply(mat, 1, crossprod)) denom2 = sqrt(apply(mat, 1, crossprod)) 1 - numer / outer(denom1,denom2) } dist_cos = cos_dist(as.matrix(dtm)) set.seed(2000) fit_cos <- cv.glmnet(x = dist_cos, y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) plot(fit_cos) print(paste("max AUC =", round(max(fit_cos$cvm), 4))) #' #' #' The AUC now is greater than $0.8$, which is a pretty good result even better than what we obtained from DTM or TF-IDF. This suggests that our machine "understanding" of the content, i.e., the natural language processing, leads to a more acceptable content classifier. #' #' # Sentiment analysis #' #' Let's use the `text2vec::movie_review` dataset, which consists of 5,000 movie reviews dichotomized as `positive` or `negative`. In the subsequent predictive analytics, this *sentiment* will represent our output feature: #' $$Y= Sentiment=\left\{ #' \begin{array}{ll} #' 0, & \quad negative \\ #' 1, & \quad positive #' \end{array} #' \right. .$$ #' #' ## Data Preprocessing #' #' The `data.table` package will also be used for some data manipulation. Let's start with splitting the data into *training* and *testing* sets. #' #' # install.packages("text2vec"); install.packages("data.table") library(text2vec) library(data.table) # Load the movie reviews data data("movie_review") # coerce the movie reviews data to a data.table (DT) object setDT(movie_review) # create a key for the movie-reviews data table setkey(movie_review, id) # View the data # View(movie_review) head(movie_review); dim(movie_review); colnames(movie_review) # Generate 80-20% training-testing split of the reviews all_ids = movie_review$id set.seed(1234) train_ids = sample(all_ids, 5000*0.8) test_ids = setdiff(all_ids, train_ids) train = movie_review[train_ids, ] test = movie_review[test_ids, ] #' #' #' Next, we will vectorize the reviews by creating terms to *termID* mappings. Note that terms may include arbitrary *n-grams*, not just single words. The set of reviews will be represented as a sparse matrix, with rows and columns corresponding to reviews/reviewers and terms, respectively. This vectorization may be accomplished in several alternative ways, e.g., by using the corpus vocabulary, feature hashing, etc. #' #' The vocabulary-based DTM, created by `create_vocabulary() function`, relies on all unique terms from all reviews, where each term has a unique ID. In this example, we will create the review vocabulary using an *iterator* construct abstracting the input details and enabling *in memory* processing of the (training) data by chunks. #' #' # define the test preprocessing # either a simple (tolower case) function preproc_fun = tolower # or a more elaborate "cleaning" function preproc_fun = function(x) # text data { require("tm") x = gsub("<.*?>", " ", x) # regex removing HTML tags x = iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters x = gsub("[^[:alnum:]]", " ", x) # remove non-alpha-numeric values x = tolower(x) # convert to lower case characters # x = removeNumbers(x) # removing numbers x = stripWhitespace(x) # removing white space x = gsub("^\\s+|\\s+$", "", x) # remove leading and trailing white space return(x) } # define the tokenization function token_fun = word_tokenizer # iterator for both training and testing sets iter_train = itoken(train$review, preprocessor = preproc_fun, tokenizer = token_fun, ids = train$id, progressbar = TRUE) iter_test = itoken(test$review, preprocessor = preproc_fun, tokenizer = token_fun, ids = test$id, progressbar = TRUE) reviewVocab = create_vocabulary(iter_train) # report the head and tail of the reviewVocab reviewVocab #' #' #' Next, we can compute the *document term matrix* (DTM). #' #' reviewVectorizer = vocab_vectorizer(reviewVocab) t0 = Sys.time() dtm_train = create_dtm(iter_train, reviewVectorizer) dtm_test = create_dtm(iter_test, reviewVectorizer) t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # check the DTM dimensions dim(dtm_train); dim(dtm_test) # confirm that the training data review DTM dimensions are consistent # with training review IDs, i.e., #rows = number of documents, and # #columns = number of unique terms (n-grams), dim(dtm_train)[[2]] identical(rownames(dtm_train), train$id) #' #' #' ## NLP/TM Analytics #' #' We can now fit statistical models or derive machine learning model-free predictions. Let's start by using `glmnet()` to fit a *logit model* with LASSO ($L_1$) regularization and 10-fold cross-validation, see [Chapter 17](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html). #' #' library(glmnet) nFolds = 10 t0 = Sys.time() glmnet_classifier = cv.glmnet(x = dtm_train, y = train[['sentiment']], family = "binomial", # LASSO L1 penalty alpha = 1, # interested in the area under ROC curve or MSE type.measure = "auc", # n-fold internal (training data) stats cross-validation nfolds = nFolds, # threshold: high value is less accurate / faster training thresh = 1e-2, # again lower number of iterations for faster training maxit = 1e3 ) lambda.best <- glmnet_classifier$lambda.min lambda.best # report execution time t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # some prediciton plots plot(glmnet_classifier) # plot(glmnet_classifier, xvar="lambda", label="TRUE") mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) #' #' #' Now let's look at external validation, i.e., testing the model on the independent 20% of the reviews we kept aside. The performance of the binary prediction (binary sentiment analysis of these movie reviews) on the test data is roughly the same as we had from the internal statistical 10-fold cross-validation. #' #' # report the mean internal cross-validated error print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4))) # report TESTING data prediction accuracy xTest = dtm_test yTest = test[['sentiment']] predLASSO <- predict(glmnet_classifier, s = glmnet_classifier$lambda.1se, newx = xTest) testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO # Binarize the LASSO probabiliuty prediction binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) # and testing data AUC glmnet:::auc(yTest, predLASSO) # report the top 20 negative and positive predictive terms summary(predLASSO) sort(predict.cv.glmnet(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20] rev(sort(predict.cv.glmnet(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20] #' #' #' The (external) prediction performance, measured by AUC, on the testing data is about the same as the internal 10-fold stats cross-validation we reported above. #' #' ## Prediction Optimization #' #' Earlier we saw that we can also prune the vocabulary and perhaps improve prediction performance, e.g., by removing non-salient terms like stopwords and by using *n-grams* instead of single words. #' #' reviewVocab = create_vocabulary(iter_train, stopwords=tm::stopwords("english"), ngram = c(1L, 2L)) prunedReviewVocab = prune_vocabulary(reviewVocab, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) prunedVectorizer = vocab_vectorizer(prunedReviewVocab) t0 = Sys.time() dtm_train = create_dtm(iter_train, prunedVectorizer) dtm_test = create_dtm(iter_test, prunedVectorizer) t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) #' #' #' Next refit the model and report the performance. Did we make improvement in the prediction accuracy? #' #' glmnet_prunedClassifier=cv.glmnet(x=dtm_train, y=train[['sentiment']], family = "binomial", # LASSO L1 penalty alpha = 1, # interested in the area under ROC curve or MSE type.measure = "auc", # n-fold internal (training data) stats cross-validation nfolds = nFolds, # threshold: high value is less accurate / faster training thresh = 1e-2, # again lower number of iterations for faster training maxit = 1e3 ) lambda.best <- glmnet_prunedClassifier$lambda.min lambda.best # report execution time t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # some prediction plots plot(glmnet_prunedClassifier) mtext("Pruned-Model CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) # report the mean internal cross-validated error print(paste("max AUC =", round(max(glmnet_prunedClassifier$cvm), 4))) # report TESTING data prediction accuracy xTest = dtm_test yTest = test[['sentiment']] predLASSO = predict(glmnet_prunedClassifier, dtm_test, type = 'response')[,1] testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO # Binarize the LASSO probabiliuty prediction binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) # and testing data AUC glmnet:::auc(yTest, predLASSO) # report the top 20 negative and positive predictive terms summary(predLASSO) sort(predict.cv.glmnet(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20] rev(sort(predict.cv.glmnet(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20] # Binarize the LASSO probability prediction # and construct an approximate confusion matrix binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) #' #' #' Using `n-grams` improved a bit the sentiment prediction model. #' #' #' #' Try these NLP techniques to: #' #' * MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635 #' * [Other data from the list of our Case-Studies](https://umich.instructure.com/courses/38100/files/). #' * Your own free text.