1 Mining Twitter Corpus

Use these R Data Mining Twitter data to apply NLP/TM methods and investigate the Twitter corpus.

Construct a VCorpus object for RDataMining-Tweets-20160212.rds.
Clean the VCorpus object.
Build document-term matrix (DTM).
Compute the TF-IDF(term frequency - inverse document frequency).
Use the DTM to construct a word cloud.

2 Mining Cancer Clinical Notes

Use the Head and Neck Cancer Medication Data to to apply NLP/TM methods and investigate the corpus. You have already explored these data in Chapter 7. Now we need to go a step further.

Use the MEDICATION_SUMMARY to construct a VCorpus object.
Clean the VCorpus object.
Build a document term matrix (DTM).
Add a column to indicate early and later stage according to seer_stage, refer to Chapter 7.
Use the DTM to construct a word cloud for early stage, later stage and the complete archive.
Interpret the results of the three generated word clouds.
Compute the TF-IDF(Term Frequency - Inverse Document Frequency).
Apply LASSO on the unweighted and weighted DTM respectively and evaluate the results according to AUC.
Try cosine similarity transformation, apply LASSO and compare the results.
Use other measures such as “class” for cv.glmnet().
Does it appear that these automated machine learning methods understand well human language?

SOCR Resource Visitor number

Data Science and Predictive Analytics (UMich HS650)

Assessment: 19. Natural Language Processing/Text Mining

Assessment: 19. Natural Language Processing/Text Mining

SOCR/MIDAS (Ivo Dinov)

SOCR/MIDAS (Ivo Dinov)

June 2017

1 Mining Twitter Corpus

2 Mining Cancer Clinical Notes