SOCR ≫ DSPA ≫ Topics ≫

1 Mining Twitter Corpus

Use these R Data Mining Twitter data to apply NLP/TM methods and investigate the Twitter corpus.

  • Construct a VCorpus object for RDataMining-Tweets-20160212.rds.
  • Clean the VCorpus object.
  • Build document-term matrix (DTM).
  • Compute the TF-IDF(term frequency - inverse document frequency).
  • Use the DTM to construct a word cloud.

2 Mining Cancer Clinical Notes

Use the Head and Neck Cancer Medication Data to to apply NLP/TM methods and investigate the corpus. You have already explored these data in Chapter 7. Now we need to go a step further.

  • Use the MEDICATION_SUMMARY to construct a VCorpus object.
  • Clean the VCorpus object.
  • Build a document term matrix (DTM).
  • Add a column to indicate early and later stage according to seer_stage, refer to Chapter 7.
  • Use the DTM to construct a word cloud for early stage, later stage and the complete archive.
  • Interpret the results of the three generated word clouds.
  • Compute the TF-IDF(Term Frequency - Inverse Document Frequency).
  • Apply LASSO on the unweighted and weighted DTM respectively and evaluate the results according to AUC.
  • Try cosine similarity transformation, apply LASSO and compare the results.
  • Use other measures such as “class” for cv.glmnet().
  • Does it appear that these automated machine learning methods understand well human language?

SOCR Resource Visitor number Dinov Email