SOCR ≫ DSPA ≫ Topics ≫

1 Explain these two concepts

  • Bayes Theorem
  • Laplace Estimation

2 Processing text data for analysis

Load the SOCR 2011 US Job Satisfaction data. The last column (Description) contains free text describing each job type. Notice that spaces are replaced by underscores, __. To mine the text field and suggest some meta-data analytics, construct an R protocol for:

  • Convert the textual meta-data into a corpus object.
  • Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
  • Tokenize the job descriptions into words. Examine the distributions of Stress_Category and Hiring_Potential.
  • Classify the job Stress into two categories.
  • Generate a word cloud to visualize the job description text.
  • Graphically visualize the difference between low and high Stress_Category graph.
  • Transform the word count features into categorical data
  • Ignore those low frequency words and report the sparsity of your categorical data matrix with or without delete those low frequency words. Note that the sparsity of a matrix is the fraction: \(Sparsity(A) =\frac{\text{number of zero-valued elements}}{\text{total number of matrix elements (} m\times n\text{)}}\).
  • Apply the Naive Bayes classifier to original matrix and lower dimension matrix, what do you observe?
  • Apply and compare LDA and Naive Bayes classifiers with respect to the error, specificity and sensitivity.

SOCR Resource Visitor number Dinov Email