SOCR ≫ DSPA ≫ Topics ≫

1 Traumatic Brain injury

Use the kNN algorithm to provide a classification of the TBI SOCR data. DIchotomize the field.gcs outcome variable by field.gcs>=7. Determine an appropriate \(k\), train and evaluate the performance of the classification model on the data. Report some model quality statistics for a couple of different values of \(k\), and use these to rank-order (and perhaps plot the classification results of) the models.

2 Parkinson’s Disease

Use 05_PPMI_top_UPDRS_Integrated_LongFormat data to practice kNN classification.

2.1 KNN Classification in a High Dimension Space

  • Process data: delete the Index and FID_IID `VisitID column; convert the response variable ResearchGroup to bipolar factor(consider SWEDD as disease); detect NA values (impute if necessary).
  • Summarize the dataset: at least use str, summary, cor, ggpairs.
  • Data Transformation: scale/normalize the data: log(x-min(x)) and discretize either 0 or 1.
  • Randomly partition the data into training and testing sets: use set.seed and random sample, \(train:test = 2:1\).
  • Select an optimized \(k\) for each of the scaled data above: Show an error plot for \(k\) including three lines: train error, cross validation error and test error.
  • What is the impact of \(k\): Formulate a hypothesis about the relation between \(k\) and the error rates. You can try to use caret::knn.tuning or caret::train to verify the results (Hint: select the same folds, or you may get slightly different results).
  • Interpret the result: If we construct a hypercube neighborhood of \(x\) to capture a fraction \(\rho\) of the observations, how big should the cube be for data dimensions 1 and 10. What is the importance of normalization?
  • Report the error rate for both the training and the testing sets. What do you find?

2.2 kNN Classification in a lower Dimension Space

Try all the above again but select only the variables: UPDRS_Part_I_Summary_Score_Baseline, UPDRS_Part_I_Summary_Score_Month_24, UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline, UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24, UPDRS_Part_III_Summary_Score_Baseline, UPDRS_Part_III_Summary_Score_Month_24, as predictors. Now, what about the specific \(k\) you select and the error rates for each kind of data (original data, normalized data, log-transformed data, and binary data). Comment on any interesting observations.

SOCR Resource Visitor number Dinov Email