SOCR ≫ DSPA ≫ Topics ≫

As we learned in Chapters 6-8, classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the required resources to manually define, derive or generate these class labels?

Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class. Clustering is used for exploratory data analytics, i.e., as unsupervised learning, rather than for confirmatory analytics or for predicting specific outcomes.

In this chapter, we will present (1) clustering as a machine learning task, (2) the silhouette plots for assessing the reliability of clustering, (3) the k-Means clustering algorithm and how to tune it, (4) examples of several interesting case-studies, including Divorce and Consequences on Young Adults, Pediatric Trauma, and Youth Development, (5) demonstrate hierarchical clustering, (6) show spectral clustering, and (7) present Gaussian mixture modeling.

1 Clustering as a machine learning task

As we mentioned earlier, clustering is a machine learning technique that bundles unlabeled cases into groups. Scatter plots we saw in previous chapters represent a simple illustration of the clustering process. Let’s start with a hotdogs example. Assume we don’t know much about the ingredients of frankfurter hot dogs and we look the following graph.

In terms of calories and sodium, these hot dogs are clearly separated into three different clusters. Cluster 1 has hot dogs of low calories and medium sodium content; Cluster 2 has both calorie and sodium at medium levels; Cluster 3 has both sodium and calories at high levels. We can make a bold guess about the ingredients used in the hot dogs in these three clusters. For cluster 1, it could be mostly chicken meat since it has low calories. The second cluster might be beef and the third one is likely to be pork, because beef hot dogs have considerably less calories and salt than pork hot dogs. However, this is just guessing. Some hot dogs have a mixture of two or three types of meat. The real situation is somewhat similar to what we guessed but with some random noise, especially in cluster 2.

The following pair of plots show the primary type of meat used for each hot dog labeled by name (top) and color-coded (bottom).