# Chapter 10 Clustering

## 10.1 Overview

As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data; this grouping can be used for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters.

## 10.2 Chapter learning objectives

By the end of the chapter, students will be able to:

• Describe a case where clustering is appropriate, and what insight it might extract from the data.
• Explain the K-means clustering algorithm.
• Interpret the output of a K-means analysis.
• Identify when it is necessary to scale variables before clustering, and do this using R.
• Perform K-means clustering in R using kmeans.
• Use the elbow method to choose the number of clusters for K-means.
• Visualize the output of K-means clustering in R using coloured scatter plots.
• Describe the advantages, limitations and assumptions of the K-means clustering algorithm.

## 10.3 Clustering

Clustering is a data analysis task involving separating a data set into subgroups of related data. For example, we might use clustering to separate a data set of documents into groups that correspond to topics, a data set of human genetic information into groups that correspond to ancestral subpopulations, or a data set of online customers into groups that correspond to purchasing behaviours. Once the data are separated we can, for example, use the subgroups to generate new questions about the data and follow up with a predictive modelling exercise. In this course, clustering will be used only for exploratory analysis, i.e., uncovering patterns in the data that we have.

Note that clustering is a fundamentally different kind of task than classification or regression. Most notably, both classification and regression are supervised tasks where there is a predictive target (a class label or value), and we have examples of past data with labels/values that help us predict those of future data. By contrast, clustering is an unsupervised task, as we are trying to understand and examine the structure of data without any labels to help us. This approach has both advantages and disadvantages. Clustering requires no additional annotation or input on the data; for example, it would be nearly impossible to annotate all the articles on Wikipedia with human-made topic labels, but we can still cluster the articles without this information to automatically find groupings corresponding to topics.

However, because there is no predictive target, it is not as easy to evaluate the “quality” of a clustering. With classification, we are able to use a test data set to assess prediction performance. In clustering, there is not a single good choice for evaluation. In this book, we will use visualization to ascertain the quality of a clustering, and leave rigorous evaluation for more advanced courses.

There are also so-called semisupervised tasks, where only some of the data come with labels / annotations, but the vast majority don’t. The goal is to try to uncover underlying structure in the data that allows one to guess the missing labels. This sort of task is very useful, for example, when one has an unlabelled data set that is too large to manually label, but one is willing to provide a few informative example labels as a “seed” to guess the labels for all the data.

An illustrative example

Here we will present an illustrative example using a simulated data set. Suppose we have data with two variables measuring customer loyalty and satisfaction, and we want to learn whether there are distinct “types” of customer. Understanding this might help us come up with better products or promotions to improve our business in a data-driven way.

marketing_data
## # A tibble: 19 x 2
##    loyalty  csat
##      <dbl> <dbl>
##  1     7    1
##  2     7.5  1
##  3     8    2
##  4     7    2
##  5     8    3
##  6     1.5  1.75
##  7     1    3
##  8     0.5  4
##  9     2    4
## 10     7    6
## 11     6    6
## 12     7    7
## 13     6    7
## 14     5    7
## 15     9.5  8
## 16     7    8
## 17     8.3  9
## 18     4    8
## 19     2    3

Based on this visualization, we might suspect there are a few subtypes of customer, selected from combinations of high/low satisfaction and high/low loyalty. How do we find this grouping automatically, and how do we pick the number of subtypes to look for? The way to rigorously separate the data into groups is to use a clustering algorithm. In this chapter, we will focus on the K-means algorithm, a widely-used and often very effective clustering method, combined with the elbow method for selecting the number of clusters. This procedure will separate the data into the following groups denoted by colour:

What are the labels for these groups? Unfortunately, we don’t have any. K-means, like almost all clustering algorithms, just outputs meaningless “cluster labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, where we can easily visualize the clusters on a scatter plot, we can give human-made labels to the groups using their positions on the plot:

• low loyalty and low satisfaction (blue cluster),
• high loyalty and low satisfaction (pink cluster),
• and high loyalty and high satisfaction (green cluster).

Once we have made these determinations, we can use them to inform our future business decisions, or to ask further questions about our data. For example, here we might notice based on our clustering that there aren’t any customers with high satisfaction but low loyalty, and generate new analyses or business strategies based on this information.

## 10.4 K-means

### 10.4.1 Measuring cluster quality

The K-means algorithm is a procedure that groups data into K clusters. It starts with an initial clustering of the data, and then iteratively improves it by making adjustments to the assignment of data to clusters until it cannot improve any further. But how do we measure the “quality” of a clustering, and what does it mean to improve it? In K-means clustering, we measure the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD). Computing this involves two steps. First, we find the cluster centers by computing the mean of each variable over data points in the cluster. For example, suppose we have a cluster containing 3 observations, and we are using two variables, $$x$$ and $$y$$, to cluster the data. Then we would compute the $$x$$ and $$y$$ variables, $$\mu_x$$ and $$\mu_y$$, of the cluster center via

$\mu_x = \frac{1}{3}(x_1+x_2+x_3) \quad \mu_y = \frac{1}{3}(y_1+y_2+y_3).$

In the first cluster from the customer satisfaction/loyalty example, there are 5 data points. These are shown with their cluster center (csat = 1.8 and loyalty = 7.5) highlighted below.

The second step in computing the WSSD is to add up the squared distance between each point in the cluster and the cluster center. We use the straight-line / Euclidean distance formula that we learned about in the classification chapter. In the 3-observation cluster example above, we would compute the WSSD $$S^2$$ via

$S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) +\left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right).$

These distances are denoted by lines for the first cluster of the customer satisfaction/loyalty data example below.

The larger the value of $$S^2$$, the more spread-out the cluster is, since large $$S^2$$ means that points are far away from the cluster center. Note, however, that “large” is relative to both the scale of the variables for clustering and the number of points in the cluster; a cluster where points are very close to the center might still have a large $$S^2$$ if there are many data points in the cluster.

### 10.4.2 The clustering algorithm

The K-means algorithm is quite simple. We begin by picking K, and uniformly randomly assigning data to the K clusters. Then K-means consists of two major steps that attempt to minimize the sum of WSSDs over all the clusters, i.e. the total WSSD:

1. Center update: Compute the center of each cluster.
2. Label update: Reassign each data point to the cluster with the nearest center.
These two steps are repeated until the cluster assignments no longer change. For example, in the customer data example from earlier, our initialization might look like this:
And the first four iterations of K-means would look like (each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters):
Center Update                            Label Update

Note that at this point we can terminate the algorithm, since none of the assignments changed in the fourth iteration; both the centers and labels will remain the same from this point onward.

Is K-means guaranteed to stop at some point, or could it iterate forever? As it turns out, the answer is thankfully that K-means is guaranteed to stop after some number of iterations. For the interested reader, the logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments are changing and the algorithm terminates.

### 10.4.3 Random restarts

K-means, unlike the classification and regression models we studied in previous chapters, can get “stuck” in a bad solution. For example, if we were unlucky and initialized K-means with the following labels:
Then the iterations of K-means would look like:
Center Update                            Label Update

This looks like a relatively bad clustering of the data, but K-means cannot improve it. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD.

### 10.4.4 Choosing K

In order to cluster data using K-means, we also have to pick the number of clusters, K. But unlike in classification, we have no data labels and cannot perform cross-validation with some measure of model prediction error. Further, if K is chosen too small, then multiple clusters get grouped together; if K is too large, then clusters get subdivided. In both cases, we will potentially miss interesting structure in the data. For example, take a look below at the K-means clustering of our customer satisfaction and loyalty data for a number of clusters ranging from 1 to 9.

If we set K less than 3, then the clustering merges separate groups of data; this causes a large total WSSD, since the cluster center (denoted by an “x”) is not close to any of the data in the cluster. On the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still decrease the total WSSD, but by only a diminishing amount. If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the right number of clusters.

## 10.5 Data pre-processing for K-means

Similar to K-nearest neighbours classification and regression, K-means clustering uses straight-line distance to decide which points are similar to each other. Therefore, the scale of each of the variables in the data will influence which cluster data points end up being assigned to. Variables that have a large scale will have a much larger effect on deciding cluster assignment than variables with a small scale. To address this problem, we typically standardize our data before clustering, which ensures that each variable has a mean of 0 and standard deviation of 1. The scale function in R can be used to do this. We show an example of how to use this function below using the data in this chapter:

scaled_data <- map_df(unscaled_data, scale)
scaled_data
## # A tibble: 19 x 2
##    loyalty[,1] csat[,1]
##          <dbl>    <dbl>
##  1       0.541   -1.40
##  2       0.720   -1.40
##  3       0.899   -1.03
##  4       0.541   -1.03
##  5       0.899   -0.659
##  6      -1.43    -1.12
##  7      -1.61    -0.659
##  8      -1.79    -0.288
##  9      -1.25    -0.288
## 10       0.541    0.454
## 11       0.183    0.454
## 12       0.541    0.826
## 13       0.183    0.826
## 14      -0.175    0.826
## 15       1.44     1.20
## 16       0.541    1.20
## 17       1.01     1.57
## 18      -0.533    1.20
## 19      -1.25    -0.659

## 10.6 K-means in R

To peform K-means clustering in R, we use the kmeans function. It takes at least two arguments: the data frame containing the data you wish to cluster, and K, the number of clusters (here we choose K = 3). Note that since the K-means algorithm uses a random initialization of assignments, we need to set the random seed to make the clustering reproducible.

set.seed(1234)
marketing_clust <- kmeans(marketing_data, centers = 3)
marketing_clust
## K-means clustering with 3 clusters of sizes 9, 5, 5
##
## Cluster means:
##    loyalty     csat    label
## 1 6.644444 7.333333 1.777778
## 2 7.500000 1.800000 3.000000
## 3 1.400000 3.150000 3.000000
##
## Clustering vector:
##  [1] 2 2 2 2 2 3 3 3 3 1 1 1 1 1 1 1 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 31.35778  3.80000  5.15000
##  (between_SS / total_SS =  85.6 %)
##
## Available components:
##
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"
## [7] "size"         "iter"         "ifault"

As you can see above, the clustering object returned by kmeans has a lot of information that can be used to visualize the clusters, pick K, and evaluate the total WSSD. To obtain this information in a tidy format, we will call in help from the broom package. Let’s start by visualizing the clustering as a coloured scatter plot. To do that we use the augment function, which takes in the model and the original data frame, and returns a data frame with the data and the cluster assignments for each point:

clustered_data <- augment(marketing_clust, marketing_data)
clustered_data
## # A tibble: 19 x 4
##    loyalty  csat label .cluster
##      <dbl> <dbl> <chr> <fct>
##  1     7    1    3     2
##  2     7.5  1    3     2
##  3     8    2    3     2
##  4     7    2    3     2
##  5     8    3    3     2
##  6     1.5  1.75 3     3
##  7     1    3    3     3
##  8     0.5  4    3     3
##  9     2    4    3     3
## 10     7    6    2     1
## 11     6    6    2     1
## 12     7    7    2     1
## 13     6    7    2     1
## 14     5    7    1     1
## 15     9.5  8    2     1
## 16     7    8    2     1
## 17     8.3  9    2     1
## 18     4    8    1     1
## 19     2    3    3     3

Now that we have this information in a tidy data frame, we can make a visualization of the cluster assignments for each point:

cluster_plot <- ggplot(clustered_data,
aes(x = csat, y = loyalty, colour = .cluster),
size=2) +
geom_point() +
labs(x = 'Customer satisfaction', y = 'Loyalty', colour = 'Cluster')
cluster_plot

As mentioned above, we also need to select K by finding where the “elbow” occurs in the plot of total WSSD versus number of clusters. We can obtain the total WSSD (tot.withinss) from our clustering using broom’s glance function. For example:

glance(marketing_clust)
## # A tibble: 1 x 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1  280.         40.3      239.     2

To calculate the total WSSD for a variety of Ks, we will create a data frame with a column named k with rows containing each value of K we want to run K-means with (here, 1 to 9).

marketing_clust_ks <- tibble(k = 1:9)
marketing_clust_ks
## # A tibble: 9 x 1
##       k
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9

Then we use map to apply the kmeans function to each K. However, we need to use map a little bit differently than we have before. This is because we need to iterate over k, which is the second argument to the kmeans function. In the past, we have used map only to iterate over values of the first argument of a function. Since that is the default, we could simply write map(data_frame, function_name). This won’t work here; we need to provide our data frame as the first argument to the kmeans function.

The solution is to create something called an anonymous function. An anonymous function is a function that has no name, unlike other functions you’ve seen so far (kmeans, select, etc). To do this we will write our map statement like this:

map(marketing_clust_ks, function(k) kmeans(marketing_data, k))

The anonymous function in the above call is function(k) kmeans(marketing_data, k). This function takes a single argument (k) and evaluates kmeans(marketing_data, k). Since k is the first (and only) argument to the function, we can use map just like we did before! The rest of the call above does just that – it passes each row of marketing_clust_ks to our anonymous function.

Below, we execute this map call inside of a mutate call on the marketing_clust_ks data frame and get a list column that contains a K-means clustering object for each value of K we had:

marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, function(ks) kmeans(marketing_data, ks)))
marketing_clust_ks
## # A tibble: 9 x 2
##       k marketing_clusts
##   <int> <list>
## 1     1 <kmeans>
## 2     2 <kmeans>
## 3     3 <kmeans>
## 4     4 <kmeans>
## 5     5 <kmeans>
## 6     6 <kmeans>
## 7     7 <kmeans>
## 8     8 <kmeans>
## 9     9 <kmeans>

Next, we use map again to apply glance to each of the K-means clustering objects to get the clustering statistics (including WSSD). The output of glance is a data frame, and so we get another list column. This results in a complex data frame with 3 columns, one for K, one for the K-means clustering objects, and one for the clustering statistics:

marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, ~kmeans(marketing_data, .x)),
glanced = map(marketing_clusts, glance))
marketing_clust_ks
## # A tibble: 9 x 3
##       k marketing_clusts glanced
##   <int> <list>           <list>
## 1     1 <kmeans>         <tibble [1 × 4]>
## 2     2 <kmeans>         <tibble [1 × 4]>
## 3     3 <kmeans>         <tibble [1 × 4]>
## 4     4 <kmeans>         <tibble [1 × 4]>
## 5     5 <kmeans>         <tibble [1 × 4]>
## 6     6 <kmeans>         <tibble [1 × 4]>
## 7     7 <kmeans>         <tibble [1 × 4]>
## 8     8 <kmeans>         <tibble [1 × 4]>
## 9     9 <kmeans>         <tibble [1 × 4]>

Finally we extract the total WSSD from the glanced column. Given that each item in this column is a data frame, we will need to use the unnest function to unpack the data frames into simpler column data types.

clustering_statistics <- marketing_clust_ks %>%
unnest(glanced)

clustering_statistics
## # A tibble: 9 x 6
##       k marketing_clusts totss tot.withinss betweenss  iter
##   <int> <list>           <dbl>        <dbl>     <dbl> <int>
## 1     1 <kmeans>          280.       280.   -5.68e-14     1
## 2     2 <kmeans>          280.       138.    1.42e+ 2     1
## 3     3 <kmeans>          280.        40.3   2.39e+ 2     2
## 4     4 <kmeans>          280.        37.8   2.42e+ 2     2
## 5     5 <kmeans>          280.        15.2   2.64e+ 2     2
## 6     6 <kmeans>          280.        35.7   2.44e+ 2     2
## 7     7 <kmeans>          280.        12.0   2.68e+ 2     2
## 8     8 <kmeans>          280.         8.98  2.71e+ 2    11
## 9     9 <kmeans>          280.         8.70  2.71e+ 2     2

Now that we have tot.withinss and k as columns in a data frame, we can make a line plot and search for the “elbow” to find which value of K to use.

elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares")+
scale_x_continuous(breaks = 1:9)
elbow_plot

It looks like 3 clusters is the right choice for this data. But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always decrease as we add more clusters? Technically yes, but remember: K-means can get “stuck” in a bad solution. Unfortunately, for K = 6 we had an unlucky initialization and found a bad clustering! We can help prevent finding a bad clustering by trying a few different random initializations via the nstart argument (here we use 10 restarts).

marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, ~kmeans(marketing_data, nstart = 10, .x)),
glanced = map(marketing_clusts, glance))

clustering_statistics <- marketing_clust_ks %>%
unnest(glanced)

elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares")+
scale_x_continuous(breaks = 1:9)
elbow_plot