# Chapter 10 Clustering

## 10.1 Overview

As part of exploratory data analysis, it is often helpful to see if there are
meaningful subgroups (or *clusters*) in the data; this grouping can be used
for many purposes, such as generating new questions or improving predictive analyses.
This chapter provides an introduction to clustering using the *K-means* algorithm,
including techniques to choose the number of clusters.

## 10.2 Chapter learning objectives

By the end of the chapter, students will be able to:

- Describe a case where clustering is appropriate, and what insight it might extract from the data.
- Explain the K-means clustering algorithm.
- Interpret the output of a K-means analysis.
- Identify when it is necessary to scale variables before clustering, and do this using R.
- Perform K-means clustering in R using
`kmeans`

. - Use the elbow method to choose the number of clusters for K-means.
- Visualize the output of K-means clustering in R using coloured scatter plots.
- Describe the advantages, limitations and assumptions of the K-means clustering algorithm.

## 10.3 Clustering

Clustering is a data analysis task involving separating a data set into subgroups of related data. For example, we might use clustering to separate a data set of documents into groups that correspond to topics, a data set of human genetic information into groups that correspond to ancestral subpopulations, or a data set of online customers into groups that correspond to purchasing behaviours. Once the data are separated we can, for example, use the subgroups to generate new questions about the data and follow up with a predictive modelling exercise. In this course, clustering will be used only for exploratory analysis, i.e., uncovering patterns in the data that we have.

Note that clustering is a fundamentally different kind of task than
classification or regression. Most notably, both classification and regression
are *supervised tasks* where there is a *predictive target* (a class label or
value), and we have examples of past data with labels/values that help us
predict those of future data. By contrast, clustering is an *unsupervised
task*, as we are trying to understand and examine the structure of data without
any labels to help us. This approach has both advantages and disadvantages.
Clustering requires no additional annotation or input on the data; for example,
it would be nearly impossible to annotate all the articles on Wikipedia with
human-made topic labels, but we can still cluster the articles without this
information to automatically find groupings corresponding to topics.

However, because there is no predictive target, it is not as easy to evaluate the “quality” of a clustering. With classification, we are able to use a test data set to assess prediction performance. In clustering, there is not a single good choice for evaluation. In this book, we will use visualization to ascertain the quality of a clustering, and leave rigorous evaluation for more advanced courses.

There are also so-called

semisupervisedtasks, where only some of the data come with labels / annotations, but the vast majority don’t. The goal is to try to uncover underlying structure in the data that allows one to guess the missing labels. This sort of task is very useful, for example, when one has an unlabelled data set that is too large to manually label, but one is willing to provide a few informative example labels as a “seed” to guess the labels for all the data.

**An illustrative example**

Here we will present an illustrative example using a simulated data set. Suppose we have data with two variables measuring customer loyalty and satisfaction, and we want to learn whether there are distinct “types” of customer. Understanding this might help us come up with better products or promotions to improve our business in a data-driven way.

```
## # A tibble: 19 x 2
## loyalty csat
## <dbl> <dbl>
## 1 7 1
## 2 7.5 1
## 3 8 2
## 4 7 2
## 5 8 3
## 6 1.5 1.75
## 7 1 3
## 8 0.5 4
## 9 2 4
## 10 7 6
## 11 6 6
## 12 7 7
## 13 6 7
## 14 5 7
## 15 9.5 8
## 16 7 8
## 17 8.3 9
## 18 4 8
## 19 2 3
```

Based on this visualization, we might suspect there are a few subtypes of customer,
selected from combinations of high/low satisfaction and high/low loyalty. How
do we find this grouping automatically, and how do we pick the number of subtypes
to look for? The
way to rigorously separate the data into groups is to use a clustering algorithm.
In this chapter, we will focus on the *K-means* algorithm, a widely-used
and often very effective clustering method, combined with the *elbow method* for
selecting the number of clusters. This procedure will separate the data into
the following groups denoted by colour:

What are the labels for these groups? Unfortunately, we don’t have any. K-means, like almost all clustering algorithms, just outputs meaningless “cluster labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, where we can easily visualize the clusters on a scatter plot, we can give human-made labels to the groups using their positions on the plot:

- low loyalty and low satisfaction (blue cluster),
- high loyalty and low satisfaction (pink cluster),
- and high loyalty and high satisfaction (green cluster).

Once we have made these determinations, we can use them to inform our future business decisions, or to ask further questions about our data. For example, here we might notice based on our clustering that there aren’t any customers with high satisfaction but low loyalty, and generate new analyses or business strategies based on this information.

## 10.4 K-means

### 10.4.1 Measuring cluster quality

The K-means algorithm is a procedure that groups data into K clusters.
It starts with an initial clustering of the data, and then iteratively
improves it by making adjustments to the assignment of data
to clusters until it cannot improve any further. But how do we measure
the “quality” of a clustering, and what does it mean to improve it?
In K-means clustering, we measure the quality of a cluster by its
*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
First, we find the cluster centers by computing the mean of each variable
over data points in the cluster. For example, suppose we have a
cluster containing 3 observations, and we are using two variables, \(x\) and \(y\), to cluster the data.
Then we would compute the \(x\) and \(y\) variables, \(\mu_x\) and \(\mu_y\), of the cluster center via

\[\mu_x = \frac{1}{3}(x_1+x_2+x_3) \quad \mu_y = \frac{1}{3}(y_1+y_2+y_3).\]

In the first cluster from the customer satisfaction/loyalty example, there
are 5 data points. These are shown with their cluster center
(`csat = 1.8`

and `loyalty = 7.5`

) highlighted below.

The second step in computing the WSSD is to add up the squared distance between each point in the cluster and the cluster center. We use the straight-line / Euclidean distance formula that we learned about in the classification chapter. In the 3-observation cluster example above, we would compute the WSSD \(S^2\) via

\[S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) +\left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right).\]

These distances are denoted by lines for the first cluster of the customer satisfaction/loyalty data example below.

The larger the value of \(S^2\), the more spread-out the cluster is, since large \(S^2\) means that points are far away from the cluster center.
Note, however, that “large” is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster; a
cluster where points are very close to the center might still have a large \(S^2\) if there are many data points in the cluster.

### 10.4.2 The clustering algorithm

The K-means algorithm is quite simple. We begin by picking K, and uniformly randomly assigning data to the K clusters.
Then K-means consists of two major steps that attempt to minimize the
sum of WSSDs over all the clusters, i.e. the *total WSSD*:

**Center update:**Compute the center of each cluster.**Label update:**Reassign each data point to the cluster with the nearest center.

**Center Update**

**Label Update**

Note that at this point we can terminate the algorithm, since none of the assignments changed in the fourth iteration; both the centers and labels will remain the same from this point onward.

Is K-means

guaranteedto stop at some point, or could it iterate forever? As it turns out, the answer is thankfully that K-means is guaranteed to stop aftersomenumber of iterations. For the interested reader, the logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments are changing and the algorithm terminates.

### 10.4.3 Random restarts

K-means, unlike the classification and regression models we studied in previous chapters, can get “stuck” in a bad solution. For example, if we were unlucky and initialized K-means with the following labels:**Center Update**

**Label Update**

This looks like a relatively bad clustering of the data, but K-means cannot improve it. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD.

### 10.4.4 Choosing K

In order to cluster data using K-means, we also have to pick the number of clusters, K. But unlike in classification, we have no data labels and cannot perform cross-validation with some measure of model prediction error. Further, if K is chosen too small, then multiple clusters get grouped together; if K is too large, then clusters get subdivided. In both cases, we will potentially miss interesting structure in the data. For example, take a look below at the K-means clustering of our customer satisfaction and loyalty data for a number of clusters ranging from 1 to 9.

*diminishing amount*. If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the right number of clusters.

## 10.5 Data pre-processing for K-means

Similar to K-nearest neighbours classification and regression, K-means
clustering uses straight-line distance to decide which points are similar to
each other. Therefore, the *scale* of each of the variables in the data
will influence which cluster data points end up being assigned to.
Variables that have a large scale will have a much larger
effect on deciding cluster assignment than variables with a small scale.
To address this problem, we typically standardize our data before clustering,
which ensures that each variable has a mean of 0 and standard deviation of 1.
The `scale`

function in R can be used to do this.
We show an example of how to use this function
below using the data in this chapter:

```
## # A tibble: 19 x 2
## loyalty[,1] csat[,1]
## <dbl> <dbl>
## 1 0.541 -1.40
## 2 0.720 -1.40
## 3 0.899 -1.03
## 4 0.541 -1.03
## 5 0.899 -0.659
## 6 -1.43 -1.12
## 7 -1.61 -0.659
## 8 -1.79 -0.288
## 9 -1.25 -0.288
## 10 0.541 0.454
## 11 0.183 0.454
## 12 0.541 0.826
## 13 0.183 0.826
## 14 -0.175 0.826
## 15 1.44 1.20
## 16 0.541 1.20
## 17 1.01 1.57
## 18 -0.533 1.20
## 19 -1.25 -0.659
```

## 10.6 K-means in R

To peform K-means clustering in R, we use the `kmeans`

function. It takes at
least two arguments: the data frame containing the data you wish to cluster,
and K, the number of clusters (here we choose K = 3). Note that since the K-means
algorithm uses a random initialization of assignments, we need to set the random
seed to make the clustering reproducible.

```
## K-means clustering with 3 clusters of sizes 9, 5, 5
##
## Cluster means:
## loyalty csat label
## 1 6.644444 7.333333 1.777778
## 2 7.500000 1.800000 3.000000
## 3 1.400000 3.150000 3.000000
##
## Clustering vector:
## [1] 2 2 2 2 2 3 3 3 3 1 1 1 1 1 1 1 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 31.35778 3.80000 5.15000
## (between_SS / total_SS = 85.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
## [7] "size" "iter" "ifault"
```

As you can see above, the clustering object returned by `kmeans`

has a lot of information
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
To obtain this information in a tidy format, we will call in help
from the `broom`

package. Let’s start by visualizing the clustering
as a coloured scatter plot. To do that
we use the `augment`

function, which takes in the model and the original data
frame, and returns a data frame with the data and the cluster assignments for
each point:

```
## # A tibble: 19 x 4
## loyalty csat label .cluster
## <dbl> <dbl> <chr> <fct>
## 1 7 1 3 2
## 2 7.5 1 3 2
## 3 8 2 3 2
## 4 7 2 3 2
## 5 8 3 3 2
## 6 1.5 1.75 3 3
## 7 1 3 3 3
## 8 0.5 4 3 3
## 9 2 4 3 3
## 10 7 6 2 1
## 11 6 6 2 1
## 12 7 7 2 1
## 13 6 7 2 1
## 14 5 7 1 1
## 15 9.5 8 2 1
## 16 7 8 2 1
## 17 8.3 9 2 1
## 18 4 8 1 1
## 19 2 3 3 3
```

Now that we have this information in a tidy data frame, we can make a visualization of the cluster assignments for each point:

```
cluster_plot <- ggplot(clustered_data,
aes(x = csat, y = loyalty, colour = .cluster),
size=2) +
geom_point() +
labs(x = 'Customer satisfaction', y = 'Loyalty', colour = 'Cluster')
cluster_plot
```

As mentioned above, we also need to select K by finding
where the “elbow” occurs in the plot of total WSSD versus number of clusters.
We can obtain the total WSSD (`tot.withinss`

) from our
clustering using `broom`

’s `glance`

function. For example:

```
## # A tibble: 1 x 4
## totss tot.withinss betweenss iter
## <dbl> <dbl> <dbl> <int>
## 1 280. 40.3 239. 2
```

To calculate the total WSSD for a variety of Ks, we will
create a data frame with a column named `k`

with rows containing
each value of K we want to run K-means with (here, 1 to 9).

```
## # A tibble: 9 x 1
## k
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
```

Then we use `map`

to apply the `kmeans`

function to each
K. However, we need to use `map`

a little bit differently than we have
before. This is because we need to iterate over `k`

, which is the *second argument*
to the `kmeans`

function. In the past, we have used `map`

only to iterate over values of the
*first argument* of a function. Since that is the default, we could simply
write `map(data_frame, function_name)`

. This won’t work here; we need to
provide our data frame as the first argument to the `kmeans`

function.

The solution is to create something called an *anonymous function*.
An anonymous function is a function that has no name,
unlike other functions you’ve seen so far (`kmeans`

, `select`

, etc).
To do this we will write our map statement like this:

`map(marketing_clust_ks, function(k) kmeans(marketing_data, k))`

The anonymous function in the above call is `function(k) kmeans(marketing_data, k)`

.
This function takes a single argument (`k`

) and evaluates `kmeans(marketing_data, k)`

.
Since `k`

is the *first* (and only) argument to the function, we can use `map`

just
like we did before! The rest of the call above does just that – it passes each row of
`marketing_clust_ks`

to our anonymous function.

Below, we execute this `map`

call inside of a `mutate`

call on the
`marketing_clust_ks`

data frame and get a list column that contains a K-means
clustering object for each value of K we had:

```
marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, function(ks) kmeans(marketing_data, ks)))
marketing_clust_ks
```

```
## # A tibble: 9 x 2
## k marketing_clusts
## <int> <list>
## 1 1 <kmeans>
## 2 2 <kmeans>
## 3 3 <kmeans>
## 4 4 <kmeans>
## 5 5 <kmeans>
## 6 6 <kmeans>
## 7 7 <kmeans>
## 8 8 <kmeans>
## 9 9 <kmeans>
```

Next, we use `map`

again to apply `glance`

to each of the K-means
clustering objects to get the clustering statistics (including WSSD). The output
of glance is a data frame, and so we get another list column. This results in a
complex data frame with 3 columns, one for K, one for the
K-means clustering objects, and one for the clustering statistics:

```
marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, ~kmeans(marketing_data, .x)),
glanced = map(marketing_clusts, glance))
marketing_clust_ks
```

```
## # A tibble: 9 x 3
## k marketing_clusts glanced
## <int> <list> <list>
## 1 1 <kmeans> <tibble [1 × 4]>
## 2 2 <kmeans> <tibble [1 × 4]>
## 3 3 <kmeans> <tibble [1 × 4]>
## 4 4 <kmeans> <tibble [1 × 4]>
## 5 5 <kmeans> <tibble [1 × 4]>
## 6 6 <kmeans> <tibble [1 × 4]>
## 7 7 <kmeans> <tibble [1 × 4]>
## 8 8 <kmeans> <tibble [1 × 4]>
## 9 9 <kmeans> <tibble [1 × 4]>
```

Finally we extract the total WSSD from the `glanced`

column. Given that each
item in this column is a data frame, we will need to use the `unnest`

function
to unpack the data frames into simpler column data types.

```
## # A tibble: 9 x 6
## k marketing_clusts totss tot.withinss betweenss iter
## <int> <list> <dbl> <dbl> <dbl> <int>
## 1 1 <kmeans> 280. 280. -5.68e-14 1
## 2 2 <kmeans> 280. 138. 1.42e+ 2 1
## 3 3 <kmeans> 280. 40.3 2.39e+ 2 2
## 4 4 <kmeans> 280. 37.8 2.42e+ 2 2
## 5 5 <kmeans> 280. 15.2 2.64e+ 2 2
## 6 6 <kmeans> 280. 35.7 2.44e+ 2 2
## 7 7 <kmeans> 280. 12.0 2.68e+ 2 2
## 8 8 <kmeans> 280. 8.98 2.71e+ 2 11
## 9 9 <kmeans> 280. 8.70 2.71e+ 2 2
```

Now that we have `tot.withinss`

and `k`

as columns in a data frame, we can make a line plot
and search for the “elbow” to find which value of K to use.

```
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares")+
scale_x_continuous(breaks = 1:9)
elbow_plot
```

It looks like 3 clusters is the right choice for this data.
But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always
decrease as we add more clusters? Technically yes, but remember: K-means can
get “stuck” in a bad solution. Unfortunately, for K = 6 we had an unlucky initialization
and found a bad clustering! We can help prevent finding a bad clustering by trying a
few different random initializations via the `nstart`

argument (here we use 10 restarts).

```
marketing_clust_ks <- tibble(k = 1:9) %>%
mutate(marketing_clusts = map(k, ~kmeans(marketing_data, nstart = 10, .x)),
glanced = map(marketing_clusts, glance))
clustering_statistics <- marketing_clust_ks %>%
unnest(glanced)
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares")+
scale_x_continuous(breaks = 1:9)
elbow_plot
```

## 10.7 Additional resources

For more about clustering and K-means, refer to pages 385-390 and 404-405 of Introduction to Statistical Learning with Applications in R (2013). We have also linked a helpful companion video below:

### References

Fripp, Geoff. 2020. “Using Cluster Analysis for Market Segmentation.” 2020. http://www.segmentationstudyguide.com/using-cluster-analysis-for-market-segmentation.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. *An Introduction to Statistical Learning*. Vol. 112. Springer.