Chapter 6 Classification I: training & predicting

6.1 Overview

Up until this point, we have focused solely on descriptive and exploratory questions about data. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on the problem of classification, i.e., using one or more quantitative variables to predict the value of a third, categorical variable. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy.

6.2 Chapter learning objectives

  • Recognize situations where a classifier would be appropriate for making predictions
  • Describe what a training data set is and how it is used in classification
  • Interpret the output of a classifier
  • Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two explanatory variables/predictors
  • Explain the K-nearest neighbour classification algorithm
  • Perform K-nearest neighbour classification in R using tidymodels
  • Explain why one should center, scale, and balance data in predictive modelling
  • Preprocess data to center, scale, and balance a dataset using a recipe
  • Combine preprocessing and model training using a Tidymodels workflow

6.3 The classification problem

In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” depending on past email text data; or an online store may want to predict whether an order is fraudulent or not.

These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other quantitative variables (sometimes called features). Generally, a classifier assigns an observation (e.g. a new patient) to a class (e.g. diseased or healthy) on the basis of how similar it is to other observations for which we know the class (e.g. previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set. We call them a “training set” because we use these observations to train, or teach, our classifier so that we can use it to make predictions on new data that we have not seen previously.

There are many possible classification algorithms that we could use to predict a categorical class/label for an observation. In addition, there are many variations on the basic classification problem, e.g., binary classification where only two classes are involved (e.g. disease or healthy patient), or multiclass classification, which involves assigning an object to one of several classes (e.g., private, public, or not for-profit organization). Here we will focus on the simple, widely used K-nearest neighbours algorithm for the binary classification problem. Other examples you may encounter in future courses include decision trees, support vector machines (SVMs), logistic regression, and neural networks.

6.4 Exploring a labelled data set

In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians.

As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumour image measurements available to us to predict whether a future tumour image (with unknown diagnosis) shows a benign or malignant tumour? Answering this question is important because traditional, non-data-driven methods for tumour diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage (learn more about cancer here). Thus, it is important to quickly and accurately diagnose the tumour type to guide patient treatment.

Loading the data

Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the necessary packages for our analysis. Below you’ll see (in addition to the usual tidyverse) a new package: forcats. The forcats package enables us to easily manipulate factors in R; factors are a special categorical type of variable in R that are often used for class label data.

library(tidyverse)
library(forcats)

In this case, the file containing the breast cancer data set is a simple .csv file with headers. We’ll use the read_csv function with no additional arguments, and then inspect its contents:

cancer <- read_csv("data/wdbc.csv")
cancer
## # A tibble: 569 x 12
##        ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity Concave_Points
##     <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>          <dbl>
##  1 8.42e5 M      1.10   -2.07     1.27    0.984      1.57       3.28      2.65            2.53 
##  2 8.43e5 M      1.83   -0.353    1.68    1.91      -0.826     -0.487    -0.0238          0.548
##  3 8.43e7 M      1.58    0.456    1.57    1.56       0.941      1.05      1.36            2.04 
##  4 8.43e7 M     -0.768   0.254   -0.592  -0.764      3.28       3.40      1.91            1.45 
##  5 8.44e7 M      1.75   -1.15     1.78    1.82       0.280      0.539     1.37            1.43 
##  6 8.44e5 M     -0.476  -0.835   -0.387  -0.505      2.24       1.24      0.866           0.824
##  7 8.44e5 M      1.17    0.161    1.14    1.09      -0.123      0.0882    0.300           0.646
##  8 8.45e7 M     -0.118   0.358   -0.0728 -0.219      1.60       1.14      0.0610          0.282
##  9 8.45e5 M     -0.320   0.588   -0.184  -0.384      2.20       1.68      1.22            1.15 
## 10 8.45e7 M     -0.473   1.10    -0.329  -0.509      1.58       2.56      1.74            0.941
## # … with 559 more rows, and 2 more variables: Symmetry <dbl>, Fractal_Dimension <dbl>

Variable descriptions

Breast tumours can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle asipiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, 10 different variables were measured for each cell nucleus in the image (3-12 below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been scaled; we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis for malignance by a physician. Therefore, the total set of variables per image in this data set are:

  1. ID number
  2. Class: the diagnosis of Malignant or Benign
  3. Radius: the mean of distances from center to points on the perimeter
  4. Texture: the standard deviation of gray-scale values
  5. Perimeter: the length of the surrounding contour
  6. Area: the area inside the contour
  7. Smoothness: the local variation in radius lengths
  8. Compactness: the ratio of squared perimeter and area
  9. Concavity: severity of concave portions of the contour
  10. Concave Points: the number of concave portions of the contour
  11. Symmetry
  12. Fractal Dimension
A malignant breast fine needle aspiration image. [Source](https://pubsonline.informs.org/doi/abs/10.1287/opre.43.4.570)

Figure 6.1: A malignant breast fine needle aspiration image. Source

Below we use glimpse to preview the data frame. This function can make it easier to inspect the data when we have a lot of columns:

glimpse(cancer)
## Rows: 569
## Columns: 12
## $ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202…
## $ Class             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
## $ Radius            <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487579, -0.4759559, 1.…
## $ Texture           <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, …
## $ Perimeter         <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.386807…
## $ Area              <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.505205…
## $ Smoothness        <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.2354545…
## $ Compactness       <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.2432415…
## $ Concavity         <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.8655400…
## $ Concave_Points    <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067…
## $ Symmetry          <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.00…
## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834…

We can see from the summary of the data above that Class is of type character (denoted by <chr>). Since we are going to be working with Class as a categorical statistical variable, we will convert it to factor using the function as_factor.

cancer <- cancer %>%
  mutate(Class = as_factor(Class))
glimpse(cancer)
## Rows: 569
## Columns: 12
## $ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202…
## $ Class             <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, B, M, M…
## $ Radius            <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487579, -0.4759559, 1.…
## $ Texture           <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, …
## $ Perimeter         <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.386807…
## $ Area              <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.505205…
## $ Smoothness        <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.2354545…
## $ Compactness       <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.2432415…
## $ Concavity         <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.8655400…
## $ Concave_Points    <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067…
## $ Symmetry          <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.00…
## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834…

Factors have what are called “levels”, which you can think of as categories. We can ask for the levels from the Class column by using the levels function. This function should return the name of each category in that column. Given that we only have 2 different values in our Class column (B and M), we only expect to get two names back. Note that the levels function requires a vector argument, while the select function outputs a data frame; so we use the pull function, which converts a single column of a data frame into a vector.

cancer %>%
  select(Class) %>%
  pull() %>% # turns a data frame into a vector
  levels()
## [1] "M" "B"

Exploring the data

Before we start doing any modelling, let’s explore our data set. Below we use the group_by + summarize code pattern we used before to see that we have 357 (63%) benign and 212 (37%) malignant tumour observations.

num_obs <- nrow(cancer)
cancer %>%
  group_by(Class) %>%
  summarize(
    n = n(),
    percentage = n() / num_obs * 100
  )
## # A tibble: 2 x 3
##   Class     n percentage
##   <fct> <int>      <dbl>
## 1 M       212       37.3
## 2 B       357       62.7

Next, let’s draw a scatter plot to visualize the relationship between the perimeter and concavity variables. Rather than use ggplot's default palette, we define our own here (cbPalette) and pass it as the values argument to the scale_color_manual function. We also make the category labels (“B” and “M”) more readable by changing them to “Benign” and “Malignant” using the labels argument.

# colour palette
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")

perim_concav <- cancer %>%
  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
perim_concav
Scatterplot of concavity versus perimeter coloured by diagnosis label

Figure 6.2: Scatterplot of concavity versus perimeter coloured by diagnosis label

In this visualization, we can see that malignant observations typically fall in the the upper right-hand corner of the plot area. By contrast, benign observations typically fall in lower left-hand corner of the plot. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumour class). We could compute the perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? What about a new observation with perimeter value of -1 and concavity value of -0.5? What about 0 and 1? It seems like the prediction of an unobserved label might be possible, based on our visualization. In order to actually do this computationally in practice, we will need a classification algorithm; here we will use the K-nearest neighbour classification algorithm.

6.5 Classification with K-nearest neighbours

To predict the label of a new observation, i.e., classify it as either benign or malignant, the K-nearest neighbour classifier generally finds the \(K\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. To illustrate this concept, we will walk through an example. Suppose we have a new observation, with perimeter of 2 and concavity of 4 (labelled in red on the scatterplot), whose diagnosis “Class” is unknown.

Scatterplot of concavity versus perimeter with new observation labelled in red

Figure 6.3: Scatterplot of concavity versus perimeter with new observation labelled in red

We see that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatterplot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis.

Scatterplot of concavity versus perimeter where the nearest neighbour to the new observation is malignant

Figure 6.4: Scatterplot of concavity versus perimeter where the nearest neighbour to the new observation is malignant

Suppose we have another new observation with perimeter 0.2 and concavity of 3.3. Looking at the scatterplot below, how would you classify this red observation? The nearest neighbour to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make? Probably not, if you consider the other nearby points…

Scatterplot of concavity versus perimeter where the nearest neighbour to this new observation is benign

Figure 6.5: Scatterplot of concavity versus perimeter where the nearest neighbour to this new observation is benign

So instead of just using the one nearest neighbour, we can consider several neighbouring points, say \(K = 3\), that are closest to the new red observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. In this case, we see that the diagnoses of 2 of the 3 nearest neighbours to our new observation are malignant. Therefore we take majority vote and classify our new red observation as malignant.

Scatterplot of concavity versus perimeter with three nearest neighbours

Figure 6.6: Scatterplot of concavity versus perimeter with three nearest neighbours

Here we chose the \(K=3\) nearest observations, but there is nothing special about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \(K\) in the next chapter.

Distance between points

How do we decide which points are the \(K\) “nearest” to our new observation? We can compute the distance between any pair of points using the following formula:

\[\mathrm{Distance} = \sqrt{(x_a -x_b)^2 + (y_a - y_b)^2}\]

This formula – sometimes called the Euclidean distance – is simply the straight line distance between two points on the x-y plane with coordinates \((x_a, y_a)\) and \((x_b, y_b)\).

Suppose we want to classify a new observation with perimeter of 0 and concavity of 3.5. Let’s calculate the distances between our new point and each of the observations in the training set to find the \(K=5\) observations in the training data that are nearest to our new point.

Scatterplot of concavity versus perimeter with new observation labelled in red

Figure 6.7: Scatterplot of concavity versus perimeter with new observation labelled in red

new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
cancer %>%
  select(ID, Perimeter, Concavity, Class) %>%
  mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) %>%
  arrange(dist_from_new) %>%
  slice(1:5) # subset the first 5 rows
## # A tibble: 5 x 5
##        ID Perimeter Concavity Class dist_from_new
##     <dbl>     <dbl>     <dbl> <fct>         <dbl>
## 1   86409     0.241      2.65 B             0.881
## 2  887181     0.750      2.87 M             0.980
## 3  899667     0.623      2.54 M             1.14 
## 4  907914     0.417      2.31 M             1.26 
## 5 8710441    -1.16       4.04 B             1.28

From this, we see that 3 of the 5 nearest neighbours to our new observation are malignant so classify our new observation as malignant. We circle those 5 in the plot below:

Scatterplot of concavity versus perimeter with 5 nearest neighbours circled

Figure 6.8: Scatterplot of concavity versus perimeter with 5 nearest neighbours circled

It can be difficult sometimes to read code as math, so here we mathematically show the calculation of distance for each of the 5 closest points.

Perimeter Concavity Distance Class
0.24 2.65 \(\sqrt{0 - 0.241)^2 + (3.5 - 2.65)^2}=\) 0.88 B
0.75 2.87 \(\sqrt{(0 - 0.750)^2 + (3.5 - 2.87)^2} =\) 0.98 M
0.62 2.54 \(\sqrt{(0 - 0.623)^2 + (3.5 - 2.54)^2} =\) 1.14 M
0.42 2.31 \(\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2} =\) 1.26 M
-1.16 4.04 \(\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} =\) 1.28 B

More than two explanatory variables

Although the above description is directed toward two explanatory variables / predictors, exactly the same K-nearest neighbour algorithm applies when you have a higher number of explanatory variables (i.e., a higher-dimensional predictor space). Each explanatory variable/predictor can give us new information to help create our classifier. The only difference is the formula for the distance between points. In particular, let’s say we have \(m\) predictor variables for two observations \(u\) and \(v\), i.e., \(u = (u_{1}, u_{2}, \dots, u_{m})\) and \(v = (v_{1}, v_{2}, \dots, v_{m})\). Before, we added up the squared difference between each of our (two) variables, and then took the square root; now we will do the same, except for all of our \(m\) variables. In other words, the distance formula becomes

\[Distance = \sqrt{(u_{1} -v_{1})^2 + (u_{2} - v_{2})^2 + \dots + (u_{m} - v_{m})^2}\]

Figure 6.9: 3D scatterplot of symmetry, concavity and perimeter

Click and drag the plot above to rotate it, and scroll to zoom. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what “higher dimensions” look like for learning purposes.

Summary

In order to classify a new observation using a K-nearest neighbour classifier, we have to:

  1. Compute the distance between the new observation and each observation in the training set
  2. Sort the data table in ascending order according to the distances
  3. Choose the top \(K\) rows of the sorted table
  4. Classify the new observation based on a majority vote of the neighbour classes

6.6 K-nearest neighbours with tidymodels

Coding the K-nearest neighbour algorithm in R ourselves would get complicated if we might have to predict the label/class for multiple new observations, or when there are multiple classes and more than two variables. Thankfully, in R, the K-nearest neighbour algorithm is implemented in the parsnip package included in the tidymodels package collection, along with many other models that you will encounter in this and future classes. The tidymodels collection provides tools to help make and use models, such as classifiers. Using the packages in this collection will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we are likely to make. We start off by loading tidymodels:

library(tidymodels)

Let’s again suppose we have a new observation with perimeter 0 and concavity 3.5, but its diagnosis is unknown (as in our example above). Suppose we want to use the perimeter and concavity explanatory variables/predictors to predict the diagnosis class of this observation. Let’s pick out our 2 desired predictor variables and class label and store it as a new dataset named cancer_train:

cancer_train <- cancer %>%
  select(Class, Perimeter, Concavity)
cancer_train
## # A tibble: 569 x 3
##    Class Perimeter Concavity
##    <fct>     <dbl>     <dbl>
##  1 M        1.27      2.65  
##  2 M        1.68     -0.0238
##  3 M        1.57      1.36  
##  4 M       -0.592     1.91  
##  5 M        1.78      1.37  
##  6 M       -0.387     0.866 
##  7 M        1.14      0.300 
##  8 M       -0.0728    0.0610
##  9 M       -0.184     1.22  
## 10 M       -0.329     1.74  
## # … with 559 more rows

Next, we create a model specification for K-nearest neighbours classification by calling the nearest_neighbor function, specifying that we want to use \(K = 5\) neighbours (we will discuss how to choose \(K\) in the next chapter) and the straight-line distance (weight_func = "rectangular"). The weight_func argument controls how neighbours vote when classifying a new observation; by setting it to "rectangular", each of the \(K\) nearest neighbours gets exactly 1 vote as described above. Other choices, which weight each neighbour’s vote differently, can be found on the tidymodels website. We specify the particular computational engine (in this case, the kknn engine) for training the model with the set_engine function. Finally we specify that this is a classification problem with the set_mode function.

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")
knn_spec
## K-Nearest Neighbor Model Specification (classification)
## 
## Main Arguments:
##   neighbors = 5
##   weight_func = rectangular
## 
## Computational engine: kknn

In order to fit the model on the breast cancer data, we need to pass the model specification and the dataset to the fit function. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the Class ~ . argument specifies that Class is the target variable (the one we want to predict), and . (everything except Class) is to be used as the predictor.

knn_fit <- knn_spec %>%
  fit(Class ~ ., data = cancer_train)
knn_fit
## parsnip model object
## 
## Fit time:  44ms 
## 
## Call:
## kknn::train.kknn(formula = formula, data = data, ks = ~5, kernel = ~"rectangular")
## 
## Type of response variable: nominal
## Minimal misclassification: 0.07557118
## Best kernel: rectangular
## Best k: 5

Here you can see the final trained model summary. It confirms that the computational engine used to train the model was kknn::train.kknn. It also shows the fraction of errors made by the nearest neighbour model, but we will ignore this for now and discuss it in more detail in the next chapter. Finally it shows (somewhat confusingly) that the “best” weight function was “rectangular” and “best” setting of \(K\) was 5; but since we specified these earlier, R is just repeating those settings to us here. In the next chapter, we will actually let R tune the model for us.

Finally, we make the prediction on the new observation by calling the predict function, passing the fit object we just created. As above when we ran the K-nearest neighbours classification algorithm manually, the knn_fit object classifies the new observation as malignant (“M”). Note that the predict function outputs a data frame with a single variable named .pred_class.

new_obs <- tibble(Perimeter = 0, Concavity = 3.5)
predict(knn_fit, new_obs)
## # A tibble: 1 x 1
##   .pred_class
##   <fct>      
## 1 M

6.7 Data preprocessing with tidymodels

6.7.1 Centering and scaling

When using K-nearest neighbour classification, the scale of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations that are nearest to it, any variables that have a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two attributes, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbour distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary!

In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celcius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the K-nearest neighbour classification algorithm, this large shift can change the outcome of using many other predictive models.

Standardization: when all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized.

To illustrate the effect that standardization can have on the K-nearest neighbour algorithm, we will read in the original, unscaled Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. To keep things simple, we will just use the Area, Smoothness, and Class variables:

unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
  mutate(Class = as_factor(Class)) %>%
  select(Class, Area, Smoothness)
unscaled_cancer
## # A tibble: 569 x 3
##    Class  Area Smoothness
##    <fct> <dbl>      <dbl>
##  1 M     1001      0.118 
##  2 M     1326      0.0847
##  3 M     1203      0.110 
##  4 M      386.     0.142 
##  5 M     1297      0.100 
##  6 M      477.     0.128 
##  7 M     1040      0.0946
##  8 M      578.     0.119 
##  9 M      520.     0.127 
## 10 M      476.     0.119 
## # … with 559 more rows

Looking at the unscaled / uncentered data above, you can see that the difference between the values for area measurements are much larger than those for smoothness, and the mean appears to be much larger too. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (coloured by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data.

In the tidymodels framework, all data preprocessing happens using a recipe. Here we will initialize a recipe for the unscaled_cancer data above, specifying that the Class variable is the target, and all other variables are predictors:

uc_recipe <- recipe(Class ~ ., data = unscaled_cancer)
print(uc_recipe)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2

So far, there is not much in the recipe; just a statement about the number of targets and predictors. Let’s add scaling (step_scale) and centering (step_center) steps for all of the predictors so that they each have a mean of 0 and standard deviation of 1. The prep function finalizes the recipe by using the data (here, unscaled_cancer) to compute anything necessary to run the recipe (in this case, the column means and standard deviations):

uc_recipe <- uc_recipe %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) %>%
  prep()
uc_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Training data contained 569 data points and no missing data.
## 
## Operations:
## 
## Scaling for Area, Smoothness [trained]
## Centering for Area, Smoothness [trained]

You can now see that the recipe includes a scaling and centering step for all predictor variables. Note that when you add a step to a recipe, you must specify what columns to apply the step to. Here we used the all_predictors() function to specify that each step should be applied to all predictor variables. However, there are a number of different arguments one could use here, as well as naming particular columns with the same syntax as the select function. For example:

  • all_nominal() and all_numeric(): specify all categorical or all numeric variables
  • all_predictors() and all_outcomes(): specify all predictor or all target variables
  • Area, Smoothness: specify both the Area and Smoothness variable
  • -Class: specify everything except the Class variable

You can find a full set of all the steps and variable selection functions on the recipes home page. We finally use the bake function to apply the recipe.

scaled_cancer <- bake(uc_recipe, unscaled_cancer)
scaled_cancer
## # A tibble: 569 x 3
##      Area Smoothness Class
##     <dbl>      <dbl> <fct>
##  1  0.984      1.57  M    
##  2  1.91      -0.826 M    
##  3  1.56       0.941 M    
##  4 -0.764      3.28  M    
##  5  1.82       0.280 M    
##  6 -0.505      2.24  M    
##  7  1.09      -0.123 M    
##  8 -0.219      1.60  M    
##  9 -0.384      2.20  M    
## 10 -0.509      1.58  M    
## # … with 559 more rows

Now let’s generate the two scatter plots, one for unscaled_cancer and one for scaled_cancer, and show them side-by-side. Each has the same new observation annotated with its \(K=3\) nearest neighbours:

Comparison of K = 3 nearest neighbours with standardized and unstandardized data

Figure 6.10: Comparison of K = 3 nearest neighbours with standardized and unstandardized data

In the plot for the nonstandardized original data, you can see some odd choices for the three nearest neighbours. In particular, the “neighbours” are visually well within the cloud of benign observations, and the neighbours are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Here the computation of nearest neighbours is dominated by the much larger-scale area variable. On the right, the plot for standardized data shows a much more intuitively reasonable selection of nearest neighbours. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis.

6.7.2 Balancing

Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another. Since classifiers like the K-nearest neighbour algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email).

To better illustrate the problem, let’s revisit the breast cancer data; except now we will remove many of the observations of malignant tumours, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations randomly from the malignant group, and keeping all of the benign observations.

set.seed(3)
rare_cancer <- bind_rows(
  filter(cancer, Class == "B"),
  cancer %>% filter(Class == "M") %>% sample_n(3)
) %>%
  select(Class, Perimeter, Concavity)

rare_plot <- rare_cancer %>%
  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
rare_plot
Imbalanced data

Figure 6.11: Imbalanced data

Note: You will see in the code above that we use the set.seed function. This is because we are using sample_n to artificially pick only 3 of the malignant tumour observations, which uses random sampling to choose which rows will be in the training set. In order to make the code reproducible, we use set.seed to specify where the random number generator starts for this process, which then guarantees the same result, i.e., the same choice of 3 observations, each time the code is run. In general, when your code involves random numbers, if you want the same result each time, you should use set.seed; if you want a different result each time, you should not.

Suppose we now decided to use \(K = 7\) in K-nearest neighbour classification. With only 3 observations of malignant tumours, the classifier will always predict that the tumour is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, look what happens for a new tumour observation that is quite close to two that were tagged as malignant:

Imbalanced data with K = 7

Figure 6.12: Imbalanced data with K = 7

And if we set the background colour of each area of the plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is always “benign,” corresponding to the blue colour:

Imbalanced data with background colour indicating the decision of the K-nn classifier

Figure 6.13: Imbalanced data with background colour indicating the decision of the K-nn classifier

Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by oversampling the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbour algorithm. In order to do this, we will add an oversampling step to the earlier uc_recipe recipe with the step_upsample function. We show below how to do this, and also use the group_by + summarize pattern we’ve seen before to see that our classes are now balanced:

ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>%
  step_upsample(Class, over_ratio = 1, skip = FALSE) %>%
  prep()
ups_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Training data contained 360 data points and no missing data.
## 
## Operations:
## 
## Up-sampling based on Class [trained]
upsampled_cancer <- bake(ups_recipe, rare_cancer)

upsampled_cancer %>%
  group_by(Class) %>%
  summarize(n = n())
## # A tibble: 2 x 2
##   Class     n
##   <fct> <int>
## 1 M       357
## 2 B       357

Now suppose we train our K-nearest neighbour classifier with \(K=7\) on this balanced data. Setting the background colour of each area of our scatter plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is more reasonable; when the points are close to those labelled malignant, the classifier predicts a malignant tumour, and vice versa when they are closer to the benign tumour observations:

K-nn after upsampling

Figure 6.14: K-nn after upsampling

6.8 Putting it together in a workflow

The tidymodels package collection also provides the workflow, a simple way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole pipeline, let’s start from scratch with the unscaled_wdbc.csv data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:

# load the unscaled cancer data and make sure the target Class variable is a factor
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
  mutate(Class = as_factor(Class))

# create the KNN model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# create the centering / scaling recipe
uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

Note that each of these steps is exactly the same as earlier, except for one major difference: we did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the formula Class ~ Area + Smoothness (instead of Class ~ .) in the recipe. You will also notice that we did not call prep() on the recipe; this is unnecssary when it is placed in a workflow.

We will now place these steps in a workflow using the add_recipe and add_model functions, and finally we will use the fit function to run the whole workflow on the unscaled_cancer data. Note another difference from earlier here: we do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it:

knn_fit <- workflow() %>%
  add_recipe(uc_recipe) %>%
  add_model(knn_spec) %>%
  fit(data = unscaled_cancer)
knn_fit
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## ● step_scale()
## ● step_center()
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = formula, data = data, ks = ~7, kernel = ~"rectangular")
## 
## Type of response variable: nominal
## Minimal misclassification: 0.112478
## Best kernel: rectangular
## Best k: 7

As before, the fit object lists the function that trains the model as well as the “best” settings for the number of neighbours and weight function (for now, these are just the values we chose manually when we created knn_spec above). But now the fit object also includes information about the overall workflow, including the centering and scaling preprocessing steps.

Let’s visualize the predictions that this trained K-nearest neighbour model will make on new observations. Below you will see how to make the coloured prediction map plots from earlier in this chapter. The basic idea is to create a grid of synthetic new observations using the expand.grid function, predict the label of each, and visualize the predictions with a coloured scatter having a very high transparency (low alpha value) and large point radius. We include the code here as a learning challenge; see if you can figure out what each line is doing!

# create the grid of area/smoothness vals, and arrange in a data frame
are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100)
smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100)
asgrid <- as_tibble(expand.grid(Area = are_grid, Smoothness = smo_grid))

# use the fit workflow to make predictions at the grid points
knnPredGrid <- predict(knn_fit, asgrid)

# bind the predictions as a new column with the grid points
prediction_table <- bind_cols(knnPredGrid, asgrid) %>% rename(Class = .pred_class)

# plot:
# 1. the coloured scatter of the original data
# 2. the faded coloured scatter for the grid points
wkflw_plot <-
  ggplot() +
  geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) +
  geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5.) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
Scatterplot of smoothness versus area where background colour indicates the decision of the classifier

Figure 6.15: Scatterplot of smoothness versus area where background colour indicates the decision of the classifier