Chapter 6 Classification I: training & predicting
6.1 Overview
Up until this point, we have focused solely on descriptive and exploratory questions about data. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on the problem of classification, i.e., using one or more quantitative variables to predict the value of a third, categorical variable. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy.
6.2 Chapter learning objectives
- Recognize situations where a classifier would be appropriate for making predictions
- Describe what a training data set is and how it is used in classification
- Interpret the output of a classifier
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two explanatory variables/predictors
- Explain the K-nearest neighbour classification algorithm
- Perform K-nearest neighbour classification in R using
tidymodels
- Explain why one should center, scale, and balance data in predictive modelling
- Preprocess data to center, scale, and balance a dataset using a
recipe
- Combine preprocessing and model training using a Tidymodels
workflow
6.3 The classification problem
In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” depending on past email text data; or an online store may want to predict whether an order is fraudulent or not.
These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other quantitative variables (sometimes called features). Generally, a classifier assigns an observation (e.g. a new patient) to a class (e.g. diseased or healthy) on the basis of how similar it is to other observations for which we know the class (e.g. previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set. We call them a “training set” because we use these observations to train, or teach, our classifier so that we can use it to make predictions on new data that we have not seen previously.
There are many possible classification algorithms that we could use to predict a categorical class/label for an observation. In addition, there are many variations on the basic classification problem, e.g., binary classification where only two classes are involved (e.g. disease or healthy patient), or multiclass classification, which involves assigning an object to one of several classes (e.g., private, public, or not for-profit organization). Here we will focus on the simple, widely used K-nearest neighbours algorithm for the binary classification problem. Other examples you may encounter in future courses include decision trees, support vector machines (SVMs), logistic regression, and neural networks.
6.4 Exploring a labelled data set
In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians.
As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumour image measurements available to us to predict whether a future tumour image (with unknown diagnosis) shows a benign or malignant tumour? Answering this question is important because traditional, non-data-driven methods for tumour diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage (learn more about cancer here). Thus, it is important to quickly and accurately diagnose the tumour type to guide patient treatment.
Loading the data
Our first step is to load, wrangle, and explore the data using visualizations
in order to better understand the data we are working with. We start by
loading the necessary packages for our analysis. Below you’ll see (in addition
to the usual tidyverse
) a new package: forcats
.
The forcats
package enables us to easily
manipulate factors in R; factors are a special categorical type of variable in
R that are often used for class label data.
In this case, the file containing the breast cancer data set is a simple .csv
file with headers. We’ll use the read_csv
function with no additional
arguments, and then inspect its contents:
## # A tibble: 569 x 12
## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65 2.53
## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238 0.548
## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36 2.04
## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91 1.45
## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37 1.43
## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866 0.824
## 7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300 0.646
## 8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610 0.282
## 9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22 1.15
## 10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74 0.941
## # … with 559 more rows, and 2 more variables: Symmetry <dbl>, Fractal_Dimension <dbl>
Variable descriptions
Breast tumours can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle asipiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, 10 different variables were measured for each cell nucleus in the image (3-12 below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been scaled; we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis for malignance by a physician. Therefore, the total set of variables per image in this data set are:
- ID number
- Class: the diagnosis of Malignant or Benign
- Radius: the mean of distances from center to points on the perimeter
- Texture: the standard deviation of gray-scale values
- Perimeter: the length of the surrounding contour
- Area: the area inside the contour
- Smoothness: the local variation in radius lengths
- Compactness: the ratio of squared perimeter and area
- Concavity: severity of concave portions of the contour
- Concave Points: the number of concave portions of the contour
- Symmetry
- Fractal Dimension
](img/malignant_cancer.png)
Figure 6.1: A malignant breast fine needle aspiration image. Source
Below we use glimpse
to preview the data frame. This function can make it easier to inspect the data when we have a lot of columns:
## Rows: 569
## Columns: 12
## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202…
## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487579, -0.4759559, 1.…
## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, …
## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.386807…
## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.505205…
## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.2354545…
## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.2432415…
## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.8655400…
## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067…
## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.00…
## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834…
We can see from the summary of the data above that Class
is of type character
(denoted by <chr>
). Since we are going to be working with Class
as a
categorical statistical variable, we will convert it to factor using the
function as_factor
.
## Rows: 569
## Columns: 12
## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202…
## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, B, M, M…
## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487579, -0.4759559, 1.…
## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, …
## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.386807…
## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.505205…
## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.2354545…
## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.2432415…
## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.8655400…
## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067…
## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.00…
## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834…
Factors have what are called “levels”, which you can think of as categories. We
can ask for the levels from the Class
column by using the levels
function.
This function should return the name of each category in that column. Given
that we only have 2 different values in our Class
column (B and M), we
only expect to get two names back. Note that the levels
function requires
a vector argument, while the select
function outputs a data frame;
so we use the pull
function, which converts a single
column of a data frame into a vector.
## [1] "M" "B"
Exploring the data
Before we start doing any modelling, let’s explore our data set. Below we use
the group_by
+ summarize
code pattern we used before to see that we have
357 (63%) benign and 212 (37%) malignant tumour observations.
num_obs <- nrow(cancer)
cancer %>%
group_by(Class) %>%
summarize(
n = n(),
percentage = n() / num_obs * 100
)
## # A tibble: 2 x 3
## Class n percentage
## <fct> <int> <dbl>
## 1 M 212 37.3
## 2 B 357 62.7
Next, let’s draw a scatter plot to visualize the relationship between the
perimeter and concavity variables. Rather than use ggplot's
default palette,
we define our own here (cbPalette
) and pass it as the values
argument to
the scale_color_manual
function. We also make the category labels (“B” and
“M”) more readable by changing them to “Benign” and “Malignant” using the
labels
argument.
# colour palette
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")
perim_concav <- cancer %>%
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
labs(color = "Diagnosis") +
scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
perim_concav

Figure 6.2: Scatterplot of concavity versus perimeter coloured by diagnosis label
In this visualization, we can see that malignant observations typically fall in the the upper right-hand corner of the plot area. By contrast, benign observations typically fall in lower left-hand corner of the plot. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumour class). We could compute the perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? What about a new observation with perimeter value of -1 and concavity value of -0.5? What about 0 and 1? It seems like the prediction of an unobserved label might be possible, based on our visualization. In order to actually do this computationally in practice, we will need a classification algorithm; here we will use the K-nearest neighbour classification algorithm.
6.5 Classification with K-nearest neighbours
To predict the label of a new observation, i.e., classify it as either benign or malignant, the K-nearest neighbour classifier generally finds the \(K\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. To illustrate this concept, we will walk through an example. Suppose we have a new observation, with perimeter of 2 and concavity of 4 (labelled in red on the scatterplot), whose diagnosis “Class” is unknown.

Figure 6.3: Scatterplot of concavity versus perimeter with new observation labelled in red
We see that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatterplot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis.

Figure 6.4: Scatterplot of concavity versus perimeter where the nearest neighbour to the new observation is malignant
Suppose we have another new observation with perimeter 0.2 and concavity of 3.3. Looking at the scatterplot below, how would you classify this red observation? The nearest neighbour to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make? Probably not, if you consider the other nearby points…

Figure 6.5: Scatterplot of concavity versus perimeter where the nearest neighbour to this new observation is benign
So instead of just using the one nearest neighbour, we can consider several neighbouring points, say \(K = 3\), that are closest to the new red observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. In this case, we see that the diagnoses of 2 of the 3 nearest neighbours to our new observation are malignant. Therefore we take majority vote and classify our new red observation as malignant.

Figure 6.6: Scatterplot of concavity versus perimeter with three nearest neighbours
Here we chose the \(K=3\) nearest observations, but there is nothing special about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \(K\) in the next chapter.
Distance between points
How do we decide which points are the \(K\) “nearest” to our new observation? We can compute the distance between any pair of points using the following formula:
\[\mathrm{Distance} = \sqrt{(x_a -x_b)^2 + (y_a - y_b)^2}\]
This formula – sometimes called the Euclidean distance – is simply the straight line distance between two points on the x-y plane with coordinates \((x_a, y_a)\) and \((x_b, y_b)\).
Suppose we want to classify a new observation with perimeter of 0 and concavity of 3.5. Let’s calculate the distances between our new point and each of the observations in the training set to find the \(K=5\) observations in the training data that are nearest to our new point.

Figure 6.7: Scatterplot of concavity versus perimeter with new observation labelled in red
new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
cancer %>%
select(ID, Perimeter, Concavity, Class) %>%
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) %>%
arrange(dist_from_new) %>%
slice(1:5) # subset the first 5 rows
## # A tibble: 5 x 5
## ID Perimeter Concavity Class dist_from_new
## <dbl> <dbl> <dbl> <fct> <dbl>
## 1 86409 0.241 2.65 B 0.881
## 2 887181 0.750 2.87 M 0.980
## 3 899667 0.623 2.54 M 1.14
## 4 907914 0.417 2.31 M 1.26
## 5 8710441 -1.16 4.04 B 1.28
From this, we see that 3 of the 5 nearest neighbours to our new observation are malignant so classify our new observation as malignant. We circle those 5 in the plot below:

Figure 6.8: Scatterplot of concavity versus perimeter with 5 nearest neighbours circled
It can be difficult sometimes to read code as math, so here we mathematically show the calculation of distance for each of the 5 closest points.
Perimeter | Concavity | Distance | Class |
---|---|---|---|
0.24 | 2.65 | \(\sqrt{0 - 0.241)^2 + (3.5 - 2.65)^2}=\) 0.88 | B |
0.75 | 2.87 | \(\sqrt{(0 - 0.750)^2 + (3.5 - 2.87)^2} =\) 0.98 | M |
0.62 | 2.54 | \(\sqrt{(0 - 0.623)^2 + (3.5 - 2.54)^2} =\) 1.14 | M |
0.42 | 2.31 | \(\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2} =\) 1.26 | M |
-1.16 | 4.04 | \(\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} =\) 1.28 | B |
More than two explanatory variables
Although the above description is directed toward two explanatory variables / predictors, exactly the same K-nearest neighbour algorithm applies when you have a higher number of explanatory variables (i.e., a higher-dimensional predictor space). Each explanatory variable/predictor can give us new information to help create our classifier. The only difference is the formula for the distance between points. In particular, let’s say we have \(m\) predictor variables for two observations \(u\) and \(v\), i.e., \(u = (u_{1}, u_{2}, \dots, u_{m})\) and \(v = (v_{1}, v_{2}, \dots, v_{m})\). Before, we added up the squared difference between each of our (two) variables, and then took the square root; now we will do the same, except for all of our \(m\) variables. In other words, the distance formula becomes
\[Distance = \sqrt{(u_{1} -v_{1})^2 + (u_{2} - v_{2})^2 + \dots + (u_{m} - v_{m})^2}\]
Figure 6.9: 3D scatterplot of symmetry, concavity and perimeter
Click and drag the plot above to rotate it, and scroll to zoom. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what “higher dimensions” look like for learning purposes.
Summary
In order to classify a new observation using a K-nearest neighbour classifier, we have to:
- Compute the distance between the new observation and each observation in the training set
- Sort the data table in ascending order according to the distances
- Choose the top \(K\) rows of the sorted table
- Classify the new observation based on a majority vote of the neighbour classes
6.6 K-nearest neighbours with tidymodels
Coding the K-nearest neighbour algorithm in R ourselves would get complicated
if we might have to predict the label/class for multiple new observations, or
when there are multiple classes and more than two variables. Thankfully, in R,
the K-nearest neighbour algorithm is implemented in the parsnip
package
included in the
tidymodels
package collection, along with
many other models
that you will encounter in this and future classes. The tidymodels
collection
provides tools to help make and use models, such as classifiers. Using the packages
in this collection will help keep our code simple, readable and accurate; the
less we have to code ourselves, the fewer mistakes we are likely to make. We
start off by loading tidymodels
:
Let’s again suppose we have a new observation with perimeter 0 and concavity
3.5, but its diagnosis is unknown (as in our example above). Suppose we
want to use the perimeter and concavity explanatory variables/predictors to
predict the diagnosis class of this observation. Let’s pick out our 2 desired
predictor variables and class label and store it as a new dataset named cancer_train
:
## # A tibble: 569 x 3
## Class Perimeter Concavity
## <fct> <dbl> <dbl>
## 1 M 1.27 2.65
## 2 M 1.68 -0.0238
## 3 M 1.57 1.36
## 4 M -0.592 1.91
## 5 M 1.78 1.37
## 6 M -0.387 0.866
## 7 M 1.14 0.300
## 8 M -0.0728 0.0610
## 9 M -0.184 1.22
## 10 M -0.329 1.74
## # … with 559 more rows
Next, we create a model specification for K-nearest neighbours classification
by calling the nearest_neighbor
function, specifying that we want to use \(K = 5\) neighbours
(we will discuss how to choose \(K\) in the next chapter) and the straight-line
distance (weight_func = "rectangular"
). The weight_func
argument controls
how neighbours vote when classifying a new observation; by setting it to "rectangular"
,
each of the \(K\) nearest neighbours gets exactly 1 vote as described above. Other choices,
which weight each neighbour’s vote differently, can be found on
the tidymodels website.
We specify the particular computational
engine (in this case, the kknn
engine) for training the model with the set_engine
function.
Finally we specify that this is a classification problem with the set_mode
function.
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_spec
## K-Nearest Neighbor Model Specification (classification)
##
## Main Arguments:
## neighbors = 5
## weight_func = rectangular
##
## Computational engine: kknn
In order to fit the model on the breast cancer data, we need to pass the model specification
and the dataset to the fit
function. We also need to specify what variables to use as predictors
and what variable to use as the target. Below, the Class ~ .
argument specifies
that Class
is the target variable (the one we want to predict),
and .
(everything except Class
) is to be used as the predictor.
## parsnip model object
##
## Fit time: 44ms
##
## Call:
## kknn::train.kknn(formula = formula, data = data, ks = ~5, kernel = ~"rectangular")
##
## Type of response variable: nominal
## Minimal misclassification: 0.07557118
## Best kernel: rectangular
## Best k: 5
Here you can see the final trained model summary. It confirms that the computational engine used
to train the model was kknn::train.kknn
. It also shows the fraction of errors made by
the nearest neighbour model, but we will ignore this for now and discuss it in more detail
in the next chapter.
Finally it shows (somewhat confusingly) that the “best” weight function
was “rectangular” and “best” setting of \(K\) was 5; but since we specified these earlier,
R is just repeating those settings to us here. In the next chapter, we will actually
let R tune the model for us.
Finally, we make the prediction on the new observation by calling the predict
function,
passing the fit object we just created. As above when we ran the K-nearest neighbours
classification algorithm manually, the knn_fit
object classifies the new observation as
malignant (“M”). Note that the predict
function outputs a data frame with a single
variable named .pred_class
.
## # A tibble: 1 x 1
## .pred_class
## <fct>
## 1 M
6.7 Data preprocessing with tidymodels
6.7.1 Centering and scaling
When using K-nearest neighbour classification, the scale of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations that are nearest to it, any variables that have a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two attributes, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbour distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary!
In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celcius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the K-nearest neighbour classification algorithm, this large shift can change the outcome of using many other predictive models.
Standardization: when all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized.
To illustrate the effect that standardization can have on the K-nearest
neighbour algorithm, we will read in the original, unscaled Wisconsin breast
cancer data set; we have been using a standardized version of the data set up
until now. To keep things simple, we will just use the Area
, Smoothness
, and Class
variables:
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
mutate(Class = as_factor(Class)) %>%
select(Class, Area, Smoothness)
unscaled_cancer
## # A tibble: 569 x 3
## Class Area Smoothness
## <fct> <dbl> <dbl>
## 1 M 1001 0.118
## 2 M 1326 0.0847
## 3 M 1203 0.110
## 4 M 386. 0.142
## 5 M 1297 0.100
## 6 M 477. 0.128
## 7 M 1040 0.0946
## 8 M 578. 0.119
## 9 M 520. 0.127
## 10 M 476. 0.119
## # … with 559 more rows
Looking at the unscaled / uncentered data above, you can see that the difference between the values for area measurements are much larger than those for smoothness, and the mean appears to be much larger too. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (coloured by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data.
In the tidymodels
framework, all data preprocessing happens using a recipe
.
Here we will initialize a recipe for the unscaled_cancer
data above, specifying
that the Class
variable is the target, and all other variables are predictors:
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
So far, there is not much in the recipe; just a statement about the number of targets
and predictors. Let’s add scaling (step_scale
) and centering (step_center
) steps for
all of the predictors so that they each have a mean of 0 and standard deviation of 1.
The prep
function finalizes the recipe by using the data (here, unscaled_cancer
)
to compute anything necessary to run the recipe (in this case, the column means and standard
deviations):
uc_recipe <- uc_recipe %>%
step_scale(all_predictors()) %>%
step_center(all_predictors()) %>%
prep()
uc_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
##
## Training data contained 569 data points and no missing data.
##
## Operations:
##
## Scaling for Area, Smoothness [trained]
## Centering for Area, Smoothness [trained]
You can now see that the recipe includes a scaling and centering step for all predictor variables.
Note that when you add a step to a recipe, you must specify what columns to apply the step to.
Here we used the all_predictors()
function to specify that each step should be applied to
all predictor variables. However, there are a number of different arguments one could use here,
as well as naming particular columns with the same syntax as the select
function.
For example:
all_nominal()
andall_numeric()
: specify all categorical or all numeric variablesall_predictors()
andall_outcomes()
: specify all predictor or all target variablesArea, Smoothness
: specify both theArea
andSmoothness
variable-Class
: specify everything except theClass
variable
You can find a full set of all the steps and variable selection functions
on the recipes home page.
We finally use the bake
function to apply the recipe.
## # A tibble: 569 x 3
## Area Smoothness Class
## <dbl> <dbl> <fct>
## 1 0.984 1.57 M
## 2 1.91 -0.826 M
## 3 1.56 0.941 M
## 4 -0.764 3.28 M
## 5 1.82 0.280 M
## 6 -0.505 2.24 M
## 7 1.09 -0.123 M
## 8 -0.219 1.60 M
## 9 -0.384 2.20 M
## 10 -0.509 1.58 M
## # … with 559 more rows
Now let’s generate the two scatter plots, one for unscaled_cancer
and one for
scaled_cancer
, and show them side-by-side. Each has the same new observation
annotated with its \(K=3\) nearest neighbours:

Figure 6.10: Comparison of K = 3 nearest neighbours with standardized and unstandardized data
In the plot for the nonstandardized original data, you can see some odd choices for the three nearest neighbours. In particular, the “neighbours” are visually well within the cloud of benign observations, and the neighbours are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Here the computation of nearest neighbours is dominated by the much larger-scale area variable. On the right, the plot for standardized data shows a much more intuitively reasonable selection of nearest neighbours. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis.
6.7.2 Balancing
Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another. Since classifiers like the K-nearest neighbour algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email).
To better illustrate the problem, let’s revisit the breast cancer data; except now we will remove many of the observations of malignant tumours, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations randomly from the malignant group, and keeping all of the benign observations.
set.seed(3)
rare_cancer <- bind_rows(
filter(cancer, Class == "B"),
cancer %>% filter(Class == "M") %>% sample_n(3)
) %>%
select(Class, Perimeter, Concavity)
rare_plot <- rare_cancer %>%
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
labs(color = "Diagnosis") +
scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
rare_plot

Figure 6.11: Imbalanced data
Note: You will see in the code above that we use the
set.seed
function. This is because we are usingsample_n
to artificially pick only 3 of the malignant tumour observations, which uses random sampling to choose which rows will be in the training set. In order to make the code reproducible, we useset.seed
to specify where the random number generator starts for this process, which then guarantees the same result, i.e., the same choice of 3 observations, each time the code is run. In general, when your code involves random numbers, if you want the same result each time, you should useset.seed
; if you want a different result each time, you should not.
Suppose we now decided to use \(K = 7\) in K-nearest neighbour classification. With only 3 observations of malignant tumours, the classifier will always predict that the tumour is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, look what happens for a new tumour observation that is quite close to two that were tagged as malignant:

Figure 6.12: Imbalanced data with K = 7
And if we set the background colour of each area of the plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is always “benign,” corresponding to the blue colour:

Figure 6.13: Imbalanced data with background colour indicating the decision of the K-nn classifier
Despite the simplicity of the problem, solving it in a statistically sound manner is actually
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
For the present purposes, it will suffice to rebalance the data by oversampling the rare class.
In other words, we will replicate rare observations multiple times in our data set to give them more
voting power in the K-nearest neighbour algorithm. In order to do this, we will add an oversampling
step to the earlier uc_recipe
recipe with the step_upsample
function.
We show below how to do this, and also
use the group_by + summarize
pattern we’ve seen before to see that our classes are now balanced:
ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>%
step_upsample(Class, over_ratio = 1, skip = FALSE) %>%
prep()
ups_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
##
## Training data contained 360 data points and no missing data.
##
## Operations:
##
## Up-sampling based on Class [trained]
upsampled_cancer <- bake(ups_recipe, rare_cancer)
upsampled_cancer %>%
group_by(Class) %>%
summarize(n = n())
## # A tibble: 2 x 2
## Class n
## <fct> <int>
## 1 M 357
## 2 B 357
Now suppose we train our K-nearest neighbour classifier with \(K=7\) on this balanced data. Setting the background colour of each area of our scatter plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is more reasonable; when the points are close to those labelled malignant, the classifier predicts a malignant tumour, and vice versa when they are closer to the benign tumour observations:

Figure 6.14: K-nn after upsampling
6.8 Putting it together in a workflow
The tidymodels
package collection also provides the workflow
, a simple way to chain
together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
To illustrate the whole pipeline, let’s start from scratch with the unscaled_wdbc.csv
data.
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
# load the unscaled cancer data and make sure the target Class variable is a factor
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
mutate(Class = as_factor(Class))
# create the KNN model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) %>%
set_engine("kknn") %>%
set_mode("classification")
# create the centering / scaling recipe
uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
Note that each of these steps is exactly the same as earlier, except for one major difference:
we did not use the select
function to extract the relevant variables from the data frame,
and instead simply specified the relevant variables to use via the
formula Class ~ Area + Smoothness
(instead of Class ~ .
) in the recipe.
You will also notice that we did not call prep()
on the recipe; this is unnecssary when it is
placed in a workflow.
We will now place these steps in a workflow
using the add_recipe
and add_model
functions,
and finally we will use the fit
function to run the whole workflow on the unscaled_cancer
data.
Note another difference from earlier here: we do not include a formula in the fit
function. This
is again because we included the formula in the recipe, so there is no need to respecify it:
knn_fit <- workflow() %>%
add_recipe(uc_recipe) %>%
add_model(knn_spec) %>%
fit(data = unscaled_cancer)
knn_fit
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
##
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## ● step_scale()
## ● step_center()
##
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────
##
## Call:
## kknn::train.kknn(formula = formula, data = data, ks = ~7, kernel = ~"rectangular")
##
## Type of response variable: nominal
## Minimal misclassification: 0.112478
## Best kernel: rectangular
## Best k: 7
As before, the fit object lists the function that trains the model as well as the “best” settings
for the number of neighbours and weight function (for now, these are just the values we chose
manually when we created knn_spec
above). But now the fit object also includes information about
the overall workflow, including the centering and scaling preprocessing steps.
Let’s visualize the predictions that this trained K-nearest neighbour model will make on new observations.
Below you will see how to make the coloured prediction map plots from earlier in this chapter.
The basic idea is to create a grid of synthetic new observations using the expand.grid
function,
predict the label of each, and visualize the predictions with a coloured scatter having a very high transparency
(low alpha
value) and large point radius. We include the code here as a learning challenge; see
if you can figure out what each line is doing!
# create the grid of area/smoothness vals, and arrange in a data frame
are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100)
smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100)
asgrid <- as_tibble(expand.grid(Area = are_grid, Smoothness = smo_grid))
# use the fit workflow to make predictions at the grid points
knnPredGrid <- predict(knn_fit, asgrid)
# bind the predictions as a new column with the grid points
prediction_table <- bind_cols(knnPredGrid, asgrid) %>% rename(Class = .pred_class)
# plot:
# 1. the coloured scatter of the original data
# 2. the faded coloured scatter for the grid points
wkflw_plot <-
ggplot() +
geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) +
geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5.) +
labs(color = "Diagnosis") +
scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)

Figure 6.15: Scatterplot of smoothness versus area where background colour indicates the decision of the classifier