DSCI 100
Is our model any good? How do we evaluate it?
How do we choose k in K-nearest neighbours classification?
Accuracy
\[Accuracy = \dfrac{\#\; correct\; predictions}{\#\; total\; predictions}\]
Notes:
Here is an example of confusion matrix with cancer diagnosis data we’ve seen before.
| Truly Malignant | Truly Benign | |
|---|---|---|
| Predicted Malignant | 1 | 4 |
| Predicted Benign | 3 | 57 |
Notes:
Typically we consider one of the class labels as “positive” - in this case the “Malignant” status is more interesting to researchers, hence we consider that label as “positive”.
Relabeling the above confusion matrix:
| Truly Positive | Truly Negative | |
|---|---|---|
| Predicted Positive | 1 | 4 |
| Predicted Negative | 3 | 57 |
Note that:
Top left cell = # correct positive predictions.
Top row = # total positive predictions.
Left column = # truly positive observations.
Notes:
\[ {Precision} = \dfrac{{\#\; correct\; positive\; predictions}}{{\#\; total\; positive\; predictions}} \]
\[ {Recall} = \dfrac{{\#\; correct\; positive\; predictions}}{{\#\; total\; truly\; positive\; observations}} \]
Recall:
| Truly Positive | Truly Negative | |
|---|---|---|
| Predicted Positive | 1 | 4 |
| Predicted Negative | 3 | 57 |
Here, precision = 1/(1+4) and recall = 1/(1+3).
Notes:
It is application/context dependent. Use your judgement/logical reasoning.
For example:
a 99% accuracy on cancer prediction may not be very useful. Why?
If we need patients with truly malignant cancer to be diagnosed correctly, what metric should we prioritize?
What if a classifier never guess positive except for the very few observations it is super confident in? What metric is affected?
Notes:
To add evaluation into our classification pipeline, we:
We’ll now talk about each step individually.
Notes:
The role of the training and test sets
Notes:
Don’t use your testing data to train your model!
Showing your classifier the labels of evaluation data is like cheating on a test; it’ll look more accurate than it really is
Notes:
There are two important things to do when splitting data.

Why?
Notes:
(tidymodels thankfully automatically does both of these things)
Choosing K is part of training. We want to choose K to maximize performance, but: - we can’t use test data to evaluate performance (cheating!) - we can’t use training data to evaluate performance (that’s what we trained with, so poor evaluation of true performance)
Solution: Split the training data further into training data and validation data sets
2a. Choose some candidate values of K
2b. Split the training data into two sets - one called the training set, another called the validation set
2c. For each K, train the model using training set only
2d. Evaluate accuracy (and/or other metrics of performance) for each using validation set only
2e. Pick the K that maximizes validation performance
But what if we get a bad training set? Just by chance?
Notes:
We can get a better estimate of performance by splitting multiple ways and averaging. Here’s an example:
Notes:
Overfitting: when your model is too sensitive to your training data; noise can influence predictions!
Underfitting: when your model isn’t sensitive enough to training data; useful information is ignored!
Source: http://kerckhoffs.schaathun.net/FPIA/Slides/09OF.pdf
Notes:
Which of these are under-, over-, and good fits?
Notes:
For KNN: small K overfits, large K underfits, and both cause lower accuracy
Notes:
IMPORTANT: We compute the metrics by predicting labels on testing data only
Recall the role of the test data
Notes:
Notes:
Let’s finish our iClicker problems and start on this week’s worksheet.
There is no tutorial this week.