Chapter 9 Regression II: linear regression

9.1 Overview

This chapter provides an introduction to linear regression models in a predictive context, focusing primarily on the case where there is a single predictor and single response variable of interest, as well as comparison to K-nearest neighbours methods. The chapter concludes with a discussion of linear regression with multiple predictors.

9.2 Chapter learning objectives

By the end of the chapter, students will be able to:

  • Perform linear regression in R using tidymodels and evaluate it on a test dataset.
  • Compare and contrast predictions obtained from K-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
  • In R, overlay regression lines from geom_smooth on a single plot.

9.3 Simple linear regression

K-NN is not the only type of regression; another quite useful, and arguably the most common, type of regression is called simple linear regression. Simple linear regression is similar to K-NN regression in that the target/response variable is quantitative. However, one way it varies quite differently is how the training data is used to predict a value for a new observation. Instead of looking at the \(K\)-nearest neighbours and averaging over their values for a prediction, in simple linear regression all the training data points are used to create a straight line of best fit, and then the line is used to “look up” the predicted value.

Note: for simple linear regression there is only one response variable and only one predictor. Later in this chapter we introduce the more general linear regression case where more than one predictor can be used.

For example, let’s revisit the smaller version of the Sacramento housing data set. Recall that we have come across a new 2,000-square foot house we are interested in purchasing with an advertised list price of $350,000. Should we offer the list price, or is that over/undervalued?

To answer this question using simple linear regression, we use the data we have to draw the straight line of best fit through our existing data points:

Scatter plot of price (USD) versus house size (square footage) with line of best fit for subset of the Sacramento housing data set

Figure 9.1: Scatter plot of price (USD) versus house size (square footage) with line of best fit for subset of the Sacramento housing data set

The equation for the straight line is:

\[\text{house price} = \beta_0 + \beta_1 \cdot (\text{house size}),\] where

  • \(\beta_0\) is the vertical intercept of the line (the value where the line cuts the vertical axis)
  • \(\beta_1\) is the slope of the line

Therefore using the data to find the line of best fit is equivalent to finding coefficients \(\beta_0\) and \(\beta_1\) that parametrize (correspond to) the line of best fit. Once we have the coefficients, we can use the equation above to evaluate the predicted price given the value we have for the predictor/explanatory variable—here 2,000 square feet.

Scatter plot of price (USD) versus house size (square footage) with line of best fit and predicted price for a 2000 square foot home represented as a red dot

Figure 9.2: Scatter plot of price (USD) versus house size (square footage) with line of best fit and predicted price for a 2000 square foot home represented as a red dot

## [1] 287961.9

By using simple linear regression on this small data set to predict the sale price for a 2,000 square foot house, we get a predicted value of $287962. But wait a minute…how exactly does simple linear regression choose the line of best fit? Many different lines could be drawn through the data points. We show some examples below:

Scatter plot of price (USD) versus house size (square footage) with many possible lines that could be drawn through the data points

Figure 9.3: Scatter plot of price (USD) versus house size (square footage) with many possible lines that could be drawn through the data points

Simple linear regression chooses the straight line of best fit by choosing the line that minimzes the average vertical distance between itself and each of the observed data points. From the lines shown above, that is the blue line. What exactly do we mean by the vertical distance between the predicted values (which fall along the line of best fit) and the observed data points? We illustrate these distances in the plot below with a red line:

Scatter plot of price (USD) versus house size (square footage) with the vertical distances between the predicted values and the observed data points

Figure 9.4: Scatter plot of price (USD) versus house size (square footage) with the vertical distances between the predicted values and the observed data points

To assess the predictive accuracy of a simple linear regression model, we use RMSPE—the same measure of predictive performance we used with K-NN regression.

9.4 Linear regression in R

We can perform simple linear regression in R using tidymodels in a very similar manner to how we performed K-NN regression. To do this, instead of creating a nearest_neighbor model specification with the kknn engine, we use a linear_reg model specification with the lm engine. Another difference is that we do not need to choose \(K\) in the context of linear regression, and so we do not need to perform cross validation. Below we illustrate how we can use the usual tidymodels workflow to predict house sale price given house size using a simple linear regression approach using the full Sacramento real estate data set.

An additional difference that you will notice below is that we do not standardize (i.e., scale and center) our predictors. In K-nearest neighbours models, recall that the model fit changes depending on whether we standardize first or not. In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!). So you can standardize if you want—it won’t hurt anything—but if you leave the predictors in their original form, the best fit coefficients are usually easier to interpret afterward.

As usual, we start by putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now.

set.seed(1234)
sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
sacramento_train <- training(sacramento_split)
sacramento_test <- testing(sacramento_split)

Now that we have our training data, we will create the model specification and recipe, and fit our simple linear regression model:

lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

lm_recipe <- recipe(price ~ sqft, data = sacramento_train)

lm_fit <- workflow() %>%
  add_recipe(lm_recipe) %>%
  add_model(lm_spec) %>%
  fit(data = sacramento_train)
lm_fit
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = formula, data = data)
## 
## Coefficients:
## (Intercept)         sqft  
##       15059          138

Our coefficients are (intercept) \(\beta_0=\) 15059 and (slope) \(\beta_1=\) 138. This means that the equation of the line of best fit is \[\text{house price} = 15059 + 138\cdot (\text{house size}),\] and that the model predicts that houses start at $15059 for 0 square feet, and that every extra square foot increases the cost of the house by $138. Finally, we predict on the test data set to assess how well our model does:

lm_test_results <- lm_fit %>%
  predict(sacramento_test) %>%
  bind_cols(sacramento_test) %>%
  metrics(truth = price, estimate = .pred)
lm_test_results
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   85161.   
## 2 rsq     standard       0.572
## 3 mae     standard   62608.

Our final model’s test error as assessed by RMSPE is 85161. Remember that this is in units of the target/response variable, and here that is US Dollars (USD). Does this mean our model is “good” at predicting house sale price based off of the predictor of home size? Again answering this is tricky to answer and requires to use domain knowledge and think about the application you are using the prediction for.

To visualize the simple linear regression model, we can plot the predicted house price across all possible house sizes we might encounter superimposed on a scatter plot of the original housing price data. There is a plotting function in the tidyverse, geom_smooth, that allows us to do this easily by adding a layer on our plot with the simple linear regression predicted line of best fit. The default for this adds a plausible range to this line that we are not interested in at this point, so to avoid plotting it, we provide the argument se = FALSE in our call to geom_smooth.

lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
  geom_point(alpha = 0.4) +
  xlab("House size (square footage)") +
  ylab("Price (USD)") +
  scale_y_continuous(labels = dollar_format()) +
  geom_smooth(method = "lm", se = FALSE)
lm_plot_final
Scatter plot of price (USD) versus house size (square footage) with line of best fit for complete Sacramento housing data set

Figure 9.5: Scatter plot of price (USD) versus house size (square footage) with line of best fit for complete Sacramento housing data set

We can extract the coefficients from our model by accessing the fit object that is output by the fit function; we first have to extract it from the workflow using the pull_workflow_fit function, and then apply the tidy function to convert the result into a data frame:

coeffs <- tidy(pull_workflow_fit(lm_fit))
coeffs
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   15059.   8745.        1.72 8.56e-  2
## 2 sqft            138.      4.77     28.9  3.13e-113

9.5 Comparing simple linear and K-NN regression

Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let’s look at the visualization of the simple linear regression model predictions for the Sacramento real estate data (predicting price from house size) and the “best” K-NN regression model obtained from the same problem:

Comparison of simple linear regression and K-NN regression

Figure 9.6: Comparison of simple linear regression and K-NN regression

What differences do we observe from the visualization above? One obvious difference is the shape of the blue lines. In simple linear regression we are restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response variable we predict given a unit increase in the predictor/explanatory variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line.

There can however also be a disadvantage to using a simple linear regression model in some cases, particularly when the relationship between the target and the predictor is not linear, but instead some other shape (e.g. curved or oscillating). In these cases the prediction model from a simple linear regression will underfit (have high bias), meaning that model/predicted values does not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMPSE when assessing model prediction quality on a test data set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future courses that may do even better at predicting with such data.

How do these two models compare on the Sacramento house prices data set? On the visualizations above we also printed the RMPSE as calculated from predicting on the test data set that was not used to train/fit the models. The RMPSE for the simple linear regression model is slightly lower than the RMPSE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model.

Finally, note that the K-NN regression model becomes “flat” at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed data is known as extrapolation; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small houses (if the intercept \(\beta_0\) was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of K-NN likely does not match reality.

9.6 Multivariate linear regression

As in K-NN classification and K-NN regression, we can move beyond the simple case of one response variable and only one predictor and perform multivariate linear regression where we can have multiple predictors. In this case we fit a plane to the data, as opposed to a straight line.

To do this, we follow a very similar approach to what we did for K-NN regression; but recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. We demonstrate how to do this below using the Sacramento real estate data with both house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our outcome/target variable that we are trying to predict. We will start by changing the formula in the recipe to include both the sqft and beds variables as predictors:

lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train)

Now we can build our workflow and fit the model:

lm_fit <- workflow() %>%
  add_recipe(lm_recipe) %>%
  add_model(lm_spec) %>%
  fit(data = sacramento_train)
lm_fit
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = formula, data = data)
## 
## Coefficients:
## (Intercept)         sqft         beds  
##     52690.1        154.8     -20209.4

And finally, we predict on the test data set to assess how well our model does:

lm_mult_test_results <- lm_fit %>%
  predict(sacramento_test) %>%
  bind_cols(sacramento_test) %>%
  metrics(truth = price, estimate = .pred)
lm_mult_test_results
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   82835.   
## 2 rsq     standard       0.596
## 3 mae     standard   61008.

In the case of two predictors, our linear regression creates a plane of best fit, shown below:

Figure 9.7: Simple linear regression model’s predictions represented as a plane overlaid on top of the data using three predictors (price, house size, and the number of bedrooms)

We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the wiggly, flexible surface we get from other methods such as K-NN regression. As discussed this can be advantageous in one aspect, which is that for each predictor, we can get slopes/intercept from linear regression, and thus describe the plane mathematically. We can extract those slope values from our model object as shown below:

coeffs <- tidy(pull_workflow_fit(lm_fit))
coeffs
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   52690.  13745.        3.83 1.41e- 4
## 2 sqft            155.      6.72     23.0  4.46e-83
## 3 beds         -20209.   5734.       -3.52 4.59e- 4

And then use those slopes to write a mathematical equation to describe the prediction plane:

\[\text{house price} = \beta_0 + \beta_1\cdot(\text{house size}) + \beta_2\cdot(\text{number of bedrooms}),\] where:

  • \(\beta_0\) is the vertical intercept of the hyperplane (the value where it cuts the vertical axis)
  • \(\beta_1\) is the slope for the first predictor (house size)
  • \(\beta_2\) is the slope for the second predictor (number of bedrooms)

Finally, we can fill in the values for \(\beta_0\), \(\beta_1\) and \(\beta_2\) from the model output above to create the equation of the plane of best fit to the data:

\[\text{house price} = 52690 + 155\cdot (\text{house size}) -20209 \cdot (\text{number of bedrooms})\]

This model is more interpretable than the multivariate K-NN regression model; we can write a mathematical equation that explains how each predictor is affecting the predictions. But as always, we should look at the test error and ask whether linear regression is doing a better job of predicting compared to K-NN regression in this multivariate regression case. To do that we can use this linear regression model to predict on the test data to get our test error.

lm_mult_test_results
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   82835.   
## 2 rsq     standard       0.596
## 3 mae     standard   61008.

We get that the RMSPE for the multivariate linear regression model of 82835.42. This prediction error is less than the prediction error for the multivariate K-NN regression model, indicating that we should likely choose linear regression for predictions of house price on this data set. But we should also ask if this more complex model is doing a better job of predicting compared to our simple linear regression model with only a single predictor (house size). Revisiting last section, we see that our RMSPE for our simple linear regression model with only a single predictor was 85160.85, which is slightly more than that of our more complex model. Our model with two predictors provided a slightly better fit on test data than our model with just one.

But should we always end up choosing a model with more predictors than fewer? The answer is no; you never know what model will be the best until you go through the process of comparing their performance on held-out test data. Exploratory data analysis can give you some hints, but until you look at the prediction errors to compare the models you don’t really know. Additionally, here we compare test errors purely for the purposes of teaching. In practice, when you want to compare several regression models with differing numbers of predictor variables, you should use cross-validation on the training set only; in this case choosing the model is part of tuning, so you cannot use the test data. There are several well known and more advanced methods to do this that are beyond the scope of this course, and they include backward or forward selection, and L1 or L2 regularization (also known as Lasso and ridge regression, respectively).

9.7 The other side of regression

So far in this textbook we have used regression only in the context of prediction. However, regression is also a powerful method to understand and/or describe the relationship between a quantitative response variable and one or more explanatory variables. Extending the case we have been working with in this chapter (where we are interested in house price as the outcome/response variable), we might also be interested in describing the individual effects of house size and the number of bedrooms on house price, quantifying how big each of these effects are, and assessing how accurately we can estimate each of these effects. This side of regression is the topic of many follow-on statistics courses and beyond the scope of this course.

9.8 Additional resources