#' Count class observations
#'
#' Creates a new data frame with two columns,
#' listing the classes present in the input data frame,
#' and the number of observations for each class.
#'
#' @param data_frame A data frame or data frame extension (e.g. a tibble).
#' @param class_col unquoted column name of column containing class labels
#'
#' @return A data frame with two columns.
#' The first column (named class) lists the classes from the input data frame.
#' The second column (named count) lists the number of observations
#' for each class from the input data frame.
#' It will have one row for each class present in input data frame.
#'
#' @export
#' @examples
#' count_classes(mtcars, am)
<- function(data_frame, class_col) {
count_classes # returns a data frame with two columns: class and count
}
20 R testing example
20.0.1 Example of workflow for writing functions and tests for data science
Let’s say we want to write a function for a task we repeatedly are performing in our data analysis. For example, summarizing the number of observations in each class. This is a common task performed for almost every classification problem to examine how many classes there are to understand if we are facing a binary or multi-class classification problem, as well as to examine whether there are any class imbalances that we may need to deal with before tuning our models.
1. Write the function specifications and documentation - but do not implement the function:
The first thing we should do is write the function specifications and documentation. This can effectively represented by an empty function and roxygen2
-styled documentation in R as shown below:
2. Plan the test cases and document them:
Next, we should plan out our test cases and start to document them. At this point we can sketch out a skeleton for our test cases with code, but we are not yet ready to write them, as we first will need to reproducibly create test data that is useful for assessing whether your function works as expected. So considering our function specifications, some kinds of input we might anticipate our function may receive, and correspondingly what it should return is listed below:
Simple expected use test case #1
- Dataframe with 2 classes, with 2 observations per class
Function input:
- dataframe
class_labels values1 class1 0.2
2 class2 0.5
3 class1 0.8
4 class2 0.5
- unquoted column name
class_labels
Expected function output:
Dataframe (or tibble)
class count1 class1 2
2 class2 2
Simple expected use test case #2
- Dataframe with 2 classes, with 2 observations for one class, and only one observation in the other
Function input:
- dataframe
class_labels values1 class1 1.0
2 class1 0.9
3 class2 0.9
- unquoted column name
class_labels
Expected function output:
Dataframe (or tibble)
class count1 class1 2
2 class2 1
Edge test case #1
- Dataframe with 1 classes, with 2 observations for that class
Function input:
- dataframe
class_labels values1 class1 0.7
2 class1 0.5
- unquoted column name
class_labels
Expected function output:
Dataframe (or tibble)
class count1 class1 2
Edge test case #2
- Dataframe with no class observations
Function input:
- dataframe
class_labels values
- unquoted column name
class_labels
Expected function output:
Dataframe (or tibble)
class count
20.0.2 Error test case #1
- A list with 2 classes, with 2 observations for each class
Function input:
- list
$class_labels
1] "class1" "class2" "class1" "class2"
[
$values
1] 0.4 0.7 0.0 0.6 [
- unquoted list element name
class_labels
Expected function output:
Error
:
Error `data_frame` should be a dataframe or dataframe extension (e.g. a tibble)
Next, I sketch out a skeleton for the unit tests. For R, we will use the well maintained and popular testthat
R package for writing our tests. For extra resources on testthat
beyond what is demonstrated here, we recommend reading: - testthat
documentation - Testing chapter of the R packages book
With testthat
we create a test_that
statement for each related group of tests for a function. For our example, we will create the four test_that
statements shown below:
library(testthat)
test_that("`count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class", {
# "expected use cases" tests to be added here
})
test_that("`count_classes` should return an empty data frame, or tibble,
if the input to the function is an empty data frame", {
# "edge cases" test to be added here
})
test_that("`count_classes` should throw an error when incorrect types
are passed to the `data_frame` argument", {
# "error" tests to be added here
})
3. Create test data that is useful for assessing whether your function works as expected:
Now that we have a plan, we can create reproducible test data for that plan! When we do this, we want to keep our data as small and tractable as possible. We want to test things we know the answer to, or can at a minimum calculate by hand. We will use R code to reproducibly create the test data. We will need to do this for the data we will feed in as inputs to our function in the tests, as well as the data we expect our function to return.
library(dplyr)
set.seed(2024)
# test input data
<- tibble(class_labels = rep(c("class1",
two_classes_2_obs "class2"), 2),
values = round(runif(4), 1))
<- tibble(class_labels = c(rep("class1", 2),
two_classes_2_and_1_obs "class2"),
values = round(runif(3), 1))
<- tibble(class_labels = c("class1", "class1"),
one_class_2_obs values = round(runif(2), 1))
<- tibble(class_labels = character(0),
empty_df values = double(0))
<- list(class_labels = rep(c("class1",
two_classes_two_obs_as_list "class2"), 2),
values = round(runif(4), 1))
# expected test outputs
<- tibble(class = c("class1", "class2"),
two_classes_2_obs_output count = c(2,2))
<- tibble(class = c("class1", "class2"),
two_classes_2_and_1_obs_output count = c(2, 1))
<- tibble(class = "class1",
one_class_2_obs_output count = 2)
<- tibble(class = character(0),
empty_df_output count = numeric(0))
4. Write the tests to evaluate your function based on the planned test cases and test data:
Now that we have the skeletons for our tests, and our reproducible test data, we can actually write the internals for our tests! We will do this by using expect_*
functions from the testthat
package. The table below shows some of the most commonly used expect_*
functions. However, there are many more that can be found in the testthat
expectations reference documentation.
testthat
test structure:
test_that("Message to print if test fails", expect_*(...))
Common expect_*
statements for use with test_that
Is the object equal to a value?
expect_identical
- test two objects for being exactly equalexpect_equal
- compare R objects x and y testing ‘near equality’ (can set a tolerance)expect_equivalent
- compare R objects x and y testing ‘near equality’ (can set a tolerance) and does not assess attributes
Does code produce an output/message/warning/error?
expect_error
- tests if an expression throws an errorexpect_warning
- tests whether an expression outputs a warningexpect_output
- tests that print output matches a specified value
Is the object true/false?
These are fall-back expectations that you can use when none of the other more specific expectations apply. The disadvantage is that you may get a less informative error message.
expect_true
- tests if the object returnsTRUE
expect_false
- tests if the object returnsFALSE
test_that("`count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class", {
expect_s3_class(count_classes(two_classes_2_obs, class_labels),
"data.frame")
expect_equal(count_classes(two_classes_2_obs, class_labels),
ignore_attr = TRUE)
two_classes_2_obs_output, expect_equal(count_classes(two_classes_2_and_1_obs, class_labels),
ignore_attr = TRUE)
two_classes_2_and_1_obs_output, })
── Failure: `count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class ──
count_classes(two_classes_2_obs, class_labels) is not an S3 object
── Failure: `count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class ──
count_classes(two_classes_2_obs, class_labels) not equal to `two_classes_2_obs_output`.
target is NULL, current is tbl_df
── Failure: `count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class ──
count_classes(two_classes_2_and_1_obs, class_labels) not equal to `two_classes_2_and_1_obs_output`.
target is NULL, current is tbl_df
Error:
! Test failed
test_that("`count_classes` should return an empty data frame, or tibble,
if the input to the function is an empty data frame", {
expect_equal(count_classes(one_class_2_obs, class_labels),
ignore_attr = TRUE)
one_class_2_obs_output, expect_equal(count_classes(empty_df, class_labels),
ignore_attr = TRUE)
empty_df_output, })
── Failure: `count_classes` should return an empty data frame, or tibble,
if the input to the function is an empty data frame ──
count_classes(one_class_2_obs, class_labels) not equal to `one_class_2_obs_output`.
target is NULL, current is tbl_df
── Failure: `count_classes` should return an empty data frame, or tibble,
if the input to the function is an empty data frame ──
count_classes(empty_df, class_labels) not equal to `empty_df_output`.
target is NULL, current is tbl_df
Error:
! Test failed
test_that("`count_classes` should throw an error when incorrect types
are passed to the `data_frame` argument", {
expect_error(count_classes(two_classes_two_obs_as_list, class_labels))
})
── Failure: `count_classes` should throw an error when incorrect types
are passed to the `data_frame` argument ──
`count_classes(two_classes_two_obs_as_list, class_labels)` did not throw an error.
Error:
! Test failed
Wait what??? Most of our tests fail…
Yes, we expect that, we haven’t written our function body yet!
5. Implement the function by writing the needed code in the function body to pass the tests:
FINALLY!! We can write the function body for our function! And then call our tests to see if they pass!
#' Count class observations
#'
#' Creates a new data frame with two columns,
#' listing the classes present in the input data frame,
#' and the number of observations for each class.
#'
#' @param data_frame A data frame or data frame extension (e.g. a tibble).
#' @param class_col unquoted column name of column containing class labels
#'
#' @return A data frame with two columns.
#' The first column (named class) lists the classes from the input data frame.
#' The second column (named count) lists the number of observations for each class from the input data frame.
#' It will have one row for each class present in input data frame.
#'
#' @export
#'
#' @examples
#' count_classes(mtcars, am)
<- function(data_frame, class_col) {
count_classes if (!is.data.frame(data_frame)) {
stop("`data_frame` should be a data frame or data frame extension (e.g. a tibble)")
}|>
data_frame ::group_by({{ class_col }}) |>
dplyr::summarize(count = dplyr::n()) |>
dplyr::rename("class" = {{ class_col }})
dplyr }
we recommending using the syntax
PACKAGE_NAME::FUNCTION()
when writing functions that will be sourced into other files in R to make it explicitly clear what external packages they depend on. This becomes even more important when we create R packages from our functions later.group_by
will throw a fairly useful error message ofclass_col
is not found indata_frame
, and we we can letgroup_by
handle that error case instead of writing our own exception to throw an error on.
test_that("`count_classes` should return a data frame, or tibble,
with the number of rows corresponding to the number of unique classes
in the `class_col` from the original dataframe. The new dataframe
will have a `class column` whose values are the unique classes,
and a `count` column, whose values will be the number of observations
for each class", {
expect_s3_class(count_classes(two_classes_2_obs, class_labels),
"data.frame")
expect_equal(count_classes(two_classes_2_obs, class_labels),
ignore_attr = TRUE)
two_classes_2_obs_output, expect_equal(count_classes(two_classes_2_and_1_obs, class_labels),
ignore_attr = TRUE)
two_classes_2_and_1_obs_output, })
Test passed 😸
test_that("`count_classes` should return an empty data frame, or tibble,
if the input to the function is an empty data frame", {
expect_equal(count_classes(one_class_2_obs, class_labels),
ignore_attr = TRUE)
one_class_2_obs_output, expect_equal(count_classes(empty_df, class_labels),
ignore_attr = TRUE)
empty_df_output, })
Test passed 🥇
test_that("`count_classes` should throw an error when incorrect types
are passed to the `data_frame` argument", {
expect_error(count_classes(two_classes_two_obs_as_list, class_lables))
})
Test passed 🎉
The tests passed!
Are we done? For the purposes of this demo, yes! However in practice you would usually cycle through steps 2-5 two-three more times to further improve our tests and and function!
Discussion: Does test-driven development afford testability? How might it do so? Let’s discuss controllability, observability, isolateablilty, and automatability in our case study of test-driven development of count_classes
.
20.0.3 Where do the function and test files go?
In the workflow above, we skipped over where we should put our functions we will use in our data analyses, as well as where we put the tests for our function, and how we call those tests!
We summarize the answer to these questions below, but highly recommend you explore and test out our demonstration GitHub repository that has a minimal working example of this: https://github.com/ttimbers/demo-tests-ds-analysis
Where does the function go?
In R, functions should be abstracted to R scripts (plain text files that end in .R
) which live in the project’s R
directory. Commonly we name the R script with the same name as the function (however, we might choose a more general name if the R script contains many functions).
In the analysis file where we call the function (e.g. eda.ipynb
) we need to call source("PATH_TO_FILE_CONTAINING_FUNCTION")
before we are able to use the function(s) contained in that R script inside our analysis file.
Where do the tests go?
The tests for the function should live in tests/testthat/test-FUNCTION_NAME.R
, and the code to reproducibly generate helper data for the tests lives in tests/testthat/helper-FUNCTION_NAME.R
. The test suite can be run via testthat::test_dir("tests/testthat")
. testthat::test_dir("tests/testthat")
first runs any files that begin with helper*
and then any files that begin with test*
.
Convenience functions for setting this up
Several usethis
R package functions can be used to setup the file and directory structure needed for this: - usethis::use_r("FUNCTION_NAME")
can be used to create the R script file the function should live in, inside the R directory - usethis::use_testthat()
can be used to create the necessary test directories to use testthat
’s automated test suite execution function (testthat::test_dir("tests/testthat")
) - usethis::use_test("FUNCTION_NAME")
can be used to create the test file for each function
Note: tests/testthat/helper-FUNCTION_NAME.R
needs to be created manually, as there is no usethis
function to automate this.
20.1 Reproducibly generating test data
As highlighted above, where at all possible, we should use code to generate reproducible, simple and tractable helper data for our tests. When using the testthat
R package in R to automate the running of the test suite, the convention is to put such code in a file named helper-FUNCTION_NAME.R
which should live in the tests/testthat
directory.
20.2 Common types of test levels in data science
Unit tests - exercise individual components, usually methods or functions, in isolation. This kind of testing is usually quick to write and the tests incur low maintenance effort since they touch such small parts of the system. They typically ensure that the unit fulfills its contract making test failures more straightforward to understand. This is the kind of tests we wrote for our example for
count_classes
above.Integration tests - exercise groups of components to ensure that their contained units interact correctly together. Integration tests touch much larger pieces of the system and are more prone to spurious failure. Since these tests validate many different units in concert, identifying the root-cause of a specific failure can be difficult. In data science, this might be testing whether several functions that call each other, or run in sequence, work as expected (e.g., tests for a
tidymodel
’sworkflow
function)
20.3 Observability of unit outputs in data science
Observability is defined as the extent to which the response of the code under test (here our functions) to a test can be verified.
Questions we should ask when trying to understand how observable our tests are: - What do we have to do to identify pass/fail? - How expensive is it to do this? - Can we extract the result from the code under test? - Do we know enough to identify pass/fail?
Source: CPSC 410 class notes from Reid Holmes, UBC
These questions are easier to answer and address for code that creates simpler data science objects such as data frames, as in the example above. However, when our code under test does something more complex, such as create a plot object, these questions are harder to answer, or can be answered less fully…
Let’s talk about how we might test code to create plots!
20.3.1 Visual regression testing
When we use certain data visualization libraries, we might think that we can test all code that generates data visualizations similar to code that generates more traditional data objects, such as data frames.
For example, when we create a scatter plot object with ggplot2
, we can easily observe many of it’s values and attributes. We show an example below:
options(repr.plot.width = 4, repr.plot.height = 4)
<- ggplot2::ggplot(mtcars, ggplot2::aes(hp, mpg)) +
cars_ggplot_scatter ::geom_point()
ggplot2
cars_ggplot_scatter
$layers[[1]]$geom cars_ggplot_scatter
<ggproto object: Class GeomPoint, Geom, gg>
aesthetics: function
default_aes: uneval
draw_group: function
draw_key: function
draw_layer: function
draw_panel: function
extra_params: na.rm
handle_na: function
non_missing_aes: size shape colour
optional_aes:
parameters: function
rename_size: FALSE
required_aes: x y
setup_data: function
setup_params: function
use_defaults: function
super: <ggproto object: Class Geom, gg>
$mapping$x cars_ggplot_scatter
<quosure>
expr: ^hp
env: global
And so we could write some tests for a function that created a ggplot2
object like so:
#' scatter2d
#'
#' A short-cut function for creating 2 dimensional scatterplots via ggplot2.
#'
#' @param data data.frame or tibble
#' @param x unquoted column name to plot on the x-axis from data data.frame or tibble
#' @param y unquoted column name to plot on the y-axis from data data.frame or tibble
#'
#' @return
#' @export
#'
#' @examples
#' scatter2d(mtcars, hp, mpg)
<- function(data, x, y) {
scatter2d ::ggplot(data, ggplot2::aes(x = {{x}}, y = {{y}})) +
ggplot2::geom_point()
ggplot2
}
<- dplyr::tibble(x_vals = c(2, 4, 6),
helper_data y_vals = c(2, 4, 6))
<- scatter2d(helper_data, x_vals, y_vals)
helper_plot2d
test_that('Plot should use geom_point and map x to x-axis, and y to y-axis.', {
expect_true("GeomPoint" %in% c(class(helper_plot2d$layers[[1]]$geom)))
expect_true("x_vals" == rlang::get_expr(helper_plot2d$mapping$x))
expect_true("y_vals" == rlang::get_expr(helper_plot2d$mapping$y))
})
Test passed 🌈
However, when we create a similar plot object using base R, we do not get an object back at all…
<- plot(mtcars$hp, mtcars$mpg) cars_scatter
typeof(cars_scatter)
[1] "NULL"
class(cars_scatter)
[1] "NULL"
So as you can see, testing plot objects can be more challenging. In the cases of several commonly used plotting functions and package in R and Python, the objects created are not rich objects with attributes that can be easily accessed (or accessed at all). Plotting packages likeggplot2
(R) and altair
(Python) which do create rich objects with observable values and attributes appear to be exceptions, rather than the rule. Thus, regression testing against an image generated by the plotting function is often the “best we can do”, or because of this history, what is commonly done.
Regression testing is defined as tests that check that recent changes to the code base do not break already implemented features.
Thus, once a desired plot is generated from the plotting function, visual regression tests can be used to ensure that further code refactoring does not change the plot function. Tools for this exist for R in the vdiffr
package. Matplotlib uses visual regression testing as well, you can see the docs for examples of this here.
Visual regression testing with vdiffr
Say you have a function that creates a nicely formatted scatter plot using ggplot2
, such as the one shown below:
<- function(.data, x_axis_col, y_axis_col) {
pretty_scatter ::ggplot(data = .data,
ggplot2::aes(x = {{ x_axis_col }}, y = {{ y_axis_col }})) +
ggplot2::geom_point(alpha = 0.8, colour = "steelblue", size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(text = ggplot2::element_text(size = 14))
ggplot2
}
library(palmerpenguins)
library(ggplot2)
<- pretty_scatter(penguins, bill_length_mm, bill_depth_mm) +
penguins_scatter labs(x = "Bill length (mm)", y = "Bill depth (mm)")
penguins_scatter
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
What is so pretty about this scatter plot? Compared to the default settings of a scatter plot created in
ggplot2
this scatter plot has a white instead of grey background, has blue points instead of black, has larger points, and they points have a bit of transparency so you can see some overlapping data points.
Now, say that you want to write tests to make sure that as you further develop and refactor your data visualization code, you do not break it or change the plot (because you have decided you are happy with what it looks like). You can use the vdiffr
visual regression testing package to do this. First, you need to abstract the function to an R script that lives in R
. For this case, we would create a file called R/pretty_scatter.R
that houses the pretty_scatter
function shown above.
Then you need to setup a tests
directory and test file in which to house your tests that works with the testthat
framework (we recommend using usethis::use_testthat()
and usethis::use_test("FUNCTION_NAME")
to do this).
Finally, add an expectation with vdiffr::expect_doppelganger
to your test_that
statement:
library(palmerpenguins)
library(ggplot2)
library(vdiffr)
source("../../R/pretty_scatter.R")
<- pretty_scatter(penguins, bill_length_mm, bill_depth_mm) +
penguins_scatter labs(x = "Bill length (mm)", y = "Bill depth (mm)")
penguins_scatter
test_that("refactoring our code should not change our plot", {
expect_doppelganger("pretty scatter", penguins_scatter)
})
Then when you run testthat::test_dir("tests/testthat")
to run your test suite for the first time, it will take a snapshot of the figure created in your test for that visualization and save it to tests/testthat/_snaps/EXPECT_DOPPELGANGER_TITLE.svg
. Then as you refactor your code, you and run testthat::test_dir("tests/testthat")
it will compare a new snapshot of the figure with the existing one. If they differ, the tests will fail. You can then run testthat::snapshot_review()
to get an interactive viewer which will let you compare the two data visualizations and allow you to either choose to accept the new snapshot (if you wish to include the changes to the data visualization as part of your code revision and refactoring) or you can stop the app and revert/fix some of your code changes so that the data visualization is not unintentionally changed.
Below we show an example of running testthat::snapshot_review()
after we made our tests fail by removing alpha = 0.8
from our pretty_scatter
function source:
vdiffr
demo
In this GitHub repository, we have created a vdiffr
demo based on the case above: https://github.com/ttimbers/vdiffr-demo
To get experience and practice using this, we recommend forking this, and then cloning it so that you can try using this, and building off it locally.
Inside RStudio, run
testthat::test_dir("tests/testthat")
to ensure you can get the tests to pass as they exist in the demo.Change something about the code in
R/pretty_scatter.R
that will change what the plot looks like (text size, point colour, type of geom used, etc).Run
testthat::test_dir("tests/testthat")
and see if the tests fail. If they do, runtestthat::snapshot_review()
to explore the differences in the two image snapshots. You may be prompted to install a couple R packages to get this working.Add another plot function to the project and create a test for it using
testthat
andvdiffr
.
20.4 Attribution:
- Advanced R by Hadley Wickham
- The Tidynomicon by Greg Wilson
- CPSC 310 and CPSC 410 class notes by Reid Holmes, UBC