17 R Data Validation: pointblank

See this workshop repo for reference: https://github.com/posit-conf-2024/ds-workflows-r
Specifically this slide deck: https://posit-conf-2024.github.io/ds-workflows-r/slides/03_data_validation_and_alerting.html#/

library(pointblank)
data(small_table)
small_table

# A tibble: 13 × 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

17.1 Validation Rules

All the validation rule functions begin with col_*(): https://rstudio.github.io/pointblank/reference/index.html#validation-expectation-and-test-functions

Here we want to check that all values in the a column are less than 10. We can use the col_vals_lt()

small_table |>
  col_vals_lt(a, value = 10)

# A tibble: 13 × 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

If the table passes the rule, you get a table back. Otherwise you will get an error:

small_table |>
  col_vals_lt(a, value = 5)

Error: Exceedance of failed test units where values in `a` should have been < `5`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (3) >= failure threshold (1)

This allows you to chain validation rules.

small_table |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")

Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

We can fix this by either fixing the data, or the actual test.

small_table |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 10000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")

# A tibble: 13 × 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

17.2 Validation Table

From the docs:

There are three things that should be noted here:

Validation steps: each step is a separate test on the table, focused on a certain aspect of the table.

Validation rules: the validation type is provided here along with key constraints.

Validation results: interrogation results are provided here, with a breakdown of test units (total, passing, and failing), threshold flags, and more.

Create the agent, apply validation rules, then interrogate() it.

agent <- small_table |>
  create_agent() |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")

Running the interrogate() on the agent, will print out the validation table, but also

agent |>
  interrogate()

Pointblank Validation
[2025-03-21\|11:38:14] tibble small_table
		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
	1	`col_vals_lt()`	`▮a`	`10`		✓	`13`	`13` `1`	`0` `0`	—	—	—	—
	2	`col_vals_between()`	`▮d`	`[0, 5,000]`		✓	`13`	`12` `0.92308`	`1` `0.07692`	—	—	—
	3	`col_vals_in_set()`	`▮f`	`low, mid, high`		✓	`13`	`13` `1`	`0` `0`	—	—	—	—
	4	`col_vals_regex()`	`▮b`	`^[0-9]-[a-z]{3}-[0-9]{3}$`		✓	`13`	`13` `1`	`0` `0`	—	—	—	—
2025-03-21 11:38:14 UTC < 1 s 2025-03-21 11:38:14 UTC

17.3 Post-interrogation

There are a few post-interrogation steps you can do with your agents. One of the more useful ones may be separately looking at the passing and failing data.

The get_sundered_data() and get_data_extracts() are a few useful functions to accomplish this goal.

Here is our agent that has failing validation checks.

Important

Don’t forget to interrogate() your agent before running post-interrogation functions.

agent <- small_table |>
  create_agent() |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$") |>
  interrogate()

We can get a separate or combined version of our passing and failing observations.

get_sundered_data(agent, type = "pass")

# A tibble: 12 × 8
   date_time           date           a b             c     d e     f    
   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3 3423. TRUE  high 
 2 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
 3 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA 3892. FALSE mid  
 4 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7  284. TRUE  low  
 5 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4 3291. TRUE  mid  
 6 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
 7 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2 1036. FALSE low  
 8 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
10 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7  834. TRUE  low  
11 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8  108. FALSE low  
12 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA 2230. TRUE  high

get_sundered_data(agent, type = "fail")

# A tibble: 1 × 8
  date_time           date           a b             c      d e     f    
  <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low

The combined option creates a .ph_combined column that you can use in a downstream process.

get_sundered_data(agent, type = "combined")

# A tibble: 13 × 9
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 
# ℹ 1 more variable: .pb_combined <chr>

You can also use the get_data_extracts() function to get the values in a list.

get_data_extracts(agent)

$`2`
# A tibble: 1 × 8
  date_time           date           a b             c      d e     f    
  <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low

17.4 Pipeline Data Validation

When you want to run your validation checks non-interactively, you may want to use {pointblank} in the Pipeline Data Validation Workflow.

In this workflow we do not need to create an agent object, and rely on the actual warning or failures from the validation checks.

small_table %>%
  col_is_posix(date_time) %>%
  col_vals_in_set(f, set = c("low", "mid", "high")) %>%
  col_vals_lt(a, value = 10) %>%
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$") %>%
  col_vals_between(d, left = 0, right = 5000)

Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

You can also set thresholds for when validation checks throw errors or warnings: https://rstudio.github.io/pointblank/articles/VALID-II.html#using-warn_on_fail-and-stop_on_fail-functions-to-generate-simple-action_levels