17  R Data Validation: pointblank

library(pointblank)
data(small_table)
small_table
# A tibble: 13 ร— 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 

17.1 Validation Rules

All the validation rule functions begin with col_*(): https://rstudio.github.io/pointblank/reference/index.html#validation-expectation-and-test-functions

Here we want to check that all values in the a column are less than 10. We can use the col_vals_lt()

small_table |>
  col_vals_lt(a, value = 10)
# A tibble: 13 ร— 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 

If the table passes the rule, you get a table back. Otherwise you will get an error:

small_table |>
  col_vals_lt(a, value = 5)
Error: Exceedance of failed test units where values in `a` should have been < `5`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (3) >= failure threshold (1)

This allows you to chain validation rules.

small_table |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")
Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

We can fix this by either fixing the data, or the actual test.

small_table |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 10000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")
# A tibble: 13 ร— 8
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 

17.2 Validation Table

From the docs:

There are three things that should be noted here:

  • Validation steps: each step is a separate test on the table, focused on a certain aspect of the table.
  • Validation rules: the validation type is provided here along with key constraints.
  • Validation results: interrogation results are provided here, with a breakdown of test units (total, passing, and failing), threshold flags, and more.

Create the agent, apply validation rules, then interrogate() it.

agent <- small_table |>
  create_agent() |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$")

Running the interrogate() on the agent, will print out the validation table, but also

agent |>
  interrogate()
Pointblank Validation
[2025-03-21|11:38:14]

tibble small_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT

1
col_vals_lt
 col_vals_lt()

โ–ฎa

10

โœ“ 13 13
1
0
0
โ€” โ€” โ€” โ€”

2
col_vals_between
 col_vals_between()

โ–ฎd

[0, 5,000]

โœ“ 13 12
0.92308
1
0.07692
โ€” โ€” โ€”

3
col_vals_in_set
 col_vals_in_set()

โ–ฎf

low, mid, high

โœ“ 13 13
1
0
0
โ€” โ€” โ€” โ€”

4
col_vals_regex
 col_vals_regex()

โ–ฎb

^[0-9]-[a-z]{3}-[0-9]{3}$

โœ“ 13 13
1
0
0
โ€” โ€” โ€” โ€”
2025-03-21 11:38:14 UTC < 1 s 2025-03-21 11:38:14 UTC

17.3 Post-interrogation

There are a few post-interrogation steps you can do with your agents. One of the more useful ones may be separately looking at the passing and failing data.

The get_sundered_data() and get_data_extracts() are a few useful functions to accomplish this goal.

Here is our agent that has failing validation checks.

Important

Donโ€™t forget to interrogate() your agent before running post-interrogation functions.

agent <- small_table |>
  create_agent() |>
  col_vals_lt(a, value = 10) |>
  col_vals_between(d, left = 0, right = 5000) |>
  col_vals_in_set(f, set = c("low", "mid", "high")) |>
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$") |>
  interrogate()

We can get a separate or combined version of our passing and failing observations.

get_sundered_data(agent, type = "pass")
# A tibble: 12 ร— 8
   date_time           date           a b             c     d e     f    
   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3 3423. TRUE  high 
 2 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
 3 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA 3892. FALSE mid  
 4 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7  284. TRUE  low  
 5 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4 3291. TRUE  mid  
 6 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
 7 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2 1036. FALSE low  
 8 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
10 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7  834. TRUE  low  
11 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8  108. FALSE low  
12 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA 2230. TRUE  high 
get_sundered_data(agent, type = "fail")
# A tibble: 1 ร— 8
  date_time           date           a b             c      d e     f    
  <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  

The combined option creates a .ph_combined column that you can use in a downstream process.

get_sundered_data(agent, type = "combined")
# A tibble: 13 ร— 9
   date_time           date           a b             c      d e     f    
   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
 3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
 4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
 5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
 6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
 7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
 8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
 9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 
# โ„น 1 more variable: .pb_combined <chr>

You can also use the get_data_extracts() function to get the values in a list.

get_data_extracts(agent)
$`2`
# A tibble: 1 ร— 8
  date_time           date           a b             c      d e     f    
  <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  

17.4 Pipeline Data Validation

When you want to run your validation checks non-interactively, you may want to use {pointblank} in the Pipeline Data Validation Workflow.

In this workflow we do not need to create an agent object, and rely on the actual warning or failures from the validation checks.

small_table %>%
  col_is_posix(date_time) %>%
  col_vals_in_set(f, set = c("low", "mid", "high")) %>%
  col_vals_lt(a, value = 10) %>%
  col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$") %>%
  col_vals_between(d, left = 0, right = 5000)
Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

You can also set thresholds for when validation checks throw errors or warnings: https://rstudio.github.io/pointblank/articles/VALID-II.html#using-warn_on_fail-and-stop_on_fail-functions-to-generate-simple-action_levels