Here we want to check that all values in the a column are less than 10. We can use the col_vals_lt()
small_table |>col_vals_lt(a, value =10)
# A tibble: 13 ร 8
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
If the table passes the rule, you get a table back. Otherwise you will get an error:
small_table |>col_vals_lt(a, value =5)
Error: Exceedance of failed test units where values in `a` should have been < `5`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (3) >= failure threshold (1)
This allows you to chain validation rules.
small_table |>col_vals_lt(a, value =10) |>col_vals_between(d, left =0, right =5000) |>col_vals_in_set(f, set =c("low", "mid", "high")) |>col_vals_regex(b, regex ="^[0-9]-[a-z]{3}-[0-9]{3}$")
Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)
We can fix this by either fixing the data, or the actual test.
small_table |>col_vals_lt(a, value =10) |>col_vals_between(d, left =0, right =10000) |>col_vals_in_set(f, set =c("low", "mid", "high")) |>col_vals_regex(b, regex ="^[0-9]-[a-z]{3}-[0-9]{3}$")
# A tibble: 13 ร 8
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
17.2 Validation Table
From the docs:
There are three things that should be noted here:
Validation steps: each step is a separate test on the table, focused on a certain aspect of the table.
Validation rules: the validation type is provided here along with key constraints.
Validation results: interrogation results are provided here, with a breakdown of test units (total, passing, and failing), threshold flags, and more.
Create the agent, apply validation rules, then interrogate() it.
agent <- small_table |>create_agent() |>col_vals_lt(a, value =10) |>col_vals_between(d, left =0, right =5000) |>col_vals_in_set(f, set =c("low", "mid", "high")) |>col_vals_regex(b, regex ="^[0-9]-[a-z]{3}-[0-9]{3}$")
Running the interrogate() on the agent, will print out the validation table, but also
agent |>interrogate()
Pointblank Validation
[2025-03-21|11:38:14]
tibblesmall_table
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_lt()
โฎa
10
โ
13
13 1
0 0
โ
โ
โ
โ
2
col_vals_between()
โฎd
[0, 5,000]
โ
13
12 0.92308
1 0.07692
โ
โ
โ
3
col_vals_in_set()
โฎf
low, mid, high
โ
13
13 1
0 0
โ
โ
โ
โ
4
col_vals_regex()
โฎb
^[0-9]-[a-z]{3}-[0-9]{3}$
โ
13
13 1
0 0
โ
โ
โ
โ
2025-03-21 11:38:14 UTC< 1 s2025-03-21 11:38:14 UTC
17.3 Post-interrogation
There are a few post-interrogation steps you can do with your agents. One of the more useful ones may be separately looking at the passing and failing data.
The get_sundered_data() and get_data_extracts() are a few useful functions to accomplish this goal.
Here is our agent that has failing validation checks.
Important
Donโt forget to interrogate() your agent before running post-interrogation functions.
agent <- small_table |>create_agent() |>col_vals_lt(a, value =10) |>col_vals_between(d, left =0, right =5000) |>col_vals_in_set(f, set =c("low", "mid", "high")) |>col_vals_regex(b, regex ="^[0-9]-[a-z]{3}-[0-9]{3}$") |>interrogate()
We can get a separate or combined version of our passing and failing observations.
get_sundered_data(agent, type ="pass")
# A tibble: 12 ร 8
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
2 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
3 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
4 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
5 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
6 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
7 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
8 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
10 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
11 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
12 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
get_sundered_data(agent, type ="fail")
# A tibble: 1 ร 8
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
The combined option creates a .ph_combined column that you can use in a downstream process.
get_sundered_data(agent, type ="combined")
# A tibble: 13 ร 9
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
# โน 1 more variable: .pb_combined <chr>
You can also use the get_data_extracts() function to get the values in a list.
get_data_extracts(agent)
$`2`
# A tibble: 1 ร 8
date_time date a b c d e f
<dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
17.4 Pipeline Data Validation
When you want to run your validation checks non-interactively, you may want to use {pointblank} in the Pipeline Data Validation Workflow.
In this workflow we do not need to create an agent object, and rely on the actual warning or failures from the validation checks.
small_table %>%col_is_posix(date_time) %>%col_vals_in_set(f, set =c("low", "mid", "high")) %>%col_vals_lt(a, value =10) %>%col_vals_regex(b, regex ="^[0-9]-[a-z]{3}-[0-9]{3}$") %>%col_vals_between(d, left =0, right =5000)
Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
The `col_vals_between()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)