Tutorial 8
In this tutorial you’ll get more practice with data validation via a simulated scenario.
Scenario
You are the lead data scientist for a clinical trial testing a new therapeutic drug. The data engineering team just handed you the first batch of patient data (clinical_trials_batch1.csv).
You can download the data here: https://github.com/UBC-DSCI/dsci-310-student/blob/main/example_files/clinical_trials_batch1.csv
Before you can begin any exploratory analysis or modeling, you must validate the data. If invalid data makes it into your downstream processes, it could compromise the trial’s integrity.
Your goal is to build an automated validation pipeline using either Python (Pandera) or R (pointblank). Your pipeline must catch all errors, isolate the bad rows for the engineering team to review, and keep the good rows for your analysis.
The Data Dictionary
You must translate the following rules into your validation schema or agent:
patient_id: Must be a string. Must be entirely unique (no duplicates). Cannot contain missing/null values.age: Must be an integer between 18 and 100 (inclusive).treatment_arm: Must perfectly match either “Placebo”, “Drug_A”, or “Drug_B” (watch out for typos and case-sensitivity).biomarker_score: Must be a positive number (strictly greater than 0).outcome: Must be either 0 or 1.
Task 1: Load the Data
- Set up your script and load the necessary libraries (pandas and pandera for Python; tidyverse and pointblank for R).
- Read clinical_trials_batch1.csv into a dataframe.
- Print the total number of rows so you have a baseline count.
Task 2: Define the Schema or Agent
Translate the Data Dictionary into programmatic rules.
- Python Users: Create a pa.DataFrameSchema. Map each column to its expected data type andapply the appropriate pa.Check functions (e.g., isin(), between()). Ensure you setnullable=False where appropriate.
- R Users: Initialize an agent using create_agent(), then pipe your dataframe through the appropriate col_vals_* functions (e.g., col_vals_between(), col_vals_in_set()).
- Important Note regarding Uniqueness: You might be tempted to use rows_distinct() to check for duplicate patient_ids. However, rows_distinct() is a table-level check, meaning it won’t flag the specific individual rows that failed. This will break our ability to extract the bad rows later. Instead, use this row-level expression to check for uniqueness:
col_vals_expr(expr = ~ !patient_id %in% patient_id[duplicated(patient_id)])
Task 3: Interrogation and Handling
Execute your validation checks in a way that catches all errors rather than stopping execution at the very first failure.
- Python Users:
- Wrap your execution in a
try...exceptblock. - Call
schema.validate()and make sure to passlazy=True. - If validation fails, catch the
pa.errors.SchemaErrorsexception. Extract thefailure_casesdataframe from the error object, isolate the unique row indices that failed, and use those indices to split your original dataframe intoclean_dataandquarantined_data.
- Wrap your execution in a
- R Users:
- Run
interrogate()on your agent. - Use the post-interrogation functions (like
get_sundered_data()) to extract the passing rows into aclean_dataobject and the failing rows into aquarantined_dataobject.
- Run
Task 4: Export the Quarantined Data
The data engineering team needs to see exactly which rows failed so they can fix their extraction process.
- Check if your
quarantined_datadataframe contains any rows. - If it does, export it to a new file named
quarantined_data.csv. - Print a final summary to the console showing the number of rows in your
clean_dataand the number of rows in yourquarantined_data.