Tutorial 8

In this tutorial you’ll get more practice with data validation via a simulated scenario.

Scenario

You are the lead data scientist for a clinical trial testing a new therapeutic drug. The data engineering team just handed you the first batch of patient data (clinical_trials_batch1.csv).

You can download the data here: https://github.com/UBC-DSCI/dsci-310-student/blob/main/example_files/clinical_trials_batch1.csv

Before you can begin any exploratory analysis or modeling, you must validate the data. If invalid data makes it into your downstream processes, it could compromise the trial’s integrity.

Your goal is to build an automated validation pipeline using either Python (Pandera) or R (pointblank). Your pipeline must catch all errors, isolate the bad rows for the engineering team to review, and keep the good rows for your analysis.

The Data Dictionary

You must translate the following rules into your validation schema or agent:

  • patient_id: Must be a string. Must be entirely unique (no duplicates). Cannot contain missing/null values.
  • age: Must be an integer between 18 and 100 (inclusive).
  • treatment_arm: Must perfectly match either “Placebo”, “Drug_A”, or “Drug_B” (watch out for typos and case-sensitivity).
  • biomarker_score: Must be a positive number (strictly greater than 0).
  • outcome: Must be either 0 or 1.

Task 1: Load the Data

  1. Set up your script and load the necessary libraries (pandas and pandera for Python; tidyverse and pointblank for R).
  2. Read clinical_trials_batch1.csv into a dataframe.
  3. Print the total number of rows so you have a baseline count.

Task 2: Define the Schema or Agent

Translate the Data Dictionary into programmatic rules.

  • Python Users: Create a pa.DataFrameSchema. Map each column to its expected data type andapply the appropriate pa.Check functions (e.g., isin(), between()). Ensure you setnullable=False where appropriate.
  • R Users: Initialize an agent using create_agent(), then pipe your dataframe through the appropriate col_vals_* functions (e.g., col_vals_between(), col_vals_in_set()).
    • Important Note regarding Uniqueness: You might be tempted to use rows_distinct() to check for duplicate patient_ids. However, rows_distinct() is a table-level check, meaning it won’t flag the specific individual rows that failed. This will break our ability to extract the bad rows later. Instead, use this row-level expression to check for uniqueness:
    col_vals_expr(expr = ~ !patient_id %in% patient_id[duplicated(patient_id)])

Task 3: Interrogation and Handling

Execute your validation checks in a way that catches all errors rather than stopping execution at the very first failure.

  • Python Users:
    • Wrap your execution in a try...except block.
    • Call schema.validate() and make sure to pass lazy=True.
    • If validation fails, catch the pa.errors.SchemaErrors exception. Extract the failure_cases dataframe from the error object, isolate the unique row indices that failed, and use those indices to split your original dataframe into clean_data and quarantined_data.
  • R Users:
    • Run interrogate() on your agent.
    • Use the post-interrogation functions (like get_sundered_data()) to extract the passing rows into a clean_data object and the failing rows into a quarantined_data object.

Task 4: Export the Quarantined Data

The data engineering team needs to see exactly which rows failed so they can fix their extraction process.

  • Check if your quarantined_data dataframe contains any rows.
  • If it does, export it to a new file named quarantined_data.csv.
  • Print a final summary to the console showing the number of rows in your clean_data and the number of rows in your quarantined_data.