UBC DSCI 200 – data-privacy-1

Data Privacy I

DSCI 200

Katie Burak, Gabriela V. Cohen Freue

Last modified – 26 March 2026

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Attribution

This material is adapted from the following sources:

Learning Objectives

By the end of today’s lesson, you should be able to:

Understand key terms in data privacy, including PII, pseudonymization, and anonymization
Identify direct and indirect identifiers in sample data sets
Explain why de-identification is challenging and context-dependent
Apply basic de-identification techniques (e.g., suppression, top-coding, permutation) using R
Recognize the tradeoff between data utility and privacy risk

Even something as simple as your Facebook “likes” can reveal a lot more than you think…
Researchers at Cambridge showed that algorithms could predict:
- Sexual orientation with up to 88% accuracy
- Race with 95% accuracy
- Political affiliation with 85% accuracy
All from analyzing the pages and posts you “liked” (no profile bio or messages needed)!

https://www.cam.ac.uk/research/news/digital-records-could-expose-intimate-details-and-personality-traits-of-millions

What Happens to Your Data?

Every time you use an app, visit a website, click on a link, fill out a survey or even just scroll on your device, your data is being:

Collected - What you click, search, watch, like or buy
Analyzed - Used to predict your behaviour, interests or identity
Shared or Sold - Passed to advertisers, data brokers or other companies

Why Does This Matter?

You may be targeted with ads, content and potentially misinformation
You could be judged or profiled based on your data (even if it’s not accurate)
You rarely know who has your data (or what they’re doing with it)
So what does this mean for us? Let’s explore how data can be used, what makes certain information sensitive and why it matters.

Personally Identifiable Information (PII)

PII refers to any data that can be used to identify a specific individual.
Direct identifiers: These clearly and uniquely point to a person.
- Examples: name, social security number, patient ID
Indirect identifiers: These don’t identify someone on their own, but could when combined.
- Examples: age, DOB, postal code, race, sex

Personal Data

Data can be identifiable when:

They contain directly identifying information.
It’s possible to single out an individual
It’s possible to infer information about an individual based on information in your dataset
It’s possible to link records relating to an individual.
De-identification is still reversible.

Scenario: Can This Data Identify You?

A fitness app shares anonymized data with researchers. The dataset includes:

Step count per day
General location (postal code)
Age
Time of day the user exercises
Health conditions

Separately, a publicly available dataset includes information from a local running club: names, age groups and 5K race times.

The Mosaic Effect

The “Mosaic Effect” can happens when separate pieces of data, which alone don’t identify anyone, are combined from different sources to reveal personal information or identify an individual.
In 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth.

https://dataprivacylab.org/projects/identifiability/paper1.pdf

Pseudonymization and Anonymization

Pseudonymisation and a nonymisation are techniques to de-identify personal data
Goal: reduce linkability of data to individuals
We will now define each of these terms

Pseudonymization

Reduces linkability of data to individuals
Data cannot identify individuals without additional information
Often done by replacing direct identifiers with pseudonyms
Link between real identifiers and pseudonyms is stored separately
Re-identification remains possible!

Anonymization

Data are anonymized when no individual is identifiable (directly or indirectly)
This applies even to the data controller
Fully anonymized data are no longer personal data
Anonymisation is difficult to achieve in practice

Identifiability Spectrum

Identifiability is a spectrum
More de-identified data = closer to anonymized
Lower identifiability = lower re-identification risk

https://www.kdnuggets.com/2020/08/anonymous-anonymized-data.html

When Are Data Truly anonymous?

Only if re-identification would require unreasonable effort (factors include cost, time and available technology)
Data are not anonymous if:

Direct identifiers are present
Individuals can be singled out from a group
Re-identification possible via linking datasets (mosaic effect)
Inference about identity is possible (e.g., through different variables)
De-identification can be reversed

Context Matters

Whether data are anonymous depends on:
- The context of the research
- Available external information
- Future data uses

De-identification Techniques

Techniques to deidentify your data include:

Suppression
Generalization
Replacement
Top- and bottom coding
Adding noise
Permutation

We will talk about each of these techniques individually.

First, let’s generate some data we can use to help illustrate these concepts.

library(tidyverse)

df <- tibble(
  name = c("Joel Miller", "Ellie Williams", "Tommy Miller", "Abby Anderson"),
  age = c(52, 19, 48, 28),
  height_cm = c(182, 160, 185,173) 
)

df

# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Joel Miller       52       182
2 Ellie Williams    19       160
3 Tommy Miller      48       185
4 Abby Anderson     28       173

Suppression

Remove entire variables, values or records
Used to eliminate highly identifying or unnecessary data
Examples:
- Names, contact details, social security numbers
- GPS metadata, IP addresses, neuroimaging facial features
- Outliers or unique participants

Suppression Example

df_suppressed <- df |>
  select(-name)

df_suppressed

# A tibble: 4 × 2
    age height_cm
  <dbl>     <dbl>
1    52       182
2    19       160
3    48       185
4    28       173

Generalization

Reduces detail or granularity in the data
Makes individuals harder to single out
Examples:
- Convert date of birth to age, or group into ranges
- Replace address with town or region
- Recategorise rare labels into “other” or “missing”
- Abstract people or places in qualitative data (e.g., “Bob” to “[colleague]”)

Generalization Example

Here we will show an example of generalization on the age column:

df_generalized <- df |>
  mutate(age_group = case_when(
    age < 30 ~ "under 30",
    TRUE     ~ "30+"
  ))|>
  select(-age)

df_generalized

# A tibble: 4 × 3
  name           height_cm age_group
  <chr>              <dbl> <chr>    
1 Joel Miller          182 30+      
2 Ellie Williams       160 under 30 
3 Tommy Miller         185 30+      
4 Abby Anderson        173 under 30

Replacement

Swap identifying info with less informative alternatives
Examples:
- Use pseudonyms for names (with securely stored keyfile)
- Replace with placeholders (e.g., “[redacted]”)
- Rounding numeric values

Creating Pseudonyms

Pseudonyms should reveal nothing about the subject
Good pseudonyms:
- Are random or meaningless strings/numbers
- Are securely managed (e.g., encrypted keyfile)
Can be generated using tools in Excel, R, Python, SPSS

Replacement with Pseudonyms

df_pseudonymized <- df |>
  mutate(pseudonym = paste0("ID", row_number())) |>
  select(pseudonym, everything(), -name)

df_pseudonymized

# A tibble: 4 × 3
  pseudonym   age height_cm
  <chr>     <dbl>     <dbl>
1 ID1          52       182
2 ID2          19       160
3 ID3          48       185
4 ID4          28       173

Hashing

Hashing converts names into fixed-length, irreversible strings.
Unlike pseudonyms, hashed values cannot be easily reversed.
In R, we can use the digest package (and function) to hash.

library(digest) 

df_hashed <- df |>
  rowwise() |>
  mutate(name_hash = digest(name)) |>
  select(name_hash, everything(), -name)

df_hashed

# A tibble: 4 × 3
# Rowwise: 
  name_hash                          age height_cm
  <chr>                            <dbl>     <dbl>
1 4a3e0ee26ab3fb1338e893f4d4e7244b    52       182
2 201943dd66d423ed3cce2242a75736d4    19       160
3 81699ec9483bad176eed57ee43ffa010    48       185
4 046dff9ba9cf33573396f4de8c0c0e0b    28       173

What happens if we juse apply digest to the name vector without using rowwise?

df_hashed <- df |>
  mutate(name_hash = digest(name)) |>
  select(name_hash, everything(), -name)

digest hashed the entire name column as a single object (it’s not vectorized), so mutate recycled the same hash to every row (which is not what we want).

Top- and Bottom-Coding

Limits extreme values in quantitative data
Recode all values above or below a threshold
Example: all incomes above $150,000 become $150,000
Preserves much of the dataset, but distorts distribution tails

Top-coding example

Consider 6ft (182.88cm) is considered our maximum height threshold.

df_top_coded <- df |>
  mutate(height_cm = if_else(height_cm > 182.88, 182.88, height_cm))

df_top_coded

# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Joel Miller       52      182 
2 Ellie Williams    19      160 
3 Tommy Miller      48      183.
4 Abby Anderson     28      173

Adding Noise

Introduces randomness to protect sensitive info
Examples:
- Add a small random amount to numeric values
- Blur images or alter voices
- Use differential privacy algorithms (advanced)

Adding Noise to Height

This adds random noise to the height variable from a normal distribution ($\mu=0$, $\sigma=2$), reducing exact re-identification risk.

set.seed(200) 

df_noisy <- df |>
  mutate(height_cm_noisy = height_cm + rnorm(n(), mean = 0, sd = 2)) |>
    select(-height_cm)

df_noisy

# A tibble: 4 × 3
  name             age height_cm_noisy
  <chr>          <dbl>           <dbl>
1 Joel Miller       52            182.
2 Ellie Williams    19            160.
3 Tommy Miller      48            186.
4 Abby Anderson     28            174.

Permutation

Swap values between individuals
Makes linking variables across a record more difficult
Maintains distributions, but breaks correlations
Can limit the types of analyses possible

Permutation of Height Values

Here, the height_cm values are shuffled between individuals, preserving the overall distribution but breaking the link between person and value.

set.seed(200)

df_permuted <- df |>
  mutate(height_cm_permuted = sample(height_cm)) |>
    select(-height_cm)

df_permuted

# A tibble: 4 × 3
  name             age height_cm_permuted
  <chr>          <dbl>              <dbl>
1 Joel Miller       52                160
2 Ellie Williams    19                173
3 Tommy Miller      48                182
4 Abby Anderson     28                185

Privacy vs. Utility Tradeoff

https://www.researchgate.net/figure/Trade-off-between-privacy-level-and-utility-level-of-data_fig1_357987903

Key Takeaways

Data exists on a spectrum of identifiability
Even seemingly anonymous data can often be re-identified (e.g., mosaic effect)
Different techniques offer varying levels of protection and utility
Context, external data and technological capabilities all affect re-identification risk
Responsible data handling requires both technical skill and ethical awareness