UBC DSCI 200 – data-privacy-2

Data Privacy II

DSCI 200

Katie Burak, Gabriela V. Cohen Freue

Last modified – 31 March 2026

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Attribution

This material is adapted from the following sources:

Case Study: Brogan Inc. and NIHB Data

The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
Brogan sold the data to pharmaceutical companies for commercial research and marketing
Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.

Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.

Discussion

Take 5 minutes to discuss this case in groups of 2-3. Consider these questions to reflect on:

Was the data truly de-identified?
Should de-identified data still require community consent before being shared or sold?
What are the limits of simply removing names and IDs from a dataset?
How can we measure whether a dataset is truly “safe” to release?

Learning Objectives

By the end of today’s lesson, you should be able to:

Define identifiers, quasi-identifiers and sensitive attributes in data sets
Explain the limitations of basic deidentification methods
Describe the concepts of $k$-anonymity and $l$-diversity
Apply $k$-anonymity and $l$-diversity to de-identify data
Understand the basic idea of differential privacy and its significance

Why basic deidentification isn’t always enough

Last class, we introduced some techniques for deidentification such as suppression and generalization.
However, individuals can often be re-identified using other information.
As datasets become more detailed and linkable, privacy risks increase.
Statistical methods are needed to ensure meaningful deidentification while preserving data utility.

Statistical approaches to deidentification

$k$-anonymity
$l$-diversity
Differential privacy (advanced)

Overview of privacy models

$k$-anonymity and $l$-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
They focus on how variables combined can lead to identification.
These approaches are complementary: a dataset can be simultaneously $k$-anonymous and $l$-diverse, where $k$ and $l$represent numeric thresholds.
$k$-anonymity and $l$-diversity are typically used to de-identify tabular datasets before sharing.
They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Privacy models distinguish between three types of variables:

Identifiers: Direct identifiers such as names, student numbers, email addresses.
Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.
- Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.
- Examples: Medical condition, Income, etc.

Importance of Correct Variable Categorization

Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
This categorization determines how to de-identify your dataset effectively using $k$-anonymity and $l$-diversity.
Now, let’s discuss each of these techniques in detail…

$k$-anonymity

A data set is $k$-anonymous if each observation cannot be distinguished from at least $k-1$ other observations based on the quasi-identifiers.
This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
Applying $k$-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.

Making a data set $k$-anonymous

Identify variables as identifiers, quasi-identifiers and sensitive attributes.
Choose a value for $k$.
Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.

Choosing $k$

There is no single correct value for $k$!
Higher $k$ increases privacy, but reduces data detail and utility.
The choice depends on promises made to data subjects and acceptable risk levels.

Source: k2view.com

Example data

Age and city are quasi-identifiers, and salary is considered a sensitive attribute.

Age	City	Salary
38	Calgary	90,000–99,999
37	Toronto	90,000–99,999
31	Vancouver	80,000–89,999
48	Calgary	110,000–119,999
39	Vancouver	110,000–119,999
37	Calgary	90,000–99,999
34	Toronto	90,000–99,999
33	Vancouver	80,000–89,999
32	Toronto	100,000–109,999
45	Calgary	90,000–99,999

$k=2$

Age Range	City	Salary Range
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
40–49	Calgary	110,000–119,999
30–39	Vancouver	110,000–119,999
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
30–39	Toronto	100,000–109,999
40–49	Calgary	90,000–99,999

$l$-diversity

$l$-diversity is an extension of $k$-anonymity that ensures sufficient variation in a sensitive attribute.
This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.

Although these data are $2$-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.

Age Range	City	Salary Range
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
40–49	Calgary	110,000–119,999
30–39	Vancouver	110,000–119,999
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
30–39	Toronto	100,000–109,999
40–49	Calgary	90,000–99,999

$l$-diversity

The approach requires at least $l$ different values for the sensitive attribute within each combination of quasi-identifiers.
Again, there is no perfect value for $l$ (typically $1< l \leq k$).

With $l=2$, that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.

Age Range	City	Salary Range
30–39	-	90,000–99,999
30–39	-	90,000–99,999
30–39	-	80,000–89,999
40–49	Calgary	110,000–119,999
30–39	-	110,000–119,999
30–39	-	90,000–99,999
30–39	-	90,000–99,999
30–39	-	80,000–89,999
30–39	-	100,000–109,999
40–49	Calgary	90,000–99,999

There are still issues…

Even though the data is de-identified, some sensitive patterns can still leak through.
In the example we discussed, both individuals are grouped into the same age range and city.
While they are in different salary ranges and exact values are hidden, the range is still quite narrow.
Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.

Age Range	City	Salary Range
40–49	Calgary	110,000–119,999
40–49	Calgary	90,000–99,999

Differential privacy

So, we may need more sophisticated tools to privatize our data…
Differential privacy is a mathematical approach to protecting privacy
It ensures algorithm results are nearly the same whether one person’s data is included or not
Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)

Differential Privacy Example

Source: https://medium.com/data-science/a-differential-privacy-example-for-beginners-ef3c23f69401

Differential privacy is a complex topic and goes beyond the scope of this course
For a clear and accessible explanation, check out this short video:

iClicker Question 1

Given the data, which field(s) could you generalize to help achieve k = 3 anonymity?

Age	ZIP Code	Disease
29	13053	Flu
27	13068	Flu
28	13068	Cold
45	14853	Diabetes
46	14853	Diabetes
47	14853	Cancer

A. Generalize Age into age ranges (e.g., 20–29, 40–49)
B. Suppress Disease entirely
C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
E. It’s already $k=3$ anonymous

iClicker Question 2

Which of the following datasets violates $k = 2$ anonymity?

Option A

Age	Sex	ZIP
34	M	02138
34	M	02138
34	F	02139

Option B

Age	Sex	ZIP
22	F	10011
22	F	10011
22	F	10011

Option C

Age Range	Sex	ZIP Prefix
30–39	*	021**
30–39	*	021**
30–39	*	021**

A. Only A
B. Only B
C. Only C
D. A and B

Question 3

Consider this 3-anonymous dataset. Is it also 2-diverse with respect to “Condition”?

Age Range	ZIP Prefix	Condition
20–29	130**	Flu
20–29	130**	Flu
20–29	130**	Flu
30–39	148**	Cold
30–39	148**	Cold
30–39	148**	Cancer

A. Yes, both groups have 2 or more different values
B. No, one group violates l-diversity
C. Yes, because the dataset is already k-anonymous
D. No, both groups have only one distinct value

Key Takeaways

Removing direct identifiers alone does not guarantee privacy
Quasi-identifiers can lead to re-identification if not protected
$k$-anonymity makes each record indistinguishable from at least $k - 1$ others
$l$-diversity improves protection by promoting diversity in sensitive attributes
Differential privacy offers mathematical privacy guarantees
Choosing privacy parameters involves balancing risk and data utility