Data Privacy II

DSCI 200

Katie Burak, Gabriela V. Cohen Freue

Last modified – 31 March 2026

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Attribution



This material is adapted from the following sources:

Case Study: Brogan Inc. and NIHB Data

  • The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
  • In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
  • Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
  • Brogan sold the data to pharmaceutical companies for commercial research and marketing
  • Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.

Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.

Discussion

Take 5 minutes to discuss this case in groups of 2-3. Consider these questions to reflect on:

  • Was the data truly de-identified?
  • Should de-identified data still require community consent before being shared or sold?
  • What are the limits of simply removing names and IDs from a dataset?
  • How can we measure whether a dataset is truly “safe” to release?

Learning Objectives

By the end of today’s lesson, you should be able to:

  • Define identifiers, quasi-identifiers and sensitive attributes in data sets
  • Explain the limitations of basic deidentification methods
  • Describe the concepts of \(k\)-anonymity and \(l\)-diversity
  • Apply \(k\)-anonymity and \(l\)-diversity to de-identify data
  • Understand the basic idea of differential privacy and its significance

Why basic deidentification isn’t always enough

  • Last class, we introduced some techniques for deidentification such as suppression and generalization.

  • However, individuals can often be re-identified using other information.

  • As datasets become more detailed and linkable, privacy risks increase.

  • Statistical methods are needed to ensure meaningful deidentification while preserving data utility.

Statistical approaches to deidentification



  • \(k\)-anonymity
  • \(l\)-diversity
  • Differential privacy (advanced)

Overview of privacy models

  • \(k\)-anonymity and \(l\)-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
  • They focus on how variables combined can lead to identification.
  • These approaches are complementary: a dataset can be simultaneously \(k\)-anonymous and \(l\)-diverse, where \(k\) and \(l\)represent numeric thresholds.
  • \(k\)-anonymity and \(l\)-diversity are typically used to de-identify tabular datasets before sharing.
  • They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Privacy models distinguish between three types of variables:

  • Identifiers: Direct identifiers such as names, student numbers, email addresses.

  • Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.

    • Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
  • Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.

    • Examples: Medical condition, Income, etc.

Importance of Correct Variable Categorization

  • Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
  • This categorization determines how to de-identify your dataset effectively using \(k\)-anonymity and \(l\)-diversity.
  • Now, let’s discuss each of these techniques in detail…

\(k\)-anonymity

  • A data set is \(k\)-anonymous if each observation cannot be distinguished from at least \(k-1\) other observations based on the quasi-identifiers.
  • This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
  • Applying \(k\)-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
  • It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.

Making a data set \(k\)-anonymous

  1. Identify variables as identifiers, quasi-identifiers and sensitive attributes.
  2. Choose a value for \(k\).
  3. Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.

Choosing \(k\)

  • There is no single correct value for \(k\)!
  • Higher \(k\) increases privacy, but reduces data detail and utility.
  • The choice depends on promises made to data subjects and acceptable risk levels.

Source: k2view.com

Example data

  • Age and city are quasi-identifiers, and salary is considered a sensitive attribute.
Age City Salary
38 Calgary 90,000–99,999
37 Toronto 90,000–99,999
31 Vancouver 80,000–89,999
48 Calgary 110,000–119,999
39 Vancouver 110,000–119,999
37 Calgary 90,000–99,999
34 Toronto 90,000–99,999
33 Vancouver 80,000–89,999
32 Toronto 100,000–109,999
45 Calgary 90,000–99,999

\(k=2\)

Age Range City Salary Range
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 Vancouver 110,000–119,999
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
30–39 Toronto 100,000–109,999
40–49 Calgary 90,000–99,999

\(l\)-diversity

  • \(l\)-diversity is an extension of \(k\)-anonymity that ensures sufficient variation in a sensitive attribute.
  • This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.

  • Although these data are \(2\)-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.
Age Range City Salary Range
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 Vancouver 110,000–119,999
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
30–39 Toronto 100,000–109,999
40–49 Calgary 90,000–99,999

\(l\)-diversity

  • The approach requires at least \(l\) different values for the sensitive attribute within each combination of quasi-identifiers.
  • Again, there is no perfect value for \(l\) (typically \(1< l \leq k\)).

  • With \(l=2\), that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.
Age Range City Salary Range
30–39 - 90,000–99,999
30–39 - 90,000–99,999
30–39 - 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 - 110,000–119,999
30–39 - 90,000–99,999
30–39 - 90,000–99,999
30–39 - 80,000–89,999
30–39 - 100,000–109,999
40–49 Calgary 90,000–99,999

There are still issues…

  • Even though the data is de-identified, some sensitive patterns can still leak through.

  • In the example we discussed, both individuals are grouped into the same age range and city.

  • While they are in different salary ranges and exact values are hidden, the range is still quite narrow.

  • Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.

Age Range City Salary Range
40–49 Calgary 110,000–119,999
40–49 Calgary 90,000–99,999

Differential privacy

  • So, we may need more sophisticated tools to privatize our data…
  • Differential privacy is a mathematical approach to protecting privacy
  • It ensures algorithm results are nearly the same whether one person’s data is included or not
  • Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)

Differential Privacy Example


Source: https://medium.com/data-science/a-differential-privacy-example-for-beginners-ef3c23f69401

  • Differential privacy is a complex topic and goes beyond the scope of this course
  • For a clear and accessible explanation, check out this short video:

iClicker Question 1

Given the data, which field(s) could you generalize to help achieve k = 3 anonymity?

Age ZIP Code Disease
29 13053 Flu
27 13068 Flu
28 13068 Cold
45 14853 Diabetes
46 14853 Diabetes
47 14853 Cancer
  • A. Generalize Age into age ranges (e.g., 20–29, 40–49)
  • B. Suppress Disease entirely
  • C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
  • D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
  • E. It’s already \(k=3\) anonymous

iClicker Question 2

Which of the following datasets violates \(k = 2\) anonymity?

Option A

Age Sex ZIP
34 M 02138
34 M 02138
34 F 02139

Option B

Age Sex ZIP
22 F 10011
22 F 10011
22 F 10011

Option C

Age Range Sex ZIP Prefix
30–39 * 021**
30–39 * 021**
30–39 * 021**
  • A. Only A
  • B. Only B
  • C. Only C
  • D. A and B

Question 3

Consider this 3-anonymous dataset. Is it also 2-diverse with respect to “Condition”?

Age Range ZIP Prefix Condition
20–29 130** Flu
20–29 130** Flu
20–29 130** Flu
30–39 148** Cold
30–39 148** Cold
30–39 148** Cancer
  • A. Yes, both groups have 2 or more different values
  • B. No, one group violates l-diversity
  • C. Yes, because the dataset is already k-anonymous
  • D. No, both groups have only one distinct value

Key Takeaways

  • Removing direct identifiers alone does not guarantee privacy
  • Quasi-identifiers can lead to re-identification if not protected
  • \(k\)-anonymity makes each record indistinguishable from at least \(k - 1\) others
  • \(l\)-diversity improves protection by promoting diversity in sensitive attributes
  • Differential privacy offers mathematical privacy guarantees
  • Choosing privacy parameters involves balancing risk and data utility