The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
Brogan sold the data to pharmaceutical companies for commercial research and marketing
Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.
Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.
Discussion
Take 5 minutes to discuss this case in groups of 2-3. Consider these questions to reflect on:
Was the data truly de-identified?
Should de-identified data still require community consent before being shared or sold?
What are the limits of simply removing names and IDs from a dataset?
How can we measure whether a dataset is truly “safe” to release?
Learning Objectives
By the end of today’s lesson, you should be able to:
Define identifiers, quasi-identifiers and sensitive attributes in data sets
Explain the limitations of basic deidentification methods
Describe the concepts of \(k\)-anonymity and \(l\)-diversity
Apply \(k\)-anonymity and \(l\)-diversity to de-identify data
Understand the basic idea of differential privacy and its significance
Why basic deidentification isn’t always enough
Last class, we introduced some techniques for deidentification such as suppression and generalization.
However, individuals can often be re-identified using other information.
As datasets become more detailed and linkable, privacy risks increase.
Statistical methods are needed to ensure meaningful deidentification while preserving data utility.
Statistical approaches to deidentification
\(k\)-anonymity
\(l\)-diversity
Differential privacy (advanced)
Overview of privacy models
\(k\)-anonymity and \(l\)-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
They focus on how variables combined can lead to identification.
These approaches are complementary: a dataset can be simultaneously \(k\)-anonymous and \(l\)-diverse, where \(k\) and \(l\)represent numeric thresholds.
\(k\)-anonymity and \(l\)-diversity are typically used to de-identify tabular datasets before sharing.
They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.
Identifiers, Quasi-Identifiers, and Sensitive Attributes
Privacy models distinguish between three types of variables:
Identifiers: Direct identifiers such as names, student numbers, email addresses.
Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.
Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.
Examples: Medical condition, Income, etc.
Importance of Correct Variable Categorization
Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
This categorization determines how to de-identify your dataset effectively using \(k\)-anonymity and \(l\)-diversity.
Now, let’s discuss each of these techniques in detail…
\(k\)-anonymity
A data set is \(k\)-anonymous if each observation cannot be distinguished from at least \(k-1\) other observations based on the quasi-identifiers.
This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
Applying \(k\)-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.
Making a data set \(k\)-anonymous
Identify variables as identifiers, quasi-identifiers and sensitive attributes.
Choose a value for \(k\).
Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.
Choosing \(k\)
There is no single correct value for \(k\)!
Higher \(k\) increases privacy, but reduces data detail and utility.
The choice depends on promises made to data subjects and acceptable risk levels.
Age and city are quasi-identifiers, and salary is considered a sensitive attribute.
Age
City
Salary
38
Calgary
90,000–99,999
37
Toronto
90,000–99,999
31
Vancouver
80,000–89,999
48
Calgary
110,000–119,999
39
Vancouver
110,000–119,999
37
Calgary
90,000–99,999
34
Toronto
90,000–99,999
33
Vancouver
80,000–89,999
32
Toronto
100,000–109,999
45
Calgary
90,000–99,999
\(k=2\)
Age Range
City
Salary Range
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
Vancouver
110,000–119,999
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
30–39
Toronto
100,000–109,999
40–49
Calgary
90,000–99,999
\(l\)-diversity
\(l\)-diversity is an extension of \(k\)-anonymity that ensures sufficient variation in a sensitive attribute.
This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.
Although these data are \(2\)-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.
Age Range
City
Salary Range
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
Vancouver
110,000–119,999
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
30–39
Toronto
100,000–109,999
40–49
Calgary
90,000–99,999
\(l\)-diversity
The approach requires at least \(l\) different values for the sensitive attribute within each combination of quasi-identifiers.
Again, there is no perfect value for \(l\) (typically \(1< l \leq k\)).
With \(l=2\), that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.
Age Range
City
Salary Range
30–39
-
90,000–99,999
30–39
-
90,000–99,999
30–39
-
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
-
110,000–119,999
30–39
-
90,000–99,999
30–39
-
90,000–99,999
30–39
-
80,000–89,999
30–39
-
100,000–109,999
40–49
Calgary
90,000–99,999
There are still issues…
Even though the data is de-identified, some sensitive patterns can still leak through.
In the example we discussed, both individuals are grouped into the same age range and city.
While they are in different salary ranges and exact values are hidden, the range is still quite narrow.
Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.
Age Range
City
Salary Range
40–49
Calgary
110,000–119,999
40–49
Calgary
90,000–99,999
Differential privacy
So, we may need more sophisticated tools to privatize our data…
Differential privacy is a mathematical approach to protecting privacy
It ensures algorithm results are nearly the same whether one person’s data is included or not
Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)