Data Privacy: The Netflix Challenge
A case study in how "anonymous" data isn't anonymous
How the Attack Worked
Privacy Attacks
| Attack Type | How It Works | Example | Prevention |
|---|---|---|---|
| Linkage | Combine anonymous with public data | Netflix + IMDB | Avoid releasing individual records |
| Inference | Unique attribute combinations | DOB + Gender + ZIP = 87% unique | Generalize attributes |
| Location | Unique movement patterns | 4 locations = 95% unique | Aggregate or add noise |
| Temporal | Time patterns reveal identity | Login times, activity patterns | Round timestamps |
Stay Out of Jail: Legal & Trust Requirements
| Regulation | Scope | Key Requirement | Penalty |
|---|---|---|---|
| GDPR | EU Citizens | Data stays in EU/adequate countries | Up to 4% revenue |
| CCPA | California | Right to delete personal data | $7,500 per violation |
| HIPAA | Healthcare | Protect electronic health records (ePHI) | $2M + criminal charges |
Defense Methods
Synthetic Data
Create fake data that mirrors the statistical properties of real data without containing any actual individual records. This is useful for development, debugging, testing, and basic machine learning workflows.
Differential Privacy
Introduce calibrated noise to query results to safeguard individual privacy while preserving aggregate accuracy. Ideal for production machine learning workflows. Here's how privacy and utility trade off:
| Epsilon (ε) | Privacy Level | Noise Amount | Use Case |
|---|---|---|---|
| 10 | Low | ±1-2% | Internal analytics |
| 1.0 | Balanced | ±10% | Research datasets |
| 0.1 | High | ±100% | Public release |
| 0.01 | Maximum | ±1000% | Highly sensitive |
-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY
OPTIONS(
epsilon = 1.0, -- Privacy budget (lower = more private)
delta = 1e-5, -- Probability bound
max_groups_contributed = 1 -- Limit contribution per user
)
diagnosis,
COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;
-- Aggregation Defense (k-anonymity)
SELECT
age_group, -- Not exact age
zip_prefix, -- First 3 digits only
COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5; -- Minimum group size
Key Takeaways
-
Your data is a fingerprint - Just 8 movie ratings can uniquely identify you.
-
Anonymous ≠ Private - Public data sources enable re-identification.
-
Combinations matter - DOB + Gender + ZIP identifies 87% of Americans.
-
Math protects privacy - Differential privacy provides proven guarantees.
-
Always a tradeoff - More privacy means less accuracy. Choose wisely.