Data Privacy: The Netflix Challenge

A case study in how "anonymous" data isn't anonymous

How the Attack Worked

Privacy Attacks

Attack Type	How It Works	Example	Prevention
Linkage	Combine anonymous with public data	Netflix + IMDB	Avoid releasing individual records
Inference	Unique attribute combinations	DOB + Gender + ZIP = 87% unique	Generalize attributes
Location	Unique movement patterns	4 locations = 95% unique	Aggregate or add noise
Temporal	Time patterns reveal identity	Login times, activity patterns	Round timestamps

Stay Out of Jail: Legal & Trust Requirements

Regulation	Scope	Key Requirement	Penalty
GDPR	EU Citizens	Data stays in EU/adequate countries	Up to 4% revenue
CCPA	California	Right to delete personal data	$7,500 per violation
HIPAA	Healthcare	Protect electronic health records (ePHI)	$2M + criminal charges

Defense Methods

Synthetic Data

Create fake data that mirrors the statistical properties of real data without containing any actual individual records. This is useful for development, debugging, testing, and basic machine learning workflows.

Differential Privacy

Introduce calibrated noise to query results to safeguard individual privacy while preserving aggregate accuracy. Ideal for production machine learning workflows. Here's how privacy and utility trade off:

Epsilon (ε)	Privacy Level	Noise Amount	Use Case
10	Low	±1-2%	Internal analytics
1.0	Balanced	±10%	Research datasets
0.1	High	±100%	Public release
0.01	Maximum	±1000%	Highly sensitive

-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Key Takeaways

Your data is a fingerprint - Just 8 movie ratings can uniquely identify you.
Anonymous ≠ Private - Public data sources enable re-identification.
Combinations matter - DOB + Gender + ZIP identifies 87% of Americans.
Math protects privacy - Differential privacy provides proven guarantees.
Always a tradeoff - More privacy means less accuracy. Choose wisely.