Data Privacy: The Netflix Challenge

A case study in how "anonymous" data isn't anonymous

? The Netflix Challenge: A Privacy Case Study 2006: Netflix releases "anonymous" data • 2007: Researchers break anonymity The Decision: Release 100M ratings from 500K users for $1M ML competition Option A: Full Release Names + Movies + Ratings What happens: Real Names + Movie List = BREACH! Everyone's viewing history exposed Option B: "Anonymous" Random ID + Movies + Ratings Netflix's thinking: A7X9B2... + Movie List = Safe? "No way to connect to real people" 🤔 Option C: Don't Release Keep data private Protected Data Safe (but no $1M prize)

How the Attack Worked

How UT Austin Researchers Broke Netflix's "Anonymous" Data (2007) The linkage attack that exposed 500,000 users' viewing histories What Netflix Thought "Anonymous" Dataset Released User: A7X9B2 Movies: [Shrek, Matrix, Titanic, ...] Ratings: [4, 5, 2, ...] Dates: [2006-01-15, ...] Netflix's Assumption "Random IDs cannot be linked back to real people" "Movie preferences alone aren't identifying information" ✓ Privacy Protected Data is "de-identified" and safe Anonymous Real Identity "No connection possible" What Actually Happened The Linkage Attack Netflix "Anonymous" User: A7X9B2 Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 + IMDB Public Alice Smith Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 = PRIVACY BREACH A7X9B2 = Alice Smith (99% confidence) -- The attack query SELECT n.*, 'Alice Smith' as revealed_identity FROM netflix n JOIN imdb i ON n.movie = i.movie AND n.rating = i.rating WHERE i.user = 'Alice Smith' -- 99% match! Result: All 200 movies exposed Including sensitive viewing preferences

Privacy Attacks

Attack Type How It Works Example Prevention
Linkage Join anonymous + public data Netflix + IMDB Don't release individual records
Inference Unique attribute combinations DOB + Gender + ZIP = 87% unique Generalize attributes
Location Movement patterns are unique 4 locations = 95% unique Aggregate or add noise
Temporal Time patterns reveal identity Login times, activity patterns Round timestamps

Stay Out of Jail: Legal & Trust Requirements

Regulation Scope Key Requirement Penalty
GDPR EU Citizens Data stays in EU/adequate countries Up to 4% revenue
CCPA California Right to delete personal data $7,500 per violation
HIPAA Healthcare Protect electronic health records (ePHI) $2M + criminal charges

Defense Methods

Synthetic Data

Generate fake data that mimics the statistical properties of real data without containing any real individual records. Goal: use for dev/debug/test and basic ML workflows

Differential Privacy

Add calibrated noise to query results to protect individual privacy while maintaining aggregate accuracy. Goal: use for production ML workflows. Here's Privacy vs Utility Tradeoff.

Epsilon (ε) Privacy Level Noise Amount Use Case
10 Low ±1-2% Internal analytics
1.0 Balanced ±10% Research datasets
0.1 High ±100% Public release
0.01 Maximum ±1000% Highly sensitive
-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Key Takeaways

  1. Your data is a fingerprint - Just 8 movie ratings can identify you uniquely

  2. Anonymous ≠ Private - Public data sources enable re-identification

  3. Combinations matter - DOB + Gender + ZIP identifies 87% of Americans

  4. Math protects privacy - Differential privacy provides proven guarantees

  5. Always a tradeoff - More privacy = less accuracy, choose wisely