Data Privacy: The Netflix Challenge

A case study in how "anonymous" data isn't anonymous

? The Netflix Challenge: A Privacy Case Study 2006: Netflix releases "anonymous" data • 2007: Researchers break anonymity 📊 The Decision: Release 100M ratings from 500K users for $1M ML competition ❌ Option A: Full Release Names + Movies + Ratings Personal + Movies = 💥 Instant Privacy Breach 🤔 Option B: "Anonymous" Random ID + Movies + Ratings Anon ID + Movies = ⚠️ "Safe" (Netflix thought) ✅ Option C: Don't Release Keep data private 🔒 Protected Data Safe (but no $1M prize) 🔓 The Attack: How UT Austin Researchers Broke "Anonymous" (2007) Netflix "Anonymous" User ID: A7X9B2 200 movie ratings with dates JOIN IMDB Public Alice Smith's profile 50 movie ratings with dates = 💥 PRIVACY BREACH A7X9B2 = Alice Smith (99% confidence) All 200 private Netflix movies exposed! SELECT n.user_id, n.all_movie_history FROM netflix_anonymous n JOIN imdb_public i ON n.movie = i.movie AND n.rating = i.rating AND n.date = i.date WHERE i.username = 'Alice' -- Just 8 matches = 99% confidence! 🔑 Key Insights 8 movie ratings = unique fingerprint 87% identified by DOB+ZIP+Gender 4 locations = 95% unique Solution: Differential Privacy Lesson: "Anonymous" data is rarely anonymous - it just takes one JOIN to break privacy

Privacy Attack & Defense Patterns

? Privacy Attack & Defense Patterns Build attacks with data components • Defend with mathematical guarantees Building Blocks Combine to attack or defend Data Types Personal Anonymous Public Metadata Attack Methods JOIN LINK Unique Identifiers DOB + ZIP + Gender Reddit/IMDB/Spotify Location History Defense Methods Differential Privacy Synthetic Data Aggregation (k>5) Attack Patterns How privacy gets broken 🔗 Linkage Attack Join anonymous data with public sources Anonymous + Public = Identity Revealed -- Netflix + IMDB attack SELECT a.user_id, a.private_data FROM anonymous_dataset a JOIN public_dataset p ON a.pattern = p.pattern WHERE p.name = 'target' -- 99% match! 🧮 Inference Attack Unique combinations reveal identity DOB + Gender + ZIP = 87% Unique! -- 87% Rule: Sweeney's Attack SELECT COUNT(*) as matches FROM voter_records WHERE dob = '1990-01-15' AND gender = 'M' AND zip = '94301' -- Returns 1! Defense Patterns Mathematical privacy guarantees Differential Privacy Add calibrated noise to protect individuals True Count + Noise(ε) = Protected Result -- BigQuery Differential Privacy SELECT WITH DIFFERENTIAL_PRIVACY OPTIONS(epsilon=1.0, delta=1e-5, max_groups_contributed=1) condition, COUNT(*) as noisy_count Synthetic Data Generate fake data with real patterns Real Data ML Model Synthetic Dataset # Generate synthetic data from sdv import GaussianCopula model = GaussianCopula() model.fit(real_data) synthetic = model.sample(1000) # Safe!

Stay Out of Jail: Legal & Trust Requirements

Regulatory Compliance

Regulation Scope Key Requirement Penalty
GDPR EU Citizens Data stays in EU/adequate countries Up to 4% revenue
CCPA California Right to delete personal data $7,500 per violation
HIPAA Healthcare Protect electronic health records (ePHI) $2M + criminal charges

Build User Trust

-- Bad: Release first, apologize later
SELECT * FROM user_data;  -- ❌ Netflix 2006

-- Good: Privacy by design
SELECT WITH DIFFERENTIAL_PRIVACY  -- ✅ Google/Apple approach
  OPTIONS(epsilon=1.0)
  aggregated_insights
FROM user_data;

Action Framework

  1. Before you build: Incorporate privacy principles from day 1

  2. Before you release: Consult privacy experts for complex cases

  3. After deployment: Keep learning - this field evolves rapidly

  4. If things go wrong: Legal AND reputational consequences

Remember: Compliance keeps you out of court. Trust keeps users coming back.

Quick Reference

Privacy Attacks

Attack Type How It Works Example Prevention
Linkage Join anonymous + public data Netflix + IMDB Don't release individual records
Inference Unique attribute combinations DOB + Gender + ZIP = 87% unique Generalize attributes
Location Movement patterns are unique 4 locations = 95% unique Aggregate or add noise
Temporal Time patterns reveal identity Login times, activity patterns Round timestamps

Defense Methods

-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Privacy vs Utility Tradeoff

Epsilon (ε) Privacy Level Noise Amount Use Case
10 Low ±1-2% Internal analytics
1.0 Balanced ±10% Research datasets
0.1 High ±100% Public release
0.01 Maximum ±1000% Highly sensitive

Key Takeaways

  1. Your data is a fingerprint - Just 8 movie ratings can identify you uniquely

  2. Anonymous ≠ Private - Public data sources enable re-identification

  3. Combinations matter - DOB + Gender + ZIP identifies 87% of Americans

  4. Math protects privacy - Differential privacy provides proven guarantees

  5. Always a tradeoff - More privacy = less accuracy, choose wisely