Data Privacy: The Netflix Challenge

A case study in how "anonymous" data isn't anonymous

? The Netflix Challenge: A Privacy Case Study 2006: Netflix releases "anonymous" data • 2007: Researchers break anonymity The Decision: Release 100M ratings from 500K users for $1M ML competition Option A: Full Release Names + Movies + Ratings What happens: Real Names + Movie List = BREACH! Everyone's viewing history exposed Option B: "Anonymous" Random ID + Movies + Ratings Netflix's thinking: A7X9B2... + Movie List = Safe? "No way to connect to real people" 🤔 Option C: Don't Release Keep data private Protected Data Safe (but no $1M prize)

How the Attack Worked

How UT Austin Researchers Broke Netflix's "Anonymous" Data (2007) The linkage attack that exposed 500,000 users' viewing histories What Netflix Thought "Anonymous" Dataset Released User: A7X9B2 Movies: [Shrek, Matrix, Titanic, ...] Ratings: [4, 5, 2, ...] Dates: [2006-01-15, ...] Netflix's Assumption "Random IDs cannot be linked back to real people" "Movie preferences alone aren't identifying information" ✓ Privacy Protected Data is "de-identified" and safe Anonymous Real Identity "No connection possible" What Actually Happened The Linkage Attack Netflix "Anonymous" User: A7X9B2 Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 + IMDB Public Alice Smith Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 = ⚠️ PRIVACY BREACH A7X9B2 = Alice Smith (99% confidence) -- The attack query SELECT n.*, 'Alice Smith' as revealed_identity FROM netflix n JOIN imdb i ON n.movie = i.movie AND n.rating = i.rating WHERE i.user = 'Alice Smith' -- 99% match! Result: All 200 movies exposed Including sensitive viewing preferences

Privacy Attack & Defense Patterns

Privacy Attack & Defense Patterns Build attacks with data components • Defend with mathematical guarantees Building Blocks Data Types Personal Anonymous Public Metadata Attack Methods JOIN LINK Unique Identifiers DOB + ZIP + Gender Reddit/IMDB/Spotify Location History Defense Methods Differential Privacy Synthetic Data Aggregation (k>5) Attack Patterns How privacy gets broken 🔗 Linkage Attack Join anonymous data with public sources Anonymous + Public = Identity Revealed -- Netflix + IMDB attack SELECT a.user_id, a.private_data FROM anonymous_dataset a JOIN public_dataset p ON a.pattern = p.pattern -- 99% match! 🧮 Inference Attack Unique combinations reveal identity DOB + Gender + ZIP = 87% Unique! -- 87% Rule: Sweeney's Attack SELECT COUNT(*) as matches FROM voter_records WHERE dob='1990-01-15' AND gender='M' AND zip='94301' -- Returns 1! Defense Patterns Mathematical privacy guarantees 🔒 Differential Privacy Add calibrated noise to protect individuals True Count + Noise(ε) = Protected Result -- BigQuery Differential Privacy SELECT WITH DIFFERENTIAL_PRIVACY OPTIONS(epsilon=1.0, delta=1e-5) condition, COUNT(*) as noisy_count 🤖 Synthetic Data Generate fake data with real patterns Real Data ML Model Synthetic Dataset # Generate synthetic data from sdv import GaussianCopula model = GaussianCopula() model.fit(real_data); synthetic = model.sample(1000) # Safe! Lesson: Build privacy into your systems from day one - retrofitting is expensive and often ineffective

Stay Out of Jail: Legal & Trust Requirements

Regulatory Compliance

Regulation Scope Key Requirement Penalty
GDPR EU Citizens Data stays in EU/adequate countries Up to 4% revenue
CCPA California Right to delete personal data $7,500 per violation
HIPAA Healthcare Protect electronic health records (ePHI) $2M + criminal charges

Build User Trust

-- Bad: Release first, apologize later
SELECT * FROM user_data;  -- ❌ Netflix 2006

-- Good: Privacy by design
SELECT WITH DIFFERENTIAL_PRIVACY  -- ✅ Google/Apple approach
  OPTIONS(epsilon=1.0)
  aggregated_insights
FROM user_data;

Action Framework

  1. Before you build: Incorporate privacy principles from day 1

  2. Before you release: Consult privacy experts for complex cases

  3. After deployment: Keep learning - this field evolves rapidly

  4. If things go wrong: Legal AND reputational consequences

Remember: Compliance keeps you out of court. Trust keeps users coming back.

Quick Reference

Privacy Attacks

Attack Type How It Works Example Prevention
Linkage Join anonymous + public data Netflix + IMDB Don't release individual records
Inference Unique attribute combinations DOB + Gender + ZIP = 87% unique Generalize attributes
Location Movement patterns are unique 4 locations = 95% unique Aggregate or add noise
Temporal Time patterns reveal identity Login times, activity patterns Round timestamps

Defense Methods

-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Privacy vs Utility Tradeoff

Epsilon (ε) Privacy Level Noise Amount Use Case
10 Low ±1-2% Internal analytics
1.0 Balanced ±10% Research datasets
0.1 High ±100% Public release
0.01 Maximum ±1000% Highly sensitive

Key Takeaways

  1. Your data is a fingerprint - Just 8 movie ratings can identify you uniquely

  2. Anonymous ≠ Private - Public data sources enable re-identification

  3. Combinations matter - DOB + Gender + ZIP identifies 87% of Americans

  4. Math protects privacy - Differential privacy provides proven guarantees

  5. Always a tradeoff - More privacy = less accuracy, choose wisely