Data Privacy: The Netflix Challenge

Group 3: Safely Using Data The Linkage Attack

A case study in how "anonymous" data isn't anonymous

? The Netflix Challenge: A Privacy Case Study 2006: Netflix releases "anonymous" data • 2007: Researchers break anonymity The Decision: Release 100M ratings from 500K users for $1M ML competition Option A: Full Release Names + Movies + Ratings What happens: Real Names + Movie List = BREACH! Everyone's viewing history exposed Option B: "Anonymous" Random ID + Movies + Ratings Netflix's thinking: A7X9B2... + Movie List = Safe? "No way to connect to real people" 🤔 Option C: Don't Release Keep data private Protected Data Safe (but no $1M prize)

How the Attack Worked

How UT Austin Researchers Broke Netflix's "Anonymous" Data (2007) The linkage attack that exposed 500,000 users' viewing histories What Netflix Thought "Anonymous" Dataset Released User: A7X9B2 Movies: [Shrek, Matrix, Titanic, ...] Ratings: [4, 5, 2, ...] Dates: [2006-01-15, ...] Netflix's Assumption "Random IDs cannot be linked back to real people" "Movie preferences alone aren't identifying information" ✓ Privacy Protected Data is "de-identified" and safe Anonymous Real Identity "No connection possible" What Actually Happened The Linkage Attack Netflix "Anonymous" User: A7X9B2 Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 + IMDB Public Alice Smith Shrek(4) on 2006-01-15 Matrix(5) on 2006-01-20 = PRIVACY BREACH A7X9B2 = Alice Smith (99% confidence) -- The attack query SELECT n.*, 'Alice Smith' as revealed_identity FROM netflix n JOIN imdb i ON n.movie = i.movie AND n.rating = i.rating WHERE i.user = 'Alice Smith' -- 99% match! Result: All 200 movies exposed Including sensitive viewing preferences

Privacy Attacks

Attack Type How It Works Example Prevention
Linkage Combine anonymous with public data Netflix + IMDB Avoid releasing individual records
Inference Unique attribute combinations DOB + Gender + ZIP = 87% unique Generalize attributes
Location Unique movement patterns 4 locations = 95% unique Aggregate or add noise
Temporal Time patterns reveal identity Login times, activity patterns Round timestamps

Stay Out of Jail: Legal & Trust Requirements

Regulation Scope Key Requirement Penalty
GDPR EU Citizens Data stays in EU/adequate countries Up to 4% revenue
CCPA California Right to delete personal data $7,500 per violation
HIPAA Healthcare Protect electronic health records (ePHI) $2M + criminal charges

Defense Methods

Synthetic Data

Create fake data that mirrors the statistical properties of real data without containing any actual individual records. This is useful for development, debugging, testing, and basic machine learning workflows.

Differential Privacy

Introduce calibrated noise to query results to safeguard individual privacy while preserving aggregate accuracy. Ideal for production machine learning workflows. Here's how privacy and utility trade off:

Epsilon (ε) Privacy Level Noise Amount Use Case
10 Low ±1-2% Internal analytics
1.0 Balanced ±10% Research datasets
0.1 High ±100% Public release
0.01 Maximum ±1000% Highly sensitive
-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Key Takeaways

  1. Your data is a fingerprint - Just 8 movie ratings can uniquely identify you.

  2. Anonymous ≠ Private - Public data sources enable re-identification.

  3. Combinations matter - DOB + Gender + ZIP identifies 87% of Americans.

  4. Math protects privacy - Differential privacy provides proven guarantees.

  5. Always a tradeoff - More privacy means less accuracy. Choose wisely.