Data Privacy: The Netflix Challenge

A case study in how "anonymous" data isn't anonymous

Privacy Attack & Defense Patterns

Stay Out of Jail: Legal & Trust Requirements

Regulatory Compliance

Regulation	Scope	Key Requirement	Penalty
GDPR	EU Citizens	Data stays in EU/adequate countries	Up to 4% revenue
CCPA	California	Right to delete personal data	$7,500 per violation
HIPAA	Healthcare	Protect electronic health records (ePHI)	$2M + criminal charges

Build User Trust

-- Bad: Release first, apologize later
SELECT * FROM user_data;  -- ❌ Netflix 2006

-- Good: Privacy by design
SELECT WITH DIFFERENTIAL_PRIVACY  -- ✅ Google/Apple approach
  OPTIONS(epsilon=1.0)
  aggregated_insights
FROM user_data;

Action Framework

Before you build: Incorporate privacy principles from day 1
Before you release: Consult privacy experts for complex cases
After deployment: Keep learning - this field evolves rapidly
If things go wrong: Legal AND reputational consequences

Remember: Compliance keeps you out of court. Trust keeps users coming back.

Quick Reference

Privacy Attacks

Attack Type	How It Works	Example	Prevention
Linkage	Join anonymous + public data	Netflix + IMDB	Don't release individual records
Inference	Unique attribute combinations	DOB + Gender + ZIP = 87% unique	Generalize attributes
Location	Movement patterns are unique	4 locations = 95% unique	Aggregate or add noise
Temporal	Time patterns reveal identity	Login times, activity patterns	Round timestamps

Defense Methods

-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY 
  OPTIONS(
    epsilon = 1.0,              -- Privacy budget (lower = more private)
    delta = 1e-5,               -- Probability bound
    max_groups_contributed = 1  -- Limit contribution per user
  )
  diagnosis,
  COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;

-- Aggregation Defense (k-anonymity)
SELECT 
  age_group,  -- Not exact age
  zip_prefix, -- First 3 digits only
  COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5;  -- Minimum group size

Privacy vs Utility Tradeoff

Epsilon (ε)	Privacy Level	Noise Amount	Use Case
10	Low	±1-2%	Internal analytics
1.0	Balanced	±10%	Research datasets
0.1	High	±100%	Public release
0.01	Maximum	±1000%	Highly sensitive

Key Takeaways

Your data is a fingerprint - Just 8 movie ratings can identify you uniquely
Anonymous ≠ Private - Public data sources enable re-identification
Combinations matter - DOB + Gender + ZIP identifies 87% of Americans
Math protects privacy - Differential privacy provides proven guarantees
Always a tradeoff - More privacy = less accuracy, choose wisely