Data Privacy: The Netflix Challenge
A case study in how "anonymous" data isn't anonymous
Privacy Attack & Defense Patterns
Stay Out of Jail: Legal & Trust Requirements
Regulatory Compliance
Regulation | Scope | Key Requirement | Penalty |
---|---|---|---|
GDPR | EU Citizens | Data stays in EU/adequate countries | Up to 4% revenue |
CCPA | California | Right to delete personal data | $7,500 per violation |
HIPAA | Healthcare | Protect electronic health records (ePHI) | $2M + criminal charges |
Build User Trust
-- Bad: Release first, apologize later
SELECT * FROM user_data; -- ❌ Netflix 2006
-- Good: Privacy by design
SELECT WITH DIFFERENTIAL_PRIVACY -- ✅ Google/Apple approach
OPTIONS(epsilon=1.0)
aggregated_insights
FROM user_data;
Action Framework
-
Before you build: Incorporate privacy principles from day 1
-
Before you release: Consult privacy experts for complex cases
-
After deployment: Keep learning - this field evolves rapidly
-
If things go wrong: Legal AND reputational consequences
Remember: Compliance keeps you out of court. Trust keeps users coming back.
Quick Reference
Privacy Attacks
Attack Type | How It Works | Example | Prevention |
---|---|---|---|
Linkage | Join anonymous + public data | Netflix + IMDB | Don't release individual records |
Inference | Unique attribute combinations | DOB + Gender + ZIP = 87% unique | Generalize attributes |
Location | Movement patterns are unique | 4 locations = 95% unique | Aggregate or add noise |
Temporal | Time patterns reveal identity | Login times, activity patterns | Round timestamps |
Defense Methods
-- BigQuery Differential Privacy (Production Ready)
SELECT WITH DIFFERENTIAL_PRIVACY
OPTIONS(
epsilon = 1.0, -- Privacy budget (lower = more private)
delta = 1e-5, -- Probability bound
max_groups_contributed = 1 -- Limit contribution per user
)
diagnosis,
COUNT(*) AS noisy_count
FROM medical_records
GROUP BY diagnosis;
-- Aggregation Defense (k-anonymity)
SELECT
age_group, -- Not exact age
zip_prefix, -- First 3 digits only
COUNT(*) AS COUNT
FROM users
GROUP BY age_group, zip_prefix
HAVING COUNT(*) >= 5; -- Minimum group size
Privacy vs Utility Tradeoff
Epsilon (ε) | Privacy Level | Noise Amount | Use Case |
---|---|---|---|
10 | Low | ±1-2% | Internal analytics |
1.0 | Balanced | ±10% | Research datasets |
0.1 | High | ±100% | Public release |
0.01 | Maximum | ±1000% | Highly sensitive |
Key Takeaways
-
Your data is a fingerprint - Just 8 movie ratings can identify you uniquely
-
Anonymous ≠ Private - Public data sources enable re-identification
-
Combinations matter - DOB + Gender + ZIP identifies 87% of Americans
-
Math protects privacy - Differential privacy provides proven guarantees
-
Always a tradeoff - More privacy = less accuracy, choose wisely