DB Design for Big Tech: When You Actually Need Scale

The Three-Tier Architecture That Powers Netflix, Spotify, and Uber

OLTP for users, OLAP for analytics, ETL to connect them - each optimized for its specific workload

The Three Worlds of Big Data Systems

The Three Worlds of Big Data Systems OLTP Online Transaction Processing Spotify Examples • User plays a song • Add to playlist • Update user profile • Like/unlike track • Follow artist Pattern: Single row operations Real-time Processing: • 1000s queries/sec • < 100ms latency • 24/7 availability Single Machine: 1K-10K TPS PostgreSQL: 5K TPS typical Distributed: 10K-1M+ TPS Spanner: 50K/region, Aurora: 200K writes/s OLAP Online Analytical Processing Spotify Analytics • Top songs by country • Genre trends/month • User retention cohorts • Revenue by segment • A/B test results Pattern: Scan billions of rows Batch Analytics: • 10s of queries/hour • 1-60 second queries • Historical analysis Always Distributed: > 1TB BigQuery, Snowflake, Redshift ETL Pipeline Extract → Transform → Load Batch Processing (Nightly): 1. Extract from OLTP (1hr) 2. Clean & aggregate (2hr) 3. Load to OLAP (30min) Tools: Apache Spark, Airflow Batch processing at scale Raw events Aggregates Daily Scale & Economics at Big Tech 100M - 5B transactions/day Budget: $3K/month handles 100M txns/day 100TB processing/day Budget: $15K/month for 100TB daily processing 1PB warehouse Budget: $20K/month for 1PB storage + queries TOTAL ~$38K/month Basic data stack; Add engineering operations team costs

OLTP: User-Facing Speed

Every click, update, "add to cart" - that's OLTP. Here's Spotify handling a play:

-- User plays "Anti-Hero" by Taylor Swift
BEGIN TRANSACTION;
INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123, 789, NOW());
UPDATE users SET last_active = NOW() WHERE user_id = 123;
UPDATE songs SET play_count = play_count + 1 WHERE song_id = 789;
COMMIT;

The OLTP Formula: < 100ms latency, 1000s queries/sec, single-row updates

Scale Limits:

OLAP: Analytics at Scale

"What's trending? How are we doing?" - that's OLAP. Here's Spotify finding hot genres:

-- Which genres are growing fastest this quarter?
SELECT 
    s.genre,
    DATE_TRUNC('week', l.timestamp) AS week,
    COUNT(DISTINCT l.user_id) AS unique_listeners,
    COUNT(*) AS total_plays
FROM listens l
JOIN songs s ON l.song_id = s.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.genre, week
ORDER BY week, unique_listeners DESC;

Reality Check: This scans 5TB (50B rows) in 15 seconds on BigQuery. PostgreSQL? 2+ hours.

Why OLAP Needs Distribution:

1 machine scanning 1TB: 200 seconds (SSD @ 5GB/s)
100 machines parallel: 2 seconds (each scans 10GB)
Result: 100x faster, same cost

ETL: Moving Data Between Worlds

The nightly job that feeds analytics. Here's Spotify's ETL:

# Simplified Apache Spark ETL job
def spotify_daily_etl():
    # 1. EXTRACT FROM OLTP (PostgreSQL replicas)
    raw_listens = spark.read.jdbc(
        "postgresql://replica.spotify.com/listens",
        query="SELECT * FROM listens WHERE date = YESTERDAY"
    )

    # 2. TRANSFORM - Aggregate AND enrich
    daily_stats = raw_listens \
        .JOIN(songs_dim, "song_id") \
        .JOIN(users_dim, "user_id") \
        .groupBy("genre", "country", "age_group") \
        .agg(
            COUNT("*").alias("play_count"),
            countDistinct("user_id").alias("unique_users"),
            AVG("listen_duration").alias("avg_duration")
        )

    # 3. LOAD to OLAP (BigQuery)
    daily_stats.write \
        .mode("append") \
        .partitionBy("date") \
        .bigquery("spotify_warehouse.daily_stats")

ETL Reality: Daily batches, GB to TB, minutes to hours. Tools: Spark + Airflow + dbt.

Kafka: The Reliable Messenger

Nothing gets lost. Kafka (or Pub/Sub) sits between systems, guaranteeing delivery:

The Complete Picture

Users → API → PostgreSQL (OLTP: Handle clicks)
         ↓
      Kafka (Queue events reliably)  
         ↓
      Spark (ETL: Transform daily)
         ↓
    BigQuery (OLAP: Answer "what's trending?")
         ↓
    Dashboards (Executives see pretty charts)

Key Takeaways

  1. Split into Three: OLTP for users, OLAP for analytics, ETL to connect them

  2. Kafka Ships the Data: Message queues (or publish-subscribe) ensure nothing gets lost between systems

  3. SQL Runs Everything: PostgreSQL (OLTP), Spark SQL (ETL), BigQuery (OLAP) - it's all SQL

  4. The Big Tech (Not So) Secret: Don't build one database to rule them all - decompose problems and use the right system for each job