6.6 · DB Design for Big Tech Scale

Data Lakes (TB - PB)

The three-tier architecture that powers Big Tech. Separate OLTP (Online Transaction Processing) for user-facing writes, OLAP (Online Analytical Processing) for analytics, and load all your raw data into a sprawling, cheap Data Lake.

The Analytical Shift: We pushed our startup Postgres to its limit previously. But eventually, a startup becomes an enterprise. You never run petabyte-scale analytics on your operational database, the OLTP (Online Transaction Processing) system that handles user clicks. A complex analytical query will lock up your in-memory transaction engine and crash the app for users. Instead, you dump raw data into a massive, dirt-cheap Data Lake (out-of-core object storage) for independent analysis. From there, you can use classic Extract, Transform, Load (ETL) pipelines to selectively move and clean data into specialized analytical warehouses, OLAP (Online Analytical Processing) systems built for fast querying over billions of rows.

The Three Parts of Big Data Systems

OLTP: User-Facing Speed

Every click, update, "add to cart" - that's OLTP. Here's Spotify handling a play:

-- User plays "Anti-Hero" by Taylor Swift
BEGIN TRANSACTION;
INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123, 789, NOW());
UPDATE users SET last_active = NOW() WHERE user_id = 123;
UPDATE songs SET play_count = play_count + 1 WHERE song_id = 789;
COMMIT;

The OLTP Formula: < 100ms latency, 1000s queries/sec, single-row updates

Scale Limits:

Single PostgreSQL: 5K transactions/sec ($500/month)
Need more? → Aurora (200K/sec) or Spanner (50K/sec per region)

OLAP: Analytics at Scale

"What's trending? How are we doing?" - that's OLAP. Here's Spotify finding hot genres:

-- Which genres are growing fastest this quarter?
SELECT 
    s.genre,
    DATE_TRUNC('week', l.timestamp) AS week,
    COUNT(DISTINCT l.user_id) AS unique_listeners,
    COUNT(*) AS total_plays
FROM listens l
JOIN songs s ON l.song_id = s.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.genre, week
ORDER BY week, unique_listeners DESC;

Reality Check: This scans 5TB (50B rows) in 15 seconds on BigQuery. PostgreSQL? 2+ hours.

Why OLAP Needs Distribution:

1 machine scanning 1TB: 200 seconds (SSD @ 5GB/s)
100 machines parallel: 2 seconds (each scans 10GB)
Result: 100x faster, same cost

ETL/ELT: Moving Data Between Worlds

The nightly job that feeds analytics. Here's Spotify's ETL:

# Simplified Apache Spark ETL job
def spotify_daily_etl():
    # 1. EXTRACT FROM OLTP (PostgreSQL replicas)
    raw_listens = spark.read.jdbc(
        "postgresql://replica.spotify.com/listens",
        query="SELECT * FROM listens WHERE date = YESTERDAY"
    )

    # 2. TRANSFORM - Aggregate AND enrich
    daily_stats = raw_listens \
        .JOIN(songs_dim, "song_id") \
        .JOIN(users_dim, "user_id") \
        .groupBy("genre", "country", "age_group") \
        .agg(
            COUNT("*").alias("play_count"),
            countDistinct("user_id").alias("unique_users"),
            AVG("listen_duration").alias("avg_duration")
        )

    # 3. LOAD to OLAP (BigQuery)
    daily_stats.write \
        .mode("append") \
        .partitionBy("date") \
        .bigquery("spotify_warehouse.daily_stats")

ETL Reality: Extract-Transform-Load: Daily batches, GB to TB, minutes to hours. Tools: Spark + Airflow.

ELT Alternative: Extract-Load-Transform: Load raw data into "big datalake", transform later into tables. Tools: Spark + Airflow + dbt. Idea: remove ETL bottleneck. Let analytics teams transform data as needed.

The Complete Picture

Users → API → PostgreSQL (OLTP: Handle clicks)
         ↓
      Kafka (Queue events reliably)  
         ↓
      Spark (ETL: Transform daily)
         ↓
    BigQuery (OLAP: Answer "what's trending?")
         ↓
    Dashboards (Executives see pretty charts)

Key Takeaways

Split into Three: OLTP for users, OLAP for analytics, ETL/ELT to connect them
Kafka Ships the Data: Reliable message queues (or publish-subscribe) ensure nothing gets lost between systems
SQL Runs Everything: PostgreSQL (OLTP), Spark SQL (ETL or ELT), BigQuery (OLAP) - it's all SQL
The Big Tech (Not So) Secret: Don't build one database to rule them all - decompose problems and use the right data system for each job

The Liability of Scale

Once you build a Petabyte-scale Data Lake and string together a massive ETL pipeline, you have not just built an asset, you have built a massive liability. In a distributed world, the attack surface is vast. Up next, we will explore what happens when these systems fail, and how to protect them using Zero-Trust Architecture.