DB Design for Big Tech Scale

Data Lakes (TB - PB)

The three-tier architecture that powers Big Tech. Separate OLTP for transactions, OLAP for analytics, and load all your raw data into a sprawling, cheap Data Lake.

The Analytical Shift: We pushed our startup Postgres to its limit previously. But eventually, a startup becomes an enterprise. You never run petabyte-scale analytics on your operational database (OLTP). A complex analytical query will lock up your in-memory transaction engine and crash the app for users. Instead, you dump raw data into a massive, dirt-cheap Data Lake (out-of-core object storage) for independent analysis. From there, you can use classic Extract, Transform, Load (ETL) pipelines to selectively move and clean data into specialized analytical warehouses (OLAP) for fast querying.

The Three Parts of Big Data Systems

The Three Worlds of Big Data Systems OLTP Online Transaction Processing Spotify Examples • User plays a song • Add to playlist • Update user profile • Like/unlike track • Follow artist Pattern: Single row operations Real-time Processing: • 1000s queries/sec • < 100ms latency • 24/7 availability Single Machine: 1K-10K TPS PostgreSQL: 5K TPS typical Distributed: 10K-1M+ TPS Spanner: 50K/region, Aurora: 200K writes/s OLAP Online Analytical Processing Spotify Analytics • Top songs by country • Genre trends/month • User retention cohorts • Revenue by segment • A/B test results Pattern: Scan billions of rows Batch Analytics: • 10s of queries/hour • 1-60 second queries • Historical analysis Always Distributed: > 1TB BigQuery, Snowflake, Redshift ETL Pipeline Extract → Transform → Load Batch Processing (Nightly): 1. Extract from OLTP (1hr) 2. Clean & aggregate (2hr) 3. Load to OLAP (30min) Tools: Apache Spark, Airflow Batch processing at scale Raw events Aggregates Daily Scale & Economics at Big Tech 100M - 5B transactions/day Budget: $3K/month handles 100M txns/day 100TB processing/day Budget: $15K/month for 100TB daily processing 1PB warehouse Budget: $20K/month for 1PB storage + queries TOTAL ~$38K/month Basic data stack; Add engineering operations team costs

OLTP: User-Facing Speed

Every click, update, "add to cart" - that's OLTP. Here's Spotify handling a play:

-- User plays "Anti-Hero" by Taylor Swift
BEGIN TRANSACTION;
INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123, 789, NOW());
UPDATE users SET last_active = NOW() WHERE user_id = 123;
UPDATE songs SET play_count = play_count + 1 WHERE song_id = 789;
COMMIT;

The OLTP Formula: < 100ms latency, 1000s queries/sec, single-row updates

Scale Limits:

OLAP: Analytics at Scale

"What's trending? How are we doing?" - that's OLAP. Here's Spotify finding hot genres:

-- Which genres are growing fastest this quarter?
SELECT 
    s.genre,
    DATE_TRUNC('week', l.timestamp) AS week,
    COUNT(DISTINCT l.user_id) AS unique_listeners,
    COUNT(*) AS total_plays
FROM listens l
JOIN songs s ON l.song_id = s.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.genre, week
ORDER BY week, unique_listeners DESC;

Reality Check: This scans 5TB (50B rows) in 15 seconds on BigQuery. PostgreSQL? 2+ hours.

Why OLAP Needs Distribution:

1 machine scanning 1TB: 200 seconds (SSD @ 5GB/s)
100 machines parallel: 2 seconds (each scans 10GB)
Result: 100x faster, same cost

ETL/ELT: Moving Data Between Worlds

The nightly job that feeds analytics. Here's Spotify's ETL:

# Simplified Apache Spark ETL job
def spotify_daily_etl():
    # 1. EXTRACT FROM OLTP (PostgreSQL replicas)
    raw_listens = spark.read.jdbc(
        "postgresql://replica.spotify.com/listens",
        query="SELECT * FROM listens WHERE date = YESTERDAY"
    )

    # 2. TRANSFORM - Aggregate AND enrich
    daily_stats = raw_listens \
        .JOIN(songs_dim, "song_id") \
        .JOIN(users_dim, "user_id") \
        .groupBy("genre", "country", "age_group") \
        .agg(
            COUNT("*").alias("play_count"),
            countDistinct("user_id").alias("unique_users"),
            AVG("listen_duration").alias("avg_duration")
        )

    # 3. LOAD to OLAP (BigQuery)
    daily_stats.write \
        .mode("append") \
        .partitionBy("date") \
        .bigquery("spotify_warehouse.daily_stats")

ETL Reality: Extract-Transform-Load: Daily batches, GB to TB, minutes to hours. Tools: Spark + Airflow.

ELT Alternative: Extract-Load-Transform: Load raw data into "big datalake", transform later into tables. Tools: Spark + Airflow + dbt. Idea: remove ETL bottleneck. Let analytics teams transform data as needed.

The Complete Picture

Users → API → PostgreSQL (OLTP: Handle clicks)
         ↓
      Kafka (Queue events reliably)  
         ↓
      Spark (ETL: Transform daily)
         ↓
    BigQuery (OLAP: Answer "what's trending?")
         ↓
    Dashboards (Executives see pretty charts)

Key Takeaways

  1. Split into Three: OLTP for users, OLAP for analytics, ETL/ELT to connect them

  2. Kafka Ships the Data: Reliable message queues (or publish-subscribe) ensure nothing gets lost between systems

  3. SQL Runs Everything: PostgreSQL (OLTP), Spark SQL (ETL or ELT), BigQuery (OLAP) - it's all SQL

  4. The Big Tech (Not So) Secret: Don't build one database to rule them all - decompose problems and use the right data system for each job