Database Systems: How to Deploy

80% of database internals share the same core algorithms

OLTP uses B-Trees, OLAP uses columnar storage, ETL uses hash partitions - but they all rely on the same fundamental patterns

Start Simple: The Reality of Database Scale

Real-World Validation:

Stack Overflow: 4 SQL Servers handle 1.3B pageviews/month
Basecamp: 3M+ accounts on simple Rails + MySQL architecture
Notion: PostgreSQL with read replicas at $10B valuation
Discord: Started simple, moved to Cassandra only after 1M+ concurrent users

When You Actually Need Distribution:

Your managed database vendor says you've hit hard limits
You have a dedicated database team (5+ engineers)
Global latency requirements (< 50ms everywhere)
Handling 100K+ requests per second sustained

OLTP, OLAP, and ETL - For Scale, Split Responsibilities

OLTP: The Speed Demon

What is OLTP?

Online Transaction Processing - The workhorse of user-facing applications. Every click, every update, every "add to cart" is an OLTP operation.

Spotify OLTP in Action

-- User plays "Anti-Hero" by Taylor Swift
BEGIN TRANSACTION;
INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123, 789, NOW());
UPDATE users SET last_active = NOW() WHERE user_id = 123;
UPDATE songs SET play_count = play_count + 1 WHERE song_id = 789;
COMMIT;

Characteristics:

Latency: < 100ms per query
Volume: 1000s of queries/second
Pattern: Single row or small set operations
Storage: Row-based (PostgreSQL, MySQL)

Single Machine vs Distributed OLTP

Single Machine

1K-10K TPS

Beefy Server

128GB RAM, 32 cores

PostgreSQL

5K TPS typical

$500/month

Cloud cost

When to distribute OLTP:

More than 10K transactions/second
Global users requiring < 50ms latency
Need 99.99% availability
Examples: Google Spanner (50K+ TPS), Aurora (200K writes/sec)

OLAP: The Data Cruncher

What is OLAP?

Online Analytical Processing - Built for business intelligence, data science, and executive dashboards. Answers questions like "What's trending?" and "How are we doing?"

Spotify Analytics Query

-- Which genres are growing fastest this quarter?
SELECT 
    s.genre,
    DATE_TRUNC('week', l.timestamp) AS week,
    COUNT(DISTINCT l.user_id) AS unique_listeners,
    COUNT(*) AS total_plays
FROM listens l
JOIN songs s ON l.song_id = s.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.genre, week
ORDER BY week, unique_listeners DESC;

This query:

Scans 5TB of listening data
Processes 50 billion rows
Takes 15 seconds on BigQuery
Would take 2 hours on PostgreSQL

Why OLAP is Almost Always Distributed

The Math Doesn't Work on Single Machine:

Data Size: 1TB+ common (won't fit in RAM)
Scan Speed: SSD reads at 5GB/s → 200 seconds for 1TB
CPU Bound: Aggregations need parallel processing
Cost: One huge machine vs 100 small machines

Distributed OLAP Architecture:

100 machines × 10GB each = 1TB total
Parallel scan: 1TB / 100 = 10GB per machine
Time: 10GB / 5GB/s = 2 seconds (vs 200s)

ETL: The Data Highway

Extract, Transform, Load

The bridge between OLTP and OLAP worlds. Takes operational data and prepares it for analysis.

Spotify's Nightly ETL Pipeline

# Simplified Apache Spark ETL job
def spotify_daily_etl():
    # 1. EXTRACT FROM OLTP (PostgreSQL replicas)
    raw_listens = spark.read.jdbc(
        "postgresql://replica.spotify.com/listens",
        query="SELECT * FROM listens WHERE date = YESTERDAY"
    )

    # 2. TRANSFORM - Aggregate AND enrich
    daily_stats = raw_listens \
        .JOIN(songs_dim, "song_id") \
        .JOIN(users_dim, "user_id") \
        .groupBy("genre", "country", "age_group") \
        .agg(
            COUNT("*").alias("play_count"),
            countDistinct("user_id").alias("unique_users"),
            AVG("listen_duration").alias("avg_duration")
        )

    # 3. LOAD to OLAP (BigQuery)
    daily_stats.write \
        .mode("append") \
        .partitionBy("date") \
        .bigquery("spotify_warehouse.daily_stats")

ETL Characteristics

Schedule: Daily, hourly, or streaming
Duration: Minutes to hours
Volume: GB to TB per run
Tools: Apache Spark, Airflow, dbt
Pattern: Batch processing

The Decision Framework

When to Use What?

Question	OLTP	OLAP	ETL
Is this user-facing?	✅ Yes	❌ No	❌ No
< 100ms response needed?	✅ Yes	❌ No	❌ No
Scanning > 1M rows?	❌ No	✅ Yes	✅ Yes
Complex aggregations?	❌ No	✅ Yes	✅ Yes
Real-time updates?	✅ Yes	❌ No	❌ No
Historical analysis?	❌ No	✅ Yes	❌ No

Real-World Architecture: Spotify Scale

The Numbers

OLTP: 500M users × 10 requests/day = 5B transactions/day
OLAP: 1PB data warehouse, 10K queries/day
ETL: 100TB/day processed

The Stack

Users → Load Balancers → API Servers
           ↓
    PostgreSQL Shards (OLTP)
           ↓
    Kafka (Event Stream)
           ↓
    Spark (ETL Processing)
           ↓
    BigQuery (OLAP Warehouse)
           ↓
    Tableau (Dashboards)

Key Takeaways

Three Different Worlds: OLTP, OLAP, and ETL solve different problems
Storage Matters: Row storage for OLTP, columnar for OLAP
Start Simple: Single machine PostgreSQL handles more than you think
Distribute When Necessary: Let performance metrics drive the decision
ETL Connects Everything: The critical bridge between operational and analytical

Next Up: Now that we understand WHEN to use different systems, let's dive into HOW they work, starting with Storage & Paging on a single machine.