Database Systems: How to Deploy

80% of database internals share the same core algorithms

OLTP uses B-Trees, OLAP uses columnar storage, ETL uses hash partitions - but they all rely on the same fundamental patterns

Start Simple: The Reality of Database Scale

Start Simple: Database Reality Check Most applications never need distributed databases 95% of Apps Managed PostgreSQL < 100GB Data < 1000 requests/sec PostgreSQL 🌍 Data Residency Solved EU (GDPR) • US • Asia-Pacific Just select region in console Providers: Supabase • Neon • Vercel AWS RDS • Google AlloyDB $25 - $100/month Zero ops, full compliance Top 5% of Apps Scaled PostgreSQL 100GB - 10TB Data 1K - 10K requests/sec Primary + Replicas Read 1 Read 2 Cache Example: Notion PostgreSQL + replicas at $10B valuation Solutions: Aurora • AlloyDB • Citus Still single logical database $500 - $5K/month Some ops, mostly managed Top 1% Only Truly Distributed > 10TB Data > 10K requests/sec Shard 1 Shard 2 Shard 3 ...N ⚠️ High operational complexity Actually needed when: • Global < 50ms latency required • Database team of 5+ engineers Examples: Spanner • DynamoDB • Cassandra $10K+/month Plus dedicated ops team 💡 Pro tip: Let managed services handle regions, compliance, and scaling. Focus on your product.
Real-World Validation:
  • Stack Overflow: 4 SQL Servers handle 1.3B pageviews/month
  • Basecamp: 3M+ accounts on simple Rails + MySQL architecture
  • Notion: PostgreSQL with read replicas at $10B valuation
  • Discord: Started simple, moved to Cassandra only after 1M+ concurrent users
When You Actually Need Distribution:
  • Your managed database vendor says you've hit hard limits
  • You have a dedicated database team (5+ engineers)
  • Global latency requirements (< 50ms everywhere)
  • Handling 100K+ requests per second sustained

💰 The $120B Database Market (2024) $50B Single Instance PostgreSQL • MySQL RDS • Azure SQL • Cloud SQL Millions of apps $70B Distributed/Scale Aurora • Spanner • DynamoDB BigQuery • Snowflake • Cosmos DB Trillions in transactions 95% of apps 90% of revenue Top 7 vendors control 91% of $120B market AWS $27B Microsoft $25B Oracle $18B Google $10B Snow $4B IBM $4B DBX $3B 100+ Others ~$10B The paradox: 95% of apps run on single instances, but the top 1% drive most database revenue Source: Gartner 2023 Database Market Analysis

OLTP, OLAP, and ETL - For Scale, Split Responsibilities

The Three Worlds of Big Data Systems OLTP Online Transaction Processing Spotify Examples • User plays a song • Add to playlist • Update user profile • Like/unlike track • Follow artist Pattern: Single row operations Characteristics: • 1000s queries/sec • < 100ms latency Single Machine: 1K-10K TPS PostgreSQL: 5K TPS typical Distributed: 10K-1M+ TPS Spanner: 50K/region, Aurora: 200K writes/s OLAP Online Analytical Processing Spotify Analytics • Top songs by country • Genre trends/month • User retention cohorts • Revenue by segment • A/B test results Pattern: Scan billions of rows Characteristics: • 10s of queries/hour • 1-60 second queries Always Distributed: > 1TB BigQuery, Snowflake, Redshift ETL Pipeline Extract → Transform → Load Nightly Job: 1. Extract from OLTP (1hr) 2. Clean & aggregate (2hr) 3. Load to OLAP (30min) Tools: Apache Spark, Airflow Batch processing at scale Raw events Aggregates Real-time Batch

OLTP: The Speed Demon

What is OLTP?

Online Transaction Processing - The workhorse of user-facing applications. Every click, every update, every "add to cart" is an OLTP operation.

Spotify OLTP in Action

-- User plays "Anti-Hero" by Taylor Swift
BEGIN TRANSACTION;
INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123, 789, NOW());
UPDATE users SET last_active = NOW() WHERE user_id = 123;
UPDATE songs SET play_count = play_count + 1 WHERE song_id = 789;
COMMIT;

Characteristics:

Single Machine vs Distributed OLTP

Single Machine
1K-10K TPS
Beefy Server
128GB RAM, 32 cores
PostgreSQL
5K TPS typical
$500/month
Cloud cost

When to distribute OLTP:


OLAP: The Data Cruncher

What is OLAP?

Online Analytical Processing - Built for business intelligence, data science, and executive dashboards. Answers questions like "What's trending?" and "How are we doing?"

Spotify Analytics Query

-- Which genres are growing fastest this quarter?
SELECT 
    s.genre,
    DATE_TRUNC('week', l.timestamp) AS week,
    COUNT(DISTINCT l.user_id) AS unique_listeners,
    COUNT(*) AS total_plays
FROM listens l
JOIN songs s ON l.song_id = s.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.genre, week
ORDER BY week, unique_listeners DESC;

This query:

Why OLAP is Almost Always Distributed

The Math Doesn't Work on Single Machine:
  • Data Size: 1TB+ common (won't fit in RAM)
  • Scan Speed: SSD reads at 5GB/s → 200 seconds for 1TB
  • CPU Bound: Aggregations need parallel processing
  • Cost: One huge machine vs 100 small machines
Distributed OLAP Architecture:
100 machines × 10GB each = 1TB total
Parallel scan: 1TB / 100 = 10GB per machine
Time: 10GB / 5GB/s = 2 seconds (vs 200s)

ETL: The Data Highway

Extract, Transform, Load

The bridge between OLTP and OLAP worlds. Takes operational data and prepares it for analysis.

Spotify's Nightly ETL Pipeline

# Simplified Apache Spark ETL job
def spotify_daily_etl():
    # 1. EXTRACT FROM OLTP (PostgreSQL replicas)
    raw_listens = spark.read.jdbc(
        "postgresql://replica.spotify.com/listens",
        query="SELECT * FROM listens WHERE date = YESTERDAY"
    )

    # 2. TRANSFORM - Aggregate AND enrich
    daily_stats = raw_listens \
        .JOIN(songs_dim, "song_id") \
        .JOIN(users_dim, "user_id") \
        .groupBy("genre", "country", "age_group") \
        .agg(
            COUNT("*").alias("play_count"),
            countDistinct("user_id").alias("unique_users"),
            AVG("listen_duration").alias("avg_duration")
        )

    # 3. LOAD to OLAP (BigQuery)
    daily_stats.write \
        .mode("append") \
        .partitionBy("date") \
        .bigquery("spotify_warehouse.daily_stats")

ETL Characteristics


The Decision Framework

When to Use What?

Question OLTP OLAP ETL
Is this user-facing? ✅ Yes ❌ No ❌ No
< 100ms response needed? ✅ Yes ❌ No ❌ No
Scanning > 1M rows? ❌ No ✅ Yes ✅ Yes
Complex aggregations? ❌ No ✅ Yes ✅ Yes
Real-time updates? ✅ Yes ❌ No ❌ No
Historical analysis? ❌ No ✅ Yes ❌ No

Real-World Architecture: Spotify Scale

The Numbers

The Stack

Users → Load Balancers → API Servers
           ↓
    PostgreSQL Shards (OLTP)
           ↓
    Kafka (Event Stream)
           ↓
    Spark (ETL Processing)
           ↓
    BigQuery (OLAP Warehouse)
           ↓
    Tableau (Dashboards)

Key Takeaways

  1. Three Different Worlds: OLTP, OLAP, and ETL solve different problems

  2. Storage Matters: Row storage for OLTP, columnar for OLAP

  3. Start Simple: Single machine PostgreSQL handles more than you think

  4. Distribute When Necessary: Let performance metrics drive the decision

  5. ETL Connects Everything: The critical bridge between operational and analytical

Next Up: Now that we understand WHEN to use different systems, let's dive into HOW they work, starting with Storage & Paging on a single machine.