Optional 8: Modern Big Data Economics

Optional 8 Reading Time: 5 mins

This is an optional exploration of how Big Tech pays for scale. Recommended after completing the SQL Introduction.

Beyond the Syntax: The Economics of Scale

Once you master SQL, the game changes from "How do I write this?" to "How do I pay for this?" Here are current trends in the industry (2020s-now)


Trend 1: Scaling Economics (2020s - now)


Trend 2: The Extract-Load-Transform (ELT) Flip (2020s - now)

The Concept: Get the raw, "dirty" data into a cheap cloud "bucket" (S3/GCS) immediately. Then, use SQL to clean it up later. In the past, we used to do the opposite (ETL).

The Why:

  1. Raw data is your most valuable asset. If your cleaning logic has a bug, re-run the process from the raw source. Store first, ask questions later.

  2. The flip is a game-changer for 2020s. It's only possible because the cost of storing and managing TBs of data has dropped by orders of magnitude in the past decade. For ex, in 2020s in BigQuery, it'll cost you < $10/TB/month.

  3. Also, it's a lot easier to work with streaming data when you store it first. And you can use SQL to process it in real-time.


Trend 3: SQL plus JSONB (over pure NoSQL ) (2020s - now)

The Concept: Use SQL databases with JSONB columns to store semi-structured data.

The Why:

  1. JSONB is a first-class citizen in modern SQL databases. It's fast, flexible, and easy to use.

  2. Pure NoSQL databases are easy to start with (don't need to structure anything), but they tend to have consistency challenges and don't have the declarative power of SQL.


Trend 4: Split "Storage" from "Compute" (Apache Iceberg, 2020s - now)

The Problem: Legacy architectures mandate that if you want to store 100TB, you must pay for a massive, always-on server cluster to house it—even if you only query that data once a week. You are paying a premium for RAM/CPU that sits idle. The Solution: Separating Storage from Compute.

The Why:

  1. Why pay for a 24/7 supercomputer just to hold files? By splitting them, you dump your raw data into dirt-cheap object storage (like AWS S3) and only spin up the expensive "Compute" engines (the CPU/RAM) for the 5 minutes you actually run your query.

  2. This decouples scale: you can scale storage to infinity for pennies, and scale compute independently based on how fast you need answers.

  3. This is the foundation of modern Data Lakes and tools like Apache Iceberg.

Trend 5: The ORM "Scale Wall" (A Trade-off)

The Concept: Many developers start with an ORM (Object-Relational Mapper) like SQLAlchemy.org. It lets you write Python objects that "magically" become database rows. No SQL required.

The Trade-off:

The Why: We focus on Raw SQL in CS145. When you know how it works, you can find the right balance of raw SQL and ORM. At scale, you need a lot more control.