Case Study 1.6: Modern Trends in Industry
This is an optional deep-dive into how Big Tech pays for scale. Recommended after completing the SQL Introduction.
Beyond the Syntax: The Economics of Scale
Once you master SQL, the game changes from "How do I write this?" to "How do I pay for this?" Here are current trends in the industry (2020s-now)
Trend 1: The Scale Paradox (2020s - now)
-
Startup: Use "Boring" Postgres. Itβs free and stable.
-
Big Tech: You pay a "Specialization Tax" (Managed Services, Custom Engines) because the cost of doing it manually would bankrupt you.
Trend 2: The Extract-Load-Transform (ELT) Flip (2020s - now)
The Concept: Get the raw, "dirty" data into a cheap cloud "bucket" (S3/GCS) immediately. Then, use SQL to clean it up later. In the past, we used to do the opposite (ETL).
The Why:
-
Raw data is your most valuable asset. If your cleaning logic has a bug, re-run the process from the raw source. Store first, ask questions later.
-
The flip is a game-changer for 2020s. It's only possible because the cost of storing and managing TBs of data has dropped by orders of magnitude in the past decade. For ex, in 2020s in BigQuery, it'll cost you < $10/TB/month.
-
Also, it's a lot easier to work with streaming data when you store it first. And you can use SQL to process it in real-time.
Trend 3: SQL plus JSONB (over pure NoSQL ) (2020s - now)
The Concept: Use SQL databases with JSONB columns to store semi-structured data.
The Why:
-
JSONB is a first-class citizen in modern SQL databases. It's fast, flexible, and easy to use.
-
Pure NoSQL databases are easy to start with (don't need to structure anything), but they tend to have consistency challenges and don't have the declarative power of SQL.
Trend 4: Split "Storage" from "Compute" (Apache Iceberg, 2020s - now)
The Problem: In many managed services, storage and compute were glued together. The Solution: Separating Storage from Compute.
The Why:
-
Why pay for a 24/7 supercomputer when you only run your analytics for 5 minutes a day? Splitting the storage and compute lets you scale storage to infinity while keeping your "Compute" bill tiny.
-
Or the other way around, when you have small data and want to run analytics on it all the time.
-
Or you need the flexibility for a custom engine for your use case. E.g., graph databases, time series databases, etc.
Trend 5: The ORM "Scale Wall" (A Trade-off)
The Concept: Many developers start with an ORM (Object-Relational Mapper) like SQLAlchemy.org. It lets you write Python objects that "magically" become database rows. No SQL required.
The Trade-off:
-
Startup Phase (The Win): ORMs are a good deal. They let you build features in minutes. They also hide a lot of details.
-
Scale Phase (The Disaster): As you get more users and machines (Distributed Systems), the "hidden details" can cause silent failures that corrupt data.
The Why: We focus on Raw SQL in CS145. When you know how it works, you can find the right balance of raw SQL and ORM. At scale, you need a lot more control.