Case Study 3.1: How BigQuery scaled to trillions of rows at Google?
How BigQuery scaled to trillions of rows at Google?
Goal: Learn why column stores are preferred for big-scale analytics.
Google’s BigQuery is a fully managed, cloud-native data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. It allows users to analyze petabyte-scale datasets with standard SQL, without needing to manage infrastructure.
The Origin: A Frustrated Engineer and a Foosball Table
In 2005, Andrey Gubarev, a super-star engineer at Google, was annoyed. He wanted to analyze a web dataset to gain insights into web page trends and patterns, but Google’s (then state-of-the-art) infrastructure was super-slow.
His queries were taking hours. He wanted to spend less time "babysitting" queries and more time playing foosball! This frustration led him to dig into why queries were slow and how to fix them. He didn’t imagine his project would scale to trillions of rows, become the core of BigQuery, and win multiple industry awards for its radical impact on data research.
The Problem: Row-Based Bottlenecks
Google stored billions of web documents with information like:
-
URL, Title, Date, Keywords
-
Body (HTML) — The heaviest part!
Row-Based Storage Approach
In a traditional row-based layout, to analyze web page trends by Date, the database must read the entire row—including the massive HTML body—even if only the date column is needed.
- The Cost: Reading a few bytes (Date) requires reading Megabytes (HTML) for every single record.
The Solution: Columnar Storage
With a columnar approach, all data for the URL column is stored together, all Date data is stored together, and so on.
-
Skip the Junk: Queries only read the columns they actually need. If you query by Date, the database never even touches the HTML Body on disk.
-
Efficient Compression: Because data in a single column is highly similar (e.g., all dates look alike), we can compress it much better than mixed row data.
Summary: The Impact
When each column is stored in its own page and compressed, the database needs to read much fewer pages from disk. This leads to smaller indices, fewer IO reads, and queries that finish in seconds instead of hours.
Takeaway: By skipping irrelevant data and leveraging compression, we can analyze the world's data at a scale that was previously impossible.