Case Study 3.1: How BigQuery scaled to trillions of rows at Google?

Case Study 3.1 Reading Time: 8 mins

How BigQuery scaled to trillions of rows at Google?

Goal: Learn why column stores are preferred for big-scale analytics.

Google’s BigQuery is a fully managed, cloud-native data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. It allows users to analyze petabyte-scale datasets with standard SQL, without needing to manage infrastructure.

The Origin: A Frustrated Engineer and a Foosball Table

In 2005, Andrey Gubarev, a super-star engineer at Google, was annoyed. He wanted to analyze a web dataset to gain insights into web page trends and patterns, but Google’s (then state-of-the-art) infrastructure was super-slow.

His queries were taking hours. He wanted to spend less time "babysitting" queries and more time playing foosball! This frustration led him to dig into why queries were slow and how to fix them. He didn’t imagine his project would scale to trillions of rows, become the core of BigQuery, and win multiple industry awards for its radical impact on data research.


The Problem: Row-Based Bottlenecks

Google stored billions of web documents with information like:

Row-Based Storage Approach

In a traditional row-based layout, to analyze web page trends by Date, the database must read the entire row—including the massive HTML body—even if only the date column is needed.


The Solution: Columnar Storage

With a columnar approach, all data for the URL column is stored together, all Date data is stored together, and so on.

  1. Skip the Junk: Queries only read the columns they actually need. If you query by Date, the database never even touches the HTML Body on disk.

  2. Efficient Compression: Because data in a single column is highly similar (e.g., all dates look alike), we can compress it much better than mixed row data.


Summary: The Impact

When each column is stored in its own page and compressed, the database needs to read much fewer pages from disk. This leads to smaller indices, fewer IO reads, and queries that finish in seconds instead of hours.

Takeaway: By skipping irrelevant data and leveraging compression, we can analyze the world's data at a scale that was previously impossible.