Case Study 3.1: How BigQuery Scaled to Trillions of Rows at Google

Case Study 3.1 Reading Time: 8 mins

How BigQuery Scaled to Trillions of Rows at Google

Goal: Understand why column stores are the go-to for large-scale analytics.

Google's BigQuery is a cloud-native data warehouse that transforms massive datasets into actionable insights. It runs SQL queries on petabyte-scale data without the typical server management hassles.

The Origin: An Engineer's Annoyance

In 2005, Andrey Gubarev, a Google engineer, faced a problem. He needed insights from web datasets, but the existing infrastructure was sluggish. Queries dragged on for hours, interfering with his foosball time. Frustrated, he dug into the bottlenecks. This investigation laid the groundwork for BigQuery, a tool that would change data research and earn industry acclaim.


The Problem: Row-Based Storage Bottlenecks

Google's web document storage included data points like:

Row-Based Storage Woes

In a row-based system, querying trends by Date required processing entire rows, including the bulky HTML body, even if only the date was needed.


The Solution: Columnar Storage

Columnar storage changes the approach. Store all URL data together, all Date data together, and so on.

  1. Targeted Reads: Queries only access the columns you request. Query by Date, and the database ignores the HTML Body.

  2. Superior Compression: Uniform data in a single column (e.g., dates) allows for better compression than mixed row data.


Summary: The Impact

By storing each column separately and compressing them, the database significantly reduces the number of pages it reads from disk. This leads to smaller indices, fewer I/O operations, and queries that complete in seconds instead of hours.

Takeaway: Direct I/O efforts to the columns you need, compress everything, and that's how you process petabytes in seconds.