Case Study 3.1: BigQuery, From Web Crawl to Petabyte Analytics

Concept. BigQuery scales to trillions of rows by storing data in columnar format and decoupling storage from compute, so queries spin up thousands of workers on demand and only read the columns they touch.

Intuition. When Spotify queries a 100-billion-row Listens table for "average rating per genre," BigQuery reads only the rating and genre columns, skipping user_id, listen_time, and song_id. Then it parallelizes the scan across thousands of workers that exist only for the seconds the query runs.

Case Study 3.1 Reading Time: 8 mins

The Origin: An Engineer's Foosball Problem

In 2005, Andrey Gubarev, a Google engineer, faced an annoyance. He needed insights from Google's web datasets, but the existing infrastructure was sluggish. Queries dragged on for hours, eating into his foosball time. Frustrated, he dug into the bottlenecks. That investigation laid the groundwork for Dremel, the system that would later ship as BigQuery and change large-scale data analysis.

The Bottleneck: Row Storage on Web Documents

Google's web document storage held data points like:

URL
Title
Date
Keywords
Body (HTML). The heavyweight, often megabytes per row.

In a row-based system, querying trends by Date required processing the entire row, including the bulky HTML body, even though only one tiny field was needed. Fetching a few bytes (Date) involved dragging along megabytes (Body) per record. At Google scale, that was hours of wasted I/O.

The Solution: Columnar Storage via Dremel

Dremel flipped the layout. Store all URL values together, all Date values together, all Body values together, each column on its own pages.

Targeted reads. Query by Date? Only the Date column is touched. The HTML Body is never read.
Better compression. A column of dates compresses far better than a row of mixed types, because uniform data has uniform patterns.
Tree-based parallel execution. A query is split across thousands of leaf workers; each scans a slice of one column; results bubble up the tree.

The result: queries that took hours on row storage finished in seconds.

Real Numbers: A Spotify-Scale Query

-- Analytics across 5 billion music listens
SELECT artist, COUNT(*) AS plays
FROM spotify.listens_1tb       -- 1 TB table, 20 columns
WHERE listen_time > '2024-01-01'
GROUP BY artist
ORDER BY plays DESC
LIMIT 100;

The query touches only 2 of 20 columns (artist, listen_time). On row storage every byte is read; on BigQuery only the relevant columns are scanned, then compressed:

Approach	Bytes read	Time	Cost
Row storage (full table scan, all 20 columns)	1 TB	10+ minutes	baseline
BigQuery (2 columns, compressed, distributed)	~50 GB	~15 seconds	20× cheaper
Speedup	20× less I/O	40× faster

Why It Works: Three Pillars

Columnar layout. Only artist and listen_time files are read; the other 18 columns are skipped.

Case Study 3.1: BigQuery, From Web Crawl to Petabyte Analytics

The Origin: An Engineer's Foosball Problem

The Bottleneck: Row Storage on Web Documents

The Solution: Columnar Storage via Dremel

Real Numbers: A Spotify-Scale Query

Why It Works: Three Pillars

The Paper

Modern Heirs

Key Takeaways

Next