Case Study 3.1: BigQuery, From Web Crawl to Petabyte Analytics

Concept. BigQuery scales to trillions of rows by storing data in columnar format and decoupling storage from compute, so queries spin up thousands of workers on demand and only read the columns they touch.

Intuition. When Spotify queries a 100-billion-row Listens table for "average rating per genre," BigQuery reads only the rating and genre columns, skipping user_id, listen_time, and song_id. Then it parallelizes the scan across thousands of workers that exist only for the seconds the query runs.

Case Study 3.1 Reading Time: 8 mins

The Origin: An Engineer's Foosball Problem

In 2005, Andrey Gubarev, a Google engineer, faced an annoyance. He needed insights from Google's web datasets, but the existing infrastructure was sluggish. Queries dragged on for hours, eating into his foosball time. Frustrated, he dug into the bottlenecks. That investigation laid the groundwork for Dremel, the system that would later ship as BigQuery and change large-scale data analysis.

The Bottleneck: Row Storage on Web Documents

Google's web document storage held data points like:

  • URL

  • Title

  • Date

  • Keywords

  • Body (HTML). The heavyweight, often megabytes per row.

In a row-based system, querying trends by Date required processing the entire row, including the bulky HTML body, even though only one tiny field was needed. Fetching a few bytes (Date) involved dragging along megabytes (Body) per record. At Google scale, that was hours of wasted I/O.

The Solution: Columnar Storage via Dremel

Dremel flipped the layout. Store all URL values together, all Date values together, all Body values together, each column on its own pages.

  1. Targeted reads. Query by Date? Only the Date column is touched. The HTML Body is never read.

  2. Better compression. A column of dates compresses far better than a row of mixed types, because uniform data has uniform patterns.

  3. Tree-based parallel execution. A query is split across thousands of leaf workers; each scans a slice of one column; results bubble up the tree.

The result: queries that took hours on row storage finished in seconds.

Real Numbers: A Spotify-Scale Query

-- Analytics across 5 billion music listens
SELECT artist, COUNT(*) AS plays
FROM spotify.listens_1tb       -- 1 TB table, 20 columns
WHERE listen_time > '2024-01-01'
GROUP BY artist
ORDER BY plays DESC
LIMIT 100;

The query touches only 2 of 20 columns (artist, listen_time). On row storage every byte is read; on BigQuery only the relevant columns are scanned, then compressed:

Approach Bytes read Time Cost
Row storage (full table scan, all 20 columns) 1 TB 10+ minutes baseline
BigQuery (2 columns, compressed, distributed) ~50 GB ~15 seconds 20× cheaper
Speedup 20× less I/O 40× faster

Why It Works: Three Pillars

Columnar layout. Only artist and listen_time files are read; the other 18 columns are skipped.

Compression. Dictionary encoding on artist (top-1K artists cover most plays) plus delta encoding on listen_time shrink the read by another order of magnitude.

Distribution. The scan is sharded across ~100 workers, each processing ~500 MB. Workers exist only for the seconds the query runs.

Together these three turn "scan a petabyte" into "scan a few GB on hundreds of machines in parallel."

The Paper

BigQuery is Google's implementation of Dremel, first described in their 2010 VLDB paper. Dremel introduced interactive analysis of web-scale datasets using nested columnar storage and tree-based query execution.

Modern Heirs

The columnar-plus-compression-plus-distribution pattern now powers most of the analytics ecosystem:

  • Google BigQuery. Petabyte-scale data warehouse.

  • Amazon Redshift. Cloud data warehouse.

  • Snowflake. Modern cloud analytics platform.

  • Apache Parquet. Open-source columnar file format.

  • ClickHouse. Real-time analytics database.

Key Takeaways

  1. Theory turns into practice. A 2010 research paper (Dremel) directly enabled a commercial breakthrough (BigQuery).

  2. Columnar plus compression plus distribution equals performance. Each pillar multiplies the others, and together they turn hours into seconds.

  3. Workload-driven design. BigQuery makes opposite trade-offs from PostgreSQL, and that's the point. Pick the storage layout that matches the query pattern.


Next

Hybrid Storage → When you need both OLTP-speed inserts and OLAP-speed analytics, you mix the two layouts instead of picking one.