Index Fundamentals

Concept. An index is an auxiliary data structure that maps a key (e.g. artist) to the page where matching rows live, turning a full-table scan into a one-IO lookup.

Intuition. Without an index, WHERE artist = 'Beatles' scans all 160,000 pages of the 10 TB Songs table. With an index on artist, the database looks up "Beatles lives on page 47," reads page 47, and returns the matching songs. Same answer, 80,000× fewer IOs.

This page defines what every index stores. The next four pages each show a different way of organizing those (key, page#) pairs: hash indexes for equality, B+Trees for ranges, SSTables for immutable writes, and LSM trees for write-heavy workloads.

160,000

Pages in the Songs table

2

IOs with an index

80,000×

Speedup for equality queries


What an Index Stores

Data and index live on disk in the same 64 MB pages but hold very different things. A data page holds whole rows: a Songs row with its audio and metadata is roughly 10 MB, so 1 M songs is 10 TB, about 160,000 pages. An index page holds only (search_key, page#) pairs of 8 to 20 bytes, so 1 M of them is 20 MB, a single page. The pointer, the Page 47 half of each pair, is the compact coordinate that jumps from the index straight to the row.

Same data, two footprints. The Songs table is 10 TB spread over 160,000 pages; the artist index is 20 MB on a single page, 500,000 times smaller.

Figure 1. The same information in two footprints: the Songs table stores whole 10 MB rows (10 TB, 160,000 pages), the artist index stores only 20-byte pointers (20 MB, one page), 500,000 times smaller. That gap is the whole point. WHERE artist = 'Beatles' without an index scans all 160,000 pages; with one the database reads a single index page to learn Beatles is on page 47, then reads page 47: two IOs instead of 160,000, 80,000 times fewer for the same answer.


When to Add an Index, and When to Skip It

An index is a tradeoff: faster reads in exchange for extra storage and slower writes. It pays off on columns the query engine actually uses, and that narrow the result set down to a few rows.

Good candidates

  • Primary keys. Indexed automatically in most databases.
  • Foreign keys. Joins walk these constantly.
  • High-selectivity columns with many unique values, like email or user_id.
  • Frequently filtered columns that WHERE clauses target.

Poor candidates

  • Rarely queried columns. You pay the cost without getting the benefit.
  • Low-selectivity columns such as a boolean, where the index points at half the table.
  • Frequently updated columns. Every write rewrites the index too.
  • Small tables (under about 1,000 rows), where a scan is already fast.

Next

Hash Indexes → Constant-time lookup for equality queries. The simplest way to organize (key, page#) pairs.