B+Tree Indexes
Concept. A B+Tree index keeps keys sorted in a balanced tree, supporting O(log N) point lookups and range scans (WHERE rating BETWEEN 4.0 AND 5.0) by walking the leaf pages in order.
Intuition. When you query "find Listens with rating between 4.0 and 5.0," a B+Tree on rating walks down the tree to the first 4.0, then reads leaf pages in order until it passes 5.0. That is one descent followed by a scan, not a tree walk per row.
Prerequisites. Index Fundamentals introduces (key, page#) pairs and pointers.
What a B+Tree Stores
A B+Tree sits on top of the table's data pages. Three layers of index nodes (root, internal, leaf) route a lookup down to the right page, and the leaves are linked left-to-right so a range scan can stream across them without going back to the root.
Figure 1. The data pages on disk hold the actual Songs rows; the B+Tree on disk sits on top, its root and internal nodes routing a lookup down to the sorted, left-linked leaves that store (key, data-page pointer) pairs. Tracing release_date = 12890 ("As It Was"): the root sends it into 5000..14999, the internal node into 11234..13788, the leaf resolves 12890 → Page 3, and reading Page 3 finds the row, four page reads total.
Fanout here is only 4 for clarity (3 keys plus 4 child pointers per node). Real databases use fanout in the thousands to millions depending on page size, so with fanout about 1,000 a billion-key table is only three levels deep: height = ⌈log_fanout(N)⌉.
Every layer is one IO. A billion-row table with fanout around 1,000 fits in three levels (root, internal, leaf). Add one more IO to read the data page itself, and a point lookup is roughly 4 IOs regardless of table size.
Walking a Range Query
The B+Tree finds the starting leaf in O(log N) IOs, then streams across leaves until it passes the end key. The trace below walks 12000 ≤ release_date ≤ 14000.
Figure 2. The same B+Tree answering the range query 12000 ≤ release_date ≤ 14000: the query visits the solid green nodes, skips the dashed grey ones, and fetches only the data pages with matching rows (Page 3 and Page 4). The descent runs root → middle internal → Leaf 3, then walks the linked leaves (Leaf 3 yields 12890 and 13456, Leaf 4 yields 13789 and 13980), and the peek at Leaf 5 stops the scan because its first key 14456 exceeds 14000. That is 5 index IOs plus 2 data-page IOs, 7 total, versus roughly 160,000 for a full scan, about 23,000× faster.
The result is mostly sequential IO once the descent reaches the starting leaf, which is what makes range scans cheap on a B+Tree.
Why B+Trees Are Shallow: Fanout
Fanout is how many children an internal node has, and it's set by how many (key, pointer) pairs fit on one page. High fanout means a shallow tree, and a shallow tree means few IOs per lookup.
Fanout = ⌊page size / (key size + pointer size)⌋
Height = ⌈log_fanout(number of keys)⌉
A 4 KB page with 4-byte keys and 4-byte pointers gives a fanout of 512. An 8 KB page gives roughly 1,024. At fanout 1,000, a billion keys need only three levels. That is why B+Tree lookups feel constant-time even as tables grow.
How Page and Key Size Drive Tree Height
One trillion keys stored four different ways. Read down the Page-size column to see the effect of moving from a 64 KB to a 64 MB page.
| Page size | Key size | Fanout | Height | IOs / lookup |
|---|---|---|---|---|
| 64 KB | 8 B | 4,096 | 4 | 4 |
| 64 KB | 1 KB | 63 | 7 | 7 |
| 64 MB | 8 B | 4,194,304 | 2 | 2 |
| 64 MB | 1 KB | 65,027 | 3 | 3 |
A trillion-key index with 1 KB search keys goes from 7 IOs on a 64 KB page to just 3 IOs on a 64 MB page. Same tree structure, different arithmetic. The tradeoff is that larger pages use more memory per node, but they give you a far shallower tree. Real systems pick the page size that fits their workload: PostgreSQL ships with 8 KB, MySQL InnoDB with 16 KB, and analytical engines like BigQuery and Snowflake use MB-sized columnar blocks.
What B+Trees Handle Well, and What They Don't
Handles well
WHERE user_id = 42
WHERE rating BETWEEN 4.0 AND 5.0
WHERE name LIKE 'Ali%'
ORDER BY created_at
Equality, ranges, prefix matching, and sorted scans. Anything that benefits from keys being in order.
Doesn't help
WHERE LOWER(name) = 'alice'
WHERE name LIKE '%son'
WHERE age != 30
Functions on the indexed column (the index is on name, not LOWER(name)), suffix matches, and non-equality !=. The tree can't narrow the search space, so it falls back to a scan.
Next
SSTables → The immutable sorted file format that write-heavy stores use instead of a B+Tree on disk.