Case Study 3.2: How does Spotify support text search?
Concept. Spotify supports text search across 100 million tracks using an inverted index, a per-word lookup table that maps each search term to the list of song_ids containing it.
Intuition. Without an index, a search for "Yesterday" would read every row in a 100-million-song Songs table. An inverted index looks up "Yesterday" once and returns a short list of matching song_ids, so the database only reads those few rows.
The Solution: An Inverted Index
An inverted index looks like the index at the back of a textbook. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears. The figure walks the live "Crazy Heart" example.
Figure 1. Inverted index over the Songs table: a term dictionary on disk (a B+Tree of every unique word) and packed postings lists on disk (the sorted song IDs per word), plus a runtime bitwise-AND intersection done in RAM. For the query "Crazy Heart" the engine looks up each word, fetches their postings ([12, 45, 108, 145, 201, …] and [8, 12, 88, 108, 180, …]), intersects the two sorted lists in memory to {12, 108}, and reads just those 2 data pages, about 10 IOs total. A full-table scan would read all 100 million rows, so the index turns text search into a B+Tree descent plus a sorted-list intersection, roughly 10,000,000× fewer reads.
Why a B+Tree for the Term Dictionary
The caption above lays out the structure and the live example. One detail worth pulling out: the term dictionary is a B+Tree, not a hash index. Users often type prefixes ("Craz" before they finish typing "Crazy"), and B+Trees support both exact and prefix lookups. A hash index would scatter neighbouring terms to random buckets and could not auto-complete.
Beyond Keyword Matching
Inverted indexes get you the candidate songs fast. Real search engines then apply two more layers on top:
-
Query expansion. A search for "happy" also checks related terms like "joyful" and "cheerful", broadening recall without forcing the user to guess synonyms.
-
Ranking. Not all matches are equally relevant. Spotify scores each match on three signals: where the term appeared (title beats tag), overall popularity (a global hit beats a deep cut), and personal history (an artist you listen to often moves up).
Takeaway: An inverted index turns an O(N) row scan into an O(log N) term-dictionary lookup (a B+Tree, for exact and prefix), intersecting postings in RAM so the expensive random IO only fires for the rows you actually need.
Next
Case Study 3.3: Tracking High-Volume Activity → When the workload flips from read-heavy search to write-heavy event ingestion, the index type has to flip with it.