Case Study 3.2: How does Spotify support text search?

Concept. Spotify supports text search across 100 million tracks using an inverted index, a per-word lookup table that maps each search term to the list of song_ids containing it.

Intuition. Without an index, a search for "Yesterday" would read every row in a 100-million-song Songs table. An inverted index looks up "Yesterday" once and returns a short list of matching song_ids, so the database only reads those few rows.

Case Study 3.2 Reading Time: 6 mins

The Solution: An Inverted Index

An inverted index looks like the index at the back of a textbook. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears. The figure walks the live "Crazy Heart" example.

Figure 1. Inverted index over the Songs table: a term dictionary on disk (a B+Tree of every unique word) and packed postings lists on disk (the sorted song IDs per word), plus a runtime bitwise-AND intersection done in RAM. For the query "Crazy Heart" the engine looks up each word, fetches their postings ([12, 45, 108, 145, 201, …] and [8, 12, 88, 108, 180, …]), intersects the two sorted lists in memory to {12, 108}, and reads just those 2 data pages, about 10 IOs total. A full-table scan would read all 100 million rows, so the index turns text search into a B+Tree descent plus a sorted-list intersection, roughly 10,000,000× fewer reads.

Why a B+Tree for the Term Dictionary

The caption above lays out the structure and the live example. One detail worth pulling out: the term dictionary is a B+Tree, not a hash index. Users often type prefixes ("Craz" before they finish typing "Crazy"), and B+Trees support both exact and prefix lookups. A hash index would scatter neighbouring terms to random buckets and could not auto-complete.

Beyond Keyword Matching

Inverted indexes get you the candidate songs fast. Real search engines then apply two more layers on top:

Query expansion. A search for "happy" also checks related terms like "joyful" and "cheerful", broadening recall without forcing the user to guess synonyms.
Ranking. Not all matches are equally relevant. Spotify scores each match on three signals: where the term appeared (title beats tag), overall popularity (a global hit beats a deep cut), and personal history (an artist you listen to often moves up).

Takeaway: An inverted index turns an O(N) row scan into an O(log N) term-dictionary lookup (a B+Tree, for exact and prefix), intersecting postings in RAM so the expensive random IO only fires for the rows you actually need.

Case Study 3.3: Tracking High-Volume Activity → When the workload flips from read-heavy search to write-heavy event ingestion, the index type has to flip with it.

Case Study 3.2: How does Spotify support text search?

The Solution: An Inverted Index

Why a B+Tree for the Term Dictionary

Beyond Keyword Matching

Next