Case Study 3.2: How does Spotify support search on text?
Concept. Spotify supports text search across 100 million tracks using inverted indexes β a per-word lookup table that maps each search term to the list of song_ids containing it.
Intuition. Without an index, "find songs whose title contains 'Yesterday'" scans 100 million Songs rows. With an inverted index, the database looks up Yesterday once, finds 50,000 matching song_ids, and reads only those β turning 100 million IOs into ~10.
How does Spotify support search on text?
Goal: Learn how to build indices on text data for fast text search.
When you type a song name into Spotify, it doesn't sift through millions of tracks one by one. Instead, it employs an Inverted Index.
The Constraint: $O(N)$ Sequential Scans
A naive sequential scan evaluates search predicates against every row. Executing full-table scans across 100 million metadata records for wildcard strings operates far outside acceptable sub-100ms latency limits.
The Solution: The Inverted Index
An inverted index functions like a textbook index. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears.

(Click image to view full size)
In our Songs database, suppose there are hundreds of songs, and 3 have βcrazyβ in their titles. An inverted index connects terms like "crazy" to the songs that contain them.
The Physical Layout: Term Dictionary + Postings Lists
To satisfy strict disk I/O budgets, the inverted index must minimize random page fetches:
-
The Term Dictionary (B+Tree): A highly compressed B+Tree containing every unique word across all 100 million songs. We use a B+Tree here because users might type "Craz", and we need fast prefix lookups ($O(1)$ IO).
-
The Postings List (Contiguous Pages): When you find the word "Crazy" in the dictionary, it gives you a pointer to a Postings Listβa contiguously packed list of
song_ids.
The I/O Math of a Multi-Word Search
What happens when a user types "Crazy Heart"?
-
Database looks up "Crazy" in the Term Dictionary (~3 IOs) -> Returns Postings List pointer.
-
Reads "Crazy" Postings List from disk:
[12, 45, 108, 992, 1045, ...](~1 IO) -
Looks up "Heart" in the Term Dictionary (~3 IOs) -> Returns pointer.
-
Reads "Heart" Postings List from disk:
[12, 88, 108, 304, ...](~1 IO)
Instead of loading millions of 10KB song rows from disk, the database performs a Bitwise AND intersection in RAM between the two lists: [12, 108].
It only incurs the massive random IO cost to fetch the full row data for song_ids 12 and 108 to display to the user. Total Cost: ~10 IOs instead of 100,000,000 IOs.
Beyond Basic Search: Ranking & Expansion
Keyword matching is just the start. Spotify uses advanced techniques to refine results:
-
Query Expansion: Search for "happy" and the system might also check for "joyful" or "cheerful" using synonyms, increasing your chances of finding the right track.
-
Ranking: Not all results are equal. Spotify scores results based on:
- Match Quality: Does the term appear in the Title (high score) or just a Tag (lower score)?
- Popularity: Is this a global hit or a niche track?
- History: Have you listened to this artist before?
Takeaway: Inverted indices bypass $O(N)$ sequential table scans via $O(\log N)$ term dictionary lookups. Intersecting contiguous memory mappings bounded by the posting lists in RAM mitigates random disk I/O, guaranteeing strict sub-millisecond retrieval.