Case Study 3.2: How does Spotify support search on text?
How does Spotify support search on text?
Goal: Learn how to build indices on text data for fast text search.
When you type a song name into Spotify, it doesn't sift through millions of tracks one by one. Instead, it employs an Inverted Index.
The Problem: Scanning is Slow
Traditional search methods scan each document for a term. Doing this for 100 million songs is impractical. Checking every track's metadata for the word "Crazy" would waste time and resources.
The Solution: The Inverted Index
An inverted index functions like a textbook index. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears.
(Click image to view full size)
In our Songs database, suppose there are hundreds of songs, and 3 have βcrazyβ in their titles. An inverted index connects terms like "crazy" to the songs that contain them.
Python implementation: Search Engine with Whoosh
Here, we use the Whoosh library to create an inverted index with a single field called "content".
Part 1: Building the Index
First, define the schema and add song titles as documents.
FROM whoosh.index import create_in
FROM whoosh.fields import Schema, TEXT
FROM whoosh.qparser import QueryParser
import os
# Define schema
schema = Schema(content=TEXT(stored=True))
# CREATE index directory
if NOT os.path.EXISTS("indexdir"):
os.mkdir("indexdir")
# CREATE index
ix = create_in("indexdir", schema)
# Add documents
writer = ix.writer()
writer.add_document(content=u"Crazy IN Love")
writer.add_document(content=u"Let's Go Crazy")
writer.add_document(content=u"Crazy Train")
writer.commit()
Part 2: Searching the Index
Once the index is built, perform efficient lookups using a QueryParser.
# Open index
ix = open_dir("indexdir")
# Parse query
with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse("crazy")
results = searcher.search(query)
for result in results:
print(result['content'])
Beyond Basic Search: Ranking & Expansion
Keyword matching is just the start. Spotify uses advanced techniques to refine results:
-
Query Expansion: Search for "happy" and the system might also check for "joyful" or "cheerful" using synonyms, increasing your chances of finding the right track.
-
Ranking: Not all results are equal. Spotify scores results based on:
- Match Quality: Does the term appear in the Title (high score) or just a Tag (lower score)?
- Popularity: Is this a global hit or a niche track?
- History: Have you listened to this artist before?
Takeaway: Inverted indices transform a daunting search task into a straightforward dictionary lookup, granting instant access to millions of songs with minimal effort.