Case Study 3.2: How does Spotify support search on text?

Case Study 3.2 Reading Time: 6 mins

How does Spotify support search on text?

Goal: Learn how to build indices on text data for fast text search.

When you type a song name into Spotify, it doesn't sift through millions of tracks one by one. Instead, it employs an Inverted Index.

The Problem: Scanning is Slow

Traditional search methods scan each document for a term. Doing this for 100 million songs is impractical. Checking every track's metadata for the word "Crazy" would waste time and resources.


The Solution: The Inverted Index

An inverted index functions like a textbook index. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears.

Inverted Index Diagram

(Click image to view full size)

In our Songs database, suppose there are hundreds of songs, and 3 have β€˜crazy’ in their titles. An inverted index connects terms like "crazy" to the songs that contain them.


Python implementation: Search Engine with Whoosh

Here, we use the Whoosh library to create an inverted index with a single field called "content".

Part 1: Building the Index

First, define the schema and add song titles as documents.

FROM whoosh.index import create_in
FROM whoosh.fields import Schema, TEXT
FROM whoosh.qparser import QueryParser
import os

# Define schema
schema = Schema(content=TEXT(stored=True))

# CREATE index directory
if NOT os.path.EXISTS("indexdir"):
    os.mkdir("indexdir")

# CREATE index
ix = create_in("indexdir", schema)

# Add documents
writer = ix.writer()
writer.add_document(content=u"Crazy IN Love")
writer.add_document(content=u"Let's Go Crazy")
writer.add_document(content=u"Crazy Train")
writer.commit()

Part 2: Searching the Index

Once the index is built, perform efficient lookups using a QueryParser.

# Open index
ix = open_dir("indexdir")

# Parse query
with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("crazy")
    results = searcher.search(query)
    for result in results:
        print(result['content'])

Beyond Basic Search: Ranking & Expansion

Keyword matching is just the start. Spotify uses advanced techniques to refine results:

  1. Query Expansion: Search for "happy" and the system might also check for "joyful" or "cheerful" using synonyms, increasing your chances of finding the right track.

  2. Ranking: Not all results are equal. Spotify scores results based on:

    • Match Quality: Does the term appear in the Title (high score) or just a Tag (lower score)?
    • Popularity: Is this a global hit or a niche track?
    • History: Have you listened to this artist before?

Takeaway: Inverted indices transform a daunting search task into a straightforward dictionary lookup, granting instant access to millions of songs with minimal effort.