Index Problem Solving: Design Patterns

Interview Gold: Index Selection Questions

One line: When you see "optimize queries," "scale writes," or "range scans" in a problem, think about the right index type.

Hash
O(1) exact match
B+Tree
Range queries
LSM
Write optimization
Hybrid
Mix and match

12 Interview Problems: Organized by Pattern

Category 1: Basic Lookups - Single & Multiple Index Strategies

1M Scale

🔐 User Authentication Service

Problem: Login service handling 1M authentication requests/second with email/username lookups
Index Type: Hash Index
Design: Hash index on email and username • O(1) exact match lookups • Shard by email domain
    Performance: 1-2 IOs per lookup, scales horizontally
    Why: Login is always exact match (email = 'user@example.com'), never ranges. Hash beats B+Tree's 3-4 IOs.

100M Scale

⭐ Song Rating Lookup

Problem: Spotify needs fast lookups: "Did user X already rate song Y?" Need exact (user_id, song_id) → rating lookup for 100M ratings
Index Type: Composite B+Tree vs Concatenated Hash
Design: Option 1: B+Tree on (user_id, song_id) - allows "all ratings by user" queries • Option 2: Hash on concat(user_id + "_" + song_id) - faster exact match only
    Performance: B+Tree: 3-4 IOs but flexible • Hash: 1-2 IOs but only exact match
    Why: If you only need "did they rate it?", use hash. If you need "show all user's ratings" or "top-rated songs", use B+Tree for range support.

10M Scale

🛒 E-commerce Price Filtering

Problem: Filter products by price ranges ($50-$200, Under $100) with sorting support
Index Type: B+Tree Index
Design: B+Tree on price column • Compound index on (category, price) • Sorted leaves for ORDER BY
    Performance: 3-4 IOs + sequential scan for ranges
    Why: Hash cannot do BETWEEN or < > operators. B+Tree's sorted structure enables range queries and ORDER BY without separate sort.

50M Scale

📊 Song Analytics Dashboard

Problem: Query 50M songs by: song_id (primary key), release_date (time ranges), popularity_score (top charts), genre (exact match). Can't sort data multiple ways!
Index Type: Multiple Indexes (Primary + Secondary)
Design: Primary B+Tree clustered on song_id (data sorted by ID) • Secondary B+Tree on release_date (pointers to primary) • Secondary B+Tree on popularity_score DESC (for TOP N) • Hash on genre (exact match only)
    Performance: Primary: 3 IOs • Secondary: 4-5 IOs (extra hop through primary) • Genre: 1-2 IOs via hash
    Why: Data physically sorted only one way (by song_id). Secondary indexes store (key, pointer) pairs pointing back to primary. Trade-off: extra IO hop vs maintaining multiple sorted copies.

10M Users

🔄 User Activity Tracking

Problem: Track 10M user sessions. Queries: session_id (current session), user_id + timestamp (history), last_active (find dormant), "has user ever logged in?"
Index Type: Mixed Index Types for Different Patterns
Design: Hash on session_id (O(1) current session) • Composite B+Tree on (user_id, timestamp) for timelines • B+Tree on last_active for inactivity queries • Bloom filter for "exists" checks
    Performance: Session: 1-2 IOs • History: 3-4 IOs + scan • Dormant: range scan • Exists: 0 IOs (memory bloom filter)
    Why: Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence. One size doesn't fit all!

Category 2: Search & Discovery - Text, Images, and Feeds

100M Scale

🔍 Artist Search & Autocomplete

Problem: Search 100M songs by artist name. Support: exact match ("Drake"), prefix ("Dra*"), fuzzy match ("Drak" → "Drake")
Index Type: Inverted Index + B+Tree
Design: Inverted index maps tokens → song_ids • B+Tree on artist names for prefix search • N-gram index (3-grams: "Drake" → ["Dra", "rak", "ake"]) for fuzzy matching • Cache popular searches in hash table
    Performance: Exact: 2-3 IOs via inverted index • Prefix: B+Tree range scan • Fuzzy: n-gram overlap scoring (e.g., "Drak" and "Drake" share 2 of 3 3-grams)
    Why: Text search needs specialized structures. Hash can't do prefix/fuzzy. B+Tree alone can't efficiently find "songs with 'love' in title".

1B+ Scale

📷 Similar Image Search

Problem: Pinterest-style "find similar images" from 1B+ images. Also detect exact duplicates before storing new uploads
Index Type: LSH + Hash Index (Hybrid)
Design: LSH (Locality Sensitive Hashing) for similarity: maps images to buckets where similar images collide • Hash index on image fingerprints for exact duplicate detection • Two-stage: Check hash for duplicates, then LSH for similar
    Performance: Exact duplicate: O(1) via hash • Similar images: O(k) where k = average bucket size in LSH
    Why: High-dimensional (pixels/features) data. B+Tree can't handle 1000+ dimensions. LSH reduces dimensionality while preserving similarity.

600M Scale

📸 Instagram Feed

Problem: Show each user's Instagram feed (posts from accounts they follow). Cristiano Ronaldo has 600M followers - can't write to 600M feeds when he posts.
Index Type: B+Tree + Hash (Hybrid Fanout)
Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) • Celebrities: Query posts table on-demand • Regular users: Pre-populate timelines table
    Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup of pre-computed timeline.
    Why: Can't fanout Ronaldo's post to 600M timeline rows. When you open Instagram, it JOINs celebrity posts with your follows list using B+Tree indexes.

Category 3: High-Volume Streaming & Updates - Write-Optimized Systems

1M/sec Scale

🎵 Real-time Event Stream

Problem: Spotify logging 1M song plays/second (user_id, song_id, timestamp, duration)
Index Type: LSM Tree
Design: LSM with size-tiered compaction • Time-based partitioning • Secondary LSM indexes on user_id
    Performance: 0 IOs for writes (memory), K×2 IOs for reads
    Why: Writes go to memory first. B+Tree would need 4M IOPS vs LSM's batched flushes. Recent data stays in memory for fast queries.

100K Scale

🌡️ IoT Time-Series Data

Problem: 100K sensors sending temperature readings/minute, query by sensor_id and time range
Index Type: LSM Tree + Bloom Filters
Design: LSM with leveled compaction • (sensor_id, timestamp) primary key • Monthly partitions • Bloom filters per SSTable
    Performance: 0 write IOs, efficient range scans with bloom filter pruning
    Why: Append-only pattern perfect for LSM. Bloom filters eliminate 99% of unnecessary SSTable checks. RLE compression for repeated values.

10M Scale

📍 User Location Tracking

Problem: Uber tracks driver locations every 5 seconds. Only need most recent location per driver, not history.
Index Type: LSM with Keep-Latest Merge
Design: LSM tree with custom merge: keep only newest timestamp per driver_id • Writes to memory • During compaction, discard old locations
    Performance: 0 write IOs, compaction reduces data 100x by keeping only latest
    Why: LSM merge function can implement "keep-latest" logic. B+Tree would need expensive updates. Old locations automatically garbage collected during compaction.

1K Users

🎵 Collaborative Playlist

Problem: Spotify collaborative playlist shared by 1000 friends for a party. Everyone adding songs simultaneously causes lock contention.
Index Type: LSM with Append Operations
Design: Each "add song" is just appended to LSM log • No locking needed • Merge during read combines all additions • Duplicates handled by merge function
    Performance: 1000 concurrent adds without blocking. B+Tree would be expensive.
    Why: LSM just logs "user X added song Y" - no conflicts. Merge creates final playlist on read.

Hybrid Systems: Real Databases Mix Indexes

Production systems use multiple index types:

B+Tree for primary data • LSM for write-ahead logs • Hash for caches

PostgreSQL: B+Tree + Hash + GiST • Cassandra: LSM + Bloom filters • MongoDB: B+Tree for data, LSM for oplog


Interview Cheat Sheet: Patterns

If You See... Think... Because...
"Check if exists" Bloom filter + Hash Negative lookup optimization
"Geographic queries" R-Tree/GiST Spatial indexing
"Full-text search" Inverted index Text tokenization
"Prefix/Autocomplete" B+Tree range scan Sorted prefix matching
"Similar images/vectors" LSH (Locality Sensitive) High-dimensional similarity
"Exact duplicates" Hash fingerprints O(1) duplicate detection
"Graph relationships" Graph index Connected data traversal
"Fuzzy matching" N-gram index (2,3,4-grams) Breaks text into overlapping chunks for fuzzy match
"Composite key lookup" B+Tree vs Hash Trade flexibility for speed

Key Takeaways

  1. Consider the whole system - Cache layers, partitioning, and replication matter as much as index choice

  2. Measure, don't guess - Profile actual workloads to choose indexes (theory ≠ practice)