Index Problem Solving: Design Patterns
When you see "optimize queries," "scale writes," or "range scans" in a problem, think about the right index type.
Problem 1: User Lookup Service (1M Scale)
Problem: User lookup service handling 1M user requests/second with email/username lookups. Make it fast.
🔍 Show Solution
-
Concept: Hash Index
-
Design: Hash index on email and username • O(1) exact match lookups
-
Rough Performance: 1-2 IOs per lookup, scales horizontally
-
The "Why": Login is always exact match (email = 'user@example.com'), never ranges. Hash beats B+Tree's 3-4 IOs.
Problem 2: Song Rating Lookup (100M Scale)
Problem: Spotify needs fast lookups: "Did user X already rate song Y?" Need exact (user_id, song_id) → rating lookup for 100M ratings. What's a good design?
🔍 Show Solution
-
Concept: Composite B+Tree vs Concatenated Hash
-
Design: Option 1: B+Tree on (user_id, song_id) - allows "all ratings by user" queries • Option 2: Hash on concat(user_id + "_" + song_id) - faster exact match only
-
Rough Performance: B+Tree: 3-4 IOs but flexible • Hash: 1-2 IOs but only exact match
-
The "Why": If you only need "did they rate it?", use hash. Maybe combine with bloom filter for existence checks.
Problem 3: E-commerce Price Filtering (10M Scale)
Problem: Filter products by price ranges ($50-$200, Under $100). Make it fast.
🔍 Show Solution
-
Concept: B+Tree Index
-
Design: B+Tree on price column • Compound index on (category, price)
-
Rough Performance: 3-4 IOs + sequential scan for ranges
-
The "Why": Hash cannot do BETWEEN or < > operators. B+Tree's sorted structure enables range queries and ORDER BY without separate sort.
Problem 4: Song Analytics Dashboard (50M Scale)
Problem: Query 50M songs by: song_id (primary key), release_date (time ranges), popularity_score (top charts), genre (exact match). Can't sort data multiple ways! What's a good design?
🔍 Show Solution
-
Concept: Combine Multiple Indexes (Primary + Secondary)
-
Design: Primary B+Tree clustered on song_id (data sorted by ID) • Secondary B+Tree on release_date (pointers to primary) • (Another) Secondary B+Tree on popularity_score DESC (for TOP N) • Hash on genre (exact match only)
-
Rough Performance: Primary: 3 IOs • Secondary: 4-5 IOs (extra hop through primary) • Genre: 1-2 IOs via hash
-
The "Why": Data physically sorted only one way (by song_id). Secondary indexes store (key, pointer) pairs pointing back to primary. Trade-off: extra IO hop vs maintaining multiple sorted copies.
Problem 5: User Activity Tracking (10M Scale)
Problem: Track 10M user sessions. Queries: session_id (current session), user_id + timestamp (history), last_active (find dormant), "has user ever logged in?". Make it fast.
🔍 Show Solution
-
Concept: Mixed Index Types for Different Patterns
-
Design: Hash on session_id (O(1) current session) • Composite B+Tree on (user_id, timestamp) for timelines • B+Tree on last_active for inactivity queries • Bloom filter for "exists" checks
-
Rough Performance: Session: 1-2 IOs • History: 3-4 IOs + scan • Dormant: range scan • Exists: 0 IOs (memory bloom filter)
-
The "Why": Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence.
Problem 6: Similar Image Search (1B+ Scale)
Problem: Use case 1: Pinterest-style "find similar images" from 1B+ images. Use case 2: detect exact duplicates before storing new uploads. Make both fast.
🔍 Show Solution
-
Concept: LSH + Hash Index (Hybrid)
-
Design: LSH (Locality Sensitive Hashing) for similarity: maps images to buckets where similar images collide • Hash index on image fingerprints for exact duplicate detection • Two-stage: Check hash for duplicates, then LSH for similar
-
Rough Performance: Exact duplicate: O(1) via hash • Similar images: O(k) where k = average bucket size in LSH
-
The "Why": High-dimensional (pixels/features) data. LSH reduces dimensionality while preserving similarity. Keeping hash index for exact duplicates for speedup for use case 2.
Problem 7: Instagram Feed (600M Scale)
Problem: Show each user's Instagram feed (posts from accounts they follow). Cristiano Ronaldo has 600M followers - can't write to 600M feeds when he posts.
🔍 Show Solution
-
Concept: B+Tree + Hash (Hybrid Fanout)
-
Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) • Celebrities: Query posts table on-demand • Regular users: Pre-populate timelines table
-
Rough Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup of pre-computed timeline.
-
The "Why": Can't fanout Ronaldo's post to 600M timeline rows. When you open Instagram, it JOINs celebrity posts with your follows list using B+Tree indexes.
Problem 8: Real-time Event Stream (1M/sec Scale)
Problem: Spotify logging 1M song plays/second (user_id, song_id, timestamp, duration).
🔍 Show Solution
-
Concept: LSM Tree
-
Design: LSM with size-tiered compaction • Time-based partitioning • Secondary LSM indexes on user_id
-
Rough Performance: 0 IOs for writes (memory), K×2 IOs for reads
-
The "Why": Writes go to memory first. B+Tree would need 4M IOPS vs LSM's batched flushes. Recent data stays in memory for fast queries.
Problem 9: IoT Time-Series Data (100K Scale)
Problem: 100K sensors sending temperature readings/minute, query by sensor_id and time range.
🔍 Show Solution
-
Concept: LSM Tree + Bloom Filters
-
Design: LSM with leveled compaction • (sensor_id, timestamp) primary key • Monthly partitions • Bloom filters per SSTable
-
Rough Performance: 0 write IOs, efficient range scans with bloom filter pruning
-
The "Why": Append-only pattern perfect for LSM. Bloom filters eliminate 99% of unnecessary SSTable checks. RLE compression for repeated values.
Problem 10: User Location Tracking (10M Scale)
Problem: Uber tracks driver locations every 5 seconds. Only need most recent location per driver, not history.
🔍 Show Solution
-
Concept: LSM with Keep-Latest Merge
-
Design: LSM tree with custom merge: keep only newest timestamp per driver_id • Writes to memory • During compaction, discard old locations
-
Rough Performance: 0 write IOs, compaction reduces data 100x by keeping only latest
-
The "Why": LSM merge function can implement "keep-latest" logic. B+Tree would need expensive updates. Old locations automatically garbage collected during compaction.