Index Problem Solving: Design Patterns

When tasked with "optimize queries," "scale writes," or "range scans," the right index type isn't just a technical choice—it's a business decision.

Hash

O(1) exact match

B+Tree

Range queries

LSM

Write optimization

Hybrid

Mix and match

Problem 1: User Lookup Service (1M Scale)

Problem: Handle 1M user requests per second with email/username lookups. Speed is non-negotiable.

🔍 Show Solution

Concept: Hash Index
Design: Hash index on email and username for O(1) exact match lookups
Rough Performance: 1-2 IOs per lookup, scales horizontally
The "Why": Logins are exact matches (email = 'user@example.com'). Hash index outperforms B+Tree's 3-4 IOs for this use case.

Problem 2: Song Rating Lookup (100M Scale)

Problem: Spotify needs to know if "User X already rated Song Y?" for 100M ratings. Efficiency is key.

🔍 Show Solution

Concept: Composite B+Tree vs Concatenated Hash
Design: Option 1: B+Tree on (user_id, song_id) for flexibility • Option 2: Hash on concat(user_id + "_" + song_id) for speed
Rough Performance: B+Tree: 3-4 IOs but flexible • Hash: 1-2 IOs for exact match
The "Why": If the sole need is "did they rate it?", go with hash. Consider a bloom filter for existence checks.

Problem 3: E-commerce Price Filtering (10M Scale)

Problem: Filter products by price ranges swiftly.

🔍 Show Solution

Concept: B+Tree Index
Design: B+Tree on price column • Compound index on (category, price)
Rough Performance: 3-4 IOs + sequential scan for ranges
The "Why": Hash can't handle BETWEEN or < > operators. B+Tree's sorted nature is ideal for range queries and ORDER BY operations.

Problem 4: Song Analytics Dashboard (50M Scale)

Problem: Query 50M songs by various criteria. Sorting data multiple ways isn't feasible.

🔍 Show Solution

Concept: Combine Multiple Indexes (Primary + Secondary)
Design: Primary B+Tree clustered on song_id • Secondary B+Tree on release_date • Another Secondary B+Tree on popularity_score DESC • Hash on genre
Rough Performance: Primary: 3 IOs • Secondary: 4-5 IOs • Genre: 1-2 IOs via hash
The "Why": Data is physically sorted by song_id. Secondary indexes point back to primary. The trade-off is an extra IO hop versus maintaining multiple sorted copies.

Problem 5: User Activity Tracking (10M Scale)

Problem: Track 10M user sessions with various query patterns.

🔍 Show Solution

Concept: Mixed Index Types for Different Patterns
Design: Hash on session_id • Composite B+Tree on (user_id, timestamp) • B+Tree on last_active • Bloom filter for "exists" checks
Rough Performance: Session: 1-2 IOs • History: 3-4 IOs + scan • Dormant: range scan • Exists: 0 IOs
The "Why": Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence.

Problem 6: Similar Image Search (1B+ Scale)

Problem: Pinterest-style "find similar images" and detect exact duplicates efficiently.

🔍 Show Solution

Concept: LSH + Hash Index (Hybrid)
Design: LSH for similarity • Hash index on image fingerprints for duplicates • Two-stage: Check hash, then LSH
Rough Performance: Exact duplicate: O(1) • Similar images: O(k) where k = average bucket size in LSH
The "Why": LSH handles high-dimensional data by reducing dimensionality while preserving similarity. Hash index speeds up exact duplicate checks.

Problem 7: Instagram Feed (600M Scale)

Problem: Deliver Instagram feeds without writing to 600M feeds for a single celebrity post.

🔍 Show Solution

Concept: B+Tree + Hash (Hybrid Fanout)
Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) • Celebrities: Query posts table on-demand • Regular users: Pre-populate timelines
Rough Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup.
The "Why": Avoid fanning out a celebrity post to 600M timelines. Instagram JOINs celebrity posts with your follows list using B+Tree indexes.

Problem 8: Real-time Event Stream (1M/sec Scale)

Problem: Spotify logs 1M song plays per second.

🔍 Show Solution

Concept: LSM Tree
Design: LSM with size-tiered compaction • Time-based partitioning • Secondary LSM indexes on user_id
Rough Performance: 0 IOs for writes, K×2 IOs for reads
The "Why": Writes hit memory first. B+Tree would need 4M IOPS, while LSM's batched flushes handle it efficiently. Recent data stays in memory for fast access.

Problem 9: IoT Time-Series Data (100K Scale)

Problem: 100K sensors sending readings, query by sensor_id and time range.

🔍 Show Solution

Concept: LSM Tree + Bloom Filters
Design: LSM with leveled compaction • (sensor_id, timestamp) primary key • Monthly partitions • Bloom filters per SSTable
Rough Performance: 0 write IOs, efficient range scans with bloom filter pruning
The "Why": Append-only pattern suits LSM. Bloom filters cut unnecessary SSTable checks. RLE compression for repeated values.

Problem 10: User Location Tracking (10M Scale)

Problem: Uber tracks driver locations every 5 seconds, only needing the latest location per driver.

🔍 Show Solution

Concept: LSM with Keep-Latest Merge
Design: LSM tree with custom merge: keep only newest timestamp per driver_id • Writes to memory • During compaction, discard old locations
Rough Performance: 0 write IOs, compaction reduces data 100x by keeping only latest
The "Why": LSM merge function implements "keep-latest" logic. B+Tree would require costly updates. Old locations are automatically discarded during compaction.