Index Problem Solving: Design Patterns
When tasked with "optimize queries," "scale writes," or "range scans," the right index type isn't just a technical choiceβit's a business decision.
Problem 1: User Lookup Service (1M Scale)
Problem: Handle 1M user requests per second with email/username lookups. Speed is non-negotiable.
π Show Solution
-
Concept: Hash Index
-
Design: Hash index on email and username for O(1) exact match lookups
-
Rough Performance: 1-2 IOs per lookup, scales horizontally
-
The "Why": Logins are exact matches (email = 'user@example.com'). Hash index outperforms B+Tree's 3-4 IOs for this use case.
Problem 2: Song Rating Lookup (100M Scale)
Problem: Spotify needs to know if "User X already rated Song Y?" for 100M ratings. Efficiency is key.
π Show Solution
-
Concept: Composite B+Tree vs Concatenated Hash
-
Design: Option 1: B+Tree on (user_id, song_id) for flexibility β’ Option 2: Hash on concat(user_id + "_" + song_id) for speed
-
Rough Performance: B+Tree: 3-4 IOs but flexible β’ Hash: 1-2 IOs for exact match
-
The "Why": If the sole need is "did they rate it?", go with hash. Consider a bloom filter for existence checks.
Problem 3: E-commerce Price Filtering (10M Scale)
Problem: Filter products by price ranges swiftly.
π Show Solution
-
Concept: B+Tree Index
-
Design: B+Tree on price column β’ Compound index on (category, price)
-
Rough Performance: 3-4 IOs + sequential scan for ranges
-
The "Why": Hash can't handle BETWEEN or < > operators. B+Tree's sorted nature is ideal for range queries and ORDER BY operations.
Problem 4: Song Analytics Dashboard (50M Scale)
Problem: Query 50M songs by various criteria. Sorting data multiple ways isn't feasible.
π Show Solution
-
Concept: Combine Multiple Indexes (Primary + Secondary)
-
Design: Primary B+Tree clustered on song_id β’ Secondary B+Tree on release_date β’ Another Secondary B+Tree on popularity_score DESC β’ Hash on genre
-
Rough Performance: Primary: 3 IOs β’ Secondary: 4-5 IOs β’ Genre: 1-2 IOs via hash
-
The "Why": Data is physically sorted by song_id. Secondary indexes point back to primary. The trade-off is an extra IO hop versus maintaining multiple sorted copies.
Problem 5: User Activity Tracking (10M Scale)
Problem: Track 10M user sessions with various query patterns.
π Show Solution
-
Concept: Mixed Index Types for Different Patterns
-
Design: Hash on session_id β’ Composite B+Tree on (user_id, timestamp) β’ B+Tree on last_active β’ Bloom filter for "exists" checks
-
Rough Performance: Session: 1-2 IOs β’ History: 3-4 IOs + scan β’ Dormant: range scan β’ Exists: 0 IOs
-
The "Why": Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence.
Problem 6: Similar Image Search (1B+ Scale)
Problem: Pinterest-style "find similar images" and detect exact duplicates efficiently.
π Show Solution
-
Concept: LSH + Hash Index (Hybrid)
-
Design: LSH for similarity β’ Hash index on image fingerprints for duplicates β’ Two-stage: Check hash, then LSH
-
Rough Performance: Exact duplicate: O(1) β’ Similar images: O(k) where k = average bucket size in LSH
-
The "Why": LSH handles high-dimensional data by reducing dimensionality while preserving similarity. Hash index speeds up exact duplicate checks.
Problem 7: Instagram Feed (600M Scale)
Problem: Deliver Instagram feeds without writing to 600M feeds for a single celebrity post.
π Show Solution
-
Concept: B+Tree + Hash (Hybrid Fanout)
-
Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) β’ Celebrities: Query posts table on-demand β’ Regular users: Pre-populate timelines
-
Rough Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup.
-
The "Why": Avoid fanning out a celebrity post to 600M timelines. Instagram JOINs celebrity posts with your follows list using B+Tree indexes.
Problem 8: Real-time Event Stream (1M/sec Scale)
Problem: Spotify logs 1M song plays per second.
π Show Solution
-
Concept: LSM Tree
-
Design: LSM with size-tiered compaction β’ Time-based partitioning β’ Secondary LSM indexes on user_id
-
Rough Performance: 0 IOs for writes, KΓ2 IOs for reads
-
The "Why": Writes hit memory first. B+Tree would need 4M IOPS, while LSM's batched flushes handle it efficiently. Recent data stays in memory for fast access.
Problem 9: IoT Time-Series Data (100K Scale)
Problem: 100K sensors sending readings, query by sensor_id and time range.
π Show Solution
-
Concept: LSM Tree + Bloom Filters
-
Design: LSM with leveled compaction β’ (sensor_id, timestamp) primary key β’ Monthly partitions β’ Bloom filters per SSTable
-
Rough Performance: 0 write IOs, efficient range scans with bloom filter pruning
-
The "Why": Append-only pattern suits LSM. Bloom filters cut unnecessary SSTable checks. RLE compression for repeated values.
Problem 10: User Location Tracking (10M Scale)
Problem: Uber tracks driver locations every 5 seconds, only needing the latest location per driver.
π Show Solution
-
Concept: LSM with Keep-Latest Merge
-
Design: LSM tree with custom merge: keep only newest timestamp per driver_id β’ Writes to memory β’ During compaction, discard old locations
-
Rough Performance: 0 write IOs, compaction reduces data 100x by keeping only latest
-
The "Why": LSM merge function implements "keep-latest" logic. B+Tree would require costly updates. Old locations are automatically discarded during compaction.