Index Problem Solving: Design Patterns

When you see "optimize queries," "scale writes," or "range scans" in a problem, think about the right index type.

Hash
O(1) exact match
B+Tree
Range queries
LSM
Write optimization
Hybrid
Mix and match

Problem 1: User Lookup Service (1M Scale)

Problem: User lookup service handling 1M user requests/second with email/username lookups. Make it fast.

🔍 Show Solution
  • Concept: Hash Index

  • Design: Hash index on email and username • O(1) exact match lookups

  • Rough Performance: 1-2 IOs per lookup, scales horizontally

  • The "Why": Login is always exact match (email = 'user@example.com'), never ranges. Hash beats B+Tree's 3-4 IOs.


Problem 2: Song Rating Lookup (100M Scale)

Problem: Spotify needs fast lookups: "Did user X already rate song Y?" Need exact (user_id, song_id) → rating lookup for 100M ratings. What's a good design?

🔍 Show Solution
  • Concept: Composite B+Tree vs Concatenated Hash

  • Design: Option 1: B+Tree on (user_id, song_id) - allows "all ratings by user" queries • Option 2: Hash on concat(user_id + "_" + song_id) - faster exact match only

  • Rough Performance: B+Tree: 3-4 IOs but flexible • Hash: 1-2 IOs but only exact match

  • The "Why": If you only need "did they rate it?", use hash. Maybe combine with bloom filter for existence checks.


Problem 3: E-commerce Price Filtering (10M Scale)

Problem: Filter products by price ranges ($50-$200, Under $100). Make it fast.

🔍 Show Solution
  • Concept: B+Tree Index

  • Design: B+Tree on price column • Compound index on (category, price)

  • Rough Performance: 3-4 IOs + sequential scan for ranges

  • The "Why": Hash cannot do BETWEEN or < > operators. B+Tree's sorted structure enables range queries and ORDER BY without separate sort.


Problem 4: Song Analytics Dashboard (50M Scale)

Problem: Query 50M songs by: song_id (primary key), release_date (time ranges), popularity_score (top charts), genre (exact match). Can't sort data multiple ways! What's a good design?

🔍 Show Solution
  • Concept: Combine Multiple Indexes (Primary + Secondary)

  • Design: Primary B+Tree clustered on song_id (data sorted by ID) • Secondary B+Tree on release_date (pointers to primary) • (Another) Secondary B+Tree on popularity_score DESC (for TOP N) • Hash on genre (exact match only)

  • Rough Performance: Primary: 3 IOs • Secondary: 4-5 IOs (extra hop through primary) • Genre: 1-2 IOs via hash

  • The "Why": Data physically sorted only one way (by song_id). Secondary indexes store (key, pointer) pairs pointing back to primary. Trade-off: extra IO hop vs maintaining multiple sorted copies.


Problem 5: User Activity Tracking (10M Scale)

Problem: Track 10M user sessions. Queries: session_id (current session), user_id + timestamp (history), last_active (find dormant), "has user ever logged in?". Make it fast.

🔍 Show Solution
  • Concept: Mixed Index Types for Different Patterns

  • Design: Hash on session_id (O(1) current session) • Composite B+Tree on (user_id, timestamp) for timelines • B+Tree on last_active for inactivity queries • Bloom filter for "exists" checks

  • Rough Performance: Session: 1-2 IOs • History: 3-4 IOs + scan • Dormant: range scan • Exists: 0 IOs (memory bloom filter)

  • The "Why": Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence.


Problem 6: Similar Image Search (1B+ Scale)

Problem: Use case 1: Pinterest-style "find similar images" from 1B+ images. Use case 2: detect exact duplicates before storing new uploads. Make both fast.

🔍 Show Solution
  • Concept: LSH + Hash Index (Hybrid)

  • Design: LSH (Locality Sensitive Hashing) for similarity: maps images to buckets where similar images collide • Hash index on image fingerprints for exact duplicate detection • Two-stage: Check hash for duplicates, then LSH for similar

  • Rough Performance: Exact duplicate: O(1) via hash • Similar images: O(k) where k = average bucket size in LSH

  • The "Why": High-dimensional (pixels/features) data. LSH reduces dimensionality while preserving similarity. Keeping hash index for exact duplicates for speedup for use case 2.


Problem 7: Instagram Feed (600M Scale)

Problem: Show each user's Instagram feed (posts from accounts they follow). Cristiano Ronaldo has 600M followers - can't write to 600M feeds when he posts.

🔍 Show Solution
  • Concept: B+Tree + Hash (Hybrid Fanout)

  • Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) • Celebrities: Query posts table on-demand • Regular users: Pre-populate timelines table

  • Rough Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup of pre-computed timeline.

  • The "Why": Can't fanout Ronaldo's post to 600M timeline rows. When you open Instagram, it JOINs celebrity posts with your follows list using B+Tree indexes.


Problem 8: Real-time Event Stream (1M/sec Scale)

Problem: Spotify logging 1M song plays/second (user_id, song_id, timestamp, duration).

🔍 Show Solution
  • Concept: LSM Tree

  • Design: LSM with size-tiered compaction • Time-based partitioning • Secondary LSM indexes on user_id

  • Rough Performance: 0 IOs for writes (memory), K×2 IOs for reads

  • The "Why": Writes go to memory first. B+Tree would need 4M IOPS vs LSM's batched flushes. Recent data stays in memory for fast queries.


Problem 9: IoT Time-Series Data (100K Scale)

Problem: 100K sensors sending temperature readings/minute, query by sensor_id and time range.

🔍 Show Solution
  • Concept: LSM Tree + Bloom Filters

  • Design: LSM with leveled compaction • (sensor_id, timestamp) primary key • Monthly partitions • Bloom filters per SSTable

  • Rough Performance: 0 write IOs, efficient range scans with bloom filter pruning

  • The "Why": Append-only pattern perfect for LSM. Bloom filters eliminate 99% of unnecessary SSTable checks. RLE compression for repeated values.


Problem 10: User Location Tracking (10M Scale)

Problem: Uber tracks driver locations every 5 seconds. Only need most recent location per driver, not history.

🔍 Show Solution
  • Concept: LSM with Keep-Latest Merge

  • Design: LSM tree with custom merge: keep only newest timestamp per driver_id • Writes to memory • During compaction, discard old locations

  • Rough Performance: 0 write IOs, compaction reduces data 100x by keeping only latest

  • The "Why": LSM merge function can implement "keep-latest" logic. B+Tree would need expensive updates. Old locations automatically garbage collected during compaction.