Index Problem Solving: Design Patterns

When tasked with "optimize queries," "scale writes," or "range scans," the right index type isn't just a technical choiceβ€”it's a business decision.

Hash
O(1) exact match
B+Tree
Range queries
LSM
Write optimization
Hybrid
Mix and match

Problem 1: User Lookup Service (1M Scale)

Problem: Handle 1M user requests per second with email/username lookups. Speed is non-negotiable.

πŸ” Show Solution
  • Concept: Hash Index

  • Design: Hash index on email and username for O(1) exact match lookups

  • Rough Performance: 1-2 IOs per lookup, scales horizontally

  • The "Why": Logins are exact matches (email = 'user@example.com'). Hash index outperforms B+Tree's 3-4 IOs for this use case.


Problem 2: Song Rating Lookup (100M Scale)

Problem: Spotify needs to know if "User X already rated Song Y?" for 100M ratings. Efficiency is key.

πŸ” Show Solution
  • Concept: Composite B+Tree vs Concatenated Hash

  • Design: Option 1: B+Tree on (user_id, song_id) for flexibility β€’ Option 2: Hash on concat(user_id + "_" + song_id) for speed

  • Rough Performance: B+Tree: 3-4 IOs but flexible β€’ Hash: 1-2 IOs for exact match

  • The "Why": If the sole need is "did they rate it?", go with hash. Consider a bloom filter for existence checks.


Problem 3: E-commerce Price Filtering (10M Scale)

Problem: Filter products by price ranges swiftly.

πŸ” Show Solution
  • Concept: B+Tree Index

  • Design: B+Tree on price column β€’ Compound index on (category, price)

  • Rough Performance: 3-4 IOs + sequential scan for ranges

  • The "Why": Hash can't handle BETWEEN or < > operators. B+Tree's sorted nature is ideal for range queries and ORDER BY operations.


Problem 4: Song Analytics Dashboard (50M Scale)

Problem: Query 50M songs by various criteria. Sorting data multiple ways isn't feasible.

πŸ” Show Solution
  • Concept: Combine Multiple Indexes (Primary + Secondary)

  • Design: Primary B+Tree clustered on song_id β€’ Secondary B+Tree on release_date β€’ Another Secondary B+Tree on popularity_score DESC β€’ Hash on genre

  • Rough Performance: Primary: 3 IOs β€’ Secondary: 4-5 IOs β€’ Genre: 1-2 IOs via hash

  • The "Why": Data is physically sorted by song_id. Secondary indexes point back to primary. The trade-off is an extra IO hop versus maintaining multiple sorted copies.


Problem 5: User Activity Tracking (10M Scale)

Problem: Track 10M user sessions with various query patterns.

πŸ” Show Solution
  • Concept: Mixed Index Types for Different Patterns

  • Design: Hash on session_id β€’ Composite B+Tree on (user_id, timestamp) β€’ B+Tree on last_active β€’ Bloom filter for "exists" checks

  • Rough Performance: Session: 1-2 IOs β€’ History: 3-4 IOs + scan β€’ Dormant: range scan β€’ Exists: 0 IOs

  • The "Why": Match index to query pattern. Hash for point lookups, B+Tree for ranges/sorting, Bloom for existence.


Problem 6: Similar Image Search (1B+ Scale)

Problem: Pinterest-style "find similar images" and detect exact duplicates efficiently.

πŸ” Show Solution
  • Concept: LSH + Hash Index (Hybrid)

  • Design: LSH for similarity β€’ Hash index on image fingerprints for duplicates β€’ Two-stage: Check hash, then LSH

  • Rough Performance: Exact duplicate: O(1) β€’ Similar images: O(k) where k = average bucket size in LSH

  • The "Why": LSH handles high-dimensional data by reducing dimensionality while preserving similarity. Hash index speeds up exact duplicate checks.


Problem 7: Instagram Feed (600M Scale)

Problem: Deliver Instagram feeds without writing to 600M feeds for a single celebrity post.

πŸ” Show Solution
  • Concept: B+Tree + Hash (Hybrid Fanout)

  • Design: Two tables: posts (B+Tree on timestamp) and timelines (B+Tree on user_id, timestamp) β€’ Celebrities: Query posts table on-demand β€’ Regular users: Pre-populate timelines

  • Rough Performance: Celebrity: JOIN query with B+Tree. Regular: Single B+Tree lookup.

  • The "Why": Avoid fanning out a celebrity post to 600M timelines. Instagram JOINs celebrity posts with your follows list using B+Tree indexes.


Problem 8: Real-time Event Stream (1M/sec Scale)

Problem: Spotify logs 1M song plays per second.

πŸ” Show Solution
  • Concept: LSM Tree

  • Design: LSM with size-tiered compaction β€’ Time-based partitioning β€’ Secondary LSM indexes on user_id

  • Rough Performance: 0 IOs for writes, KΓ—2 IOs for reads

  • The "Why": Writes hit memory first. B+Tree would need 4M IOPS, while LSM's batched flushes handle it efficiently. Recent data stays in memory for fast access.


Problem 9: IoT Time-Series Data (100K Scale)

Problem: 100K sensors sending readings, query by sensor_id and time range.

πŸ” Show Solution
  • Concept: LSM Tree + Bloom Filters

  • Design: LSM with leveled compaction β€’ (sensor_id, timestamp) primary key β€’ Monthly partitions β€’ Bloom filters per SSTable

  • Rough Performance: 0 write IOs, efficient range scans with bloom filter pruning

  • The "Why": Append-only pattern suits LSM. Bloom filters cut unnecessary SSTable checks. RLE compression for repeated values.


Problem 10: User Location Tracking (10M Scale)

Problem: Uber tracks driver locations every 5 seconds, only needing the latest location per driver.

πŸ” Show Solution
  • Concept: LSM with Keep-Latest Merge

  • Design: LSM tree with custom merge: keep only newest timestamp per driver_id β€’ Writes to memory β€’ During compaction, discard old locations

  • Rough Performance: 0 write IOs, compaction reduces data 100x by keeping only latest

  • The "Why": LSM merge function implements "keep-latest" logic. B+Tree would require costly updates. Old locations are automatically discarded during compaction.