Practice Problems for Basic Systems Design

Hints: hashing, bloom filters, locality-sensitive hashing, and compression data structures.

Problem 1: Startup API Cache (1M Scale)

Problem: A music streaming API serves 1M song metadata requests (artist, album, genre, duration). Heavy queries hit multiple DB tables during peak hours, causing high latency. How do you improve latency?

πŸ” Show Solution
  • Concept: Hash Table

  • Design: In-memory HashMap: request_id β†’ cached_response β€’ Expire periodically β€’ Evict least-recently used entries

  • Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy

  • The "Why": Database queries take 50-200ms vs 0.01ms hash lookup. Caching "popular" results avoids expensive DB connections and queries.


Problem 2: Photo Similarity Detection (1M Scale)

Problem: A photo app needs to detect duplicate or near-similar images in a user's library to suggest cleanup and save space. What's a good design?

πŸ” Show Solution
  • Concept: Locality-Sensitive Hashing (LSH)

  • Design: Convert images to 512-dim feature vectors β€’ Generate similarity-preserving hash codes β€’ Group photos in same buckets

  • Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy

  • The "Why": Exact pixel comparison is $O(n^2)$ across 1M photos = 1 trillion operations. LSH allows finding near-matches in near $O(n)$ time.


Problem 3: CDN Cache Check (10M Scale)

Problem: A Content Delivery Network (CDN) like Akamai needs to check if a requested URL is already cached on a specific edge server to speedup user requests. What's a good design?

πŸ” Show Solution
  • Concept: Bloom Filter

  • Design: In-memory bit-array for quick "definitely not cached" checks β€’ Only check disk if filter returns "maybe cached"

  • Rough Performance: 20MB memory, 100K checks/sec, 99.9% space savings vs Hash Map

  • The "Why": Disk seeks take 10ms vs 0.01ms memory lookup. A hash table for 10M URLs would need 400MB+ RAM; a Bloom filter needs ~20MB.


Problem 4: Social Media Recommendations (100M Scale)

Problem: Instagram needs to recommend "similar users" to follow based on a user's current interests, follow list, and hashtags. What's a good design?

πŸ” Show Solution
  • Concept: Locality-Sensitive Hashing (LSH)

  • Design: Represent users as sparse feature vectors (likes, follows) β€’ Hash similar users to same buckets β€’ Pull recommendations from the same bucket

  • Rough Performance: 2GB memory, 50K users/sec, 90% relevance

  • The "Why": Computing exact similarity between 100M users requires $10^{16}$ comparisons. LSH reduces the search space to a localized bucket of likely candidates.


Problem 5: Web Crawler Deduplication (1B Scale)

Problem: A web crawler (like Googlebot) needs to keep track of billions of visited URLs to avoid re-crawling the same pages and wasting bandwidth. What's a good design?

πŸ” Show Solution
  • Concept: Bloom Filter

  • Design: A large Bloom filter tracks all visited URLs β€’ Check URL before crawling β€’ 1% false positive rate is acceptable (skips 1% of new pages)

  • Rough Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication

  • The "Why": A standard Hash Table for 1B URLs would require 32GB+ RAM. A Bloom filter fits comfortably in the memory of a single machine.


Problem 6: Username Availability (1B Scale)

Problem: When a user signs up, you want to show "Username Taken" instantly as they type, without sending a network request to the server for every keystroke. What's a good design?

πŸ” Show Solution
  • Concept: Bloom Filter

  • Design: Periodically download a compressed Bloom filter of taken usernames to the client/app β€’ Perform local "taken" check

  • Rough Performance: 10MB download, instant client-side checks

  • The "Why": Replaces a 100ms API call with a 0.01ms local check. A false positive simply tells the user a name is taken when it's availableβ€”a minor ux friction compared to lag.


Problem 7: Document Near-Duplicates (10B Scale)

Problem: A search engine needs to filter out millions of pages that are "near-duplicates" (identical content with different ads or headers) from its search results. What's a good design?

πŸ” Show Solution
  • Concept: Locality-Sensitive Hashing (LSH / MinHash)

  • Design: Convert docs to sets of words (shingles) β€’ Create MinHash fingerprints β€’ Group near-identical fingerprints

  • Rough Performance: 40GB memory, 500K docs/sec, 85% detection rate

  • The "Why": Comparing the full text of 10B documents is physically impossible. MinHash/LSH compresses the comparison into a small set of integer hashes.


Problem 8: Log Storage Compression (100M Scale)

Problem: A system generates 100M log entries daily, but 99% of them are the same 1,000 "error" or "info" strings repeated with different timestamps. What's a good design?

πŸ” Show Solution
  • Concept: Dictionary Encoding

  • Design: Create a mapping of 1,000 unique strings β†’ 10-bit IDs β€’ Store only IDs in the main log table

  • Rough Performance: 40GB raw logs β†’ 2GB stored (95% reduction)

  • The "Why": Storing the string "ConnectionTimeoutError" (22 bytes) millions of times is wasteful. Storing a 2-byte ID achieves massive space savings.


Problem 9: Time Series Analytics (1B Scale)

Problem: An IoT platform tracks temperature sensors every second. Timestamps are almost always one second apart (12:00:01, 12:00:02, etc.). What's a good design?

πŸ” Show Solution
  • Concept: Delta Encoding

  • Design: Store a "Base" timestamp (e.g., midnight) β€’ Store only the +1s or +2s difference (delta) for each subsequent row

  • Rough Performance: 32GB raw timestamps β†’ 4GB stored

  • The "Why": A standard 8-byte timestamp is redundant when values are predictable. Deltas can be further compressed using bit-packing for even smaller footprints.


Problem 10: E-commerce User Behavior (10M Scale)

Problem: An e-commerce site tracks user clicks. A typical session looks like: view_item, view_item, view_item, add_to_cart, view_item, view_item. What's a good design to store this data?

πŸ” Show Solution
  • Concept: Run-Length Encoding (RLE) + Dictionary Encoding

  • Design: Dictionary encode the action types β€’ Use RLE for consecutive repeated actions (e.g., (view_item, 3))

  • Rough Performance: 10x compression ratio for session data

  • The "Why": User behavior is repetitive. RLE eliminates the redundancy of sequential identical events.


Problem 11: Rating System Storage (100M Scale)

Problem: You need to store 100M ratings for a recommendation engine. Ratings are integers from 1 to 5. What's a good design?

πŸ” Show Solution
  • Concept: Bit Packing

  • Design: A number from 1-5 only needs 3 bits ($2^3 = 8$) β€’ Pack multiple 3-bit ratings into a single 32-bit or 64-bit word

  • Rough Performance: 400MB raw integers β†’ 50MB packed storage

  • The "Why": Using a full 32-bit integer for a value that only goes up to 5 wastes 90% of the allocated space. Bit packing maximizes density.


Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.