Practice Problems for Basic Systems Design
Hints: hashing, bloom filters, locality-sensitive hashing, and compression data structures.
Problem 1: Startup API Cache (1M Scale)
Problem: A music streaming API serves 1M song metadata requests (artist, album, genre, duration). Heavy queries hit multiple DB tables during peak hours, causing high latency. How do you improve latency?
π Show Solution
-
Concept: Hash Table
-
Design: In-memory HashMap:
request_idβcached_responseβ’ Expire periodically β’ Evict least-recently used entries -
Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
-
The "Why": Database queries take 50-200ms vs 0.01ms hash lookup. Caching "popular" results avoids expensive DB connections and queries.
Problem 2: Photo Similarity Detection (1M Scale)
Problem: A photo app needs to detect duplicate or near-similar images in a user's library to suggest cleanup and save space. What's a good design?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH)
-
Design: Convert images to 512-dim feature vectors β’ Generate similarity-preserving hash codes β’ Group photos in same buckets
-
Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
-
The "Why": Exact pixel comparison is $O(n^2)$ across 1M photos = 1 trillion operations. LSH allows finding near-matches in near $O(n)$ time.
Problem 3: CDN Cache Check (10M Scale)
Problem: A Content Delivery Network (CDN) like Akamai needs to check if a requested URL is already cached on a specific edge server to speedup user requests. What's a good design?
π Show Solution
-
Concept: Bloom Filter
-
Design: In-memory bit-array for quick "definitely not cached" checks β’ Only check disk if filter returns "maybe cached"
-
Rough Performance: 20MB memory, 100K checks/sec, 99.9% space savings vs Hash Map
-
The "Why": Disk seeks take 10ms vs 0.01ms memory lookup. A hash table for 10M URLs would need 400MB+ RAM; a Bloom filter needs ~20MB.
Problem 4: Social Media Recommendations (100M Scale)
Problem: Instagram needs to recommend "similar users" to follow based on a user's current interests, follow list, and hashtags. What's a good design?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH)
-
Design: Represent users as sparse feature vectors (likes, follows) β’ Hash similar users to same buckets β’ Pull recommendations from the same bucket
-
Rough Performance: 2GB memory, 50K users/sec, 90% relevance
-
The "Why": Computing exact similarity between 100M users requires $10^{16}$ comparisons. LSH reduces the search space to a localized bucket of likely candidates.
Problem 5: Web Crawler Deduplication (1B Scale)
Problem: A web crawler (like Googlebot) needs to keep track of billions of visited URLs to avoid re-crawling the same pages and wasting bandwidth. What's a good design?
π Show Solution
-
Concept: Bloom Filter
-
Design: A large Bloom filter tracks all visited URLs β’ Check URL before crawling β’ 1% false positive rate is acceptable (skips 1% of new pages)
-
Rough Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
-
The "Why": A standard Hash Table for 1B URLs would require 32GB+ RAM. A Bloom filter fits comfortably in the memory of a single machine.
Problem 6: Username Availability (1B Scale)
Problem: When a user signs up, you want to show "Username Taken" instantly as they type, without sending a network request to the server for every keystroke. What's a good design?
π Show Solution
-
Concept: Bloom Filter
-
Design: Periodically download a compressed Bloom filter of taken usernames to the client/app β’ Perform local "taken" check
-
Rough Performance: 10MB download, instant client-side checks
-
The "Why": Replaces a 100ms API call with a 0.01ms local check. A false positive simply tells the user a name is taken when it's availableβa minor ux friction compared to lag.
Problem 7: Document Near-Duplicates (10B Scale)
Problem: A search engine needs to filter out millions of pages that are "near-duplicates" (identical content with different ads or headers) from its search results. What's a good design?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH / MinHash)
-
Design: Convert docs to sets of words (shingles) β’ Create MinHash fingerprints β’ Group near-identical fingerprints
-
Rough Performance: 40GB memory, 500K docs/sec, 85% detection rate
-
The "Why": Comparing the full text of 10B documents is physically impossible. MinHash/LSH compresses the comparison into a small set of integer hashes.
Problem 8: Log Storage Compression (100M Scale)
Problem: A system generates 100M log entries daily, but 99% of them are the same 1,000 "error" or "info" strings repeated with different timestamps. What's a good design?
π Show Solution
-
Concept: Dictionary Encoding
-
Design: Create a mapping of 1,000 unique strings β 10-bit IDs β’ Store only IDs in the main log table
-
Rough Performance: 40GB raw logs β 2GB stored (95% reduction)
-
The "Why": Storing the string "ConnectionTimeoutError" (22 bytes) millions of times is wasteful. Storing a 2-byte ID achieves massive space savings.
Problem 9: Time Series Analytics (1B Scale)
Problem: An IoT platform tracks temperature sensors every second. Timestamps are almost always one second apart (12:00:01, 12:00:02, etc.). What's a good design?
π Show Solution
-
Concept: Delta Encoding
-
Design: Store a "Base" timestamp (e.g., midnight) β’ Store only the +1s or +2s difference (delta) for each subsequent row
-
Rough Performance: 32GB raw timestamps β 4GB stored
-
The "Why": A standard 8-byte timestamp is redundant when values are predictable. Deltas can be further compressed using bit-packing for even smaller footprints.
Problem 10: E-commerce User Behavior (10M Scale)
Problem: An e-commerce site tracks user clicks. A typical session looks like: view_item, view_item, view_item, add_to_cart, view_item, view_item. What's a good design to store this data?
π Show Solution
-
Concept: Run-Length Encoding (RLE) + Dictionary Encoding
-
Design: Dictionary encode the action types β’ Use RLE for consecutive repeated actions (e.g.,
(view_item, 3)) -
Rough Performance: 10x compression ratio for session data
-
The "Why": User behavior is repetitive. RLE eliminates the redundancy of sequential identical events.
Problem 11: Rating System Storage (100M Scale)
Problem: You need to store 100M ratings for a recommendation engine. Ratings are integers from 1 to 5. What's a good design?
π Show Solution
-
Concept: Bit Packing
-
Design: A number from 1-5 only needs 3 bits ($2^3 = 8$) β’ Pack multiple 3-bit ratings into a single 32-bit or 64-bit word
-
Rough Performance: 400MB raw integers β 50MB packed storage
-
The "Why": Using a full 32-bit integer for a value that only goes up to 5 wastes 90% of the allocated space. Bit packing maximizes density.
Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.