Practice Problems for Basic Systems Design
Hints: hashing, bloom filters, locality-sensitive hashing, and compression data structures.
Problem 1: Startup API Cache (1M Scale)
Problem: A music streaming API is buckling under the weight of 1M song metadata requests. Each query drags through multiple database tables during peak hours, and latency is the unwelcome guest. How do you cut the wait?
π Show Solution
-
Concept: Hash Table
-
Design: In-memory HashMap:
request_idβcached_responseβ’ Expire periodically β’ Evict least-recently used entries -
Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
-
The "Why": Database queries take 50-200ms vs 0.01ms hash lookup. Caching "popular" results avoids expensive DB connections and queries.
Problem 2: Photo Similarity Detection (1M Scale)
Problem: Your photo app is drowning in duplicates. It needs to sift through a user's library, flagging near-similar images to suggest a cleanup. What's the move?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH)
-
Design: Convert images to 512-dim feature vectors β’ Generate similarity-preserving hash codes β’ Group photos in same buckets
-
Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
-
The "Why": Exact pixel comparison is $O(n^2)$ across 1M photos = 1 trillion operations. LSH allows finding near-matches in near $O(n)$ time.
Problem 3: CDN Cache Check (10M Scale)
Problem: A CDN like Akamai needs a quick check to see if a URL is already cached on an edge server. Speed is of the essence. What's the play?
π Show Solution
-
Concept: Bloom Filter
-
Design: In-memory bit-array for quick "definitely not cached" checks β’ Only check disk if filter returns "maybe cached"
-
Rough Performance: 20MB memory, 100K checks/sec, 99.9% space savings vs Hash Map
-
The "Why": Disk seeks take 10ms vs 0.01ms memory lookup. A hash table for 10M URLs would need 400MB+ RAM; a Bloom filter needs ~20MB.
Problem 4: Social Media Recommendations (100M Scale)
Problem: Instagram wants to suggest "similar users" to follow. It's all about interests, follow lists, and hashtags. How do you make it happen?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH)
-
Design: Represent users as sparse feature vectors (likes, follows) β’ Hash similar users to same buckets β’ Pull recommendations from the same bucket
-
Rough Performance: 2GB memory, 50K users/sec, 90% relevance
-
The "Why": Computing exact similarity between 100M users requires $10^{16}$ comparisons. LSH reduces the search space to a localized bucket of likely candidates.
Problem 5: Web Crawler Deduplication (1B Scale)
Problem: A web crawler like Googlebot needs to avoid re-crawling billions of URLs. It's not just about saving bandwidth; it's about efficiency. What's the strategy?
π Show Solution
-
Concept: Bloom Filter
-
Design: A large Bloom filter tracks all visited URLs β’ Check URL before crawling β’ 1% false positive rate is acceptable (skips 1% of new pages)
-
Rough Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
-
The "Why": A standard Hash Table for 1B URLs would require 32GB+ RAM. A Bloom filter fits comfortably in the memory of a single machine.
Problem 6: Username Availability (1B Scale)
Problem: As users type their desired username, you want to instantly flag "Username Taken" without pestering the server with every keystroke. What's the plan?
π Show Solution
-
Concept: Bloom Filter
-
Design: Periodically download a compressed Bloom filter of taken usernames to the client/app β’ Perform local "taken" check
-
Rough Performance: 10MB download, instant client-side checks
-
The "Why": Replaces a 100ms API call with a 0.01ms local check. A false positive simply tells the user a name is taken when it's availableβa minor ux friction compared to lag.
Problem 7: Document Near-Duplicates (10B Scale)
Problem: A search engine needs to weed out near-duplicate pagesβsame content, different ads. How do you keep the results clean?
π Show Solution
-
Concept: Locality-Sensitive Hashing (LSH / MinHash)
-
Design: Convert docs to sets of words (shingles) β’ Create MinHash fingerprints β’ Group near-identical fingerprints
-
Rough Performance: 40GB memory, 500K docs/sec, 85% detection rate
-
The "Why": Comparing the full text of 10B documents is physically impossible. MinHash/LSH compresses the comparison into a small set of integer hashes.
Problem 8: Log Storage Compression (100M Scale)
Problem: Your system churns out 100M log entries daily, but it's the same 1,000 strings on repeat. How do you cut the bloat?
π Show Solution
-
Concept: Dictionary Encoding
-
Design: Create a mapping of 1,000 unique strings β 10-bit IDs β’ Store only IDs in the main log table
-
Rough Performance: 40GB raw logs β 2GB stored (95% reduction)
-
The "Why": Storing the string "ConnectionTimeoutError" (22 bytes) millions of times is wasteful. Storing a 2-byte ID achieves massive space savings.
Problem 9: Time Series Analytics (1B Scale)
Problem: An IoT platform tracks temperature sensors every second. Timestamps are predictable. How do you store them efficiently?
π Show Solution
-
Concept: Delta Encoding
-
Design: Store a "Base" timestamp (e.g., midnight) β’ Store only the +1s or +2s difference (delta) for each subsequent row
-
Rough Performance: 32GB raw timestamps β 4GB stored
-
The "Why": A standard 8-byte timestamp is redundant when values are predictable. Deltas can be further compressed using bit-packing for even smaller footprints.
Problem 10: E-commerce User Behavior (10M Scale)
Problem: An e-commerce site tracks user clicks. Sessions are repetitive. How do you store this data without redundancy?
π Show Solution
-
Concept: Run-Length Encoding (RLE) + Dictionary Encoding
-
Design: Dictionary encode the action types β’ Use RLE for consecutive repeated actions (e.g.,
(view_item, 3)) -
Rough Performance: 10x compression ratio for session data
-
The "Why": User behavior is repetitive. RLE eliminates the redundancy of sequential identical events.
Problem 11: Rating System Storage (100M Scale)
Problem: You're storing 100M ratings for a recommendation engine. Ratings are 1 to 5. How do you store them efficiently?
π Show Solution
-
Concept: Bit Packing
-
Design: A number from 1-5 only needs 3 bits ($2^3 = 8$) β’ Pack multiple 3-bit ratings into a single 32-bit or 64-bit word
-
Rough Performance: 400MB raw integers β 50MB packed storage
-
The "Why": Using a full 32-bit integer for a value that only goes up to 5 wastes 90% of the allocated space. Bit packing maximizes density.
Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.