Practice Problems for Basic Systems Design

Hints: hashing, bloom filters, locality-sensitive hashing, and compression data structures.

Problem 1: Startup API Cache (1M Scale)

Problem: A music streaming API serves 1M song metadata requests (artist, album, genre, duration). Heavy queries hit multiple DB tables during peak hours, causing high latency. How do you improve latency?

🔍 Show Solution

Concept: Hash Table
Design: In-memory HashMap: request_id → cached_response • Expire periodically • Evict least-recently used entries
Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
The "Why": Database queries take 50-200ms vs 0.01ms hash lookup. Caching "popular" results avoids expensive DB connections and queries.

Problem 2: Photo Similarity Detection (1M Scale)

Problem: A photo app needs to detect duplicate or near-similar images in a user's library to suggest cleanup and save space. What's a good design?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH)
Design: Convert images to 512-dim feature vectors • Generate similarity-preserving hash codes • Group photos in same buckets
Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
The "Why": Exact pixel comparison is $O(n^2)$ across 1M photos = 1 trillion operations. LSH allows finding near-matches in near $O(n)$ time.

Problem 3: CDN Cache Check (10M Scale)

Problem: A Content Delivery Network (CDN) like Akamai needs to check if a requested URL is already cached on a specific edge server to speedup user requests. What's a good design?

🔍 Show Solution

Concept: Bloom Filter
Design: In-memory bit-array for quick "definitely not cached" checks • Only check disk if filter returns "maybe cached"
Rough Performance: 20MB memory, 100K checks/sec, 99.9% space savings vs Hash Map
The "Why": Disk seeks take 10ms vs 0.01ms memory lookup. A hash table for 10M URLs would need 400MB+ RAM; a Bloom filter needs ~20MB.

Problem 4: Social Media Recommendations (100M Scale)

Problem: Instagram needs to recommend "similar users" to follow based on a user's current interests, follow list, and hashtags. What's a good design?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH)
Design: Represent users as sparse feature vectors (likes, follows) • Hash similar users to same buckets • Pull recommendations from the same bucket
Rough Performance: 2GB memory, 50K users/sec, 90% relevance
The "Why": Computing exact similarity between 100M users requires $10^{16}$ comparisons. LSH reduces the search space to a localized bucket of likely candidates.

Problem 5: Web Crawler Deduplication (1B Scale)

Problem: A web crawler (like Googlebot) needs to keep track of billions of visited URLs to avoid re-crawling the same pages and wasting bandwidth. What's a good design?

🔍 Show Solution

Concept: Bloom Filter
Design: A large Bloom filter tracks all visited URLs • Check URL before crawling • 1% false positive rate is acceptable (skips 1% of new pages)
Rough Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
The "Why": A standard Hash Table for 1B URLs would require 32GB+ RAM. A Bloom filter fits comfortably in the memory of a single machine.

Problem 6: Username Availability (1B Scale)

Problem: When a user signs up, you want to show "Username Taken" instantly as they type, without sending a network request to the server for every keystroke. What's a good design?

🔍 Show Solution

Concept: Bloom Filter
Design: Periodically download a compressed Bloom filter of taken usernames to the client/app • Perform local "taken" check
Rough Performance: 10MB download, instant client-side checks
The "Why": Replaces a 100ms API call with a 0.01ms local check. A false positive simply tells the user a name is taken when it's available—a minor ux friction compared to lag.

Problem 7: Document Near-Duplicates (10B Scale)

Problem: A search engine needs to filter out millions of pages that are "near-duplicates" (identical content with different ads or headers) from its search results. What's a good design?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH / MinHash)
Design: Convert docs to sets of words (shingles) • Create MinHash fingerprints • Group near-identical fingerprints
Rough Performance: 40GB memory, 500K docs/sec, 85% detection rate
The "Why": Comparing the full text of 10B documents is physically impossible. MinHash/LSH compresses the comparison into a small set of integer hashes.

Problem 8: Log Storage Compression (100M Scale)

Problem: A system generates 100M log entries daily, but 99% of them are the same 1,000 "error" or "info" strings repeated with different timestamps. What's a good design?

🔍 Show Solution

Concept: Dictionary Encoding
Design: Create a mapping of 1,000 unique strings → 10-bit IDs • Store only IDs in the main log table
Rough Performance: 40GB raw logs → 2GB stored (95% reduction)
The "Why": Storing the string "ConnectionTimeoutError" (22 bytes) millions of times is wasteful. Storing a 2-byte ID achieves massive space savings.

Problem 9: Time Series Analytics (1B Scale)

Problem: An IoT platform tracks temperature sensors every second. Timestamps are almost always one second apart (12:00:01, 12:00:02, etc.). What's a good design?

🔍 Show Solution

Concept: Delta Encoding
Design: Store a "Base" timestamp (e.g., midnight) • Store only the +1s or +2s difference (delta) for each subsequent row
Rough Performance: 32GB raw timestamps → 4GB stored
The "Why": A standard 8-byte timestamp is redundant when values are predictable. Deltas can be further compressed using bit-packing for even smaller footprints.

Problem 10: E-commerce User Behavior (10M Scale)

Problem: An e-commerce site tracks user clicks. A typical session looks like: view_item, view_item, view_item, add_to_cart, view_item, view_item. What's a good design to store this data?

🔍 Show Solution

Concept: Run-Length Encoding (RLE) + Dictionary Encoding
Design: Dictionary encode the action types • Use RLE for consecutive repeated actions (e.g., (view_item, 3))
Rough Performance: 10x compression ratio for session data
The "Why": User behavior is repetitive. RLE eliminates the redundancy of sequential identical events.

Problem 11: Rating System Storage (100M Scale)

Problem: You need to store 100M ratings for a recommendation engine. Ratings are integers from 1 to 5. What's a good design?

🔍 Show Solution

Concept: Bit Packing
Design: A number from 1-5 only needs 3 bits ($2^3 = 8$) • Pack multiple 3-bit ratings into a single 32-bit or 64-bit word
Rough Performance: 400MB raw integers → 50MB packed storage
The "Why": Using a full 32-bit integer for a value that only goes up to 5 wastes 90% of the allocated space. Bit packing maximizes density.

Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.