Practice Problems for Basic Systems Design

Hints: hashing, bloom filters, locality-sensitive hashing, and compression data structures.

Problem 1: Startup API Cache (1M Scale)

Problem: A music streaming API is buckling under the weight of 1M song metadata requests. Each query drags through multiple database tables during peak hours, and latency is the unwelcome guest. How do you cut the wait?

🔍 Show Solution

Concept: Hash Table
Design: In-memory HashMap: request_id → cached_response • Expire periodically • Evict least-recently used entries
Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
The "Why": Database queries take 50-200ms vs 0.01ms hash lookup. Caching "popular" results avoids expensive DB connections and queries.

Problem 2: Photo Similarity Detection (1M Scale)

Problem: Your photo app is drowning in duplicates. It needs to sift through a user's library, flagging near-similar images to suggest a cleanup. What's the move?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH)
Design: Convert images to 512-dim feature vectors • Generate similarity-preserving hash codes • Group photos in same buckets
Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
The "Why": Exact pixel comparison is $O(n^2)$ across 1M photos = 1 trillion operations. LSH allows finding near-matches in near $O(n)$ time.

Problem 3: CDN Cache Check (10M Scale)

Problem: A CDN like Akamai needs a quick check to see if a URL is already cached on an edge server. Speed is of the essence. What's the play?

🔍 Show Solution

Concept: Bloom Filter
Design: In-memory bit-array for quick "definitely not cached" checks • Only check disk if filter returns "maybe cached"
Rough Performance: 20MB memory, 100K checks/sec, 99.9% space savings vs Hash Map
The "Why": Disk seeks take 10ms vs 0.01ms memory lookup. A hash table for 10M URLs would need 400MB+ RAM; a Bloom filter needs ~20MB.

Problem 4: Social Media Recommendations (100M Scale)

Problem: Instagram wants to suggest "similar users" to follow. It's all about interests, follow lists, and hashtags. How do you make it happen?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH)
Design: Represent users as sparse feature vectors (likes, follows) • Hash similar users to same buckets • Pull recommendations from the same bucket
Rough Performance: 2GB memory, 50K users/sec, 90% relevance
The "Why": Computing exact similarity between 100M users requires $10^{16}$ comparisons. LSH reduces the search space to a localized bucket of likely candidates.

Problem 5: Web Crawler Deduplication (1B Scale)

Problem: A web crawler like Googlebot needs to avoid re-crawling billions of URLs. It's not just about saving bandwidth; it's about efficiency. What's the strategy?

🔍 Show Solution

Concept: Bloom Filter
Design: A large Bloom filter tracks all visited URLs • Check URL before crawling • 1% false positive rate is acceptable (skips 1% of new pages)
Rough Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
The "Why": A standard Hash Table for 1B URLs would require 32GB+ RAM. A Bloom filter fits comfortably in the memory of a single machine.

Problem 6: Username Availability (1B Scale)

Problem: As users type their desired username, you want to instantly flag "Username Taken" without pestering the server with every keystroke. What's the plan?

🔍 Show Solution

Concept: Bloom Filter
Design: Periodically download a compressed Bloom filter of taken usernames to the client/app • Perform local "taken" check
Rough Performance: 10MB download, instant client-side checks
The "Why": Replaces a 100ms API call with a 0.01ms local check. A false positive simply tells the user a name is taken when it's available, a minor ux friction compared to lag.

Problem 7: Document Near-Duplicates (10B Scale)

Problem: A search engine needs to weed out near-duplicate pages, same content, different ads. How do you keep the results clean?

🔍 Show Solution

Concept: Locality-Sensitive Hashing (LSH / MinHash)
Design: Convert docs to sets of words (shingles) • Create MinHash fingerprints • Group near-identical fingerprints
Rough Performance: 40GB memory, 500K docs/sec, 85% detection rate
The "Why": Comparing the full text of 10B documents is physically impossible. MinHash/LSH compresses the comparison into a small set of integer hashes.

Problem 8: Log Storage Compression (100M Scale)

Problem: Your system churns out 100M log entries daily, but it's the same 1,000 strings on repeat. How do you cut the bloat?

🔍 Show Solution

Concept: Dictionary Encoding
Design: Create a mapping of 1,000 unique strings → 10-bit IDs • Store only IDs in the main log table
Rough Performance: 40GB raw logs → 2GB stored (95% reduction)
The "Why": Storing the string "ConnectionTimeoutError" (22 bytes) millions of times is wasteful. Storing a 2-byte ID achieves massive space savings.

Problem 9: Time Series Analytics (1B Scale)

Problem: An IoT platform tracks temperature sensors every second. Timestamps are predictable. How do you store them efficiently?

🔍 Show Solution

Concept: Delta Encoding
Design: Store a "Base" timestamp (e.g., midnight) • Store only the +1s or +2s difference (delta) for each subsequent row
Rough Performance: 32GB raw timestamps → 4GB stored
The "Why": A standard 8-byte timestamp is redundant when values are predictable. Deltas can be further compressed using bit-packing for even smaller footprints.

Problem 10: E-commerce User Behavior (10M Scale)

Problem: An e-commerce site tracks user clicks. Sessions are repetitive. How do you store this data without redundancy?

🔍 Show Solution

Concept: Run-Length Encoding (RLE) + Dictionary Encoding
Design: Dictionary encode the action types • Use RLE for consecutive repeated actions (e.g., (view_item, 3))
Rough Performance: 10x compression ratio for session data
The "Why": User behavior is repetitive. RLE eliminates the redundancy of sequential identical events.

Problem 11: Rating System Storage (100M Scale)

Problem: You're storing 100M ratings for a recommendation engine. Ratings are 1 to 5. How do you store them efficiently?

🔍 Show Solution

Concept: Bit Packing
Design: A number from 1-5 only needs 3 bits ($2^3 = 8$) • Pack multiple 3-bit ratings into a single 32-bit or 64-bit word
Rough Performance: 400MB raw integers → 50MB packed storage
The "Why": Using a full 32-bit integer for a value that only goes up to 5 wastes 90% of the allocated space. Bit packing maximizes density.

Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.