Data Structures: Problem Solving Patterns
Interview Gold: 50% of System Design Questions
One line: When you see "fast lookups," "data patterns," or "storage optimization" in a problem, think hashing and compression data structures.
When to Use What: Scale-Driven Decisions
Hashing
| Scale | Use Case | Tool |
|---|---|---|
| 1M items | Fast lookups | Hash Table |
| 1B+ items | Space-efficient | Bloom Filter |
| Any scale | Similarity search | LSH |
Compression
| Pattern | Ratio | Tool |
|---|---|---|
| Repeated values | 10-100x | RLE |
| Unique strings | 5-20x | Dictionary |
| Sequential data | 5-10x | Delta |
| Small integers | 4-8x | Bit Packing |
11 Real-World Problems
Hashing Problems (1M ā 10B Scale)
š Startup API Cache
Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
Why: Database queries take 50-200ms vs 1ms hash lookup. Storing full responses avoids expensive DB connections.
šø Photo Similarity Detection
Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
Why: Exact pixel comparison is O(n²) across 1M photos = 1 trillion operations. Computing full similarity matrices would need 1TB+ RAM.
š CDN Cache Check
Rough Performance: 20MB memory, 100K checks/sec, 99% space savings
Why: Disk seeks take 10ms vs 0.01ms memory lookup. Hash table for 10M URLs needs 400MB RAM vs 20MB Bloom filter.
š± Social Media Recommendations
Performance: 2GB memory, 50K users/sec, 90% relevance
Why: Exact similarity calculation across 100M users = 10^16 comparisons. Full user similarity matrices would require 40TB+ storage.
š·ļø Web Crawler Deduplication
Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
Why: Hash table for 1B URLs needs 32GB+ RAM vs 1.2GB Bloom filter. Database lookups add network latency to every crawl decision.
š¤ Username Availability
Performance: 10MB download, instant client checks, 99% accuracy
Why: Server API calls add 50-100ms network latency per keystroke. Hash table for 1B usernames needs 32GB+ RAM per client.
š Document Near-Duplicates
Performance: 40GB memory, 500K docs/sec, 85% duplicate detection
Why: Full-text comparison across 10B documents = O(n²) operations. Storing complete document similarity matrices would need petabytes of storage.
Compression Problems (100M ā 1B+ Scale)
š Log Storage Compression
Performance: 40GB logs ā 2GB storage, 95% space savings
Why: Storing full error messages wastes space. 1000 unique errors Ć 100 chars = 100KB dict vs 40GB full text storage.
š Time Series Analytics
Performance: 32GB raw data ā 4GB compressed, 1M points/sec ingestion
Why: Full timestamps take 8 bytes each vs 1-2 bytes for deltas. Sequential patterns compress 10-20x better.
š E-commerce User Behavior
Performance: 800MB sessions ā 80MB storage, 10x compression ratio
Why: Users repeat actions (scroll, scroll, click). Full action strings waste space vs 1-byte action codes + run counts.
ā Rating System Storage
Performance: 400MB raw ratings ā 50MB storage, 8x compression
Why: Each rating needs only 3 bits (1-5 values) vs 32-bit integers. Bit packing + patterns achieve massive savings.
Interview Recognition Patterns
Hashing Triggers
"Have we seen this before?" / "Fast lookups" / "Membership testing"
-
Cache lookups, duplicate detection, set operations
-
<100M items: Hash Table (exact, fast)
-
>1B items: Bloom Filter (approximate, space-efficient)
-
Any scale + similarity: LSH (near-matches, recommendations)
Compression Triggers
"Storage costs too much" / "Data has patterns" / "Query performance"
-
Repeated values ā RLE (10-100x compression)
-
Limited unique strings ā Dictionary (5-20x compression)
-
Sequential data ā Delta Encoding (5-10x compression)
-
Small integers ā Bit Packing (4-8x compression)
-
Analytical queries ā Columnar + combined techniques (95%+ savings)
Interview Framework
-
Scale: 1M vs 100M vs 1B items determines tool choice
-
Accuracy: 100% exact vs ~99% approximate trade-offs
-
Memory: RAM abundant vs space-constrained environments
-
Patterns: Identify repeated, sequential, or categorical data
-
Combine: Chain hashing + compression for optimal results
Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.