Data Structures: Problem Solving Patterns

Interview Gold: 50% of System Design Questions

One line: When you see "fast lookups," "data patterns," or "storage optimization" in a problem, think hashing and compression data structures.

O(1)
Average lookup time
Deduplication
Remove duplicates fast
Cache Check
Is it worth checking disk?
Compression
10-100x space savings

When to Use What: Scale-Driven Decisions

Hashing

Scale Use Case Tool
1M items Fast lookups Hash Table
1B+ items Space-efficient Bloom Filter
Any scale Similarity search LSH

Compression

Pattern Ratio Tool
Repeated values 10-100x RLE
Unique strings 5-20x Dictionary
Sequential data 5-10x Delta
Small integers 4-8x Bit Packing

11 Real-World Problems

Hashing Problems (1M → 10B Scale)

1M Scale

šŸš€ Startup API Cache

Problem: Music streaming API serves 1M song metadata requests (artist, album, genre, duration). Heavy queries hit multiple database tables during peak hours.
Concept: Hash Table
Design: In-memory HashMap: request_id → cached_response • TTL expiration • LRU eviction
    Rough Performance: 100MB memory, 50K lookups/sec, 100% accuracy
    Why: Database queries take 50-200ms vs 1ms hash lookup. Storing full responses avoids expensive DB connections.

1M Scale

šŸ“ø Photo Similarity Detection

Problem: Photo app detects duplicate/similar images using visual features
Concept: LSH (Locality-Sensitive Hashing)
Design: Convert images to 512-dim vectors • Generate similarity-preserving hash codes • Group similar photos in same buckets
    Rough Performance: 200MB memory, 10K photos/sec, 95% accuracy
    Why: Exact pixel comparison is O(n²) across 1M photos = 1 trillion operations. Computing full similarity matrices would need 1TB+ RAM.

10M Scale

🌐 CDN Cache Check

Problem: Akamai checks if URL is cached on edge server before checking disk
Concept: Bloom Filter
Design: In-memory filter for quick "definitely not cached" checks • Only check disk if "maybe cached" • False positives = unnecessary disk check (acceptable)
    Rough Performance: 20MB memory, 100K checks/sec, 99% space savings
    Why: Disk seeks take 10ms vs 0.01ms memory lookup. Hash table for 10M URLs needs 400MB RAM vs 20MB Bloom filter.

100M Scale

šŸ“± Social Media Recommendations

Problem: Instagram recommends similar users based on interests/behavior
Concept: LSH
Design: Represent users as sparse feature vectors (likes, follows, hashtags) • Hash similar users to same buckets • Recommend from same/nearby buckets
    Performance: 2GB memory, 50K users/sec, 90% relevance
    Why: Exact similarity calculation across 100M users = 10^16 comparisons. Full user similarity matrices would require 40TB+ storage.

1B Scale

šŸ•·ļø Web Crawler Deduplication

Problem: Google's crawler avoids re-crawling already visited web pages
Concept: Bloom Filter
Design: Bloom filter tracks all crawled URLs • Check before crawling new URL • 1% false positive = skip 1% of new pages (acceptable)
    Performance: 1.2GB memory, 1M URLs/sec, 99% deduplication
    Why: Hash table for 1B URLs needs 32GB+ RAM vs 1.2GB Bloom filter. Database lookups add network latency to every crawl decision.

1B Scale

šŸ‘¤ Username Availability

Problem: Instagram provides instant feedback on username availability as user types
Concept: Bloom Filter
Design: Download Bloom filter of taken usernames to client • Real-time check without server queries • False positive shows "taken" for available name (user tries another)
    Performance: 10MB download, instant client checks, 99% accuracy
    Why: Server API calls add 50-100ms network latency per keystroke. Hash table for 1B usernames needs 32GB+ RAM per client.

10B Scale

šŸ“„ Document Near-Duplicates

Problem: Search engine detects near-duplicate web pages for ranking/filtering
Concept: LSH
Design: Convert documents to hash fingerprints • Similar documents hash to same values • Group near-duplicates for content quality scoring
    Performance: 40GB memory, 500K docs/sec, 85% duplicate detection
    Why: Full-text comparison across 10B documents = O(n²) operations. Storing complete document similarity matrices would need petabytes of storage.

Compression Problems (100M → 1B+ Scale)

100M Scale

šŸ“ Log Storage Compression

Problem: Server logs with repeated error messages need efficient storage
Concept: Dictionary Encoding
Design: Dictionary maps error messages to IDs • Store message ID instead of full text • Compress with zstd
    Performance: 40GB logs → 2GB storage, 95% space savings
    Why: Storing full error messages wastes space. 1000 unique errors Ɨ 100 chars = 100KB dict vs 40GB full text storage.

1B Scale

šŸ“Š Time Series Analytics

Problem: IoT sensor data with sequential timestamps needs fast compression
Concept: Delta Encoding
Design: Store base timestamp • Compress deltas between readings • Use bit packing for small deltas
    Performance: 32GB raw data → 4GB compressed, 1M points/sec ingestion
    Why: Full timestamps take 8 bytes each vs 1-2 bytes for deltas. Sequential patterns compress 10-20x better.

10M Scale

šŸ›’ E-commerce User Behavior

Problem: Track user clickstreams with repeated actions across sessions
Concept: RLE + Dictionary Encoding
Design: Dictionary for action types • RLE for consecutive same actions • Bit pack user IDs
    Performance: 800MB sessions → 80MB storage, 10x compression ratio
    Why: Users repeat actions (scroll, scroll, click). Full action strings waste space vs 1-byte action codes + run counts.

100M Scale

⭐ Rating System Storage

Problem: Store 1-5 star ratings efficiently for recommendations engine
Concept: Bit Packing
Design: Pack 8 ratings per 32-bit word • Group by user for cache locality • Use RLE for rating patterns
    Performance: 400MB raw ratings → 50MB storage, 8x compression
    Why: Each rating needs only 3 bits (1-5 values) vs 32-bit integers. Bit packing + patterns achieve massive savings.


Interview Recognition Patterns

Hashing Triggers

"Have we seen this before?" / "Fast lookups" / "Membership testing"

Compression Triggers

"Storage costs too much" / "Data has patterns" / "Query performance"

Interview Framework

  1. Scale: 1M vs 100M vs 1B items determines tool choice

  2. Accuracy: 100% exact vs ~99% approximate trade-offs

  3. Memory: RAM abundant vs space-constrained environments

  4. Patterns: Identify repeated, sequential, or categorical data

  5. Combine: Chain hashing + compression for optimal results

Key insight: Scale numbers (millions vs billions) + problem type ("lookup," "similarity," "storage") immediately point to the right data structure family. Real systems combine multiple techniques for 10-100x improvements.