Compression Basics: Making Data Smaller

The Magic of Patterns: 10× Smaller, Same Information

Four Ways to Compress Spotify Data Run-Length Encoding Original user_ids (72 bytes): [42,42,42,42,89,89,89,17,17] Compressed (27 bytes): [(42,4), (89,3), (17,2)] 63% smaller Dictionary Encoding Original (54 bytes): ["Willow","Willow","Evermore"] Dict + Indices (20 bytes): {0:"Willow",1:"Evermore"} [0,0,1] 63% smaller Delta Encoding Timestamps (32 bytes): 1704067200 (Jan 1, 00:00) 1704067260 (Jan 1, 00:01) Base + Deltas (12 bytes): Base: 1704067200 Deltas: [0, 60] 62% smaller Bit Packing Ratings (36 bytes): [4,5,3,4,5,3,4,5,3] 3 bits each (4 bytes): 100 101 011 100 101 011 100 101 011 89% smaller Real Spotify Data Compression User Listening Pattern user_id column (1M rows): [42,42,42,42,89,89,89,17,17,17,17,17...] → RLE: 4MB → 200KB (95% reduction) song_id (limited catalog): → Dictionary: 4MB → 1MB (75% reduction) Time Series Data listen_time (sequential): 14:35:00, 14:35:03, 14:35:07... → Delta: 8MB → 2MB (75% reduction) play_duration (seconds): → Delta from song length: 90% reduction Combined Techniques Original column: 1M ratings × 4 bytes = 4MB Step 1: Bit pack (1-5 needs 3 bits) → 375KB Step 2: RLE on patterns → 150KB Step 3: Compress with zstd → 75KB Final: 4MB → 75KB (98% reduction!)

When to Use What?

Technique Best For Compression Ratio Speed
RLE Repeated values, sorted data (e.g, user_id after GROUP BY) 10-100× Very Fast
Dictionary Low cardinality strings (e.g, country, genre) 5-20× Fast
Delta Timestamps (always increasing), sequences (e.g, IDs) 5-10× Fast
Bit Packing Small integers (e.g, ratings 1-5, boolean flags) 4-8× Very Fast

Combining Techniques

The real magic happens when you chain compressions:

Example: 1M User Listening Sessions
Original user_id column:     4MB (1M × 4 bytes)
↓
After RLE (sorted):          200KB (runs of same user)
↓  
After Dictionary:            100KB (only 500K unique users)
↓
After zstd compression:       50KB (general compression)

Final: 4MB → 50KB = 98.75% reduction!