Compression Basics: Making Data Smaller
The Magic of Patterns: 10× Smaller, Same Information
When to Use What?
| Technique | Best For | Compression Ratio | Speed |
|---|---|---|---|
| RLE | Repeated values, sorted data (e.g, user_id after GROUP BY) | 10-100× | Very Fast |
| Dictionary | Low cardinality strings (e.g, country, genre) | 5-20× | Fast |
| Delta | Timestamps (always increasing), sequences (e.g, IDs) | 5-10× | Fast |
| Bit Packing | Small integers (e.g, ratings 1-5, boolean flags) | 4-8× | Very Fast |
Combining Techniques
The real magic happens when you chain compressions:
Example: 1M User Listening Sessions
Original user_id column: 4MB (1M × 4 bytes)
↓
After RLE (sorted): 200KB (runs of same user)
↓
After Dictionary: 100KB (only 500K unique users)
↓
After zstd compression: 50KB (general compression)
Final: 4MB → 50KB = 98.75% reduction!