Compression Basics: Making Data Smaller
The Magic of Patterns: 10× Smaller, Same Information
When to Use What?
| Technique | Best For | Compression Ratio | Speed |
|---|---|---|---|
| RLE | Repeated values, sorted data (e.g., user_id after GROUP BY) | 10-100× | Very Fast |
| Dictionary | Low cardinality strings (e.g., country, genre) | 5-20× | Fast |
| Delta | Timestamps (always increasing), sequences (e.g., IDs) | 5-10× | Fast |
| Bit Packing | Small integers (e.g., ratings 1-5, boolean flags) | 4-8× | Very Fast |
Combining Techniques
The real magic is in the mix. Stack these techniques, and you start to see serious space savings:
Example: 1M User Listening Sessions
Original user_id column: 4MB (1M × 4 bytes)
↓
After RLE (sorted): 200KB (runs of same user)
↓
After Dictionary: 100KB (only 500K unique users)
↓
After zstd compression: 50KB (general compression)
Final: 4MB → 50KB = 98.75% reduction!