Compression Basics: Making Data Smaller
Concept. Columnar compression (run-length, dictionary, delta) exploits the fact that values in a single column repeat heavily, typically shrinking on-disk size 10× and cutting IO by the same factor.
Intuition. Mickey appears in 3 of 9 Listens rows. At a billion rows, his user_id might appear 100 million times in a stretch. Run-length encoding stores "Mickey × 100M" instead of repeating the string, Listens shrinks 10× and reads 10× fewer pages.
When to Use What?
| Technique | Best For | Compression Ratio | Speed |
|---|---|---|---|
| RLE | Repeated values, sorted data (e.g., user_id after GROUP BY) | 10-100× | Very Fast |
| Dictionary | Low cardinality strings (e.g., country, genre) | 5-20× | Fast |
| Delta | Timestamps (always increasing), sequences (e.g., IDs) | 5-10× | Fast |
| Bit Packing | Small integers (e.g., ratings 1-5, boolean flags) | 4-8× | Very Fast |
Run-Length Encoding (RLE)
The cleanest case. A sorted column with long runs of the same value. Instead of storing Mickey, Mickey, Mickey, Mickey ... a hundred million times, store Mickey × 100M. One value plus a count. Compression ratios of 10× to 100× are normal on user_id after GROUP BY user_id, on country in a sorted log, on any column where the same value lingers.
Raw: [Mickey, Mickey, Mickey, Daffy, Daffy, Daffy, Daffy]
RLE: [(Mickey, 3), (Daffy, 4)]
Dictionary Encoding
When a column has a small set of distinct values, replace each value with a small integer index into a dictionary. Genre has maybe 50 distinct values; storing each as a 1-byte code instead of a 16-byte string is a 16× win. Country, currency, status flags, all good targets.
Dictionary: {0: "rock", 1: "pop", 2: "jazz"}
Raw: ["rock", "pop", "rock", "jazz", "rock"]
Encoded: [0, 1, 0, 2, 0]
Delta Encoding
For sorted numeric columns, store the difference between consecutive values rather than the values themselves. Timestamps arriving microseconds apart turn into tiny deltas that pack into 1-byte ints instead of 8-byte longs. Auto-increment IDs and sensor readings benefit the same way.
Raw: [1700000000, 1700000001, 1700000003, 1700000004]
Delta: [1700000000, +1, +2, +1]
Combining Techniques
The real magic is in the mix. Stack these techniques, and you start to see serious space savings:
Example: 1M User Listening Sessions
Original user_id column: 4MB (1M × 4 bytes)
↓
After RLE (sorted): 200KB (runs of same user)
↓
After Dictionary: 100KB (only 500K unique users)
↓
After zstd compression: 50KB (general compression)
Final: 4MB → 50KB = 98.75% reduction!