Compression Basics: Making Data Smaller

Concept. Columnar compression (run-length, dictionary, delta) exploits the fact that values in a single column repeat heavily, typically shrinking on-disk size 10× and cutting IO by the same factor.

Intuition. Mickey appears in 3 of 9 Listens rows. At a billion rows, his user_id might appear 100 million times in a stretch. Run-length encoding stores "Mickey × 100M" instead of repeating the string, Listens shrinks 10× and reads 10× fewer pages.

When to Use What?

Technique Best For Compression Ratio Speed
RLE Repeated values, sorted data (e.g., user_id after GROUP BY) 10-100× Very Fast
Dictionary Low cardinality strings (e.g., country, genre) 5-20× Fast
Delta Timestamps (always increasing), sequences (e.g., IDs) 5-10× Fast
Bit Packing Small integers (e.g., ratings 1-5, boolean flags) 4-8× Very Fast

Run-Length Encoding (RLE)

The cleanest case. A sorted column with long runs of the same value. Instead of storing Mickey, Mickey, Mickey, Mickey ... a hundred million times, store Mickey × 100M. One value plus a count. Compression ratios of 10× to 100× are normal on user_id after GROUP BY user_id, on country in a sorted log, on any column where the same value lingers.

Raw:        [Mickey, Mickey, Mickey, Daffy, Daffy, Daffy, Daffy]
RLE:        [(Mickey, 3), (Daffy, 4)]

Dictionary Encoding

When a column has a small set of distinct values, replace each value with a small integer index into a dictionary. Genre has maybe 50 distinct values; storing each as a 1-byte code instead of a 16-byte string is a 16× win. Country, currency, status flags, all good targets.

Dictionary: {0: "rock", 1: "pop", 2: "jazz"}
Raw:        ["rock", "pop", "rock", "jazz", "rock"]
Encoded:    [0, 1, 0, 2, 0]

Delta Encoding

For sorted numeric columns, store the difference between consecutive values rather than the values themselves. Timestamps arriving microseconds apart turn into tiny deltas that pack into 1-byte ints instead of 8-byte longs. Auto-increment IDs and sensor readings benefit the same way.

Raw:    [1700000000, 1700000001, 1700000003, 1700000004]
Delta:  [1700000000, +1, +2, +1]

Combining Techniques

The real magic is in the mix. Stack these techniques, and you start to see serious space savings:

Example: 1M User Listening Sessions

Original user_id column:     4MB (1M × 4 bytes)
↓
After RLE (sorted):          200KB (runs of same user)
↓  
After Dictionary:            100KB (only 500K unique users)
↓
After zstd compression:       50KB (general compression)

Final: 4MB → 50KB = 98.75% reduction!