Compression Basics: Making Data Smaller

The Magic of Patterns: 10× Smaller, Same Information

Four Ways to Compress Spotify Data Run-Length Encoding Original user_ids (72 bytes): [42,42,42,42,89,89,89,17,17] Compressed (27 bytes): [(42,4), (89,3), (17,2)] 63% smaller Dictionary Encoding Original (54 bytes): ["Willow","Willow","Evermore"] Dict + Indices (20 bytes): {0:"Willow",1:"Evermore"} [0,0,1] 63% smaller Delta Encoding Timestamps (32 bytes): 1704067200 (Jan 1, 00:00) 1704067260 (Jan 1, 00:01) Base + Deltas (12 bytes): Base: 1704067200 Deltas: [0, 60] 62% smaller Bit Packing Ratings (36 bytes): [4,5,3,4,5,3,4,5,3] 3 bits each (4 bytes): 100 101 011 100 101 011 100 101 011 89% smaller Real Spotify Data Compression User Listening Pattern user_id column (1M rows): [42,42,42,42,89,89,89,17,17,17,17,17...] → RLE: 4MB → 200KB (95% reduction) song_id (limited catalog): → Dictionary: 4MB → 1MB (75% reduction) Time Series Data listen_time (sequential): 14:35:00, 14:35:03, 14:35:07... → Delta: 8MB → 2MB (75% reduction) play_duration (seconds): → Delta from song length: 90% reduction Combined Techniques Original column: 1M ratings × 4 bytes = 4MB Step 1: Bit pack (1-5 needs 3 bits) → 375KB Step 2: RLE on patterns → 150KB Step 3: Compress with zstd → 75KB Final: 4MB → 75KB (98% reduction!)

Run-Length Encoding (RLE)

The Pattern: Repeated Values

When you see the same value multiple times in a row, just count them!

What is Categorical Data? Data that falls into distinct categories or groups, like:

Spotify Example: User Listening Sessions

Original:  [42, 42, 42, 42, 89, 89, 89, 17, 17]  // 72 bytes (9 × 8 bytes)
           (user 42 listens to 4 songs, then user 89 listens to 3, etc.)

Encoded:   [(42, 4), (89, 3), (17, 2)]           // 27 bytes
           3 pairs × (8 bytes for user_id + 1 byte for count)
           (user_id stays 8 bytes, count only needs 1 byte for small values)

Savings:   63% smaller (45 bytes saved)

Best for:


Dictionary Encoding

The Pattern: Limited Unique Values

Replace repeated strings/values with small integer lookups.

Spotify Example: Song Titles

Original:  ["Flowers", "Flowers", "Anti-Hero", "Flowers", "Anti-Hero"]  // 45 bytes

Dictionary: {0: "Flowers", 1: "Anti-Hero"}                              // 20 bytes
Indices:    [0, 0, 1, 0, 1]                                            // 5 bytes
Total:      25 bytes (44% smaller)

Best for:


Delta Encoding

The Pattern: Sequential Values

Store the first value, then only differences.

Spotify Example: Timestamps

Original times:     2024-01-01 14:35:00  (1704120900)
                   2024-01-01 14:35:03  (1704120903)
                   2024-01-01 14:35:07  (1704120907)

Delta encoded:      Base: 1704120900
                   Deltas: [0, 3, 7]

Savings:           12 bytes → 5 bytes (58% smaller)

Best for:


Bit Packing

The Pattern: Small Range Numbers

Don't use 32 bits when 3 will do!

Spotify Example: Ratings (1-5 stars)

Original:   [4, 5, 3, 4, 5]  // 5 × 32 bits = 20 bytes

Bit packed: 100 101 011 100 101  // 5 × 3 bits = 15 bits ≈ 2 bytes

Savings:    90% smaller

Best for:


Combining Techniques

The real magic happens when you chain compressions:

Example: 1M User Listening Sessions
Original user_id column:     4MB (1M × 4 bytes)
↓
After RLE (sorted):          200KB (runs of same user)
↓  
After Dictionary:            100KB (only 500K unique users)
↓
After zstd compression:       50KB (general compression)

Final: 4MB → 50KB = 98.75% reduction!


When to Use What?

Technique Best For Compression Ratio Speed
RLE Repeated values, sorted data 10-100× Very Fast
Dictionary Low cardinality strings 5-20× Fast
Delta Time series, sequences 5-10× Fast
Bit Packing Small integers 4-8× Very Fast

Key Insights

Patterns
Data has patterns
10-100×
Typical compression
Columnar
Better patterns
Chain
Combine techniques

The Big Idea: Compression finds and eliminates redundancy. Columnar storage creates more redundancy to eliminate!