Compression Basics: Making Data Smaller
The Magic of Patterns: 10× Smaller, Same Information
Run-Length Encoding (RLE)
The Pattern: Repeated Values
When you see the same value multiple times in a row, just count them!
What is Categorical Data? Data that falls into distinct categories or groups, like:
-
User IDs (user 42, user 89, user 17)
-
Device types (mobile, desktop, smart_tv)
-
Countries (US, UK, JP)
-
Boolean flags (true/false)
Spotify Example: User Listening Sessions
Original: [42, 42, 42, 42, 89, 89, 89, 17, 17] // 72 bytes (9 × 8 bytes)
(user 42 listens to 4 songs, then user 89 listens to 3, etc.)
Encoded: [(42, 4), (89, 3), (17, 2)] // 27 bytes
3 pairs × (8 bytes for user_id + 1 byte for count)
(user_id stays 8 bytes, count only needs 1 byte for small values)
Savings: 63% smaller (45 bytes saved)
Best for:
-
Sorted data (user_id after GROUP BY)
-
Categorical data with runs (same category repeating)
-
Boolean flags (long runs of true/false)
Dictionary Encoding
The Pattern: Limited Unique Values
Replace repeated strings/values with small integer lookups.
Spotify Example: Song Titles
Original: ["Flowers", "Flowers", "Anti-Hero", "Flowers", "Anti-Hero"] // 45 bytes
Dictionary: {0: "Flowers", 1: "Anti-Hero"} // 20 bytes
Indices: [0, 0, 1, 0, 1] // 5 bytes
Total: 25 bytes (44% smaller)
Best for:
-
String columns with < 1000 unique values
-
Categories (genre, country, device_type)
-
Repeated text values
Delta Encoding
The Pattern: Sequential Values
Store the first value, then only differences.
Spotify Example: Timestamps
Original times: 2024-01-01 14:35:00 (1704120900)
2024-01-01 14:35:03 (1704120903)
2024-01-01 14:35:07 (1704120907)
Delta encoded: Base: 1704120900
Deltas: [0, 3, 7]
Savings: 12 bytes → 5 bytes (58% smaller)
Best for:
-
Timestamps (always increasing)
-
Sequential IDs
-
Cumulative metrics
Bit Packing
The Pattern: Small Range Numbers
Don't use 32 bits when 3 will do!
Spotify Example: Ratings (1-5 stars)
Original: [4, 5, 3, 4, 5] // 5 × 32 bits = 20 bytes
Bit packed: 100 101 011 100 101 // 5 × 3 bits = 15 bits ≈ 2 bytes
Savings: 90% smaller
Best for:
-
Ratings (1-5 or 1-10)
-
Small integers
-
Boolean arrays
Combining Techniques
The real magic happens when you chain compressions:
Original user_id column: 4MB (1M × 4 bytes)
↓
After RLE (sorted): 200KB (runs of same user)
↓
After Dictionary: 100KB (only 500K unique users)
↓
After zstd compression: 50KB (general compression)
Final: 4MB → 50KB = 98.75% reduction!When to Use What?
| Technique | Best For | Compression Ratio | Speed |
|---|---|---|---|
| RLE | Repeated values, sorted data | 10-100× | Very Fast |
| Dictionary | Low cardinality strings | 5-20× | Fast |
| Delta | Time series, sequences | 5-10× | Fast |
| Bit Packing | Small integers | 4-8× | Very Fast |
Key Insights
The Big Idea: Compression finds and eliminates redundancy. Columnar storage creates more redundancy to eliminate!