SSTables: Sorted String Tables

Concept. An SSTable is an immutable file of key-value pairs sorted by key, written once and never modified — the building block for write-heavy stores like LSM trees.

Intuition. When Spotify needs to record 1 million listen events per second, writing each one to a sorted on-disk B+Tree is too slow (every write triggers a rebalance). Instead, buffer writes in RAM until you have a chunk, then flush a sorted immutable file to disk — that file is an SSTable.

Immutable Files That Power Modern Databases

Immutable

Never modified after write

Sorted by Key

Enables efficient merging

Sequential Write

Created by memory flush

Self-Contained

Index + data in one file


Visual: Anatomy of an SSTable

SSTable: Structure and Operations 1. Single SSTable File Structure Bloom Filter Index Block Data Blocks (Sorted) Bloom Filter: • Quick "not exists" • 1% false positive • Saves disk reads Index: • Key → Offset • Binary search • Sparse entries Data: • Key-value pairs • Sorted by key • Compressed blocks 2. Multiple SSTables (Overlapping Keys) sstable_3.db (newest) [user_100:v3, user_150:v1, user_200:v2, .., user_400:v1] 10MB, 1hr ago sstable_2.db [user_050:v1, user_100:v2, user_200:v1,.., user_350:v1] 10MB, 2hr ago sstable_1.db (oldest) [user_001:v1, user_100:v1, user_150:v1, .. user_600:v1] 10MB, 3hr ago

Pseudocode

// Create SSTable
Flush(memtable):
    sorted = sort(memtable)        // Sort by key
    write(bloom_filter)             // For quick "not exists"
    write(sparse_index)             // Every 100th key → offset
    write(sorted_data)              // The actual key-values
    write(footer)                   // min_key, max_key, size

// Read from SSTable
Read(sstable, key):
    if not bloom_filter.contains(key):
        return None                 // Quick reject

    offset = binary_search(index, key)
    block = read_block(offset)
    return linear_search(block, key)

// Merge SSTables (Compaction)
Compact(sstables):
    // K-way merge, keep newest version
    merged = k_way_merge(sstables)
    return write_new_sstable(merged)

Key Takeaway

  • SSTables turn random writes into sequential writes by batching updates in memory and flushing sorted, immutable files to disk.

  • This simple idea powers Google's Spanner, Google's Bigtable, Amazon's DynamoDB, and Apache Cassandra.