SSTables: Sorted String Tables

Immutable Files That Power Modern Databases

Core idea: Never update files in-place. Instead, write new sorted files and merge them later.

Immutable
Never modified after write
Sorted by Key
Enables efficient merging
Sequential Write
Created by memory flush
Self-Contained
Index + data in one file

Visual: Anatomy of an SSTable

SSTable: Structure and Operations 1. Single SSTable File Structure Bloom Filter Index Block Data Blocks (Sorted) Bloom Filter: • Quick "not exists" • 1% false positive • Saves disk reads Index: • Key → Offset • Binary search • Sparse entries Data: • Key-value pairs • Sorted by key • Compressed blocks Footer: min_key="user_001", max_key="user_999", size=10MB, timestamp=1234567890 2. Multiple SSTables (Overlapping Keys) sstable_3.db (newest) [user_100:v3, user_150:v1, user_200:v2, user_300:v1, user_400:v1] 10MB, 1hr ago sstable_2.db [user_050:v1, user_100:v2, user_200:v1, user_250:v1, user_350:v1] 10MB, 2hr ago sstable_1.db (oldest) [user_001:v1, user_100:v1, user_150:v1, user_500:v1, user_600:v1] 10MB, 3hr ago Query: Get("user_100") 1. Check sstable_3 → Found v3 ✓ (newest wins!) 2. Skip older files (no need to check) 3. Compaction: Merging SSTables Before Compaction File1: [A:1, C:1, E:1] File2: [B:1, C:2, F:1] File3: [A:2, D:1, E:2] 3 files, duplicate keys, 3 reads needed K-way Merge (like MergeSort) After Compaction Merged: [A:2, B:1, C:2, D:1, E:2, F:1] Latest values only, sorted, no duplicates 1 file, no duplicates, 1 read needed Compaction Benefits ✓ Fewer files to search ✓ Reclaim space from old versions ✓ Better read performance Why SSTables Are Brilliant Sequential Writes: • Append-only to disk • No random seeks • 100x faster than B+Tree Efficient Merging: • Already sorted • Linear scan merge • Like merge sort Crash Recovery: • Immutable = no corruption • Just replay from WAL • Simple and robust Compression: • Block compression • Prefix compression • 3-10x space savings




Pseudocode

// Create SSTable
Flush(memtable):
    sorted = sort(memtable)        // Sort by key
    write(bloom_filter)             // For quick "not exists"
    write(sparse_index)             // Every 100th key → offset
    write(sorted_data)              // The actual key-values
    write(footer)                   // min_key, max_key, size

// Read from SSTable
Read(sstable, key):
    if not bloom_filter.contains(key):
        return None                 // Quick reject

    offset = binary_search(index, key)
    block = read_block(offset)
    return linear_search(block, key)

// Merge SSTables (Compaction)
Compact(sstables):
    // K-way merge, keep newest version
    merged = k_way_merge(sstables)
    return write_new_sstable(merged)

Real-World Systems Using SSTables

Distributed SQL Databases

Google Spanner

CockroachDB

TiDB

NoSQL Systems

Apache Cassandra

HBase (Hadoop Database)

RocksDB (Facebook)

Google Bigtable



Key Takeaway

SSTables turn random writes into sequential writes by batching updates in memory and flushing sorted, immutable files to disk. This simple idea powers Spanner, CockroachDB, Cassandra, and many other modern databases.