SSTables: Sorted String Tables

Concept. An SSTable is an immutable file of key-value pairs sorted by key, written once and never modified. It's the building block for write-heavy stores like LSM trees.

Intuition. When Spotify needs to record 1 million listen events per second, writing each one to a sorted on-disk B+Tree is too slow, because every write triggers a rebalance. Instead, the system buffers writes in RAM until it has a chunk, then flushes a sorted immutable file to disk. That file is an SSTable.


Anatomy of an SSTable

SSTable structure. One file with three sections (bloom filter, index block, data blocks), all in pastel green (everything lives on disk). Bottom: three SSTable files coexist on disk, with the user_100 key appearing as v1, v2, v3 across them in pastel purple. The newest wins on read.

Figure 1. SSTable anatomy. One immutable file, plus how many files coexist on disk.

  • Bloom filter (in the SSTable file, cached in RAM): a compact bitmap. Answers "is key K in this file?" in O(1): definitely-no, or maybe-yes. About 1 KB per million keys, lets the database skip a wasted disk read.
  • Index block (disk): sparse, with every Nth key plus its byte offset. Binary-search to find which data block holds your key.
  • Data blocks (disk): the actual rows, sorted by key, compressed. Read by following the byte offset from the index.
  • Multiple SSTables coexist: the same key can live in several files. Live example: user100 appears in all three with versions v1, v2, v3. The newest (v3 in sstable_3.db) wins on read.

Each SSTable is a single file with three sections laid out in order. The caption above lists what each section does. (Recall Bloom filters from Module 2 for how they work and their false-positive behavior.)

Multiple SSTables coexist on disk. A background compaction process eventually merges older files together to throw away obsolete versions.


How Databases Use SSTables

Three operations show up in any system that uses SSTables:

  • Flush. Convert an in-memory buffer of pending writes into a sorted, immutable file on disk. One sequential write per flush.

  • Read. Consult the bloom filter, binary-search the sparse index, and read the target data block. Three steps, mostly from cache.

  • Compact. K-way merge a set of older SSTables into one larger file, keeping the newest version of each key and dropping deleted entries.

Some databases use SSTables as a standalone storage format (notably older versions of HBase and many analytics stores). Others layer them into a larger structure, which is where LSM trees come in on the next page.


Algorithm

Algorithm · SSTable: Flush, Read, Compact

Flush(memtable):                    // RAM buffer is full
    sorted = sort(memtable)         // sort by key
    write(bloom_filter)             // for quick "not exists"
    write(sparse_index)             // every Nth key → offset
    write(sorted_data)              // the actual key-values
    write(footer)                   // min_key, max_key, size

Read(sstable, key):
    if not bloom_filter.contains(key):
        return None                 // quick reject, no disk read
    offset = binary_search(index, key)
    block  = read_block(offset)
    return linear_search(block, key)

Compact(sstables):
    merged = k_way_merge(sstables)  // keep newest version per key
    return write_new_sstable(merged)

Where SSTables Run in Production

The same file format powers Google's Bigtable and Spanner, Amazon DynamoDB, Apache Cassandra, and RocksDB. RocksDB is the storage engine underneath Kafka Streams, CockroachDB, TiKV, and many other systems. If you've used a modern key-value store, you've almost certainly written an SSTable.


Next

LSM Tree Indexes → How a cluster of SSTables behaves as a single log-structured index, and how reads and writes flow through it.