Flushable SSTables in Pebble

This post describes my implementation of Flushable SSTables in Pebble during my time at Cockroach Labs in 2022.

High-throughput distributed databases face a fundamental challenge: how to efficiently ingest large amounts of external data without disrupting ongoing operations. During my internship at Cockroach Labs, I tackled one such bottleneck in Pebble's SSTable ingestion system, where SSTables overlapping with memtables could force the entire ingestion process to stall, waiting for the memtables to flush to disk.

This post details how I redesigned Pebble's ingestion pipeline to use a lazy evaluation approach, transforming ingested SSTables into special "flushable" entities that integrate seamlessly with the existing flush infrastructure. The result: up to 75% faster ingestion with no compromise on durability or consistency guarantees.

Where's the ingestion bottlneck?

SSTable ingestion is a critical operation in distributed databases. CockroachDB uses it for:

Bulk data imports from external systems
Range rebalancing across cluster nodes
Backup restoration operations
Schema migrations requiring data reorganization

The original Pebble implementation followed a straightforward but problematic approach:

The Sequence Number Dilemma

Every write in Pebble receives a monotonically increasing sequence number that establishes total ordering. When ingesting an SSTable:

Sequence Number Assignment: The ingested SSTable receives sequence numbers newer than all existing data
Visibility Constraint: The SSTable cannot become visible until all overlapping data with older sequence numbers is persisted to L0
Forced Synchronization: If a memtable overlaps with the ingested SSTable's key range, ingestion must wait for memtable flush completion

// Original ingestion flow (simplified)
func (d *DB) ingest(paths []string) error {
    // Analyze overlap with memtables
    if d.hasMemtableOverlap(ssts) {
        // BLOCKING: Wait for memtable flush to complete
        if err := d.forceFlushMemtable(); err != nil {
            return err
        }
        // Wait for flush job completion and manifest update
        d.waitForFlushCompletion()
    }

    // Only now can we add SSTable to LSM
    return d.addToLSM(ssts)
}

This synchronous approach created the following performance pathologies:

Write Stalls: New writes requiring sequence numbers higher than the ingested SSTable's range were blocked until ingestion completed, wasting CPU clock cycles and I/O bandwidth. In high-throughput scenarios, this could pause all writes for 200-500ms.

The problem compounded with multiple overlapping memtables—each required individual flush completion before ingestion could proceed.

The lazy ingestion approach

The solution involved rearchitecting the ingestion pipeline to treat ingested SSTables as equal partners with memtables in the flush process. By transforming ingested SSTables into special "flushable" entities, they could be seamlessly integrated with the existing flush pattern.

High-level design

The lazy ingestion approach is based on three core principles:

The ingestion of an SSTable is immediately made visible to reads, but its placement in the LSM is deferred until the memtable flush completion. This allows for efficient range rebalancing and bulk data imports without blocking ongoing writes.

Both memtables and ingested SSTables flow through the same flush infrastructure, ensuring consistency and strong durability guarantees.

Finally, the design must ensure that all operations can survive system crashes with the same durability guarantees as regular writes.

The `flushable` Interface

Pebble's existing flush system was built around the flushable interface:

type flushable interface {
    newIter(o *IterOptions) internalIterator
    newFlushIter(o *IterOptions, byteInterval keyspan.Interval) internalIterator
    newRangeDelIter(o *IterOptions) keyspan.FragmentIterator
    inuseBytes() uint64
    totalBytes() uint64
    readyForFlush() bool
}

The key realization was that ingested SSTables already share many properties with flushables, such as:

being sorted
being immutable
having well-defined iterators.

By implementing the flushable interface, ingested SSTables could be seamlessly integrated with the existing flush infrastructure.

Implementation Strategy

// ingestedFlushable wraps ingested SSTables as flushable entities
type ingestedFlushable struct {
    files        []*fileMetadata    // Metadata for ingested files
    cmp          Compare            // Key comparator
    newIters     tableNewIters      // Iterator factory
    slice        manifest.LevelSlice // Virtual level slice
    logNum       base.DiskFileNum   // Associated WAL number
    exciseSpan   KeyRange           // Excised key range (if any)
}

func (f *ingestedFlushable) newIter(o *IterOptions) internalIterator {
    // Create a level iterator over the ingested files
    return newLevelIter(o, f.cmp, f.newIters, f.slice.Iter())
}

func (f *ingestedFlushable) readyForFlush() bool {
    // Always ready - files are already on disk
    return true
}

func (f *ingestedFlushable) inuseBytes() uint64 {
    // Return total size of ingested files
    var size uint64
    for _, file := range f.files {
        size += file.Size
    }
    return size
}

This design provided several key advantages:

Immediate visibility: Reads could access ingested data immediately through the flushable's iterator interface
Zero copy: No data movement required—the SSTable files were already on disk
Ordering preservation: The flush queue maintained proper sequence number ordering

Implementation Deep Dive

The lazy ingestion system required careful coordination between multiple subsystems. Let's examine the technical implementation through a practical example.

Scenario: Overlapping Ingestion

We start with a memtable containing keys H-I and we need to ingest an SSTable with ranges A-B and E-L.

Step 1: Overlap Detection

The ingestion process begins with precise overlap detection:

func (d *DB) analyzeIngestOverlaps(meta []*fileMetadata) (overlaps []overlapInfo, err error) {
    d.mu.Lock()
    defer d.mu.Unlock()

    for i, file := range meta {
        // Check overlap with mutable memtable
        if d.hasMemtableOverlap(file, d.mu.mem.mutable) {
            overlaps = append(overlaps, overlapInfo{
                fileIndex: i,
                memtable:  d.mu.mem.mutable,
                keyRange:  file.UserKeyBounds(),
            })
        }

        // Check overlap with immutable memtables
        for _, imm := range d.mu.mem.queue {
            if d.hasMemtableOverlap(file, imm.flushable) {
                overlaps = append(overlaps, overlapInfo{
                    fileIndex: i,
                    memtable:  imm.flushable,
                    keyRange:  file.UserKeyBounds(),
                })
            }
        }
    }

    return overlaps, nil
}

func (d *DB) hasMemtableOverlap(file *fileMetadata, mem flushable) bool {
    iter := mem.newIter(nil)
    if iter == nil {
        return false
    }
    defer iter.Close()

    // Efficient range overlap test using iterator bounds
    iter.First()
    iter.Last()
    return d.cmp(file.Smallest.UserKey, iter.Key().UserKey) <= 0 &&
           d.cmp(file.Largest.UserKey, iter.Key().UserKey) >= 0
}

Step 2: WAL-Based Durability

Before any state changes, the ingestion is logged for crash recovery:

func (d *DB) recordIngestOperation(paths []string, seqNum uint64) error {
    // Create a specialized WAL record for ingested SSTables
    record := &ingestRecord{
        paths:      paths,
        seqNum:     seqNum,
        timestamp:  time.Now().UnixNano(),
    }

    // Encode as a special batch entry
    batch := d.newBatch()
    for _, path := range paths {
        batch.logData(InternalKeyKindIngestSST, []byte(path), nil)
    }

    // Force WAL sync for durability
    opts := &writeOptions{Sync: true}
    return d.apply(batch, opts)
}

Even if the system crashes between ingestion and flush completion, the WAL replay will reconstruct the ingested flushables and complete the LSM integration process.

The WAL replay system handles ingested SSTables transparently:

func (d *DB) replayWAL() error {
    for {
        record, err := d.readWALRecord()
        if err == io.EOF {
            break
        }

        switch record.Kind {
        case InternalKeyKindIngestSST:
            // Reconstruct ingested SSTable from WAL record
            path := string(record.Key)
            if err := d.recreateIngestedFlushable(path); err != nil {
                return err
            }
        default:
            // Handle regular writes
            if err := d.applyWALRecord(record); err != nil {
                return err
            }
        }
    }
    return nil
}

func (d *DB) recreateIngestedFlushable(path string) error {
    // Re-read SSTable metadata from disk
    meta, err := d.loadSSTableMetadata(path)
    if err != nil {
        return err
    }

    // Recreate the flushable and add to queue
    flushable := d.createIngestedFlushable([]*fileMetadata{meta}, d.currentLogNum())
    d.addToFlushQueue(flushable)

    return nil
}

Step 3: `ingestedFlushable` homogeneity

Our main change: transforming SSTables into flushable objects that integrate seamlessly into the flushables queue.

func (d *DB) createIngestedFlushable(meta []*fileMetadata, logNum FileNum) *ingestedFlushable {
    // Sort files by smallest key for efficient iteration
    sort.Slice(meta, func(i, j int) bool {
        return d.cmp(meta[i].Smallest.UserKey, meta[j].Smallest.UserKey) < 0
    })

    return &ingestedFlushable{
        files:      meta,
        cmp:        d.cmp,
        newIters:   d.newIters,
        logNum:     logNum,
        slice:      manifest.NewLevelSliceKeySorted(d.cmp, meta),
        readTime:   time.Now(), // Track ingestion time for metrics
    }
}

func (f *ingestedFlushable) newIter(o *IterOptions) internalIterator {
    // Create a mergingIter over all ingested files
    iters := make([]internalIterator, len(f.files))
    for i, file := range f.files {
        iter, err := f.newIters(file, o, internalIterOpts{})
        if err != nil {
            // Clean up already opened iterators
            for j := 0; j < i; j++ {
                iters[j].Close()
            }
            return &errorIter{err: err}
        }
        iters[i] = iter
    }

    // Return a merging iterator that provides unified view
    return newMergingIter(f.cmp, iters...)
}

Sequence Number Management: The ingested SSTable receives sequence number 15, newer than the memtable's entries (10-14). This ensures proper read ordering while allowing immediate visibility.

Asynchronous Flush

While ingestion returns immediately, a flush is the background process that moves the flushableQueue (including ingested SSTables and memtables) into the LSM.

Step 4: Queue-ordered processing

The flush system processes entries in strict sequence number order, ensuring consistency:

func (d *DB) processFlushQueue() error {
    d.mu.Lock()
    defer d.mu.Unlock()

    for len(d.mu.mem.queue) > 0 {
        entry := d.mu.mem.queue[0]

        // Determine flush strategy based on flushable type
        switch f := entry.flushable.(type) {
        case *memTable:
            // Standard memtable flush to L0
            return d.flushMemtable(f, entry.logNum)
        case *ingestedFlushable:
            // Special handling for ingested SSTables
            return d.flushIngestedSST(f, entry.logNum)
        default:
            return errors.New("unknown flushable type")
        }
    }
    return nil
}

Step 5: LSM Placement

(Note: 000005.sst and 000004.sst are ingested SSTables)

First: The original memtable is flushed to L0, creating SSTable 000007.sst with keys H-I.

Second: The ingested SSTable's ranges are placed optimally:

A-B range: No overlaps detected, placed directly in L6 (zero-copy)
E-L range: Overlaps with flushed memtable, must go to L0

This differential placement is a key optimization – parts of the same ingested SSTable can end up at different levels based on overlap analysis.

Unlike memtables that always flush to L0, ingested SSTables can be placed at their optimal level:

func (d *DB) flushIngestedSST(f *ingestedFlushable) (*versionEdit, error) {
    ve := &versionEdit{DeletedFiles: make(map[deletedFileEntry]*fileMetadata)}

    for _, file := range f.files {
        // Analyze LSM structure to find optimal placement
        targetLevel := d.pickIngestLevel(file)

        if targetLevel == 0 {
            // File overlaps with L0, must be placed there
            ve.NewFiles = append(ve.NewFiles, newFileEntry{
                Level: 0,
                Meta:  file,
            })
        } else {
            // Can be placed directly at target level (zero-copy optimization)
            ve.NewFiles = append(ve.NewFiles, newFileEntry{
                Level: targetLevel,
                Meta:  file,
            })
        }
    }

    // Atomically update the LSM manifest
    return ve, d.mu.versions.logAndApply(ve)
}

func (d *DB) pickIngestLevel(file *fileMetadata) int {
    // Start from L6 and work upward to find the deepest level with no overlaps
    for level := numLevels - 1; level > 0; level-- {
        if d.mu.versions.current.overlaps(level, d.cmp,
            file.Smallest.UserKey, file.Largest.UserKey, false) {
            continue
        }

        // Check if the level above has space (to maintain LSM pyramid shape)
        if level > 1 && d.levelHasCapacity(level) {
            return level
        }
    }

    // Default to L0 if no deeper placement is safe
    return 0
}

Benchmarking

Measuring ingestion performance requires careful benchmarking that captures real-world scenarios where memtable overlaps create bottlenecks.

I created synthetic workloads that stressed the ingestion system:

func BenchmarkIngestOverlappingMemtable(b *testing.B) {
    // Create overlapping memtables with varying degrees of key overlap
    for _, numMemtables := range []int{1, 2, 3} {
        b.Run(fmt.Sprintf("memtables=%d", numMemtables), func(b *testing.B) {
            db := setupTestDB(b)

            // Fill memtables with overlapping key ranges
            for i := 0; i < numMemtables; i++ {
                fillMemtableWithKeys(db, 10000, i*5000, (i+2)*5000)
            }

            // Create SSTable that overlaps with all memtables
            sstPath := createOverlappingSSTables(b, 1000, 7500)

            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                err := db.Ingest([]string{sstPath})
                require.NoError(b, err)
            }
        })
    }
}

The performance improvements were promising and scaled with the degree of overlap:

Scenario	Baseline	Flushable SSTables	Improvement
1 Overlapping Memtable	237µs	126µs	46.73%
2 Overlapping Memtables	615µs	230µs	62.69%
3 Overlapping Memtables	1.66ms	0.41ms	75.45%

How did we achieve this?

Eliminated blocking: Ingestion no longer waits for memtable flush completion (200-500ms saved per memtable)
Reduced system calls: No forced fsync operations during ingestion critical path
Improved parallelism: Flush operations happen concurrently with ongoing writes

Trade-offs:

By transforming ingested SSTables into flushable entities, we effectively shifted the workload from the ingestion critical path to the flush process. This allowed us to eliminate blocking and reduce system calls during ingestion, while introducing a small overhead during the actual flush process.

Scenario	Baseline	Flushable SST Flush	Overhead
1 SSTable	68.5µs	103.5µs	+51.21%
2 SSTables	68.6µs	106.7µs	+55.59%
3 SSTables	68.2µs	109.0µs	+59.89%

However, this flush overhead is asynchronous, amortized, and minimal (40µs additional latency vs. 1000µs+ savings during ingestion).

In production, this optimization reduced bulk import time by 60%, eliminated write stalls during bulk loading, and improved cluster stability and elasticity during rebalancing.

Broader Implications

The flushable SSTables project demonstrates how thoughtful systems design can yield substantial performance improvements without sacrificing correctness. By understanding the fundamental constraints: sequence number ordering, durability requirements, and LSM invariants – we found a solution that improved performance while actually simplifying the operational model.