hero

This post describes my implementation of Flushable SSTables in Pebble during my time at Cockroach Labs in 2022.

 

I previously wrote about my work on concurrent manual compactions in Pebble, where I also covered the fundamentals of storage engines. Building on that foundation, this post explores another performance enhancement I implemented: Flushable SSTables.

 

The Problem with Ingestion

When an external SSTable needs to be added (ingested) to Pebble's LSM, it experiences a hiccup if the table being ingested overlaps with a memtable. In this scenario, the ingestion needs to wait for the memtable to be flushed before proceeding.

This waiting is necessary because:

  1. Ingested SSTables are assigned sequence numbers newer than entries in the memtable
  2. We cannot add the ingested table to the LSM until overlapping entries with older sequence numbers are written to L0

This creates a bottleneck that blocks not only the ingestion itself but also subsequent writes that need sequence numbers higher than the ingested SSTs. These writes must wait until their sequence numbers become visible, which is blocked behind the ingest waiting for memtable flushing and manifest updates.

 

The Solution: Flushable SSTables

The key insight is to lazily add ingested SSTables to the LSM. Instead of forcing a flush when an overlap is detected, we can:

  1. Add the SSTable to the flushable queue as a fake memtable
  2. Schedule a flush that will eventually add the SSTs to the LSM
  3. Return from the ingestion without waiting for the flush to complete

This allows the ingestion to complete without waiting, improving performance significantly.

 

Let's walk through how this works in practice, separating into the ingestion and flush phases.

 

Ingestion Phase

Step 0: Initial State

We start with a memtable containing keys H-I (in grey) and want to ingest an SSTable with keys A-B and E-L (in blue).

Step 1: Lazily ingest

Firstly, Pebble detects that the SSTables to ingest overlap with the current memtable (keys H-I overlap with E-L).

Previously, we would:

  • Force flush the memtable to L0
  • Wait for flush completion
  • Set sequence numbers on ingested SST
  • Add SST to appropriate level
  • Return from ingestion

With the new approach, we:

  1. Add the SST to the flushable queue as a pseudo-memtable
  2. Create an iterator over the SST that's compatible with memtable iterators
  3. Insert a new mutable memtable for upcoming writes
  4. Record ingested SST details in the WAL for crash recovery
  5. Schedule an async flush, serving reads/writes while the flush is in progress
  6. Return from ingestion

This method ensures system integrity even if a crash occurs before the SSTs are permanently added to the LSM. The temporary state of ingested SSTs in the memtable queue is thus recoverable, maintaining data consistency.

 

Flush Phase

Step 2: Async Flush Processing

While the ingestion has already returned, the flush process continues in the background:

  • The flushable queue now contains both the memtable and the ingested SSTable
  • The flush job processes them according to their position in the queue

First, the original memtable is flushed to L0, creating a new SSTable (000007.sst).

 

The ingested SSTable is then processed and incorporated into the LSM at its optimal level. The A-B range, not overlapping with any existing SSTable, is placed in L6. This placement requires no data copying as the SSTable already exists on disk. The E-L range, which overlaps with the memtable, is flushed to L0, similar to the memtable's treatment.

Step 4: Final State

Upon completion of both flushes, the LSM tree now includes the flushed memtable in L0 and the ingested SSTable distributed across its optimal levels. The memtable is cleared, and the ingested SSTable data is efficiently positioned within the LSM structure.

With this approach, the ingestion returns much faster, and the actual flush happens asynchronously in the background.

 

Implementation Details

Let's explore the key components of this implementation:

1. Detecting Memtable Overlap

First, we need to determine if there's an overlap between the SSTable and any memtable:

// Check if any memtable overlaps with the SST
func ingestMemtableOverlaps(cmp Compare, mem flushable, meta *fileMetadata) bool {
    iter := mem.newIter(nil)
    defer iter.Close()
    
    // Simple range check between memtable and SSTable bounds
    return iter != nil && cmp(meta.Smallest.UserKey, iter.Largest().UserKey) <= 0 &&
           cmp(meta.Largest.UserKey, iter.Smallest().UserKey) >= 0
}

2. Creating a Flushable Wrapper

The core of the implementation is a new type that implements the flushable interface:

// Wraps ingested SSTables as a flushable
type ingestedFlushable struct {
    files        []*fileMetadata
    cmp          Compare
    newIters     tableNewIters
    slice        manifest.LevelSlice
}

// Implements the flushable interface
func (s *ingestedFlushable) newIter(o *IterOptions) internalIterator {
    return newLevelIter(o, s.cmp, s.newIters, s.slice.Iter())
}

// Always ready for flush since files are already on disk
func (s *ingestedFlushable) readyForFlush() bool {
    return true
}

3. Recording in the WAL for Crash Recovery

We need to record the ingestion in the Write-Ahead Log to survive crashes:

// Add a new batch type for ingested SSTables
func (b *Batch) ingestSST(path []byte) {
    if b.Empty() {
        b.ingestedSSTBatch = true
    } else if !b.ingestedSSTBatch {
        panic("pebble: invalid call to ingestSST")
    }
    
    // Record the SSTable path in the WAL
    b.prepareDeferredKeyRecord(len(path), InternalKeyKindIngestSST)
    copy(b.deferredOp.Key, path)
}

4. Adding to the Flushable Queue

The main function that handles the ingestion as a flushable:

func (d *DB) handleIngestAsFlushable(meta []*fileMetadata, seqNum uint64) error {
    // 1. Record in WAL
    b := d.NewBatch()
    for _, path := range paths {
        b.ingestSST([]byte(path))
    }
    b.setSeqNum(seqNum)
    
    // 2. Create a new WAL
    logNum := d.newLogNum()
    
    // 3. Create the flushable wrapper
    entry := d.newIngestedFlushableEntry(meta, seqNum, logNum)
    
    // 4. Add to flushable queue and create new memtable
    d.mu.mem.queue = append(d.mu.mem.queue, entry)
    d.makeRoomForWrite(nil)
    
    // 5. Schedule a flush but don't wait
    d.maybeScheduleFlush()
    return nil
}

5. Handling During Flush

When it's time to flush, we handle the ingested SSTable differently:

func (d *DB) flushIngestedSSTables(c *compaction) (*versionEdit, error) {
    // Get the flushable to be flushed
    f := c.flushing[0].flushable.(*ingestedFlushable)
    
    // Create version edit to place files in the LSM
    ve := &versionEdit{}
    
    for _, file := range f.files {
        // Determine optimal level for the SSTable
        level := d.pickLevelForIngest(file)
        ve.NewFiles = append(ve.NewFiles, newFileEntry{
            Level: level,
            Meta:  file,
        })
    }
    
    return ve, nil
}

 

WAL Replay for Crash Recovery

To ensure durability, we need to handle crash recovery by replaying WAL entries:

func (d *DB) replayWAL() error {
    // During WAL replay, detect InternalKeyKindIngestSST records
    if kind == InternalKeyKindIngestSST {
        // Extract the SSTable path
        paths := extractSSTPathsFromBatch(batch)
        
        // Load the SSTable metadata
        meta := loadSSTMetadata(paths)
        
        // Create a flushable entry
        entry := d.newIngestedFlushableEntry(meta, seqNum, logNum)
        
        // Add to queue or flush depending on DB mode
        if d.opts.ReadOnly {
            d.mu.mem.queue = append(d.mu.mem.queue, entry)
        } else {
            d.flushIngestedSSTables(entry)
        }
    }
}

 

Performance Impact

I ran detailed benchmarks to measure the impact of this optimization:

TestOriginal TimeWith Flushable SSTsImprovement
IngestOverlappingMemtable/memtables=1237µs126µs46.73%
IngestOverlappingMemtable/memtables=2615µs230µs62.69%
IngestOverlappingMemtable/memtables=31.66ms0.41ms75.45%

 

As expected, the improvement becomes more dramatic as the number of overlapping memtables increases. With three memtables, we see a 75% reduction in ingestion time!

There is a small performance cost when actually flushing the memtables:

TestOriginal TimeWith Flushable SSTsChange
FlushOverlappingMemtable/ssts=168.5µs103.5µs+51.21%
FlushOverlappingMemtable/ssts=268.6µs106.7µs+55.59%
FlushOverlappingMemtable/ssts=368.2µs109.0µs+59.89%

 

However, this tradeoff is well worth it, as the flush performance impact is significantly smaller than the ingestion performance gain, and flushes happen asynchronously in the background.

 

Conclusion

The flushable SSTables optimization provides a significant performance improvement for Pebble's ingest operations. By avoiding the forced wait for memtable flushes, we achieved up to 75% faster ingestion while maintaining all the necessary guarantees for crash recovery and sequence ordering.

This implementation demonstrates a classic performance pattern: moving work off the critical path. By deferring the actual flush operation to the background, we allow the ingestion to return much faster, improving overall system responsiveness.

This optimization is particularly valuable for operations that involve bulk ingestion, such as imports, restores, and data rebalancing operations in distributed databases like CockroachDB.