This post describes my implementation of Flushable SSTables in Pebble during my time at Cockroach Labs in 2022.
I previously wrote about my work on concurrent manual compactions in Pebble, where I also covered the fundamentals of storage engines. Building on that foundation, this post explores another performance enhancement I implemented: Flushable SSTables.
When an external SSTable needs to be added (ingested) to Pebble's LSM, it experiences a hiccup if the table being ingested overlaps with a memtable. In this scenario, the ingestion needs to wait for the memtable to be flushed before proceeding.
This waiting is necessary because:
This creates a bottleneck that blocks not only the ingestion itself but also subsequent writes that need sequence numbers higher than the ingested SSTs. These writes must wait until their sequence numbers become visible, which is blocked behind the ingest waiting for memtable flushing and manifest updates.
The key insight is to lazily add ingested SSTables to the LSM. Instead of forcing a flush when an overlap is detected, we can:
This allows the ingestion to complete without waiting, improving performance significantly.
Let's walk through how this works in practice, separating into the ingestion and flush phases.
We start with a memtable containing keys H-I (in grey) and want to ingest an SSTable with keys A-B and E-L (in blue).
Firstly, Pebble detects that the SSTables to ingest overlap with the current memtable (keys H-I overlap with E-L).
Previously, we would:
With the new approach, we:
This method ensures system integrity even if a crash occurs before the SSTs are permanently added to the LSM. The temporary state of ingested SSTs in the memtable queue is thus recoverable, maintaining data consistency.
While the ingestion has already returned, the flush process continues in the background:
First, the original memtable is flushed to L0, creating a new SSTable (000007.sst).
The ingested SSTable is then processed and incorporated into the LSM at its optimal level. The A-B range, not overlapping with any existing SSTable, is placed in L6. This placement requires no data copying as the SSTable already exists on disk. The E-L range, which overlaps with the memtable, is flushed to L0, similar to the memtable's treatment.
Upon completion of both flushes, the LSM tree now includes the flushed memtable in L0 and the ingested SSTable distributed across its optimal levels. The memtable is cleared, and the ingested SSTable data is efficiently positioned within the LSM structure.
With this approach, the ingestion returns much faster, and the actual flush happens asynchronously in the background.
Let's explore the key components of this implementation:
First, we need to determine if there's an overlap between the SSTable and any memtable:
// Check if any memtable overlaps with the SST func ingestMemtableOverlaps(cmp Compare, mem flushable, meta *fileMetadata) bool { iter := mem.newIter(nil) defer iter.Close() // Simple range check between memtable and SSTable bounds return iter != nil && cmp(meta.Smallest.UserKey, iter.Largest().UserKey) <= 0 && cmp(meta.Largest.UserKey, iter.Smallest().UserKey) >= 0 }
The core of the implementation is a new type that implements the flushable
interface:
// Wraps ingested SSTables as a flushable type ingestedFlushable struct { files []*fileMetadata cmp Compare newIters tableNewIters slice manifest.LevelSlice } // Implements the flushable interface func (s *ingestedFlushable) newIter(o *IterOptions) internalIterator { return newLevelIter(o, s.cmp, s.newIters, s.slice.Iter()) } // Always ready for flush since files are already on disk func (s *ingestedFlushable) readyForFlush() bool { return true }
We need to record the ingestion in the Write-Ahead Log to survive crashes:
// Add a new batch type for ingested SSTables func (b *Batch) ingestSST(path []byte) { if b.Empty() { b.ingestedSSTBatch = true } else if !b.ingestedSSTBatch { panic("pebble: invalid call to ingestSST") } // Record the SSTable path in the WAL b.prepareDeferredKeyRecord(len(path), InternalKeyKindIngestSST) copy(b.deferredOp.Key, path) }
The main function that handles the ingestion as a flushable:
func (d *DB) handleIngestAsFlushable(meta []*fileMetadata, seqNum uint64) error { // 1. Record in WAL b := d.NewBatch() for _, path := range paths { b.ingestSST([]byte(path)) } b.setSeqNum(seqNum) // 2. Create a new WAL logNum := d.newLogNum() // 3. Create the flushable wrapper entry := d.newIngestedFlushableEntry(meta, seqNum, logNum) // 4. Add to flushable queue and create new memtable d.mu.mem.queue = append(d.mu.mem.queue, entry) d.makeRoomForWrite(nil) // 5. Schedule a flush but don't wait d.maybeScheduleFlush() return nil }
When it's time to flush, we handle the ingested SSTable differently:
func (d *DB) flushIngestedSSTables(c *compaction) (*versionEdit, error) { // Get the flushable to be flushed f := c.flushing[0].flushable.(*ingestedFlushable) // Create version edit to place files in the LSM ve := &versionEdit{} for _, file := range f.files { // Determine optimal level for the SSTable level := d.pickLevelForIngest(file) ve.NewFiles = append(ve.NewFiles, newFileEntry{ Level: level, Meta: file, }) } return ve, nil }
To ensure durability, we need to handle crash recovery by replaying WAL entries:
func (d *DB) replayWAL() error { // During WAL replay, detect InternalKeyKindIngestSST records if kind == InternalKeyKindIngestSST { // Extract the SSTable path paths := extractSSTPathsFromBatch(batch) // Load the SSTable metadata meta := loadSSTMetadata(paths) // Create a flushable entry entry := d.newIngestedFlushableEntry(meta, seqNum, logNum) // Add to queue or flush depending on DB mode if d.opts.ReadOnly { d.mu.mem.queue = append(d.mu.mem.queue, entry) } else { d.flushIngestedSSTables(entry) } } }
I ran detailed benchmarks to measure the impact of this optimization:
Test | Original Time | With Flushable SSTs | Improvement |
---|---|---|---|
IngestOverlappingMemtable/memtables=1 | 237µs | 126µs | 46.73% |
IngestOverlappingMemtable/memtables=2 | 615µs | 230µs | 62.69% |
IngestOverlappingMemtable/memtables=3 | 1.66ms | 0.41ms | 75.45% |
As expected, the improvement becomes more dramatic as the number of overlapping memtables increases. With three memtables, we see a 75% reduction in ingestion time!
There is a small performance cost when actually flushing the memtables:
Test | Original Time | With Flushable SSTs | Change |
---|---|---|---|
FlushOverlappingMemtable/ssts=1 | 68.5µs | 103.5µs | +51.21% |
FlushOverlappingMemtable/ssts=2 | 68.6µs | 106.7µs | +55.59% |
FlushOverlappingMemtable/ssts=3 | 68.2µs | 109.0µs | +59.89% |
However, this tradeoff is well worth it, as the flush performance impact is significantly smaller than the ingestion performance gain, and flushes happen asynchronously in the background.
The flushable SSTables optimization provides a significant performance improvement for Pebble's ingest operations. By avoiding the forced wait for memtable flushes, we achieved up to 75% faster ingestion while maintaining all the necessary guarantees for crash recovery and sequence ordering.
This implementation demonstrates a classic performance pattern: moving work off the critical path. By deferring the actual flush operation to the background, we allow the ingestion to return much faster, improving overall system responsiveness.
This optimization is particularly valuable for operations that involve bulk ingestion, such as imports, restores, and data rebalancing operations in distributed databases like CockroachDB.