This post describes my implementation of Flushable SSTables in Pebble during my time at Cockroach Labs in 2022.
High-throughput distributed databases face a fundamental challenge: how to efficiently ingest large amounts of external data without disrupting ongoing operations. During my internship at Cockroach Labs, I tackled one such bottleneck in Pebble's SSTable ingestion system, where SSTables overlapping with memtables could force the entire ingestion process to stall, waiting for the memtables to flush to disk.
This post details how I redesigned Pebble's ingestion pipeline to use a lazy evaluation approach, transforming ingested SSTables into special "flushable" entities that integrate seamlessly with the existing flush infrastructure. The result: up to 75% faster ingestion with no compromise on durability or consistency guarantees.
SSTable ingestion is a critical operation in distributed databases. CockroachDB uses it for:
The original Pebble implementation followed a straightforward but problematic approach:
Every write in Pebble receives a monotonically increasing sequence number that establishes total ordering. When ingesting an SSTable:
// Original ingestion flow (simplified) func (d *DB) ingest(paths []string) error { // Analyze overlap with memtables if d.hasMemtableOverlap(ssts) { // BLOCKING: Wait for memtable flush to complete if err := d.forceFlushMemtable(); err != nil { return err } // Wait for flush job completion and manifest update d.waitForFlushCompletion() } // Only now can we add SSTable to LSM return d.addToLSM(ssts) }
This synchronous approach created the following performance pathologies:
Write Stalls: New writes requiring sequence numbers higher than the ingested SSTable's range were blocked until ingestion completed, wasting CPU clock cycles and I/O bandwidth. In high-throughput scenarios, this could pause all writes for 200-500ms.
The problem compounded with multiple overlapping memtables—each required individual flush completion before ingestion could proceed.
The solution involved rearchitecting the ingestion pipeline to treat ingested SSTables as equal partners with memtables in the flush process. By transforming ingested SSTables into special "flushable" entities, they could be seamlessly integrated with the existing flush pattern.
The lazy ingestion approach is based on three core principles:
The ingestion of an SSTable is immediately made visible to reads, but its placement in the LSM is deferred until the memtable flush completion. This allows for efficient range rebalancing and bulk data imports without blocking ongoing writes.
Both memtables and ingested SSTables flow through the same flush infrastructure, ensuring consistency and strong durability guarantees.
Finally, the design must ensure that all operations can survive system crashes with the same durability guarantees as regular writes.
flushable
InterfacePebble's existing flush system was built around the flushable
interface:
type flushable interface { newIter(o *IterOptions) internalIterator newFlushIter(o *IterOptions, byteInterval keyspan.Interval) internalIterator newRangeDelIter(o *IterOptions) keyspan.FragmentIterator inuseBytes() uint64 totalBytes() uint64 readyForFlush() bool }
The key realization was that ingested SSTables already share many properties with flushables, such as:
By implementing the flushable interface, ingested SSTables could be seamlessly integrated with the existing flush infrastructure.
// ingestedFlushable wraps ingested SSTables as flushable entities type ingestedFlushable struct { files []*fileMetadata // Metadata for ingested files cmp Compare // Key comparator newIters tableNewIters // Iterator factory slice manifest.LevelSlice // Virtual level slice logNum base.DiskFileNum // Associated WAL number exciseSpan KeyRange // Excised key range (if any) } func (f *ingestedFlushable) newIter(o *IterOptions) internalIterator { // Create a level iterator over the ingested files return newLevelIter(o, f.cmp, f.newIters, f.slice.Iter()) } func (f *ingestedFlushable) readyForFlush() bool { // Always ready - files are already on disk return true } func (f *ingestedFlushable) inuseBytes() uint64 { // Return total size of ingested files var size uint64 for _, file := range f.files { size += file.Size } return size }
This design provided several key advantages:
The lazy ingestion system required careful coordination between multiple subsystems. Let's examine the technical implementation through a practical example.
We start with a memtable containing keys H-I and we need to ingest an SSTable with ranges A-B and E-L.
The ingestion process begins with precise overlap detection:
func (d *DB) analyzeIngestOverlaps(meta []*fileMetadata) (overlaps []overlapInfo, err error) { d.mu.Lock() defer d.mu.Unlock() for i, file := range meta { // Check overlap with mutable memtable if d.hasMemtableOverlap(file, d.mu.mem.mutable) { overlaps = append(overlaps, overlapInfo{ fileIndex: i, memtable: d.mu.mem.mutable, keyRange: file.UserKeyBounds(), }) } // Check overlap with immutable memtables for _, imm := range d.mu.mem.queue { if d.hasMemtableOverlap(file, imm.flushable) { overlaps = append(overlaps, overlapInfo{ fileIndex: i, memtable: imm.flushable, keyRange: file.UserKeyBounds(), }) } } } return overlaps, nil } func (d *DB) hasMemtableOverlap(file *fileMetadata, mem flushable) bool { iter := mem.newIter(nil) if iter == nil { return false } defer iter.Close() // Efficient range overlap test using iterator bounds iter.First() iter.Last() return d.cmp(file.Smallest.UserKey, iter.Key().UserKey) <= 0 && d.cmp(file.Largest.UserKey, iter.Key().UserKey) >= 0 }
Before any state changes, the ingestion is logged for crash recovery:
func (d *DB) recordIngestOperation(paths []string, seqNum uint64) error { // Create a specialized WAL record for ingested SSTables record := &ingestRecord{ paths: paths, seqNum: seqNum, timestamp: time.Now().UnixNano(), } // Encode as a special batch entry batch := d.newBatch() for _, path := range paths { batch.logData(InternalKeyKindIngestSST, []byte(path), nil) } // Force WAL sync for durability opts := &writeOptions{Sync: true} return d.apply(batch, opts) }
Even if the system crashes between ingestion and flush completion, the WAL replay will reconstruct the ingested flushables and complete the LSM integration process.
The WAL replay system handles ingested SSTables transparently:
func (d *DB) replayWAL() error { for { record, err := d.readWALRecord() if err == io.EOF { break } switch record.Kind { case InternalKeyKindIngestSST: // Reconstruct ingested SSTable from WAL record path := string(record.Key) if err := d.recreateIngestedFlushable(path); err != nil { return err } default: // Handle regular writes if err := d.applyWALRecord(record); err != nil { return err } } } return nil } func (d *DB) recreateIngestedFlushable(path string) error { // Re-read SSTable metadata from disk meta, err := d.loadSSTableMetadata(path) if err != nil { return err } // Recreate the flushable and add to queue flushable := d.createIngestedFlushable([]*fileMetadata{meta}, d.currentLogNum()) d.addToFlushQueue(flushable) return nil }
ingestedFlushable
homogeneityOur main change: transforming SSTables into flushable objects that integrate seamlessly into the flushables queue.
func (d *DB) createIngestedFlushable(meta []*fileMetadata, logNum FileNum) *ingestedFlushable { // Sort files by smallest key for efficient iteration sort.Slice(meta, func(i, j int) bool { return d.cmp(meta[i].Smallest.UserKey, meta[j].Smallest.UserKey) < 0 }) return &ingestedFlushable{ files: meta, cmp: d.cmp, newIters: d.newIters, logNum: logNum, slice: manifest.NewLevelSliceKeySorted(d.cmp, meta), readTime: time.Now(), // Track ingestion time for metrics } } func (f *ingestedFlushable) newIter(o *IterOptions) internalIterator { // Create a mergingIter over all ingested files iters := make([]internalIterator, len(f.files)) for i, file := range f.files { iter, err := f.newIters(file, o, internalIterOpts{}) if err != nil { // Clean up already opened iterators for j := 0; j < i; j++ { iters[j].Close() } return &errorIter{err: err} } iters[i] = iter } // Return a merging iterator that provides unified view return newMergingIter(f.cmp, iters...) }
Sequence Number Management: The ingested SSTable receives sequence number 15, newer than the memtable's entries (10-14). This ensures proper read ordering while allowing immediate visibility.
While ingestion returns immediately, a flush is the background process that moves the flushableQueue
(including ingested SSTables and memtables) into the LSM.
The flush system processes entries in strict sequence number order, ensuring consistency:
func (d *DB) processFlushQueue() error { d.mu.Lock() defer d.mu.Unlock() for len(d.mu.mem.queue) > 0 { entry := d.mu.mem.queue[0] // Determine flush strategy based on flushable type switch f := entry.flushable.(type) { case *memTable: // Standard memtable flush to L0 return d.flushMemtable(f, entry.logNum) case *ingestedFlushable: // Special handling for ingested SSTables return d.flushIngestedSST(f, entry.logNum) default: return errors.New("unknown flushable type") } } return nil }
(Note: 000005.sst and 000004.sst are ingested SSTables)
First: The original memtable is flushed to L0, creating SSTable 000007.sst with keys H-I.
Second: The ingested SSTable's ranges are placed optimally:
This differential placement is a key optimization – parts of the same ingested SSTable can end up at different levels based on overlap analysis.
Unlike memtables that always flush to L0, ingested SSTables can be placed at their optimal level:
func (d *DB) flushIngestedSST(f *ingestedFlushable) (*versionEdit, error) { ve := &versionEdit{DeletedFiles: make(map[deletedFileEntry]*fileMetadata)} for _, file := range f.files { // Analyze LSM structure to find optimal placement targetLevel := d.pickIngestLevel(file) if targetLevel == 0 { // File overlaps with L0, must be placed there ve.NewFiles = append(ve.NewFiles, newFileEntry{ Level: 0, Meta: file, }) } else { // Can be placed directly at target level (zero-copy optimization) ve.NewFiles = append(ve.NewFiles, newFileEntry{ Level: targetLevel, Meta: file, }) } } // Atomically update the LSM manifest return ve, d.mu.versions.logAndApply(ve) } func (d *DB) pickIngestLevel(file *fileMetadata) int { // Start from L6 and work upward to find the deepest level with no overlaps for level := numLevels - 1; level > 0; level-- { if d.mu.versions.current.overlaps(level, d.cmp, file.Smallest.UserKey, file.Largest.UserKey, false) { continue } // Check if the level above has space (to maintain LSM pyramid shape) if level > 1 && d.levelHasCapacity(level) { return level } } // Default to L0 if no deeper placement is safe return 0 }
Measuring ingestion performance requires careful benchmarking that captures real-world scenarios where memtable overlaps create bottlenecks.
I created synthetic workloads that stressed the ingestion system:
func BenchmarkIngestOverlappingMemtable(b *testing.B) { // Create overlapping memtables with varying degrees of key overlap for _, numMemtables := range []int{1, 2, 3} { b.Run(fmt.Sprintf("memtables=%d", numMemtables), func(b *testing.B) { db := setupTestDB(b) // Fill memtables with overlapping key ranges for i := 0; i < numMemtables; i++ { fillMemtableWithKeys(db, 10000, i*5000, (i+2)*5000) } // Create SSTable that overlaps with all memtables sstPath := createOverlappingSSTables(b, 1000, 7500) b.ResetTimer() for i := 0; i < b.N; i++ { err := db.Ingest([]string{sstPath}) require.NoError(b, err) } }) } }
The performance improvements were promising and scaled with the degree of overlap:
Scenario | Baseline | Flushable SSTables | Improvement |
---|---|---|---|
1 Overlapping Memtable | 237µs | 126µs | 46.73% |
2 Overlapping Memtables | 615µs | 230µs | 62.69% |
3 Overlapping Memtables | 1.66ms | 0.41ms | 75.45% |
Trade-offs:
By transforming ingested SSTables into flushable entities, we effectively shifted the workload from the ingestion critical path to the flush process. This allowed us to eliminate blocking and reduce system calls during ingestion, while introducing a small overhead during the actual flush process.
Scenario | Baseline | Flushable SST Flush | Overhead |
---|---|---|---|
1 SSTable | 68.5µs | 103.5µs | +51.21% |
2 SSTables | 68.6µs | 106.7µs | +55.59% |
3 SSTables | 68.2µs | 109.0µs | +59.89% |
However, this flush overhead is asynchronous, amortized, and minimal (40µs additional latency vs. 1000µs+ savings during ingestion).
In production, this optimization reduced bulk import time by 60%, eliminated write stalls during bulk loading, and improved cluster stability and elasticity during rebalancing.
The flushable SSTables project demonstrates how thoughtful systems design can yield substantial performance improvements without sacrificing correctness. By understanding the fundamental constraints: sequence number ordering, durability requirements, and LSM invariants – we found a solution that improved performance while actually simplifying the operational model.