High-level overview of the storage engine's design.
┌─────────────────────────────────────────────┐
│ Client │
│ (shell, bench, test, SQL*) │
├─────────────────────────────────────────────┤
│ BPlusTree │
│ Insert · Search · Delete · RangeQuery │
├─────────────────────────────────────────────┤
│ Page Abstraction Layer │
│ LeafPage · InternalPage │
├─────────────────────────────────────────────┤
│ Buffer Pool Manager (Phase 2) │
│ LRU eviction · pin/unpin │
├───────────────────────┬─────────────────────┤
│ Write-Ahead Log │ DiskManager │
│ crash recovery (WAL) │ mmap · alloc · sync│
├───────────────────────┴─────────────────────┤
│ Linux Kernel / FS │
│ (ext4 / xfs / tmpfs) │
└─────────────────────────────────────────────┘
Manages a single memory-mapped index file.
- File I/O:
open()→mmap()→munmap()lifecycle via RAII. - Capacity growth: doubles the file via
ftruncate+ re-mmapwhen the next allocated page would exceed the current mapping. - Page allocation: returns a byte offset into the mapped region; newly allocated pages are zeroed.
- Metadata page (page 0): stores
root_offsetandnext_page_offset.
Zero-overhead typed views over raw char* page buffers.
| Class | Layout |
|---|---|
LeafPage |
[num_keys(4) | is_leaf=1(4) | next_leaf(8) | records…] |
InternalPage |
[num_keys(4) | is_leaf=0(4) | slots…] |
Each leaf record: [key(4 B) \| data(100 B)] = 104 bytes × 35 = 3640 B + 16 B header = 3656 B (fits in 4096 B page).
Each internal slot: [child_ptr(8 B) \| key(4 B)] = 12 bytes × 101 = 1212 B + 8 B header = 1220 B.
The core index.
- Insert: recursive descent to the target leaf → insert in sorted order → split upward if necessary → create new root on root split.
- Search: traverse internal nodes → binary-ish scan in the leaf.
- Range Query: locate starting leaf → follow
next_leaflinked list → collect matching records. - Delete: recursive descent to target leaf → remove key → if underful, try to redistribute from a sibling, otherwise merge → propagate underflow upward through internal nodes → shrink root when empty.
Lightweight result type inspired by LevelDB. Avoids exit(1) or exceptions
for expected errors like "key not found".
┌────────────────────┐ offset 0
│ Metadata Page │ root_offset (int64) + next_page_off (int64)
├────────────────────┤ offset 4096
│ Page 1 │ leaf or internal node
├────────────────────┤ offset 8192
│ Page 2 │
├────────────────────┤
│ ... │
└────────────────────┘
The file grows in 4 KB increments. All writes go through mmap (MAP_SHARED),
so the kernel handles write-back to disk. msync is called on metadata
changes and periodically via SyncAsync().
Implemented LRU page cache sitting between BPlusTree and DiskManager.
BPlusTree ──→ BufferPool ──→ DiskManager
↕ LRU cache of PageFrame slots
↕ pin count tracking
↕ dirty flag → flush on eviction
- Configurable pool size (default 1024 frames = 4 MB)
- Pin / unpin semantics: callers
FetchPage()to pin and mustUnpinPage()when done. Only unpinned frames are eviction candidates. - LRU eviction: doubly-linked list (front = oldest). On fetch miss, the least-recently-used unpinned frame is evicted. Dirty frames are flushed to disk before reuse.
- NewPage / DeletePage: allocates via
DiskManager::AllocatePage()(which tries the free-page list first); deletion removes the frame and pushes the page onto the disk free-list. - Statistics: hit count, miss count, hit rate exposed to the tree and benchmark tool.
Deleted pages are pushed onto a singly-linked list stored in the metadata page
(offset 16: free_list_head). Each freed page stores the previous head at
offset 0. AllocatePage() tries ReclaimPage() before growing the file,
so disk space is recycled.
A redo-only WAL that guarantees crash recovery for the index file.
- Before the buffer pool flushes any dirty page to the mmap region, the
page's full after-image (4096 bytes) is appended to the WAL file and
fsync'd. - On clean shutdown, a checkpoint is written and the WAL is truncated.
- On crash recovery (next startup), all
PAGE_WRITErecords after the last completed checkpoint are replayed to the data file, restoring any pages that were logged but never made it to disk.
┌──────────────────────┐ offset 0
│ WALFileHeader (16B) │ magic "WAL1" · version · checkpoint_lsn
├──────────────────────┤
│ LogRecordHeader │ lsn · type · page_id · data_len · crc32
│ + page data (4096B) │ (only for PAGE_WRITE records)
├──────────────────────┤
│ LogRecordHeader │
│ + page data │
├──────────────────────┤
│ ... │
└──────────────────────┘
- Record types:
PAGE_WRITE,CHECKPOINT_BEGIN,CHECKPOINT_END - Checksum: CRC32 over the header (with checksum field zeroed) XOR'd with CRC32 of the payload. Corrupt or truncated records are treated as end-of-log.
- LSN (log sequence number): monotonically increasing 64-bit counter.
BufferPool::FlushPage(),FlushAllPages(), andEvictFrame()log viaWriteAheadLog::LogPageWrite()before writing to the mmap.BPlusTreeconstructor creates the WAL, runsRecover(), then attaches the WAL to the buffer pool.BPlusTreedestructor andCheckpoint()write a checkpoint record and truncate the WAL.- WAL can be disabled with
enable_wal=falsefor backward compatibility.