A log-structured key-value storage engine in Go, modeled after the Bitcask paper from Basho.
Built as the storage layer of a larger distributed systems portfolio — exposed via HTTP and gRPC servers, and used as the state machine backend for a Raft consensus implementation.
put → append to log → update in-memory index → O(1)
get → lookup index → single disk seek → O(1)
del → write tombstone → remove from index → O(1)
Most KV store implementations skip the hard parts: crash recovery, compaction, CRC-validated binary encoding, hint files, and the tradeoffs that make each decision non-obvious. This one doesn’t.
Every design decision is documented in ARCHITECTURE.md
with the reasoning and the alternatives explicitly rejected. The engine is
built to be read, not just used.
Every record written to the log has a fixed binary header:
┌──────────┬───────────┬──────────────┬──────────────┬─────────────────────┐
│ CRC32 │ Timestamp │ Key Size │ Val Size† │ Key │ Value │
│ 4 bytes │ 8 bytes │ 4 bytes │ 4 bytes │ ... │ ... │
└──────────┴───────────┴──────────────┴──────────────┴─────────────────────┘
† Bit 31 of Val Size is reserved as the tombstone flag.
CRC32 is stored first so corruption is detected before allocating memory for the key or value. Timestamp enables compaction to resolve conflicts between duplicate keys across segments.
The keydir is an in-memory hash map that stores the exact disk location of every live key:
type KeydirEntry struct {
SegmentPath string // which segment file
Offset int64 // byte offset of the record header
ValueSize uint32 // how many bytes to read
Timestamp int64 // used during compaction
}A Get never scans the log. It does exactly one map lookup and one disk
seek. This is the core performance property of the Bitcask model.
Tradeoff: the keydir must fit entirely in RAM. This is a deliberate
design boundary — the Bitcask model trades bounded key-space for O(1) read
latency. For datasets where the key space exceeds available RAM, an LSM-tree
or B-tree structure is more appropriate. See ARCHITECTURE.md.
Delete does not remove data from disk. It appends a tombstone record —
a header-only entry with bit 31 of ValSize set — and removes the key
from the keydir. Disk space is reclaimed during compaction.
This design guarantees that a delete survives a crash: if the process dies
after writing the tombstone, the key remains absent after recovery. Using
bit 31 as the tombstone flag (rather than a zero-length value convention)
preserves the ability to store []byte{} as a valid value.
On Open, the engine scans all segment files in chronological order and
reconstructs the keydir. For each segment it first attempts the fast path:
loading from a hint file (which omits value bytes — reads ~20% of the data).
If the hint file is missing or its CRC is invalid, it falls back to a full
record-by-record scan with CRC validation. Corrupted or partial records
cause the segment to be truncated to the last valid offset.
When the active segment reaches MaxFileSize, it is sealed (reopened as
O_RDONLY) and a new active segment is created. Segment names encode a
unix timestamp, pid, and atomic counter — lexicographic sort equals
chronological order, which recovery relies on.
import "github.com/siluk00/gocask/internal"
store, err := gocask.Open(gocask.Config{
Dir: "/tmp/mystore",
MaxFileSize: 1 << 30, // 1GB segments
Policy: 0, // no SyncOnWrite
})
if err != nil {
log.Fatal(err)
}
defer store.Close()
// Write
if err := store.Put([]byte("user:1001"), []byte(`{"name":"felipe"}`)); err != nil {
log.Fatal(err)
}
// Read
val, err := store.Get([]byte("user:1001"))
if err != nil {
log.Fatal(err) // gocask.ErrKeyNotFound if absent
}
// Delete (writes tombstone, reclaimed at compaction)
if err := store.Delete([]byte("user:1001")); err != nil {
log.Fatal(err)
}The store is safe for concurrent use. Multiple goroutines may call Get
simultaneously. Put and Delete are serialized internally.
type Config struct {
Dir string // store directory (required)
MaxFileSize uint32 // rotate active segment at this size in bytes
Policy ConfigFlags // bitmask of behavioral flags
}
const (
SyncOnWrite ConfigFlags = 1 << iota // fsync after every Put (~97x write latency cost)
)SyncOnWrite off is the correct default for most workloads. It means
the OS may buffer up to ~30s of writes before flushing to disk. A process
crash loses at most that window. A machine crash (power loss) loses more
without SyncOnWrite.
gocask/
├── internal/
│ ├── gocask.go # Store implementation + Bitcask interface
│ ├── storage/
│ │ ├── segment.go # Append-only log segment
│ │ ├── keydir.go # In-memory index
│ │ ├── record.go # Binary encoding: records, hints, metadata
│ │ └── recovery.go # Crash recovery + hint file loading
│ ├── bench_test.go # Throughput benchmarks
│ ├── highload_test.go # Concurrent correctness under rotation
│ ├── profiling_test.go # CPU, memory, block profiling entry points
│ └── store_test.go # Unit + integration tests
├── cmd/
│ ├── server-http/ # HTTP server [TODO]
│ └── server-grpc/ # gRPC server [TODO]
├── proto/ # Protobuf definitions [TODO]
├── profiles/ # Generated profile files (gitignored)
├── ARCHITECTURE.md
├── Makefile
└── go.mod
# Tests with race detector
make test
# All benchmarks
make bench
# CPU profile — sequential Put (opens browser)
make profile-cpu-put
# Memory profile — Put allocations (opens browser)
make profile-mem-put
# Block profile — lock contention under parallel writes (opens browser)
make profile-block
# Recovery benchmark
make profile-recovery
# Lint
make lintRequirements: Go 1.22+, golangci-lint, graphviz (for profiling)
These are intentional design boundaries, not missing features:
Keydir must fit in RAM. Each entry costs ~80 bytes. 1M keys ≈ 80MB.
No range queries. Hash index — exact-key lookup only. Range scans require a full log scan and are not a supported use case.
Single active writer. Concurrent writes are serialized via mutex.
Concurrent reads scale without contention via sync.RWMutex.
Values are opaque bytes. Serialization and schema management are the caller’s responsibility.
No transactions. Per-key atomicity only.
- Sheehy & Smith — Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho, 2010)
- Kleppmann — Designing Data-Intensive Applications, chapters 3 and 5
- LevelDB implementation notes
MIT