Skip to content

Commit 30a89da

Browse files
committed
release: v5.0.0 — M10 Hydra
Content-defined chunking (CDC) for dramatically better dedup on versioned files. Buzhash rolling hash engine limits the blast radius of edits to 1–2 chunks (98.4% chunk reuse vs 32% for fixed-size chunking). New hexagonal ChunkingPort with FixedChunker and CdcChunker adapters. CasService refactored to delegate chunking to the port. Facade accepts declarative chunking config. ManifestSchema extended with optional chunking field (backward compatible). 709 tests, 0 lint errors.
1 parent 3612d05 commit 30a89da

24 files changed

Lines changed: 2112 additions & 264 deletions

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [5.0.0] — Hydra (2026-02-28)
9+
10+
### Breaking Changes
11+
- **`CasService` constructor accepts `chunker` port** — a new optional `ChunkingPort` parameter controls chunking strategy. Existing code that does not pass `chunker` is unaffected (defaults to `FixedChunker`).
12+
- **Major version bump** — new hexagonal port (`ChunkingPort`) and manifest schema extension warrant a semver-major release for downstream tooling awareness.
13+
14+
### Added
15+
- **Content-defined chunking (CDC)** — Buzhash rolling-hash engine with configurable `minChunkSize` (64 KiB), `maxChunkSize` (1 MiB), and `targetChunkSize` (256 KiB). CDC limits the dedup blast radius to 1–2 chunks on incremental edits vs. total invalidation with fixed-size chunking. Benchmarked at 265 MB/s and 98.4% chunk reuse on small edits.
16+
- **`ChunkingPort`** — new hexagonal port (`src/ports/ChunkingPort.js`) with `async *chunk(source)`, `strategy`, and `params`. Abstracts chunking behind a pluggable interface.
17+
- **`FixedChunker`** — adapter wrapping existing fixed-size buffer slicing behind `ChunkingPort`.
18+
- **`CdcChunker`** — adapter wrapping the buzhash CDC engine behind `ChunkingPort`.
19+
- **`chunking` manifest field** — optional `{ strategy: 'fixed' | 'cdc', params: {...} }` metadata in manifests. Fixed-strategy manifests omit the field for full backward compatibility.
20+
- **`ChunkingSchema`** — Zod discriminated union (`FixedChunkingSchema` + `CdcChunkingSchema`) for manifest validation.
21+
- **`INVALID_CHUNKING_STRATEGY` error code** — thrown when an unrecognized chunking strategy is encountered in a manifest.
22+
- **Facade `chunking` config**`ContentAddressableStore` constructor accepts `chunking: { strategy, ... }` declarative config or a raw `chunker` port instance.
23+
- **CDC benchmarks** (`test/benchmark/chunking.bench.js`) — throughput and dedup efficiency comparison.
24+
- 90 new unit tests (709 total).
25+
26+
### Changed
27+
- `CasService._chunkAndStore()` refactored to delegate to `ChunkingPort` instead of inline buffer slicing.
28+
- `ChunkingPort`, `FixedChunker`, `CdcChunker` exported from the main entry point.
29+
830
## [4.0.1] — M8 Spit Shine + M9 Cockpit (2026-02-28)
931

1032
### Added

COMPLETED_TASKS.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,19 @@ Task cards moved here from ROADMAP.md after completion. Organized by milestone.
44

55
---
66

7+
# M10 — Hydra (v5.0.0) ✅ CLOSED
8+
9+
**Theme:** Content-defined chunking for dramatically better dedup on versioned files.
10+
11+
**Completed:** v5.0.0 (2026-02-28)
12+
13+
- **Task 10.1:** Buzhash rolling hash + CDC chunking engine — standalone module (`src/infrastructure/chunkers/CdcChunker.js`) with 256-entry deterministic byte table, configurable min/max/target chunk sizes, streaming async generator. Benchmarked at 265 MB/s, 98.4% chunk reuse on small edits.
14+
- **Task 10.2:** `ChunkingPort` abstraction — new hexagonal port with `async *chunk(source)`, `strategy`, and `params`. `FixedChunker` and `CdcChunker` adapters. `CasService` refactored to delegate chunking to the port. Facade accepts `chunking` config and raw `chunker` option.
15+
- **Task 10.3:** CDC manifest metadata + backward compatibility — `ChunkingSchema` (Zod discriminated union), optional `chunking` field in `ManifestSchema`, `INVALID_CHUNKING_STRATEGY` error code. Old manifests remain valid.
16+
- **Task 10.4:** CDC benchmarks + dedup efficiency comparison (`test/benchmark/chunking.bench.js`) — throughput and dedup tables comparing CDC vs fixed chunking.
17+
18+
---
19+
720
# M8 — Spit Shine (v4.0.1) ✅ CLOSED
821

922
**Theme:** Polish and harden based on code review findings. Fix asymmetries, eliminate duplication, improve docs. No new features.

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,21 @@ We use the object database.
4444

4545
See [CHANGELOG.md](./CHANGELOG.md) for the full list of changes.
4646

47+
## What's new in v5.0.0
48+
49+
**Content-defined chunking (CDC)** — Fixed-size chunking invalidates every chunk after an edit. CDC uses a buzhash rolling hash to find natural boundaries, limiting the blast radius to 1–2 chunks. Benchmarked at 98.4% chunk reuse on small edits vs 32% for fixed.
50+
51+
```js
52+
const cas = new ContentAddressableStore({
53+
plumbing,
54+
chunking: { strategy: 'cdc', targetChunkSize: 262144, minChunkSize: 65536, maxChunkSize: 1048576 },
55+
});
56+
```
57+
58+
**`ChunkingPort`** — new hexagonal port abstracts chunking strategy. `FixedChunker` and `CdcChunker` adapters ship out of the box. Bring your own chunker by extending `ChunkingPort`.
59+
60+
See [CHANGELOG.md](./CHANGELOG.md) for the full list of changes.
61+
4762
## What's new in v4.0.1
4863

4964
**`git cas verify`** — verify stored asset integrity from the CLI without restoring (`git cas verify --slug my-asset`).

ROADMAP.md

Lines changed: 6 additions & 236 deletions
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ Return and throw semantics for every public method (current and planned).
189189
| v4.0.1 | M8+M9 | Spit Shine + Cockpit | CryptoPort refactor, verify, --json, error handler, vault list ||
190190
| v4.0.0 | M14 | Conduit | Streaming I/O, observability, parallel chunks ||
191191
| v3.1.0 | M13 | Bijou | TUI dashboard & progress ||
192-
| v5.0.0 | M10 | Hydra | Content-defined chunking | |
192+
| v5.0.0 | M10 | Hydra | Content-defined chunking | |
193193
| v5.1.0 | M11 | Locksmith | Multi-recipient encryption | |
194194
| v5.2.0 | M12 | Carousel | Key rotation | |
195195

@@ -203,7 +203,7 @@ M13 Bijou (v3.1.0) ✅
203203
M14 Conduit (v4.0.0) ✅
204204
M8 Spit Shine + M9 Cockpit (v4.0.1) ✅
205205
206-
M10 Hydra ──────────── (independent)
206+
M10 Hydra ──────────── ✅ v5.0.0
207207
M11 Locksmith ──────── (independent)
208208
└──► M12 Carousel ── (needs M11)
209209
```
@@ -246,239 +246,9 @@ All tasks completed (9.2–9.5). See [COMPLETED_TASKS.md](./COMPLETED_TASKS.md).
246246

247247
---
248248

249-
# M10 — Hydra (v3.0.0)
250-
**Theme:** Content-defined chunking for dramatically better dedup on versioned files. Fixed-size chunking invalidates every chunk after an edit; CDC limits the blast radius to 1–2 chunks. Major version bump for new chunking port and manifest metadata.
249+
# M10 — Hydra (v5.0.0) ✅ CLOSED
251250

252-
---
253-
254-
## Task 10.1: Buzhash rolling hash + CDC chunking engine
255-
256-
**User Story**
257-
As a developer storing versioned files, I want content-defined chunk boundaries so incremental changes don't invalidate every chunk downstream of the edit point.
258-
259-
**Requirements**
260-
- R1: Implement Buzhash rolling hash algorithm with a 256-entry random byte table (deterministic seed).
261-
- R2: Implement CDC chunker that uses rolling hash to find chunk boundaries.
262-
- R3: Configurable parameters: `minChunkSize` (default 64 KiB), `maxChunkSize` (default 1 MiB), `targetChunkSize` (default 256 KiB).
263-
- R4: Chunk boundary determined when `hash & mask === 0`, where mask is derived from `targetChunkSize` (e.g., `targetChunkSize - 1` for power-of-2 targets).
264-
- R5: Force boundary at `maxChunkSize` if no natural boundary found (prevent unbounded chunks).
265-
- R6: Force minimum chunk size: never split below `minChunkSize` (prevent tiny chunks).
266-
- R7: Deterministic: same input always produces same chunks regardless of runtime.
267-
- R8: Streaming: operates on `AsyncIterable<Buffer>` with O(1) memory.
268-
269-
**Acceptance Criteria**
270-
- AC1: CDC chunker produces variable-size chunks bounded by min/max.
271-
- AC2: Identical input always produces identical chunks (deterministic).
272-
- AC3: Inserting 10 bytes in the middle of a 1MB file changes only 1–2 chunks (not all downstream chunks).
273-
- AC4: Average chunk size approximates `targetChunkSize`.
274-
- AC5: No chunk smaller than `minChunkSize` (except final chunk of file).
275-
- AC6: No chunk larger than `maxChunkSize`.
276-
277-
**Scope**
278-
- In scope: Rolling hash + CDC chunker implementation + unit tests.
279-
- Out of scope: Integration with CasService (Task 10.2), Rabin fingerprinting (Buzhash is simpler and sufficient), gear-based CDC.
280-
281-
**Est. Complexity (LoC)**
282-
- Prod: ~200 (Buzhash table + rolling hash + CDC logic)
283-
- Tests: ~150 (determinism, boundary detection, size bounds, dedup)
284-
- Total: ~350
285-
286-
**Est. Human Working Hours**
287-
- ~12h
288-
289-
**Test Plan**
290-
- Golden path:
291-
- 1MB buffer → produces ~4 chunks (target 256KB).
292-
- Same buffer → same chunks every time.
293-
- Modify 10 bytes at offset 500KB → only 1–2 chunks differ vs. original.
294-
- Failures:
295-
- minChunkSize > maxChunkSize → throws configuration error.
296-
- targetChunkSize outside [min, max] → throws.
297-
- Edges:
298-
- File smaller than minChunkSize → single chunk.
299-
- File exactly maxChunkSize → single chunk.
300-
- All-zero file (degenerate hash behavior) → chunks bounded by max.
301-
- File = 1 byte → single chunk.
302-
- Fuzz/stress:
303-
- 100 random buffers (1KB–10MB, seeded): verify all chunks satisfy min/max bounds.
304-
- Determinism: chunk same buffer 100 times, assert identical output.
305-
- Dedup test: insert/delete 1–100 bytes at random offsets, measure % of chunks unchanged (expect >80% for small edits).
306-
307-
**Definition of Done**
308-
- DoD1: Buzhash + CDC chunker implemented as standalone module under `src/infrastructure/chunkers/`.
309-
- DoD2: All boundary and determinism tests pass.
310-
- DoD3: Performance: >100 MB/s throughput on chunking alone (no I/O).
311-
312-
**Blocking**
313-
- Blocks: Task 10.2, Task 10.4
314-
315-
**Blocked By**
316-
- Blocked by: None
317-
318-
---
319-
320-
## Task 10.2: ChunkingPort abstraction
321-
322-
**User Story**
323-
As an architect, I want chunking strategy behind a port so fixed-size and CDC can be swapped without modifying the domain service.
324-
325-
**Requirements**
326-
- R1: Add `src/ports/ChunkingPort.js` with abstract method `chunk(source: AsyncIterable<Buffer>): AsyncIterable<Buffer>`.
327-
- R2: Implement `FixedChunker` adapter wrapping existing `_chunkAndStore` buffer-slicing logic.
328-
- R3: Implement `CdcChunker` adapter wrapping Task 10.1's CDC engine.
329-
- R4: `CasService` constructor accepts optional `chunker` port. Defaults to `FixedChunker(chunkSize)`.
330-
- R5: Refactor `CasService._chunkAndStore()` to use the chunking port instead of inline buffer slicing.
331-
- R6: `ContentAddressableStore` constructor accepts optional `chunking` config: `{ strategy: 'fixed' | 'cdc', …params }`.
332-
333-
**Acceptance Criteria**
334-
- AC1: `CasService({ chunker: new CdcChunker(…) })` uses CDC.
335-
- AC2: Default behavior (no chunker specified) is identical to current fixed-size chunking.
336-
- AC3: All existing store/restore tests pass without modification.
337-
- AC4: CDC chunker plugs in and produces valid manifests that restore correctly.
338-
339-
**Scope**
340-
- In scope: Port + 2 adapters + CasService refactor + facade config.
341-
- Out of scope: Additional chunking strategies, auto-detection of optimal strategy.
342-
343-
**Est. Complexity (LoC)**
344-
- Prod: ~80 (port + 2 adapters + service refactor + facade config)
345-
- Tests: ~40 (port contract tests, integration with both chunkers)
346-
- Total: ~120
347-
348-
**Est. Human Working Hours**
349-
- ~4h
350-
351-
**Test Plan**
352-
- Golden path:
353-
- Store with FixedChunker → same behavior as before (byte-identical manifests).
354-
- Store with CdcChunker → valid manifest, restore succeeds.
355-
- Failures:
356-
- Chunker that yields empty buffers → handled gracefully (skip empty).
357-
- Edges:
358-
- Switch chunker between store and restore → restore still works (chunking strategy doesn't affect restore — chunks are self-describing via manifest).
359-
- Fuzz/stress:
360-
- 50 random files stored with both chunkers → all restore correctly.
361-
362-
**Definition of Done**
363-
- DoD1: ChunkingPort, FixedChunker, CdcChunker implemented.
364-
- DoD2: CasService uses chunking port.
365-
- DoD3: All existing tests pass (no regression).
366-
367-
**Blocking**
368-
- Blocks: Task 10.3
369-
370-
**Blocked By**
371-
- Blocked by: Task 10.1
372-
373-
---
374-
375-
## Task 10.3: CDC manifest metadata + backward compatibility
376-
377-
**User Story**
378-
As a user, I want CDC manifests to record their chunking strategy so future tools can understand or reproduce the chunk boundaries.
379-
380-
**Requirements**
381-
- R1: Add optional `chunking` field to ManifestSchema: `{ strategy: 'fixed' | 'cdc', params: { … } }`.
382-
- R2: Fixed-size manifests omit the field (backward compatible with all existing manifests).
383-
- R3: CDC manifests include `{ strategy: 'cdc', params: { target: N, min: N, max: N } }`.
384-
- R4: `readManifest()` handles manifests with or without `chunking` field.
385-
- R5: v1 and v2 manifests remain valid (no migration required).
386-
- R6: Add `INVALID_CHUNKING_STRATEGY` error code for unrecognized strategies.
387-
388-
**Acceptance Criteria**
389-
- AC1: CDC store produces manifest with `chunking` field.
390-
- AC2: Fixed-size store produces manifests without `chunking` field (backward compatible).
391-
- AC3: Old manifests (no `chunking` field) read correctly on new code.
392-
- AC4: Unrecognized strategy in manifest throws `INVALID_CHUNKING_STRATEGY`.
393-
394-
**Scope**
395-
- In scope: Schema extension, backward compat, error code.
396-
- Out of scope: Migration tooling for old manifests, manifest version bump (chunking field is additive).
397-
398-
**Est. Complexity (LoC)**
399-
- Prod: ~40 (schema + Manifest value object + error code)
400-
- Tests: ~60 (round-trip, backward compat, unknown strategy)
401-
- Total: ~100
402-
403-
**Est. Human Working Hours**
404-
- ~3h
405-
406-
**Test Plan**
407-
- Golden path:
408-
- CDC store → manifest includes `chunking.strategy === 'cdc'`.
409-
- Fixed store → manifest has no `chunking` field.
410-
- Read old manifest without `chunking` → works fine.
411-
- Failures:
412-
- Manifest with `chunking.strategy === 'unknown'` → throws INVALID_CHUNKING_STRATEGY.
413-
- Edges:
414-
- v1 manifest with compression + encryption + no chunking field → still valid.
415-
- v2 merkle manifest with CDC → both `subManifests` and `chunking` fields present.
416-
- Fuzz/stress:
417-
- Generate 100 manifests with random valid/invalid chunking fields → validate schema behavior.
418-
419-
**Definition of Done**
420-
- DoD1: ManifestSchema extended with optional chunking field.
421-
- DoD2: Backward compatibility verified across v1/v2 manifests.
422-
- DoD3: Error code registered and tested.
423-
424-
**Blocking**
425-
- Blocks: None
426-
427-
**Blocked By**
428-
- Blocked by: Task 10.2
429-
430-
---
431-
432-
## Task 10.4: CDC benchmarks + dedup efficiency comparison
433-
434-
**User Story**
435-
As a maintainer, I want empirical data comparing CDC vs fixed chunking so I can document trade-offs and tune defaults.
436-
437-
**Requirements**
438-
- R1: Add benchmark suite comparing fixed vs CDC chunking across file sizes (1MB, 10MB, 100MB).
439-
- R2: Measure chunking throughput (MB/s) for both strategies.
440-
- R3: Measure dedup efficiency: for a file modified by N random byte insertions, what % of chunks remain unchanged?
441-
- R4: Output results as a comparison table (console).
442-
443-
**Acceptance Criteria**
444-
- AC1: Benchmark suite runs without errors.
445-
- AC2: CDC shows significantly better dedup for incrementally modified files (>80% chunk reuse for small edits vs. ~0% for fixed).
446-
- AC3: CDC throughput is within 2× of fixed chunking (rolling hash overhead is bounded).
447-
448-
**Scope**
449-
- In scope: Synthetic benchmarks with in-memory data.
450-
- Out of scope: CI benchmark tracking, real-world file corpus, regression detection.
451-
452-
**Est. Complexity (LoC)**
453-
- Prod: ~0
454-
- Tests/Bench: ~120
455-
- Total: ~120
456-
457-
**Est. Human Working Hours**
458-
- ~3h
459-
460-
**Test Plan**
461-
- Golden path:
462-
- Bench suite completes and prints results table.
463-
- Failures:
464-
- N/A (benchmarks are informational).
465-
- Edges:
466-
- Include 0-byte and 1-byte files in benchmark.
467-
- Fuzz/stress:
468-
- Run 3 times; verify <20% variance in throughput measurements.
469-
470-
**Definition of Done**
471-
- DoD1: Benchmark suite added to `test/benchmark/`.
472-
- DoD2: Results documented in commit message or GUIDE.md addendum.
473-
- DoD3: Default CDC parameters tuned based on results if needed.
474-
475-
**Blocking**
476-
- Blocks: None
477-
478-
**Blocked By**
479-
- Blocked by: Task 10.1
480-
481-
---
251+
All tasks completed (10.1–10.4). See [COMPLETED_TASKS.md](./COMPLETED_TASKS.md).
482252

483253
# M11 — Locksmith (v3.1.0)
484254
**Theme:** Multi-recipient encryption via envelope encryption (DEK/KEK model). Each file is encrypted with a random Data Encryption Key; the DEK is wrapped per-recipient. Adding or removing access never re-encrypts the data.
@@ -961,7 +731,7 @@ Competitive landscape for content-addressed storage, encrypted binary assets, an
961731
|---|---|---|---|---|---|---|---|---|---|---|
962732
| Content-addressed storage | ✅ SHA-256 || ✅ SHA-256 | ✅ SHA-256/512 | ✅ SHA-256 || ✅ MD5 | Dedup, integrity, immutability | git-cas is Git-native; others use separate object stores ||
963733
| Fixed-size chunking | ✅ 256 KiB default, configurable ||| ⚠️ Special remotes only |||| Break large files into stable blobs | Simple and deterministic; poor dedup on edits ||
964-
| Content-defined chunking (CDC) | | 🗓 M10 Hydra ||| ✅ Rabin fingerprint, 512K–8M ||| Sub-file dedup on versioned data | Only Restic offers this today; dramatically better dedup | Buzhash engine + ChunkingPort. ~350 LoC, ~12h (Task 10.1) |
734+
| Content-defined chunking (CDC) | ✅ v5.0.0 Buzhash | ||| ✅ Rabin fingerprint, 512K–8M ||| Sub-file dedup on versioned data | Buzhash CDC engine with 98% chunk reuse on small edits | |
965735
| Sub-file deduplication | ✅ Via chunking | ✅ Via CDC || ⚠️ Chunk-level only | ✅ Via CDC ||| Avoid storing redundant bytes | Fixed chunks dedup exact matches; CDC handles shifted content | CDC (M10) improves from exact-match to shift-tolerant |
966736
| File-level deduplication | ✅ Git ODB ||||||| Identical files stored once | All CAS systems get this for free ||
967737
| Git-native storage (ODB) | ✅ Blobs + trees || ❌ Separate LFS store | ⚠️ Pointers in ODB, content in annex | ❌ Custom format || ❌ Cache dir | Inspectable via `git log`, replicable via `git push` | Unique to git-cas. Competitors use custom storage layers ||
@@ -1005,7 +775,7 @@ Competitive landscape for content-addressed storage, encrypted binary assets, an
1005775
| Codec pluggability | ✅ JsonCodec, CborCodec ||||||| Choose manifest format per use case | Extensible via CodecPort. No other tool offers this ||
1006776
| Merkle tree manifests | ✅ v2 auto-split ||||||| Scale manifests for millions of chunks | Auto-splits at threshold (default 1000). Transparent reconstitution ||
1007777
| Vault / ref-based indexing | ✅ refs/cas/vault ||| ✅ git-annex branch |||| GC-safe asset index that survives `git gc` | CAS semantics with retry. Unique among Git-native tools ||
1008-
| Manifest versioning | ✅ v1 flat, v2 Merkle | 🗓 M10 adds chunking field | Pointer v1 only ||||| Evolve format without breaking old manifests | Full backward compat: v2 code reads v1 manifests | Additive schema fields for CDC metadata (Task 10.3) |
778+
| Manifest versioning | ✅ v1 flat, v2 Merkle + chunking || Pointer v1 only ||||| Evolve format without breaking old manifests | Full backward compat: v2 code reads v1 manifests | |
1009779

1010780
---
1011781

0 commit comments

Comments
 (0)