You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Content-defined chunking (CDC) for dramatically better dedup on versioned
files. Buzhash rolling hash engine limits the blast radius of edits to
1–2 chunks (98.4% chunk reuse vs 32% for fixed-size chunking).
New hexagonal ChunkingPort with FixedChunker and CdcChunker adapters.
CasService refactored to delegate chunking to the port. Facade accepts
declarative chunking config. ManifestSchema extended with optional
chunking field (backward compatible). 709 tests, 0 lint errors.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
5
5
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
6
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
8
+
## [5.0.0] — Hydra (2026-02-28)
9
+
10
+
### Breaking Changes
11
+
-**`CasService` constructor accepts `chunker` port** — a new optional `ChunkingPort` parameter controls chunking strategy. Existing code that does not pass `chunker` is unaffected (defaults to `FixedChunker`).
12
+
-**Major version bump** — new hexagonal port (`ChunkingPort`) and manifest schema extension warrant a semver-major release for downstream tooling awareness.
13
+
14
+
### Added
15
+
-**Content-defined chunking (CDC)** — Buzhash rolling-hash engine with configurable `minChunkSize` (64 KiB), `maxChunkSize` (1 MiB), and `targetChunkSize` (256 KiB). CDC limits the dedup blast radius to 1–2 chunks on incremental edits vs. total invalidation with fixed-size chunking. Benchmarked at 265 MB/s and 98.4% chunk reuse on small edits.
16
+
-**`ChunkingPort`** — new hexagonal port (`src/ports/ChunkingPort.js`) with `async *chunk(source)`, `strategy`, and `params`. Abstracts chunking behind a pluggable interface.
-**`CdcChunker`** — adapter wrapping the buzhash CDC engine behind `ChunkingPort`.
19
+
-**`chunking` manifest field** — optional `{ strategy: 'fixed' | 'cdc', params: {...} }` metadata in manifests. Fixed-strategy manifests omit the field for full backward compatibility.
20
+
-**`ChunkingSchema`** — Zod discriminated union (`FixedChunkingSchema` + `CdcChunkingSchema`) for manifest validation.
21
+
-**`INVALID_CHUNKING_STRATEGY` error code** — thrown when an unrecognized chunking strategy is encountered in a manifest.
22
+
-**Facade `chunking` config** — `ContentAddressableStore` constructor accepts `chunking: { strategy, ... }` declarative config or a raw `chunker` port instance.
23
+
-**CDC benchmarks** (`test/benchmark/chunking.bench.js`) — throughput and dedup efficiency comparison.
24
+
- 90 new unit tests (709 total).
25
+
26
+
### Changed
27
+
-`CasService._chunkAndStore()` refactored to delegate to `ChunkingPort` instead of inline buffer slicing.
28
+
-`ChunkingPort`, `FixedChunker`, `CdcChunker` exported from the main entry point.
Copy file name to clipboardExpand all lines: COMPLETED_TASKS.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,19 @@ Task cards moved here from ROADMAP.md after completion. Organized by milestone.
4
4
5
5
---
6
6
7
+
# M10 — Hydra (v5.0.0) ✅ CLOSED
8
+
9
+
**Theme:** Content-defined chunking for dramatically better dedup on versioned files.
10
+
11
+
**Completed:** v5.0.0 (2026-02-28)
12
+
13
+
-**Task 10.1:** Buzhash rolling hash + CDC chunking engine — standalone module (`src/infrastructure/chunkers/CdcChunker.js`) with 256-entry deterministic byte table, configurable min/max/target chunk sizes, streaming async generator. Benchmarked at 265 MB/s, 98.4% chunk reuse on small edits.
14
+
-**Task 10.2:**`ChunkingPort` abstraction — new hexagonal port with `async *chunk(source)`, `strategy`, and `params`. `FixedChunker` and `CdcChunker` adapters. `CasService` refactored to delegate chunking to the port. Facade accepts `chunking` config and raw `chunker` option.
15
+
-**Task 10.3:** CDC manifest metadata + backward compatibility — `ChunkingSchema` (Zod discriminated union), optional `chunking` field in `ManifestSchema`, `INVALID_CHUNKING_STRATEGY` error code. Old manifests remain valid.
16
+
-**Task 10.4:** CDC benchmarks + dedup efficiency comparison (`test/benchmark/chunking.bench.js`) — throughput and dedup tables comparing CDC vs fixed chunking.
17
+
18
+
---
19
+
7
20
# M8 — Spit Shine (v4.0.1) ✅ CLOSED
8
21
9
22
**Theme:** Polish and harden based on code review findings. Fix asymmetries, eliminate duplication, improve docs. No new features.
Copy file name to clipboardExpand all lines: README.md
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,21 @@ We use the object database.
44
44
45
45
See [CHANGELOG.md](./CHANGELOG.md) for the full list of changes.
46
46
47
+
## What's new in v5.0.0
48
+
49
+
**Content-defined chunking (CDC)** — Fixed-size chunking invalidates every chunk after an edit. CDC uses a buzhash rolling hash to find natural boundaries, limiting the blast radius to 1–2 chunks. Benchmarked at 98.4% chunk reuse on small edits vs 32% for fixed.
**`ChunkingPort`** — new hexagonal port abstracts chunking strategy. `FixedChunker` and `CdcChunker` adapters ship out of the box. Bring your own chunker by extending `ChunkingPort`.
59
+
60
+
See [CHANGELOG.md](./CHANGELOG.md) for the full list of changes.
61
+
47
62
## What's new in v4.0.1
48
63
49
64
**`git cas verify`** — verify stored asset integrity from the CLI without restoring (`git cas verify --slug my-asset`).
@@ -246,239 +246,9 @@ All tasks completed (9.2–9.5). See [COMPLETED_TASKS.md](./COMPLETED_TASKS.md).
246
246
247
247
---
248
248
249
-
# M10 — Hydra (v3.0.0)
250
-
**Theme:** Content-defined chunking for dramatically better dedup on versioned files. Fixed-size chunking invalidates every chunk after an edit; CDC limits the blast radius to 1–2 chunks. Major version bump for new chunking port and manifest metadata.
249
+
# M10 — Hydra (v5.0.0) ✅ CLOSED
251
250
252
-
---
253
-
254
-
## Task 10.1: Buzhash rolling hash + CDC chunking engine
255
-
256
-
**User Story**
257
-
As a developer storing versioned files, I want content-defined chunk boundaries so incremental changes don't invalidate every chunk downstream of the edit point.
258
-
259
-
**Requirements**
260
-
- R1: Implement Buzhash rolling hash algorithm with a 256-entry random byte table (deterministic seed).
261
-
- R2: Implement CDC chunker that uses rolling hash to find chunk boundaries.
- R4: Chunk boundary determined when `hash & mask === 0`, where mask is derived from `targetChunkSize` (e.g., `targetChunkSize - 1` for power-of-2 targets).
264
-
- R5: Force boundary at `maxChunkSize` if no natural boundary found (prevent unbounded chunks).
265
-
- R6: Force minimum chunk size: never split below `minChunkSize` (prevent tiny chunks).
266
-
- R7: Deterministic: same input always produces same chunks regardless of runtime.
267
-
- R8: Streaming: operates on `AsyncIterable<Buffer>` with O(1) memory.
268
-
269
-
**Acceptance Criteria**
270
-
- AC1: CDC chunker produces variable-size chunks bounded by min/max.
As a maintainer, I want empirical data comparing CDC vs fixed chunking so I can document trade-offs and tune defaults.
436
-
437
-
**Requirements**
438
-
- R1: Add benchmark suite comparing fixed vs CDC chunking across file sizes (1MB, 10MB, 100MB).
439
-
- R2: Measure chunking throughput (MB/s) for both strategies.
440
-
- R3: Measure dedup efficiency: for a file modified by N random byte insertions, what % of chunks remain unchanged?
441
-
- R4: Output results as a comparison table (console).
442
-
443
-
**Acceptance Criteria**
444
-
- AC1: Benchmark suite runs without errors.
445
-
- AC2: CDC shows significantly better dedup for incrementally modified files (>80% chunk reuse for small edits vs. ~0% for fixed).
446
-
- AC3: CDC throughput is within 2× of fixed chunking (rolling hash overhead is bounded).
447
-
448
-
**Scope**
449
-
- In scope: Synthetic benchmarks with in-memory data.
450
-
- Out of scope: CI benchmark tracking, real-world file corpus, regression detection.
451
-
452
-
**Est. Complexity (LoC)**
453
-
- Prod: ~0
454
-
- Tests/Bench: ~120
455
-
- Total: ~120
456
-
457
-
**Est. Human Working Hours**
458
-
-~3h
459
-
460
-
**Test Plan**
461
-
- Golden path:
462
-
- Bench suite completes and prints results table.
463
-
- Failures:
464
-
- N/A (benchmarks are informational).
465
-
- Edges:
466
-
- Include 0-byte and 1-byte files in benchmark.
467
-
- Fuzz/stress:
468
-
- Run 3 times; verify <20% variance in throughput measurements.
469
-
470
-
**Definition of Done**
471
-
- DoD1: Benchmark suite added to `test/benchmark/`.
472
-
- DoD2: Results documented in commit message or GUIDE.md addendum.
473
-
- DoD3: Default CDC parameters tuned based on results if needed.
474
-
475
-
**Blocking**
476
-
- Blocks: None
477
-
478
-
**Blocked By**
479
-
- Blocked by: Task 10.1
480
-
481
-
---
251
+
All tasks completed (10.1–10.4). See [COMPLETED_TASKS.md](./COMPLETED_TASKS.md).
482
252
483
253
# M11 — Locksmith (v3.1.0)
484
254
**Theme:** Multi-recipient encryption via envelope encryption (DEK/KEK model). Each file is encrypted with a random Data Encryption Key; the DEK is wrapped per-recipient. Adding or removing access never re-encrypts the data.
@@ -961,7 +731,7 @@ Competitive landscape for content-addressed storage, encrypted binary assets, an
| Content-defined chunking (CDC) |✅ v5.0.0 Buzhash | —| ❌ | ❌ | ✅ Rabin fingerprint, 512K–8M | ❌ | ❌ | Sub-file dedup on versioned data |Buzhash CDC engine with 98% chunk reuse on small edits | —|
965
735
| Sub-file deduplication | ✅ Via chunking | ✅ Via CDC | ❌ | ⚠️ Chunk-level only | ✅ Via CDC | ❌ | ❌ | Avoid storing redundant bytes | Fixed chunks dedup exact matches; CDC handles shifted content | CDC (M10) improves from exact-match to shift-tolerant |
966
736
| File-level deduplication | ✅ Git ODB | — | ✅ | ✅ | ✅ | ❌ | ✅ | Identical files stored once | All CAS systems get this for free | — |
967
737
| Git-native storage (ODB) | ✅ Blobs + trees | — | ❌ Separate LFS store | ⚠️ Pointers in ODB, content in annex | ❌ Custom format | ❌ | ❌ Cache dir | Inspectable via `git log`, replicable via `git push`| Unique to git-cas. Competitors use custom storage layers | — |
@@ -1005,7 +775,7 @@ Competitive landscape for content-addressed storage, encrypted binary assets, an
1005
775
| Codec pluggability | ✅ JsonCodec, CborCodec | — | ❌ | ❌ | ❌ | ❌ | ❌ | Choose manifest format per use case | Extensible via CodecPort. No other tool offers this | — |
1006
776
| Merkle tree manifests | ✅ v2 auto-split | — | ❌ | ❌ | ❌ | ❌ | ❌ | Scale manifests for millions of chunks | Auto-splits at threshold (default 1000). Transparent reconstitution | — |
1007
777
| Vault / ref-based indexing | ✅ refs/cas/vault | — | ❌ | ✅ git-annex branch | ❌ | ❌ | ❌ | GC-safe asset index that survives `git gc`| CAS semantics with retry. Unique among Git-native tools | — |
1008
-
| Manifest versioning | ✅ v1 flat, v2 Merkle | 🗓 M10 adds chunking field | Pointer v1 only | ❌ | ❌ | ❌ | ❌ | Evolve format without breaking old manifests | Full backward compat: v2 code reads v1 manifests |Additive schema fields for CDC metadata (Task 10.3)|
778
+
| Manifest versioning | ✅ v1 flat, v2 Merkle + chunking | — | Pointer v1 only | ❌ | ❌ | ❌ | ❌ | Evolve format without breaking old manifests | Full backward compat: v2 code reads v1 manifests |—|
0 commit comments