Commit b5b87a7
authored
feat(dag/walker): opt-in BloomTracker to avoid duplicated walks (#1124)
* feat(dag/walker): add VisitedTracker with BloomTracker and MapTracker
VisitedTracker interface for memory-efficient DAG traversal dedup.
BloomTracker uses a scalable bloom filter chain (~4 bytes/CID vs ~75
for a map), enabling dedup on repos with tens of millions of CIDs.
- BloomTracker: auto-scaling chain, configurable FP rate via BloomParams,
unique random SipHash keys per instance (uncorrelated FPs across nodes)
- MapTracker: exact dedup for tests and small datasets
- *cid.Set satisfies the interface for drop-in compatibility
- go.mod: update ipfs/bbloom to master (for NewWithKeys)
* feat(dag/walker): add WalkDAG with codec-agnostic link extraction
iterative DFS walker that integrates VisitedTracker dedup directly
into the traversal loop, skipping entire subtrees in O(1).
- LinksFetcherFromBlockstore: extracts links from any codec registered
in the global multicodec registry (dag-pb, dag-cbor, raw, etc.)
- ~2x faster than legacy go-ipld-prime selector traversal (no selector
machinery, simpler decoding, fewer allocations)
- WithLocality option for MFS providers to skip non-local blocks
- best-effort error handling: fetch failures log and skip, do not mark
the CID as visited (allows retry via another pin or next cycle)
- benchmarks comparing BlockAll vs WalkDAG across dag-pb, dag-cbor,
and mixed-codec DAGs
* feat(dag/walker): add WalkEntityRoots for entity-aware traversal
emits entity roots (files, directories, HAMT shards) skipping
internal file chunks. core of the +entities provide strategy.
- NodeFetcherFromBlockstore: detects UnixFS entity type from the
ipld-prime decoded node's Data field
- directories and HAMT shards: emit and recurse into children
- non-UnixFS codecs (dag-cbor, dag-json): emit and follow links
- same options as WalkDAG: WithVisitedTracker, WithLocality
- tests: dag-pb, raw, dag-cbor, mixed codecs, HAMT, dedup,
error handling, stop conditions
* test(dag/walker): add BloomTracker FP rate regression tests
catch unexpected regressions in ipfs/bbloom behavior or BloomParams
derivation that would silently degrade the false positive rate.
- measurable rate (1/1000): 100K probes produce observable FPs,
asserts rate is within 5x of target
- default rate (1/4.75M): 100K probes must produce exactly 0 FPs
* fix(provider): stream error continues to next, add NewConcatProvider
- NewPrioritizedProvider: stream init error no longer stops remaining
streams (e.g. MFS flush error does not prevent pinned content from
being provided)
- NewConcatProvider: concatenates pre-deduplicated streams without
its own visited set, for use with shared VisitedTracker
* feat(pinner): add NewUniquePinnedProvider and NewPinnedEntityRootsProvider
NewUniquePinnedProvider: emits all pinned blocks with cross-pin
dedup via shared VisitedTracker (bloom or map). walks recursive pin
DAGs first, then direct pins.
NewPinnedEntityRootsProvider: same structure but uses WalkEntityRoots,
emitting only entity roots and skipping internal file chunks.
existing NewPinnedProvider is unchanged.
* test: add PrioritizedProvider error-continue regression test
- remove unused daggen variable in uniquepinprovider_test.go
* refactor(provider): use labeled break in NewConcatProvider for consistency
match the defensive read-side ctx.Done select pattern already used by
NewPrioritizedProvider in the same file
* refactor(dag/walker): extract shared linkSystemForBlockstore helper
- deduplicate LinkSystem construction used by both
LinksFetcherFromBlockstore and NodeFetcherFromBlockstore
- wrap blockstore with NewIdStore so identity CIDs (multihash 0x00,
data inline in the CID) are decoded without a datastore lookup
* fix(dag/walker): skip emitting identity CIDs, add tests
identity CIDs (multihash 0x00) embed data inline, so providing them
to the DHT is wasteful. the walker now traverses through identity
CIDs (following their links) but never emits them.
- add isIdentityCID check to WalkDAG and WalkEntityRoots
- simplify WalkEntityRoots emit/descend logic
- tests for identity raw leaf, identity dag-pb directory with normal
children, normal directory with identity child
* test(dag/walker): add symlink entity detection tests
* refactor: consolidate identity CID tests, filter direct pins
- inline identity CID check (c.Prefix().MhType == mh.IDENTITY) in all
emit paths: WalkDAG, WalkEntityRoots, and direct pin loops in both
NewUniquePinnedProvider and NewPinnedEntityRootsProvider
- move all identity CID tests to dag/walker/identity_test.go
- add provider-level identity tests for direct pins and recursive DAGs
* fix(dag/walker): visit siblings in left-to-right link order
the stack-based DFS was pushing children in link order, causing
the last child to be popped first (right-to-left). reverse children
before pushing so the first link is on top and gets visited first.
this matches the legacy fetcherhelpers.BlockAll selector traversal
(ipld-prime iterates list/map entries in insertion order) and the
conventional DFS order described in IPIP-0412.
- walker.go, entity.go: slices.Reverse(children) before stack push
- walker.go: document traversal order in WalkDAG godoc
- entity.go: document order parity in WalkEntityRoots godoc
- walker_test.go, entity_test.go: add sibling order regression tests
* fix(pinner): continue on pin iteration error in unique providers
a corrupted pin entry was stopping the entire provide cycle because
the goroutine returned on RecursiveKeys/DirectKeys error. change to
continue so remaining pins are still provided (best-effort).
the error from the pinner iterator already contains context (bad CID
bytes, datastore key, etc.) -- sc.Pin.Key is zero-value on error so
including it in the log would be noise.
matches the best-effort pattern used in WalkDAG/WalkEntityRoots
where fetch errors are logged and skipped.
* docs(dag/walker): document implicit behaviors
- collectLinks: note that map keys are not recursed (no known codec
uses link-typed map keys)
- detectEntityType: extract c.Prefix() once for readability
- grow: document MinBloomCapacity invariant that prevents small-bitset
FP rate issues in grown blooms
* fix: address review feedback from gammazero
uniquepinprovider: use skip-early style for tracker.Visit
in direct pin loops (clearer control flow)
visited.go: document that VisitedTracker implementations may
be probabilistic, and must keep FP rate negligible or allow
callers to adjust it
* feat(walker): log bloom tracker creation and autoscaling
log capacity, FP rate, and hash parameters on creation.
log previous/new capacity and chain length on autoscale.
helps operators understand bloom sizing and detect unexpected
growth during reprovide cycles.
* feat(walker): add Deduplicated() to BloomTracker and MapTracker
counts Visit() calls that returned false (CID already seen).
callers can log this after a reprovide cycle to show how much
dedup the bloom filter achieved.
* chore: update ipfs/bbloom to v0.1.0
* docs(walker): document bloom iteration order tradeoff
Addresses review feedback:
#1124 (comment)
#1124 (comment)
* docs(walker): remove Kubo-specific 22h reprovide interval
The reprovide interval is configured by the caller, not by boxo.
Addresses review feedback:
#1124 (comment)
* refactor(walker): consolidate WalkDAG and WalkEntityRoots
Extract shared walkLoop for the common iterative DFS logic (stack
management, tracker dedup, locality check, identity CID skip, emit).
WalkDAG and WalkEntityRoots now differ only in their fetch callback.
Addresses review feedback:
#1124 (comment)
* refactor(pinner): consolidate pin provider functions
Extract shared newPinnedProvider for the common goroutine, emit, pin
iteration, and direct-pin logic. NewUniquePinnedProvider and
NewPinnedEntityRootsProvider now differ only in the walk callback.
Addresses review feedback:
#1124 (comment)
* docs: move pin provider entry to Added in CHANGELOG
These are new functions, not fixes to existing ones.
Addresses review feedback:
#1124 (comment)1 parent c0f7759 commit b5b87a7
18 files changed
Lines changed: 3331 additions & 8 deletions
File tree
- dag/walker
- examples
- pinning/pinner/dspinner
- provider
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
19 | 21 | | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| 27 | + | |
25 | 28 | | |
26 | 29 | | |
27 | 30 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
0 commit comments