feat(retrieval): branch piece-missing on PDP pieceLive instead of always cleanup#561
Conversation
…ays cleanup
Pre-flight piece-status probe used to mark any deal cleaned_up whenever the
SP returned 404. That conflated two scenarios: pieces SPs legitimately
purged (terminated datasets, hard-removed pieces) and pieces SPs should
still serve (live datasets, scheduled-but-not-finalized removals). The
latter was being silently swept under skipped.piece_missing instead of
surfacing as a failed retrieval.
New flow in RetrievalService.performAllRetrievals:
1. Chain pre-check via PDPVerifier.pieceLive(dataSetId, pieceId).
- false: dataset terminated, piece never created, or piece hard-removed.
Mark deal cleaned_up, emit skipped.piece_missing, no SP probe needed.
- true: piece should be retrievable. Proceed.
2. SP HEAD probe on /pdp/piece/:pieceCid/status (HEAD with GET fallback on 405).
- 404 after pieceLive=true: SP failure. Emit failure.other, persist a
failed Retrieval row, keep deal in candidate pool for re-probing.
- 200/other: proceed to full retrieval.
Probe was GET, now HEAD; no body required.
Extracts isDataSetLive + its FWSS/SP-HTTP probes into a new
DatasetLivenessService (apps/backend/src/dataset-liveness/) so
RetrievalService doesn't depend on DealService. DealService.isDataSetLive
becomes a thin proxy. New isPieceLive method on the service reads
PDPVerifier.pieceLive via viem.
Updates docs/checks/events-and-metrics.md to document the new branching
behavior of skipped.piece_missing vs failure.other on the Retrieval path.
Probe-level tests for isDataSetLive in deal.service.spec.ts are skipped
with a TODO; coverage to be re-added in a new dataset-liveness spec.
Refs: #556
- New DatasetLivenessService spec ports the FWSS/SP-HTTP probe matrix
previously in deal.service.spec.ts and adds isPieceLive coverage:
happy/false/no-client/RPC error/abort.
- Retrieval pre-check now treats missing dataSetId/pieceId as an error
(emits failure.other and bails) instead of falling through to the SP
probe. Backfill of legacy null-ID deals is a prerequisite before this
branch deploys. The error path is loud enough to spot any stragglers.
- Add retrieval spec coverage for the missing-IDs branch and the HEAD
with GET-on-405 fallback.
- Delete the stale describe.skip("isDataSetLive") block from
deal.service.spec.ts; coverage now lives in the new spec.
Refs: #561 follow-up commit addressing peer-review consensus.
There was a problem hiding this comment.
Pull request overview
This PR refines the Retrieval pre-flight logic to avoid incorrectly marking deals as cleaned_up when an SP returns /pdp/piece/:pieceCid/status 404 for a piece that is still expected to be live on-chain. It also extracts PDP dataset/piece liveness probing into a dedicated DatasetLivenessService so the logic can be reused outside DealService.
Changes:
- Add a retrieval pre-check pipeline that first consults on-chain
pieceLive(dataSetId, pieceId)and only treats SP 404 as cleanup when the piece is not live. - Record an explicit failed
Retrievalrow (withfailure.other) when the chain reports the piece live but the SP reports it missing. - Refactor dataset liveness probes out of
DealServiceinto a newDatasetLivenessService+ module, and update tests/docs accordingly.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/checks/events-and-metrics.md | Updates retrievalStatus metric documentation to describe the new pre-flight branching behavior. |
| apps/backend/src/retrieval/retrieval.service.ts | Implements chain pieceLive gating and changes SP-404 handling to record failures instead of always cleaning up. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Extends retrieval tests to cover the new pre-check branches and failure recording. |
| apps/backend/src/retrieval/retrieval.module.ts | Wires DatasetLivenessModule into the retrieval module for DI. |
| apps/backend/src/deal/deal.service.ts | Replaces inlined dataset liveness probing with a proxy call to DatasetLivenessService. |
| apps/backend/src/deal/deal.service.spec.ts | Updates deal-service tests to mock DatasetLivenessService instead of the removed inlined probes. |
| apps/backend/src/deal/deal.module.ts | Wires DatasetLivenessModule into the deal module for DI. |
| apps/backend/src/dataset-liveness/dataset-liveness.service.ts | New service encapsulating FWSS+SP dataset liveness checks and on-chain pieceLive reads. |
| apps/backend/src/dataset-liveness/dataset-liveness.service.spec.ts | New unit tests for dataset liveness + piece liveness probing behavior. |
| apps/backend/src/dataset-liveness/dataset-liveness.module.ts | New Nest module exporting DatasetLivenessService. |
BigLep
left a comment
There was a problem hiding this comment.
PR description looks good/right to me.
I do want to see "A separate backfill script must run against staging and mainnet before this PR merges. Tracking issue + script PR will be linked here."
Updated PR description, backfill done. results at https://gist.github.com/SgtPooki/9fcf39ad2e95e022930ec3fbbbd139b1#file-backfill-results-md |
silent-cipher
left a comment
There was a problem hiding this comment.
Looks good to me!
What changed
Retrieval pre-flight now distinguishes "SP correctly purged a piece that should be gone" from "SP failed to serve a piece it should still have."
Before this PR, a 404 from
/pdp/piece/:pieceCid/statusalways marked the dealcleaned_up, masking real SP failures as routine cleanups.New flow
In
RetrievalService.performAllRetrievals:deal.dataSetIdordeal.pieceIdis null, emitfailure.otherwith log eventretrieval_missing_chain_idsand bail. Backfill is a deploy prerequisite (see below).PDPVerifier.pieceLive(dataSetId, pieceId).false: markcleaned_up, emitskipped.piece_missing, return. No SP probe.true: proceed.HEAD /pdp/piece/:pieceCid/status(GET fallback on 405).404: emitfailure.other, persist a failedRetrievalrow witherrorMessage: "SP reports piece missing but PDP pieceLive=true"and log eventretrieval_failed_piece_missing_live. Deal stays in the candidate pool for re-probing.pieceLivesource: PDPVerifier.sol:288. Pieces scheduled for removal but not yet finalized returntrue, since the SP is still on the hook for challenges.The documented
retrievalStatusenum (success | failure.timedout | failure.other | skipped.piece_missing) is preserved. The new failure modes reusefailure.other; dashboards slice via log event orerrorMessage.Refactor
isDataSetLiveand its FWSS + SP-HTTP probes move out ofDealServiceinto a newDatasetLivenessService, which also ownsisPieceLive.DealService.isDataSetLiveis now a thin proxy.Backfill (completed prior to merge)
Legacy deals from before 2026-05-07 have null
piece_idand/ordata_set_id. Backfill was run against both environments using a one-off script (kept out of the repo). Script + run results + chain-verification spot checks: https://gist.github.com/SgtPooki/9fcf39ad2e95e022930ec3fbbbd139b1How to verify
After deploy:
failure.otheron the Retrieval path should pick up SP-side piece misses that previously hid inskipped.piece_missing.retrieval_missing_chain_idsevents should stay at zero.Notes
checkPieceLivereturnstrueon RPC error so a chain outage cannot cascade into bulk cleanups.HEAD(GET fallback on 405) instead ofGET. No body required.