fix(retrieval): mark deals cleaned_up when SP reports piece missing#556
Conversation
Pre-flight SP `/pdp/piece/:pieceCid/status` before IPNI verify + block fetch in the retrieval check path. On 404 mark the deal cleaned_up and skip — saves the 30s IPNI timeout plus downstream block-fetch 404 noise per stale candidate. Closes #555
There was a problem hiding this comment.
Pull request overview
This PR adds a pre-flight probe to the retrieval pipeline so that if a Storage Provider reports a piece is missing (/pdp/piece/:pieceCid/status returns 404), the corresponding deal is marked cleaned_up=true and the retrieval (including IPNI verification and transport) is skipped. This reduces wasted work and removes stale deals from future retrieval candidate selection.
Changes:
- Add SP piece-status pre-flight in
RetrievalService.performAllRetrievals; on 404, update the deal tocleanedUp=trueand return[]. - Introduce
probeSpPieceStatushelper to perform the/pdp/piece/:pieceCid/statusrequest with a 10s timeout. - Add unit tests covering the 404/200/network-error/no-serviceUrl cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| apps/backend/src/retrieval/retrieval.service.ts | Adds SP piece-status probing and marks deals cleaned up when the SP reports the piece is missing. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Adds tests validating the new pre-flight behavior and cleanup update call. |
- Build probe URL via WHATWG URL + encodeURIComponent (handles trailing slash + non-trivial pieceCid chars) - Re-throw outer-signal aborts; only probe-timeout/network fall through - Post-probe signal.throwIfAborted() to stop promptly on cancellation - 10s -> 5s probe timeout; safer pre-flight that beats 30s IPNI path - Record retrievalStatus="skipped.piece_missing" + log statusUrl/statusCode/probeDurationMs - User-Agent header on probe; unstubAllGlobals in tests for isolation - 2 new tests: outer-signal abort re-thrown; pieceCid URL-encoded
Address Copilot review: - Cancel fetch response body to release the undici socket back to the pool - Capture dealRepository.update affected count; log distinguishes "marked cleaned_up" from "concurrent writer already cleaned"
affected=0 today is effectively unreachable. Forward-looking branch was weakly justified. Log affected count and let readers interpret without speculating about cause.
Synapse-sdk and dealbot's ipni.strategy + pull-check use direct concat on serviceUrl. Match that. Base paths are not part of the SP serviceUrl contract.
silent-cipher
left a comment
There was a problem hiding this comment.
Overall looks good to me! Should we also document the new retrievalStatus state, skipped.piece_missing?
Per PR review on #556. Adds a row note on retrievalStatus in events-and-metrics.md and a subsection in retrievals.md explaining when the status is emitted and how it differs from a transport failure.
|
Pre-deploy (~16:09–17:24, old code, v with 68 failure.other + 110 success/min):
Post-deploy (17:31+):
|
…ays cleanup (#561) * feat(retrieval): branch piece-missing on PDP pieceLive instead of always cleanup Pre-flight piece-status probe used to mark any deal cleaned_up whenever the SP returned 404. That conflated two scenarios: pieces SPs legitimately purged (terminated datasets, hard-removed pieces) and pieces SPs should still serve (live datasets, scheduled-but-not-finalized removals). The latter was being silently swept under skipped.piece_missing instead of surfacing as a failed retrieval. New flow in RetrievalService.performAllRetrievals: 1. Chain pre-check via PDPVerifier.pieceLive(dataSetId, pieceId). - false: dataset terminated, piece never created, or piece hard-removed. Mark deal cleaned_up, emit skipped.piece_missing, no SP probe needed. - true: piece should be retrievable. Proceed. 2. SP HEAD probe on /pdp/piece/:pieceCid/status (HEAD with GET fallback on 405). - 404 after pieceLive=true: SP failure. Emit failure.other, persist a failed Retrieval row, keep deal in candidate pool for re-probing. - 200/other: proceed to full retrieval. Probe was GET, now HEAD; no body required. Extracts isDataSetLive + its FWSS/SP-HTTP probes into a new DatasetLivenessService (apps/backend/src/dataset-liveness/) so RetrievalService doesn't depend on DealService. DealService.isDataSetLive becomes a thin proxy. New isPieceLive method on the service reads PDPVerifier.pieceLive via viem. Updates docs/checks/events-and-metrics.md to document the new branching behavior of skipped.piece_missing vs failure.other on the Retrieval path. Probe-level tests for isDataSetLive in deal.service.spec.ts are skipped with a TODO; coverage to be re-added in a new dataset-liveness spec. Refs: #556 * test(retrieval): port liveness probe tests + add isPieceLive coverage - New DatasetLivenessService spec ports the FWSS/SP-HTTP probe matrix previously in deal.service.spec.ts and adds isPieceLive coverage: happy/false/no-client/RPC error/abort. - Retrieval pre-check now treats missing dataSetId/pieceId as an error (emits failure.other and bails) instead of falling through to the SP probe. Backfill of legacy null-ID deals is a prerequisite before this branch deploys. The error path is loud enough to spot any stragglers. - Add retrieval spec coverage for the missing-IDs branch and the HEAD with GET-on-405 fallback. - Delete the stale describe.skip("isDataSetLive") block from deal.service.spec.ts; coverage now lives in the new spec. Refs: #561 follow-up commit addressing peer-review consensus. * docs(retrieval): drop change-oriented comments * refactor(retrieval): drop HEAD fallback, GET only * test(retrieval): trim backfill note from test name * test+docs: cover pieceLive RPC throw, fix metric desc * refactor: hoist SynapseViemClient, drop bigint cast
What changed
Pre-flight SP
/pdp/piece/:pieceCid/statusinRetrievalService.performAllRetrievals. On HTTP 404 mark the dealcleaned_up=trueand return[]— skip IPNI verify + transport entirely.Why
Staging telemetry showed ~33% of retrieval IPNI verifications timing out at 30s, all on aged deals whose SPs had silently dropped the piece (covered in #555). SP probe is the authoritative signal — if the SP itself says it doesn't have the piece, dealbot has no way to retrieve it. Marking
cleaned_up=trueremoves the deal fromselectRandomSuccessfulDealForProvider's candidate pool.getDataSet-based termination (#548) is orthogonal: it covers dataset-level termination. This fix covers per-piece removal while the dataset is still live.How to verify
All backend tests pass, including the
RetrievalService SP piece status pre-flightdescribe block:retrievalStatus="skipped.piece_missing"In staging post-deploy, watch for
retrieval_skipped_piece_missinglog events and a drop inipni_verification_timed_outon the retrieval path.Notes / risks
SP_PIECE_STATUS_PROBE_TIMEOUT_MS). Beats the 30s IPNI fallthrough; on timeout the probe returns"unknown"and retrieval proceeds as today — no false-positive cleanup.fetchfor status classification without throwing on 404. Response body is cancelled to release the undici socket.dealRepository.update({id, cleanedUp:false}, ...)is atomic + idempotent. Safe to race withdataset_cleanup_sweepfrom feat(jobs): add dataset_cleanup_sweep global job #548 — last writer wins on a monotonic field.retrieval_skipped_piece_missingwarn log includesstatusUrl,statusCode,probeDurationMs, andaffectedfor audit/correlation.