Skip to content

fix(retrieval): mark deals cleaned_up when SP reports piece missing#556

Merged
SgtPooki merged 7 commits into
mainfrom
fix/retrieval-cleanup-missing-pieces
May 20, 2026
Merged

fix(retrieval): mark deals cleaned_up when SP reports piece missing#556
SgtPooki merged 7 commits into
mainfrom
fix/retrieval-cleanup-missing-pieces

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

@SgtPooki SgtPooki commented May 20, 2026

What changed

Pre-flight SP /pdp/piece/:pieceCid/status in RetrievalService.performAllRetrievals. On HTTP 404 mark the deal cleaned_up=true and return [] — skip IPNI verify + transport entirely.

Why

Staging telemetry showed ~33% of retrieval IPNI verifications timing out at 30s, all on aged deals whose SPs had silently dropped the piece (covered in #555). SP probe is the authoritative signal — if the SP itself says it doesn't have the piece, dealbot has no way to retrieve it. Marking cleaned_up=true removes the deal from selectRandomSuccessfulDealForProvider's candidate pool.

getDataSet-based termination (#548) is orthogonal: it covers dataset-level termination. This fix covers per-piece removal while the dataset is still live.

How to verify

pnpm test

All backend tests pass, including the RetrievalService SP piece status pre-flight describe block:

  • 404 → marks cleaned_up + skips retrieval + records retrievalStatus="skipped.piece_missing"
  • 200 → proceeds normally
  • Network error → proceeds (treated as unknown, no false-positive cleanup)
  • No serviceUrl → skips probe + proceeds
  • Outer signal aborts during probe → re-thrown, no DB write
  • Trailing slash on serviceUrl + URL-encoded pieceCid → constructs correct probe URL

In staging post-deploy, watch for retrieval_skipped_piece_missing log events and a drop in ipni_verification_timed_out on the retrieval path.

Notes / risks

  • Probe timeout 5s (SP_PIECE_STATUS_PROBE_TIMEOUT_MS). Beats the 30s IPNI fallthrough; on timeout the probe returns "unknown" and retrieval proceeds as today — no false-positive cleanup.
  • Probe uses native fetch for status classification without throwing on 404. Response body is cancelled to release the undici socket.
  • dealRepository.update({id, cleanedUp:false}, ...) is atomic + idempotent. Safe to race with dataset_cleanup_sweep from feat(jobs): add dataset_cleanup_sweep global job #548 — last writer wins on a monotonic field.
  • retrieval_skipped_piece_missing warn log includes statusUrl, statusCode, probeDurationMs, and affected for audit/correlation.
  • Adds one HTTP call per scheduled retrieval. Negligible against the IPNI + transport work it replaces on dead candidates.
  • Closes retrieval: mark deals cleaned_up when SP returns 404 for piece status #555.

Pre-flight SP `/pdp/piece/:pieceCid/status` before IPNI verify + block fetch in
the retrieval check path. On 404 mark the deal cleaned_up and skip — saves the
30s IPNI timeout plus downstream block-fetch 404 noise per stale candidate.

Closes #555
Copilot AI review requested due to automatic review settings May 20, 2026 15:33
@FilOzzy FilOzzy added this to FOC May 20, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a pre-flight probe to the retrieval pipeline so that if a Storage Provider reports a piece is missing (/pdp/piece/:pieceCid/status returns 404), the corresponding deal is marked cleaned_up=true and the retrieval (including IPNI verification and transport) is skipped. This reduces wasted work and removes stale deals from future retrieval candidate selection.

Changes:

  • Add SP piece-status pre-flight in RetrievalService.performAllRetrievals; on 404, update the deal to cleanedUp=true and return [].
  • Introduce probeSpPieceStatus helper to perform the /pdp/piece/:pieceCid/status request with a 10s timeout.
  • Add unit tests covering the 404/200/network-error/no-serviceUrl cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
apps/backend/src/retrieval/retrieval.service.ts Adds SP piece-status probing and marks deals cleaned up when the SP reports the piece is missing.
apps/backend/src/retrieval/retrieval.service.spec.ts Adds tests validating the new pre-flight behavior and cleanup update call.

Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.spec.ts
Copy link
Copy Markdown
Collaborator Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

- Build probe URL via WHATWG URL + encodeURIComponent (handles trailing
  slash + non-trivial pieceCid chars)
- Re-throw outer-signal aborts; only probe-timeout/network fall through
- Post-probe signal.throwIfAborted() to stop promptly on cancellation
- 10s -> 5s probe timeout; safer pre-flight that beats 30s IPNI path
- Record retrievalStatus="skipped.piece_missing" + log statusUrl/statusCode/probeDurationMs
- User-Agent header on probe; unstubAllGlobals in tests for isolation
- 2 new tests: outer-signal abort re-thrown; pieceCid URL-encoded
@SgtPooki SgtPooki self-assigned this May 20, 2026
Address Copilot review:
- Cancel fetch response body to release the undici socket back to the pool
- Capture dealRepository.update affected count; log distinguishes "marked
  cleaned_up" from "concurrent writer already cleaned"
@SgtPooki SgtPooki moved this from 📌 Triage to ⌨️ In Progress in FOC May 20, 2026
SgtPooki added 2 commits May 20, 2026 11:50
affected=0 today is effectively unreachable. Forward-looking branch was
weakly justified. Log affected count and let readers interpret without
speculating about cause.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.ts
Comment thread apps/backend/src/retrieval/retrieval.service.spec.ts
Synapse-sdk and dealbot's ipni.strategy + pull-check use direct concat on
serviceUrl. Match that. Base paths are not part of the SP serviceUrl
contract.
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me! Should we also document the new retrievalStatus state, skipped.piece_missing?

@github-project-automation github-project-automation Bot moved this from ⌨️ In Progress to ✔️ Approved by reviewer in FOC May 20, 2026
Per PR review on #556. Adds a row note on retrievalStatus in
events-and-metrics.md and a subsection in retrievals.md explaining when
the status is emitted and how it differs from a transport failure.
@SgtPooki SgtPooki merged commit c3be5e3 into main May 20, 2026
9 checks passed
@SgtPooki SgtPooki deleted the fix/retrieval-cleanup-missing-pieces branch May 20, 2026 17:10
@github-project-automation github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC May 20, 2026
@SgtPooki
Copy link
Copy Markdown
Collaborator Author

Pre-deploy (~16:09–17:24, old code, v with 68 failure.other + 110 success/min):

  • failure.other: 68/min
  • success: 110/min
  • skipped.piece_missing: 0/min (label didn't exist)

Post-deploy (17:31+):

  • failure.other: 0–2/min (~97% drop)
  • success: 30–40/min
  • skipped.piece_missing: 6→22/min (ramping as candidate pool surfaces dead pieces)

SgtPooki added a commit that referenced this pull request May 22, 2026
…ays cleanup (#561)

* feat(retrieval): branch piece-missing on PDP pieceLive instead of always cleanup

Pre-flight piece-status probe used to mark any deal cleaned_up whenever the
SP returned 404. That conflated two scenarios: pieces SPs legitimately
purged (terminated datasets, hard-removed pieces) and pieces SPs should
still serve (live datasets, scheduled-but-not-finalized removals). The
latter was being silently swept under skipped.piece_missing instead of
surfacing as a failed retrieval.

New flow in RetrievalService.performAllRetrievals:

  1. Chain pre-check via PDPVerifier.pieceLive(dataSetId, pieceId).
     - false: dataset terminated, piece never created, or piece hard-removed.
       Mark deal cleaned_up, emit skipped.piece_missing, no SP probe needed.
     - true: piece should be retrievable. Proceed.
  2. SP HEAD probe on /pdp/piece/:pieceCid/status (HEAD with GET fallback on 405).
     - 404 after pieceLive=true: SP failure. Emit failure.other, persist a
       failed Retrieval row, keep deal in candidate pool for re-probing.
     - 200/other: proceed to full retrieval.

Probe was GET, now HEAD; no body required.

Extracts isDataSetLive + its FWSS/SP-HTTP probes into a new
DatasetLivenessService (apps/backend/src/dataset-liveness/) so
RetrievalService doesn't depend on DealService. DealService.isDataSetLive
becomes a thin proxy. New isPieceLive method on the service reads
PDPVerifier.pieceLive via viem.

Updates docs/checks/events-and-metrics.md to document the new branching
behavior of skipped.piece_missing vs failure.other on the Retrieval path.

Probe-level tests for isDataSetLive in deal.service.spec.ts are skipped
with a TODO; coverage to be re-added in a new dataset-liveness spec.

Refs: #556

* test(retrieval): port liveness probe tests + add isPieceLive coverage

- New DatasetLivenessService spec ports the FWSS/SP-HTTP probe matrix
  previously in deal.service.spec.ts and adds isPieceLive coverage:
  happy/false/no-client/RPC error/abort.
- Retrieval pre-check now treats missing dataSetId/pieceId as an error
  (emits failure.other and bails) instead of falling through to the SP
  probe. Backfill of legacy null-ID deals is a prerequisite before this
  branch deploys. The error path is loud enough to spot any stragglers.
- Add retrieval spec coverage for the missing-IDs branch and the HEAD
  with GET-on-405 fallback.
- Delete the stale describe.skip("isDataSetLive") block from
  deal.service.spec.ts; coverage now lives in the new spec.

Refs: #561 follow-up commit addressing peer-review consensus.

* docs(retrieval): drop change-oriented comments

* refactor(retrieval): drop HEAD fallback, GET only

* test(retrieval): trim backfill note from test name

* test+docs: cover pieceLive RPC throw, fix metric desc

* refactor: hoist SynapseViemClient, drop bigint cast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

retrieval: mark deals cleaned_up when SP returns 404 for piece status

4 participants