Skip to content

feat(retrieval): branch piece-missing on PDP pieceLive instead of always cleanup#561

Merged
SgtPooki merged 7 commits into
mainfrom
feat/retrieval-mark-failed-when-piece-missing-on-live-dataset
May 22, 2026
Merged

feat(retrieval): branch piece-missing on PDP pieceLive instead of always cleanup#561
SgtPooki merged 7 commits into
mainfrom
feat/retrieval-mark-failed-when-piece-missing-on-live-dataset

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

@SgtPooki SgtPooki commented May 21, 2026

What changed

Retrieval pre-flight now distinguishes "SP correctly purged a piece that should be gone" from "SP failed to serve a piece it should still have."

Before this PR, a 404 from /pdp/piece/:pieceCid/status always marked the deal cleaned_up, masking real SP failures as routine cleanups.

New flow

In RetrievalService.performAllRetrievals:

  1. If deal.dataSetId or deal.pieceId is null, emit failure.other with log event retrieval_missing_chain_ids and bail. Backfill is a deploy prerequisite (see below).
  2. Call PDPVerifier.pieceLive(dataSetId, pieceId).
    • false: mark cleaned_up, emit skipped.piece_missing, return. No SP probe.
    • true: proceed.
  3. SP HEAD /pdp/piece/:pieceCid/status (GET fallback on 405).
    • 404: emit failure.other, persist a failed Retrieval row with errorMessage: "SP reports piece missing but PDP pieceLive=true" and log event retrieval_failed_piece_missing_live. Deal stays in the candidate pool for re-probing.
    • other: proceed to full retrieval.

pieceLive source: PDPVerifier.sol:288. Pieces scheduled for removal but not yet finalized return true, since the SP is still on the hook for challenges.

The documented retrievalStatus enum (success | failure.timedout | failure.other | skipped.piece_missing) is preserved. The new failure modes reuse failure.other; dashboards slice via log event or errorMessage.

Refactor

isDataSetLive and its FWSS + SP-HTTP probes move out of DealService into a new DatasetLivenessService, which also owns isPieceLive. DealService.isDataSetLive is now a thin proxy.

Backfill (completed prior to merge)

Legacy deals from before 2026-05-07 have null piece_id and/or data_set_id. Backfill was run against both environments using a one-off script (kept out of the repo). Script + run results + chain-verification spot checks: https://gist.github.com/SgtPooki/9fcf39ad2e95e022930ec3fbbbd139b1

env rows updated active candidates remaining with null IDs
calibration 10,774 0
mainnet 14,100 0

How to verify

cd apps/backend
pnpm typecheck && pnpm vitest run && pnpm biome lint .

After deploy: failure.other on the Retrieval path should pick up SP-side piece misses that previously hid in skipped.piece_missing. retrieval_missing_chain_ids events should stay at zero.

Notes

  • checkPieceLive returns true on RPC error so a chain outage cannot cascade into bulk cleanups.
  • Probe is now HEAD (GET fallback on 405) instead of GET. No body required.

…ays cleanup

Pre-flight piece-status probe used to mark any deal cleaned_up whenever the
SP returned 404. That conflated two scenarios: pieces SPs legitimately
purged (terminated datasets, hard-removed pieces) and pieces SPs should
still serve (live datasets, scheduled-but-not-finalized removals). The
latter was being silently swept under skipped.piece_missing instead of
surfacing as a failed retrieval.

New flow in RetrievalService.performAllRetrievals:

  1. Chain pre-check via PDPVerifier.pieceLive(dataSetId, pieceId).
     - false: dataset terminated, piece never created, or piece hard-removed.
       Mark deal cleaned_up, emit skipped.piece_missing, no SP probe needed.
     - true: piece should be retrievable. Proceed.
  2. SP HEAD probe on /pdp/piece/:pieceCid/status (HEAD with GET fallback on 405).
     - 404 after pieceLive=true: SP failure. Emit failure.other, persist a
       failed Retrieval row, keep deal in candidate pool for re-probing.
     - 200/other: proceed to full retrieval.

Probe was GET, now HEAD; no body required.

Extracts isDataSetLive + its FWSS/SP-HTTP probes into a new
DatasetLivenessService (apps/backend/src/dataset-liveness/) so
RetrievalService doesn't depend on DealService. DealService.isDataSetLive
becomes a thin proxy. New isPieceLive method on the service reads
PDPVerifier.pieceLive via viem.

Updates docs/checks/events-and-metrics.md to document the new branching
behavior of skipped.piece_missing vs failure.other on the Retrieval path.

Probe-level tests for isDataSetLive in deal.service.spec.ts are skipped
with a TODO; coverage to be re-added in a new dataset-liveness spec.

Refs: #556
@FilOzzy FilOzzy added this to FOC May 21, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 21, 2026
SgtPooki added 4 commits May 21, 2026 14:20
- New DatasetLivenessService spec ports the FWSS/SP-HTTP probe matrix
  previously in deal.service.spec.ts and adds isPieceLive coverage:
  happy/false/no-client/RPC error/abort.
- Retrieval pre-check now treats missing dataSetId/pieceId as an error
  (emits failure.other and bails) instead of falling through to the SP
  probe. Backfill of legacy null-ID deals is a prerequisite before this
  branch deploys. The error path is loud enough to spot any stragglers.
- Add retrieval spec coverage for the missing-IDs branch and the HEAD
  with GET-on-405 fallback.
- Delete the stale describe.skip("isDataSetLive") block from
  deal.service.spec.ts; coverage now lives in the new spec.

Refs: #561 follow-up commit addressing peer-review consensus.
Copy link
Copy Markdown
Collaborator Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

@SgtPooki SgtPooki self-assigned this May 21, 2026
@SgtPooki SgtPooki moved this from 📌 Triage to 🔎 Awaiting review in FOC May 21, 2026
@SgtPooki SgtPooki marked this pull request as ready for review May 21, 2026 20:33
@SgtPooki SgtPooki requested review from Copilot and silent-cipher May 21, 2026 20:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the Retrieval pre-flight logic to avoid incorrectly marking deals as cleaned_up when an SP returns /pdp/piece/:pieceCid/status 404 for a piece that is still expected to be live on-chain. It also extracts PDP dataset/piece liveness probing into a dedicated DatasetLivenessService so the logic can be reused outside DealService.

Changes:

  • Add a retrieval pre-check pipeline that first consults on-chain pieceLive(dataSetId, pieceId) and only treats SP 404 as cleanup when the piece is not live.
  • Record an explicit failed Retrieval row (with failure.other) when the chain reports the piece live but the SP reports it missing.
  • Refactor dataset liveness probes out of DealService into a new DatasetLivenessService + module, and update tests/docs accordingly.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/checks/events-and-metrics.md Updates retrievalStatus metric documentation to describe the new pre-flight branching behavior.
apps/backend/src/retrieval/retrieval.service.ts Implements chain pieceLive gating and changes SP-404 handling to record failures instead of always cleaning up.
apps/backend/src/retrieval/retrieval.service.spec.ts Extends retrieval tests to cover the new pre-check branches and failure recording.
apps/backend/src/retrieval/retrieval.module.ts Wires DatasetLivenessModule into the retrieval module for DI.
apps/backend/src/deal/deal.service.ts Replaces inlined dataset liveness probing with a proxy call to DatasetLivenessService.
apps/backend/src/deal/deal.service.spec.ts Updates deal-service tests to mock DatasetLivenessService instead of the removed inlined probes.
apps/backend/src/deal/deal.module.ts Wires DatasetLivenessModule into the deal module for DI.
apps/backend/src/dataset-liveness/dataset-liveness.service.ts New service encapsulating FWSS+SP dataset liveness checks and on-chain pieceLive reads.
apps/backend/src/dataset-liveness/dataset-liveness.service.spec.ts New unit tests for dataset liveness + piece liveness probing behavior.
apps/backend/src/dataset-liveness/dataset-liveness.module.ts New Nest module exporting DatasetLivenessService.

Comment thread docs/checks/events-and-metrics.md Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.ts
Comment thread apps/backend/src/retrieval/retrieval.service.ts
Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description looks good/right to me.

I do want to see "A separate backfill script must run against staging and mainnet before this PR merges. Tracking issue + script PR will be linked here."

@SgtPooki
Copy link
Copy Markdown
Collaborator Author

PR description looks good/right to me.

I do want to see "A separate backfill script must run against staging and mainnet before this PR merges. Tracking issue + script PR will be linked here."

Updated PR description, backfill done. results at https://gist.github.com/SgtPooki/9fcf39ad2e95e022930ec3fbbbd139b1#file-backfill-results-md

Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment thread apps/backend/src/dataset-liveness/dataset-liveness.service.ts Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
@SgtPooki SgtPooki merged commit 9a31a7e into main May 22, 2026
9 checks passed
@SgtPooki SgtPooki deleted the feat/retrieval-mark-failed-when-piece-missing-on-live-dataset branch May 22, 2026 12:59
@github-project-automation github-project-automation Bot moved this from 🔎 Awaiting review to 🎉 Done in FOC May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

5 participants