feat: anon piece selection and retrieval by dennis-tra · Pull Request #487 · FilOzone/dealbot

dennis-tra · 2026-04-28T12:37:09Z

Hi folks,

this PR adds the anon retrieval flow from #427. It is a follow-up of:

feat: dealbot-specific subgraph #469 - added the subgraph)
feat(retrieval-anon): anon piece selection and retrieval #459 - the original large PR which contained the subgraph and anon retrieval flow (I've closed that PR).

The main logic is in ./apps/backend/retrieval-anon:

anon-retrieval.service.ts - "called" on a schedule as a job. It starts the retrieval process (select piece, fetch piece, validate car, store result)
anon-piece-selector.service.ts - implements the subgraph query logic
car-validation.service.ts - parses the piece bytes as a CAR, checks IPNI availability, fetches k blocks from that car and validates their hashes
piece-retreival.service.ts - implements the HTTP request to download the piece and CommP validation

The anonymous piece selection logic works as follows:

The retrievalAnon check probes an SP for non-dealbot pieces so we can detect SPs that behave well even if the teacher is not watching. To do this fairly, the piece selection should satisfy the following requirements:

uniform random across the SP's entire active pieces (not biased toward recent writes, specific payers, or specific sizes).
Prefer withIPFSIndexing pieces (so CAR/IPNI validation has something to check) but still exercise non-indexed pieces so an SP can't optimise only its CAR corpus.
Cover a realistic spread of piece sizes: big enough for useful bandwidth measurements, not so big that SPs with only small deals are skipped.
Avoid immediately re-testing the same piece across consecutive checks.

How it works in practice:

Every Root entity in the subgraph carries a sampleKey = keccak256(setId-rootId) populated once at insert time. Because keccak256 is uniform over 256 bits and independent of creation order/size/dataset,
sampleKey sorts roots into a uniform random permutation that is stable across queries.

This is necessary because you cannot just select a random element from a range query in GraphQL. If we knew the total number of pieces we could define a random skip value but this is also capped at 5000. I've read that it becomes very inefficient at higher values. This would also require a non-trivial bookkeeping of active pieces/datasets counts. The sampleKey is much easier.

Drawing a sample looks like this:

Pick a size bucket (small < 20 MiB, medium 20 MiB to 100 GiB, large 100 MiB to 500 MiB) by weighted random — weights 20% / 50% / 30% respectively.
Pick the pool: withIPFSIndexing: true with probability 80%; otherwise no filter.
Generate 32 random bytes as $sampleKey and query:

query randomPiece {
  roots( // <- piece
    first: 1
    orderBy: $sampleKey
    orderDirection: asc
    where: {
      sampleKey_gte: $sampleKey
      removed: false
      rawSize_gte: $sizeBucket_lo
      rawSize_lte: $sizeBucket_hi
      proofSet_: { // <- dataset
        fwssServiceProvider: $sp
        fwssPayer_not: $dealbotPayer
        isActive: true
        withIPFSIndexing: $pool
      }
    }
  )
}

This returns the root with the smallest sampleKey >= $sampleKey which is effectively a uniform random pick, in O(log N).
Drop it if pdpPaymentEndEpoch has already passed the latest indexed block, or if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession). On a miss, redraws once with a fresh $sampleKey.
Falls back through: (same bucket, opposite pool) -> (any bucket, indexed) -> (any bucket, any) before giving up.

Subgraph

Note

The deployed subgraphs don't contain the latest changes from the recent PR review. They should still work for testing.

I have deployed the new subgraphs:

A deployment looks like this from within the subgraph folder (prerequisite is a call to goldsky login):

pnpm run codegen && pnpm run build:calibration && VERSION=0.3.0 pnpm run deploy:calibration
pnpm run codegen && pnpm run build:mainnet && VERSION=0.3.0 pnpm run deploy:mainnet

Comments

timeout handling was a bit tricky because we have 1) job timeouts 2) a connect timeout 3) a transfer timeout. Connect and transfer timeouts were shared between the basic and anon retrievals but because anon retrievals may download larger files they were too short. I've configured a job timeout for anon retrieval to 5 minutes (which should actually also take the job-rate into account but doesn't at the moment) and the http transfer timeout is set to the maximum job timeout value of the basic and anon retrievals. That's because both code paths use the same HTTP client.
If an http2 retrieval times out but has received partial data, it returns partial information (ttfb, retrieved bytes, etc). I've only added this to http2.
This PR clashes with @iand's feat: add retrieval_type column to retrieval_checks clickhouse table #485. I have created a separate table for the anon retrievals because I figured that the overlap between both types was too little.
Do we want to keep calling it anon or rather something like sampled?
I don't know how much the subgraph will cost

iand · 2026-04-29T14:11:27Z

To be sample/sampled better conveys what is happening.

iand · 2026-04-29T14:21:33Z

+    first_byte_ms              Nullable(Float64),                 -- time to first response byte
+    last_byte_ms               Nullable(Float64),                 -- time to last response byte
+    bytes_retrieved            Nullable(UInt64),                  -- bytes received from /piece/{cid}
+    throughput_bps             Nullable(UInt64),                  -- effective throughput, bytes per second


This is data that can be easily derived. Also is it (ttlb-ttfb)/bytes or simply ttlb/bytes?

It's http response size / total time of the HTTP request

yeah it could easily be derived, I agree

dennis-tra · 2026-04-30T08:15:02Z

One example data point in my local clickhouse db:

timestamp:                  2026-04-30 08:02:34.818
probe_location:             unknown
sp_address:                 0xa3971A7234a3379A1813d9867B531e7EeB20ae07
sp_id:                      9
sp_name:                    ezpdpz-calib
retrieval_id:               488b4334-0bdf-4dae-a299-4c11869f49cc
piece_cid:                  bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
data_set_id:                6142
piece_id:                   2684
raw_size:                   10486897 -- 10.49 million
with_ipfs_indexing:         true
ipfs_root_cid:              bafybeiehkmndajr3dy5524mnglj3cqbljemwj5f3ldot4ulizzmw4sp4mu
service_type:               direct_sp
retrieval_endpoint:         https://calib.ezpdpz.net/piece/bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              347
last_byte_ms:               5701
bytes_retrieved:            10486897 -- 10.49 million
throughput_bps:             1839484 -- 1.84 million
commp_valid:                true
car_parseable:              true
car_block_count:            12
block_fetch_endpoint:       https://calib.ezpdpz.net/ipfs/
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             1087
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0
error_message:              ᴺᵁᴸᴸ

dennis-tra · 2026-04-30T09:23:56Z

A discussion that just came up with @iand:

This anon retrieval here persists data only in Clickhouse. AFAIU the basic retrieval persists data in postgres, exposes aggregated metrics via prometheus, and if the CLICKHOUSE_URL is configured also writes a subset of the postgres data also to clickhouse.

My initial understanding was that metrics data will exclusively live in Clickhouse while Postgres will handle job queues/orchestration/keeping deal state/etc.

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

dennis-tra · 2026-04-30T11:19:21Z

@iand implemented in 6824f75

new Clickhouse row:

Row 1:
──────
timestamp:                  2026-04-30 11:15:04.945
probe_location:             unknown
sp_address:                 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id:                      4
sp_name:                    infrafolio-calib
retrieval_id:               ebe6c618-868f-42e9-a70b-f37f21b56f1d
raw_size:                   26196779 -- 26.20 million
with_ipfs_indexing:         true
service_type:               direct_sp
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              515
last_byte_ms:               5898
bytes_retrieved:            26196779 -- 26.20 million
throughput_bps:             4441638 -- 4.44 million
commp_valid:                true
car_parseable:              true
car_block_count:            275
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             3410
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0

new Postgres row:

-[ RECORD 3 ]--------------+--------------------------------------------------------------------------------------------------------------------
id                         | ebe6c618-868f-42e9-a70b-f37f21b56f1d
started_at                 | 2026-04-30 11:15:04.945+00
probe_location             | unknown
sp_address                 | 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id                      | 4
sp_name                    | infrafolio-calib
piece_cid                  | bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
data_set_id                | 13199
piece_id                   | 283
raw_size                   | 26196779
with_ipfs_indexing         | t
ipfs_root_cid              | bafkreigd5yjmb4mf5luayeac3danaoglqeqqeqfqmbpq2fhrjyuj2fftam
service_type               | direct_sp
retrieval_endpoint         | https://caliberation-pdp.infrafolio.com/piece/bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
piece_fetch_status         | success
http_response_code         | 200
first_byte_ms              | 515
last_byte_ms               | 5898
bytes_retrieved            | 26196779
throughput_bps             | 4441638
commp_valid                | t
car_parseable              | t
car_block_count            | 275
block_fetch_endpoint       | https://caliberation-pdp.infrafolio.com/ipfs/
block_fetch_valid          | t
block_fetch_sampled_count  | 5
block_fetch_failed_count   | 0
ipni_status                | valid
ipni_verify_ms             | 3410
ipni_verified_cids_count   | 6
ipni_unverified_cids_count | 0
error_message              |
created_at                 | 2026-04-30 11:15:04.945+00

SgtPooki · 2026-04-30T12:10:01Z

@dennis-tra

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

idk if we need to store retrieval++ data in postgres, we should probably just store in prometheus + clickhouse

dennis-tra · 2026-04-30T12:11:39Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

Copilot

Pull request overview

Adds a new “anonymous retrieval” check flow to the backend, enabling scheduled sampling of non-dealbot pieces via the subgraph, retrieval via /piece/{cid}, optional CAR/IPNI validation, and persistence of results/metrics.

Changes:

Introduces retrieval-anon module/services (piece selection, retrieval + CommP, CAR/IPNI validation) and wires it into pg-boss scheduling.
Replaces the PDP-subgraph client with a unified SubgraphService and renames env config to SUBGRAPH_ENDPOINT.
Adds new Prometheus metrics and a ClickHouse table (anon_retrieval_checks) plus Postgres schema/entity for anon retrieval results; adjusts HTTP/2 timeout/partial-download behavior.

Reviewed changes

Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
pnpm-lock.yaml	Dependency lock updates (oclif/minimatch patch bumps).
kustomize/overlays/local/backend-configmap-local.yaml	Renames local env var to `SUBGRAPH_ENDPOINT`.
docs/environment-variables.md	Documents `SUBGRAPH_ENDPOINT` + anon retrieval job timeout.
docs/checks/production-configuration-and-approval-methodology.md	Updates production config reference to `SUBGRAPH_ENDPOINT`.
docs/checks/data-retention.md	Updates data-retention docs to reference `SubgraphService`/`SUBGRAPH_ENDPOINT`.
apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts	Updates config shape to `subgraphEndpoint`.
apps/backend/src/subgraph/types.ts	Adds anon piece sampling types + CID decoding + response validator.
apps/backend/src/subgraph/types.spec.ts	Adds unit tests for subgraph response validators.
apps/backend/src/subgraph/subgraph.service.ts	Renames/extends subgraph client; adds `sampleAnonPiece()` and generic query helper.
apps/backend/src/subgraph/subgraph.service.spec.ts	Updates tests and adds coverage for `sampleAnonPiece()`.
apps/backend/src/subgraph/subgraph.module.ts	New Nest module exporting `SubgraphService`.
apps/backend/src/subgraph/queries.ts	New subgraph query definitions incl. anon sampling query builder.
apps/backend/src/retrieval-anon/types.ts	New anon retrieval domain result types.
apps/backend/src/retrieval-anon/retrieval-anon.module.ts	New Nest module wiring anon retrieval dependencies.
apps/backend/src/retrieval-anon/piece-retrieval.service.ts	Implements `/piece/{cid}` download + CommP validation.
apps/backend/src/retrieval-anon/car-validation.service.ts	Implements CAR parsing, IPNI verification, sampled block fetch+hash verification.
apps/backend/src/retrieval-anon/anon-retrieval.service.ts	Orchestrates anon selection → retrieval → validation → persistence + metrics.
apps/backend/src/retrieval-anon/anon-retrieval.service.spec.ts	Adds unit tests for persistence/metrics behavior (including abort/partial).
apps/backend/src/retrieval-anon/anon-piece-selector.service.ts	Implements bucketed/pool sampling + dedup + fallback strategy.
apps/backend/src/retrieval-anon/anon-piece-selector.service.spec.ts	Adds unit tests for sampling/fallback/dedup/termination behavior.
apps/backend/src/pdp-subgraph/queries.ts	Removes legacy PDP-subgraph queries (migrated to `subgraph/`).
apps/backend/src/pdp-subgraph/pdp-subgraph.module.ts	Removes legacy PDP subgraph module (replaced by `SubgraphModule`).
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts	Registers new anon retrieval Prometheus metrics + provider.
apps/backend/src/metrics-prometheus/check-metrics.service.ts	Adds `AnonRetrievalCheckMetrics` helper class.
apps/backend/src/metrics-prometheus/check-metric-labels.ts	Adds `anon_retrieval` to the `CheckType` union.
apps/backend/src/jobs/jobs.service.ts	Adds `retrieval_anon` job type, queue, scheduler, and timeout handling.
apps/backend/src/jobs/jobs.service.spec.ts	Updates tests for new dependency ordering and new schedule rows.
apps/backend/src/jobs/jobs.module.ts	Imports `RetrievalAnonModule` for job execution.
apps/backend/src/jobs/job-queues.ts	Adds `RETRIEVAL_ANON_QUEUE`.
apps/backend/src/ipni/ipni-verification.service.ts	Changes IPNI verification to per-CID checks with per-CID failure tracking/counts.
apps/backend/src/http-client/types.ts	Extends request result with `aborted`/`abortReason` for partial HTTP/2 downloads.
apps/backend/src/http-client/http-client.service.ts	Reworks HTTP/2 timeout handling and returns partial bytes+metrics on abort mid-download.
apps/backend/src/http-client/http-client.service.spec.ts	Adds tests for headersTimeout mapping, signal behavior, partial-download returns, and rethrowing non-abort errors.
apps/backend/src/database/types.ts	Adds `PieceFetchStatus` and `IpniCheckStatus` enums for anon retrievals.
apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts	Adds Postgres schema (table + enums + indexes) for anon retrievals.
apps/backend/src/database/entities/job-schedule-state.entity.ts	Adds `retrieval_anon` to scheduled job type union.
apps/backend/src/database/entities/anon-retrieval.entity.ts	Adds TypeORM entity mapping for `anon_retrievals`.
apps/backend/src/database/database.module.ts	Registers `AnonRetrieval` entity and exports it for injection.
apps/backend/src/data-retention/data-retention.service.ts	Switches data retention polling to `SubgraphService` and `subgraphEndpoint`.
apps/backend/src/data-retention/data-retention.service.spec.ts	Updates tests to mock `SubgraphService` and renamed config field.
apps/backend/src/data-retention/data-retention.module.ts	Switches module import from legacy PDP subgraph module to `SubgraphModule`.
apps/backend/src/config/app.config.ts	Renames env var to `SUBGRAPH_ENDPOINT`; adds anon retrieval rates/timeouts/block-sample config; derives HTTP timeouts from max job timeout.
apps/backend/src/clickhouse/clickhouse.schema.ts	Adds ClickHouse `anon_retrieval_checks` table schema.
apps/backend/src/app.module.ts	Imports `RetrievalAnonModule`.
apps/backend/README.md	Updates env var docs to `SUBGRAPH_ENDPOINT`.
apps/backend/.env.example	Renames env var, adds anon retrieval env vars, updates timeout guidance.
.gitignore	Ignores `.tool-versions`.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SgtPooki · 2026-04-30T12:27:14Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

iand · 2026-04-30T12:56:46Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

This is my basic point. Clickhouse is an optional component so without storing in postgres there is no record of the retrieval.

iand · 2026-04-30T12:58:08Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

This is a valid point too. It's really boils down to what the purpose of postgres is, whether it is to simply record working state or if it's to provide any diagnostic data.

dennis-tra · 2026-04-30T14:05:36Z

If it's necessary, than I don't want to push back too hard [...].

It's not necessary. Everything worked fine before the last commit 👍 Then Clickhouse becomes a hard dependency (which I think is fine for the same reason as you enumerated @SgtPooki ) but IIUC this was the point to pause for @iand.

So what do we do? Revert to CH only?

BigLep · 2026-05-01T17:50:06Z

Wow - good stuff! A few things from driving by. Let me know if it's key that I look more closely:

I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).
"Drop it ... if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession)." I don't know how important is. Keeping it simple where each run we just pick a random sample is fine. If we we randomly pick the same CID a couple of times in a row, I don't think it's a big deal. Also, what do you do when an SP has less than 500 CIDs?
Documentation: I realize New check: retrieval++ #427 wasn't explicit about this (my bad), but I'd like https://github.com/FilOzone/dealbot/tree/main/docs/checks to stay updated in terms of events, metrics, and check descriptions. Basically I want to have a place where a human can reason about our our checks our metrics.

SgtPooki · 2026-05-01T18:00:59Z

@dennis-tra

I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).

Yea, we should pump to clickhouse and prometheus. so i think the going consensus is: we should not be adding info to postgres unless it's required to read/infer state. we do need that for deals/data-storage checks, we dont need it for retrievals but it's still there because we haven't removed it.

we can eventually migrate to no data-storage/deal table in the future, but that would require a lot more RPC calls and chain walking to figure out what exists and where we want to upload things to. this isn't priority, but keeping new out of postgres where possible is (a priority).

SgtPooki

reviewed some and have a few high level questions:

why change PDP_SUBGRAPH_ENDPOINT env var? seems unnecessary and will cause issues with our deployed version, and potentially old stale code as well
why move subgraph code to /subgraph instead of just leaving in /pdp-subgraph? -- this is a large PR and keeping to existing norms rather than overwriting things would be ideal
changing how we do ipni verification can have drastic impact on current metrics.. switching to serial individual CID checks could cause already slow IPNI verification for some SPs to start fully failing.

SgtPooki

two last things besides the one we discussed on slack!

BigLep

I have a few comments here. The only reason I didn't give a full approval was because I thought it would be good to confirm things look good after merge and there is the feedback item on why we're deviating from how "data storage check" does sub-statuses if we ware going to use "skipped" then use it throughout all the statuses (including car parsing).

dennis-tra · 2026-05-27T10:41:34Z

@BigLep your comments regarding the status lead me down to restructure a few things (commit aa19374) because you were totally right regarding some skipped vs. error status remarks.

I've changed the CarValidationService to become the PieceValidationService
Each check (CAR parsing / IPNI verification / Block fetching) is called separately by the AnonRetrievalService - previously it was just a single validate call which returned a big result object.
This changed allowed for easier error handling tracking of when tests were skipped.
Notably
1. anonCarParseStatus received a skipped sub-status
2. For anonIpniStatus The semantics when skipped vs. errors are emitted have been tightened to accommodate the comments.
3. anonBlockFetchStatus sub-status have changed from valid/invalid to success/failure (valid/invalid sounded weird in that context).

For reference here are all new sub-status meanings

anonPieceRetrievalStatus	Meaning
`success`	`GET /piece/{pieceCid}` returned HTTP 2xx and the response bytes hashed to the declared CommP.
`skipped` (changed from `failure.no_piece`)	The subgraph returned no candidate piece for the SP after all selection fallbacks. No HTTP request was attempted.
`failure.http`	Piece fetch did not return HTTP 2xx, or the request failed at the transport layer (DNS, TLS, connection reset, etc.).
`failure.commp`	Piece fetch returned HTTP 2xx, but the response bytes hashed to a different CID than `pieceCid`. The bytes are discarded — downstream CAR / IPNI / block-fetch validation is skipped to avoid amplifying a misbehaving SP.
`failure.timedout`	The job-level `AbortSignal` fired (most often `ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS`). Partial timing/byte evidence is still persisted.
`failure.other`	Catch-all for retrieval failures that do not match any of the categories above.

anonCarParseStatus	Meaning
`parseable`	The fetched piece bytes were successfully parsed as a CAR by `@ipld/car`.
`not_parseable`	The fetched piece bytes could not be parsed as a CAR (malformed header, truncated content, unexpected encoding, or parser threw an error).
`skipped` (NEW)	CAR parsing was not attempted — piece fetch failed, the piece does not advertise IPFS indexing, or the job aborted before parsing.

anonIpniStatus	Meaning
`valid`	filecoinpin.contact returned the SP as a provider for the root CID and every sampled child CID within `IPNI_VERIFICATION_TIMEOUT_MS`.
`invalid`	IPNI was queried but at least one CID never resolved to the SP under test before the timeout (or the timeout fired with unresolved CIDs).
`skipped` (semantic changed)	IPNI verification was not attempted — piece fetch failed, the piece does not advertise IPFS indexing, CAR parsing returned `not_parseable`, the root CID itself failed to parse, or the job aborted.
`error` (semantic changed)	IPNI verification was attempted and `IpniVerificationService.verify` threw unexpectedly (transport error, service down, etc.).

anonBlockFetchStatus	Meaning
`success` (RENAMED)	Every sampled CID was fetched via `GET {spBaseUrl}/ipfs/{cid}?format=raw` and the response bytes hash-verified against the declared CID.
`failure` (RENAMED)	At least one sampled block fetch failed: non-2xx HTTP, hash mismatch, unsupported codec, unsupported hash, or transport error. Each failed sample counts as one failed block.
`skipped`	Block-fetch sampling was not attempted — piece fetch failed, the piece does not advertise IPFS indexing, CAR parsing returned `not_parseable`, or the job aborted.
`error`	Block-fetch sampling was attempted but the loop threw unexpectedly outside the per-block try/catch.

dennis-tra · 2026-05-27T11:21:16Z

The typecheck CI failure also happens on main and is unrelated to the changes here.

I wanted to test and run the latest changes but nestjs is refusing to start due to the type errors:

Output:

> nest start

src/deal/deal.service.ts:403:20 - error TS2678: Type '"stored"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

403               case "stored": {
                       ~~~~~~~~
src/deal/deal.service.ts:451:39 - error TS2339: Property 'data' does not exist on type 'never'.

451                 deal.pieceCid = event.data.pieceCid.toString();
                                          ~~~~
src/deal/deal.service.ts:452:49 - error TS2339: Property 'data' does not exist on type 'never'.

452                 dealLogContext.pieceCid = event.data.pieceCid.toString();
                                                    ~~~~
src/deal/deal.service.ts:470:20 - error TS2678: Type '"piecesAdded"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

470               case "piecesAdded":
                       ~~~~~~~~~~~~~
src/deal/deal.service.ts:475:33 - error TS2339: Property 'data' does not exist on type 'never'.

475                   txHash: event.data.txHash,
                                    ~~~~
src/deal/deal.service.ts:478:27 - error TS2339: Property 'data' does not exist on type 'never'.

478                 if (event.data.txHash != null) {
                              ~~~~
src/deal/deal.service.ts:479:48 - error TS2339: Property 'data' does not exist on type 'never'.

479                   deal.transactionHash = event.data.txHash as Hex;
                                                   ~~~~
src/deal/deal.service.ts:495:20 - error TS2678: Type '"piecesConfirmed"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

495               case "piecesConfirmed":
                       ~~~~~~~~~~~~~~~~~
src/deal/deal.service.ts:500:35 - error TS2339: Property 'data' does not exist on type 'never'.

500                   pieceIds: event.data.pieceIds,
                                      ~~~~
src/deal/deal.service.ts:502:27 - error TS2339: Property 'data' does not exist on type 'never'.

502                 if (event.data.pieceIds.length > 1) {
                              ~~~~
src/deal/deal.service.ts:507:37 - error TS2339: Property 'data' does not exist on type 'never'.

507                     pieceIds: event.data.pieceIds,
                                        ~~~~
src/deal/deal.service.ts:510:27 - error TS2339: Property 'data' does not exist on type 'never'.

510                 if (event.data.pieceIds.length > 0) {
                              ~~~~
src/deal/deal.service.ts:511:47 - error TS2339: Property 'data' does not exist on type 'never'.

511                   deal.pieceId = Number(event.data.pieceIds[0]);
                                                  ~~~~
src/deal/deal.service.ts:925:20 - error TS2678: Type '"stored"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

925               case "stored":
                       ~~~~~~~~
src/deal/deal.service.ts:926:34 - error TS2339: Property 'data' does not exist on type 'never'.

926                 pieceCid = event.data.pieceCid.toString();
                                     ~~~~
src/deal/deal.service.ts:936:20 - error TS2678: Type '"piecesAdded"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

936               case "piecesAdded":
                       ~~~~~~~~~~~~~
src/deal/deal.service.ts:944:33 - error TS2339: Property 'data' does not exist on type 'never'.

944                   txHash: event.data.txHash ?? "unknown",
                                    ~~~~
src/deal/deal.service.ts:947:20 - error TS2678: Type '"piecesConfirmed"' is not comparable to type '"onStored" | "onPiecesAdded" | "onPiecesConfirmed" | "onCopyComplete" | "onCopyFailed" | "onPullProgress" | "onProviderSelected" | "onDataSetResolved" | "ipniProviderResults.retryUpdate" | "ipniProviderResults.complete" | "ipniProviderResults.failed"'.

947               case "piecesConfirmed":
                       ~~~~~~~~~~~~~~~~~
src/deal/deal.service.ts:955:35 - error TS2339: Property 'data' does not exist on type 'never'.

955                   pieceIds: event.data.pieceIds,
                                      ~~~~
src/pull-check/pull-check.module.ts:2:33 - error TS2307: Cannot find module '@nestjs/throttler' or its corresponding type declarations.

2 import { ThrottlerModule } from "@nestjs/throttler";
                                  ~~~~~~~~~~~~~~~~~~~
src/pull-check/pull-check.service.ts:3:22 - error TS2305: Module '"@filoz/synapse-core/sp"' has no exported member 'waitForPullPieces'.

3 import { pullPieces, waitForPullPieces } from "@filoz/synapse-core/sp";
                       ~~~~~~~~~~~~~~~~~
src/pull-check/pull-check.service.ts:137:58 - error TS2345: Argument of type 'string | null' is not assignable to parameter of type 'string'.
  Type 'null' is not assignable to type 'string'.

137       this.pullCheckMetrics.recordProviderStatus(labels, finalProviderStatus);
                                                             ~~~~~~~~~~~~~~~~~~~
src/pull-check/pull-piece.controller.ts:5:32 - error TS2307: Cannot find module '@nestjs/throttler' or its corresponding type declarations.

5 import { ThrottlerGuard } from "@nestjs/throttler";
                                 ~~~~~~~~~~~~~~~~~~~
src/wallet-sdk/wallet-sdk.service.ts:428:21 - error TS2339: Property 'extraCapabilities' does not exist on type 'PDPOffering'.

428     return info.pdp.extraCapabilities?.serviceStatus === DEV_TAG;
                        ~~~~~~~~~~~~~~~~~

Found 24 error(s).

SgtPooki · 2026-05-27T21:19:15Z

I wanted to test and run the latest changes but nestjs is refusing to start due to the type errors:

looks like you got this fixed.. lmk if you need further help there.

BigLep

Arg, apologies for the delay. I didn't realize I hadn't finished htis off.

No blocking comments from me, although looking at this again with eyes that spent too long today looking at our metric documentation, I was wonder if we want to be introducing other terms for ipni validation rather than success and failure. Your call.

Co-authored-by: Steve Loeppky <stvn@loeppky.com>

…onvention

BigLep · 2026-05-29T16:19:48Z

Things are good from a docs regard to me. Looks like just need code signoff from @SgtPooki (who I know is juggling other things for GA).

BigLep · 2026-06-05T21:14:13Z

@SgtPooki : are you planning to look at this more, or is this good to merge?

FilOzzy added this to FOC Apr 28, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Apr 28, 2026

dennis-tra force-pushed the anon-retrieval branch 2 times, most recently from e40d010 to 444a79b Compare April 28, 2026 12:42

BigLep moved this from 📌 Triage to ⌨️ In Progress in FOC Apr 28, 2026

BigLep assigned dennis-tra Apr 29, 2026

dennis-tra mentioned this pull request Apr 29, 2026

feat(retrieval-anon): anon piece selection and retrieval #459

Closed

iand reviewed Apr 29, 2026

View reviewed changes

dennis-tra marked this pull request as ready for review April 30, 2026 08:04

dennis-tra requested a review from iand April 30, 2026 08:04

iand mentioned this pull request Apr 30, 2026

feat: add retrieval_type column to retrieval_checks clickhouse table #485

Closed

dennis-tra mentioned this pull request Apr 30, 2026

New check: retrieval++ #427

Open

SgtPooki requested review from SgtPooki, Copilot and silent-cipher April 30, 2026 12:08

Copilot started reviewing on behalf of SgtPooki April 30, 2026 12:09 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

SgtPooki requested changes May 1, 2026

View reviewed changes

SgtPooki requested changes May 26, 2026

View reviewed changes

Comment thread apps/backend/src/retrieval-anon/car-validation.service.ts Outdated

Comment thread apps/backend/src/retrieval-anon/anon-retrieval.service.ts

SgtPooki reviewed May 26, 2026

View reviewed changes

Comment thread apps/backend/src/retrieval-anon/anon-retrieval.service.ts Outdated

BigLep reviewed May 26, 2026

View reviewed changes

dennis-tra mentioned this pull request May 27, 2026

feat(subgraph): mark pieces as inactive when the pdpPaymentEndEpoch has passed #579

Open

dennis-tra added 5 commits May 27, 2026 10:27

fix(retrieval-anon): wrap-around sampleKey subgraph query

d098e37

fix(retrieval-anon): handle aborted signal in car response

70686e1

fix(retrieval-anon): correctly interpret operator-driven cancellation

007cfd7

docs(anon-retrievals): incorporate pr feedback

9611e6c

refactor(anon-retrieval): decouple CAR parse / IPNI verify / Block Fetch

aa19374

dennis-tra added 2 commits May 27, 2026 12:49

docs: claude review

24ef113

Merge branch 'main' into anon-retrieval

afbf4ea

dennis-tra force-pushed the anon-retrieval branch from c6116f6 to afbf4ea Compare May 27, 2026 11:13

docs: claude consolidation

9099370

dennis-tra added 3 commits May 27, 2026 13:22

refactor: put newest anon retrieval table last

8d4c1f8

fix(retrieval-anon): mark job as failed if no piece was found

2066a31

fix: job service dep indices

24c9f55

BigLep moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC May 27, 2026

Merge branch 'main' into anon-retrieval

74a6c6d

BigLep approved these changes May 29, 2026

View reviewed changes

Comment thread docs/checks/anon-retrievals.md Outdated

Comment thread docs/checks/anon-retrievals.md Outdated

Comment thread docs/checks/anon-retrievals.md Outdated

Update docs/checks/anon-retrievals.md

4714bf8

Co-authored-by: Steve Loeppky <stvn@loeppky.com>

This was referenced May 29, 2026

feat(subgraph): track payment-active state for cheaper anon-retrieval queries #582

Draft

refactor(retrieval-anon): filter expired PDP payments at the subgraph probe-lab/dealbot#12

Draft

dennis-tra added 2 commits May 29, 2026 09:40

refactor(retrieval-anon): align check statuses to success/failure.* c…

b3d7731

…onvention

docs: clarify currently unused probe location field

4666e0c

BigLep requested a review from SgtPooki May 29, 2026 16:19

Conversation

dennis-tra commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Subgraph

Comments

Uh oh!

iand commented Apr 29, 2026

Uh oh!

iand Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

SgtPooki commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki commented Apr 30, 2026

Uh oh!

iand commented Apr 30, 2026

Uh oh!

iand commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

BigLep commented May 1, 2026

Uh oh!

SgtPooki commented May 1, 2026

Uh oh!

SgtPooki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-tra commented May 27, 2026

dennis-tra commented Apr 28, 2026 •

edited

Loading

dennis-tra commented Apr 30, 2026 •

edited

Loading

dennis-tra commented May 27, 2026 •

edited

Loading