feat: anon piece selection and retrieval#487
Conversation
e40d010 to
444a79b
Compare
|
To be sample/sampled better conveys what is happening. |
| first_byte_ms Nullable(Float64), -- time to first response byte | ||
| last_byte_ms Nullable(Float64), -- time to last response byte | ||
| bytes_retrieved Nullable(UInt64), -- bytes received from /piece/{cid} | ||
| throughput_bps Nullable(UInt64), -- effective throughput, bytes per second |
There was a problem hiding this comment.
This is data that can be easily derived. Also is it (ttlb-ttfb)/bytes or simply ttlb/bytes?
There was a problem hiding this comment.
It's http response size / total time of the HTTP request
There was a problem hiding this comment.
yeah it could easily be derived, I agree
|
One example data point in my local clickhouse db: |
|
A discussion that just came up with @iand: This anon retrieval here persists data only in Clickhouse. AFAIU the basic retrieval persists data in postgres, exposes aggregated metrics via prometheus, and if the My initial understanding was that metrics data will exclusively live in Clickhouse while Postgres will handle job queues/orchestration/keeping deal state/etc. To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if |
|
new Clickhouse row: new Postgres row: |
idk if we need to store retrieval++ data in postgres, we should probably just store in prometheus + clickhouse |
|
I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here. |
There was a problem hiding this comment.
Pull request overview
Adds a new “anonymous retrieval” check flow to the backend, enabling scheduled sampling of non-dealbot pieces via the subgraph, retrieval via /piece/{cid}, optional CAR/IPNI validation, and persistence of results/metrics.
Changes:
- Introduces
retrieval-anonmodule/services (piece selection, retrieval + CommP, CAR/IPNI validation) and wires it into pg-boss scheduling. - Replaces the PDP-subgraph client with a unified
SubgraphServiceand renames env config toSUBGRAPH_ENDPOINT. - Adds new Prometheus metrics and a ClickHouse table (
anon_retrieval_checks) plus Postgres schema/entity for anon retrieval results; adjusts HTTP/2 timeout/partial-download behavior.
Reviewed changes
Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Dependency lock updates (oclif/minimatch patch bumps). |
| kustomize/overlays/local/backend-configmap-local.yaml | Renames local env var to SUBGRAPH_ENDPOINT. |
| docs/environment-variables.md | Documents SUBGRAPH_ENDPOINT + anon retrieval job timeout. |
| docs/checks/production-configuration-and-approval-methodology.md | Updates production config reference to SUBGRAPH_ENDPOINT. |
| docs/checks/data-retention.md | Updates data-retention docs to reference SubgraphService/SUBGRAPH_ENDPOINT. |
| apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts | Updates config shape to subgraphEndpoint. |
| apps/backend/src/subgraph/types.ts | Adds anon piece sampling types + CID decoding + response validator. |
| apps/backend/src/subgraph/types.spec.ts | Adds unit tests for subgraph response validators. |
| apps/backend/src/subgraph/subgraph.service.ts | Renames/extends subgraph client; adds sampleAnonPiece() and generic query helper. |
| apps/backend/src/subgraph/subgraph.service.spec.ts | Updates tests and adds coverage for sampleAnonPiece(). |
| apps/backend/src/subgraph/subgraph.module.ts | New Nest module exporting SubgraphService. |
| apps/backend/src/subgraph/queries.ts | New subgraph query definitions incl. anon sampling query builder. |
| apps/backend/src/retrieval-anon/types.ts | New anon retrieval domain result types. |
| apps/backend/src/retrieval-anon/retrieval-anon.module.ts | New Nest module wiring anon retrieval dependencies. |
| apps/backend/src/retrieval-anon/piece-retrieval.service.ts | Implements /piece/{cid} download + CommP validation. |
| apps/backend/src/retrieval-anon/car-validation.service.ts | Implements CAR parsing, IPNI verification, sampled block fetch+hash verification. |
| apps/backend/src/retrieval-anon/anon-retrieval.service.ts | Orchestrates anon selection → retrieval → validation → persistence + metrics. |
| apps/backend/src/retrieval-anon/anon-retrieval.service.spec.ts | Adds unit tests for persistence/metrics behavior (including abort/partial). |
| apps/backend/src/retrieval-anon/anon-piece-selector.service.ts | Implements bucketed/pool sampling + dedup + fallback strategy. |
| apps/backend/src/retrieval-anon/anon-piece-selector.service.spec.ts | Adds unit tests for sampling/fallback/dedup/termination behavior. |
| apps/backend/src/pdp-subgraph/queries.ts | Removes legacy PDP-subgraph queries (migrated to subgraph/). |
| apps/backend/src/pdp-subgraph/pdp-subgraph.module.ts | Removes legacy PDP subgraph module (replaced by SubgraphModule). |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Registers new anon retrieval Prometheus metrics + provider. |
| apps/backend/src/metrics-prometheus/check-metrics.service.ts | Adds AnonRetrievalCheckMetrics helper class. |
| apps/backend/src/metrics-prometheus/check-metric-labels.ts | Adds anon_retrieval to the CheckType union. |
| apps/backend/src/jobs/jobs.service.ts | Adds retrieval_anon job type, queue, scheduler, and timeout handling. |
| apps/backend/src/jobs/jobs.service.spec.ts | Updates tests for new dependency ordering and new schedule rows. |
| apps/backend/src/jobs/jobs.module.ts | Imports RetrievalAnonModule for job execution. |
| apps/backend/src/jobs/job-queues.ts | Adds RETRIEVAL_ANON_QUEUE. |
| apps/backend/src/ipni/ipni-verification.service.ts | Changes IPNI verification to per-CID checks with per-CID failure tracking/counts. |
| apps/backend/src/http-client/types.ts | Extends request result with aborted/abortReason for partial HTTP/2 downloads. |
| apps/backend/src/http-client/http-client.service.ts | Reworks HTTP/2 timeout handling and returns partial bytes+metrics on abort mid-download. |
| apps/backend/src/http-client/http-client.service.spec.ts | Adds tests for headersTimeout mapping, signal behavior, partial-download returns, and rethrowing non-abort errors. |
| apps/backend/src/database/types.ts | Adds PieceFetchStatus and IpniCheckStatus enums for anon retrievals. |
| apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts | Adds Postgres schema (table + enums + indexes) for anon retrievals. |
| apps/backend/src/database/entities/job-schedule-state.entity.ts | Adds retrieval_anon to scheduled job type union. |
| apps/backend/src/database/entities/anon-retrieval.entity.ts | Adds TypeORM entity mapping for anon_retrievals. |
| apps/backend/src/database/database.module.ts | Registers AnonRetrieval entity and exports it for injection. |
| apps/backend/src/data-retention/data-retention.service.ts | Switches data retention polling to SubgraphService and subgraphEndpoint. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Updates tests to mock SubgraphService and renamed config field. |
| apps/backend/src/data-retention/data-retention.module.ts | Switches module import from legacy PDP subgraph module to SubgraphModule. |
| apps/backend/src/config/app.config.ts | Renames env var to SUBGRAPH_ENDPOINT; adds anon retrieval rates/timeouts/block-sample config; derives HTTP timeouts from max job timeout. |
| apps/backend/src/clickhouse/clickhouse.schema.ts | Adds ClickHouse anon_retrieval_checks table schema. |
| apps/backend/src/app.module.ts | Imports RetrievalAnonModule. |
| apps/backend/README.md | Updates env var docs to SUBGRAPH_ENDPOINT. |
| apps/backend/.env.example | Renames env var, adds anon retrieval env vars, updates timeout guidance. |
| .gitignore | Ignores .tool-versions. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config. Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services? |
This is my basic point. Clickhouse is an optional component so without storing in postgres there is no record of the retrieval. |
This is a valid point too. It's really boils down to what the purpose of postgres is, whether it is to simply record working state or if it's to provide any diagnostic data. |
It's not necessary. Everything worked fine before the last commit 👍 Then Clickhouse becomes a hard dependency (which I think is fine for the same reason as you enumerated @SgtPooki ) but IIUC this was the point to pause for @iand. So what do we do? Revert to CH only? |
|
Wow - good stuff! A few things from driving by. Let me know if it's key that I look more closely:
|
Yea, we should pump to clickhouse and prometheus. so i think the going consensus is: we should not be adding info to postgres unless it's required to read/infer state. we do need that for deals/data-storage checks, we dont need it for retrievals but it's still there because we haven't removed it. we can eventually migrate to no data-storage/deal table in the future, but that would require a lot more RPC calls and chain walking to figure out what exists and where we want to upload things to. this isn't priority, but keeping new out of postgres where possible is (a priority). |
SgtPooki
left a comment
There was a problem hiding this comment.
reviewed some and have a few high level questions:
- why change PDP_SUBGRAPH_ENDPOINT env var? seems unnecessary and will cause issues with our deployed version, and potentially old stale code as well
- why move subgraph code to /subgraph instead of just leaving in /pdp-subgraph? -- this is a large PR and keeping to existing norms rather than overwriting things would be ideal
- changing how we do ipni verification can have drastic impact on current metrics.. switching to serial individual CID checks could cause already slow IPNI verification for some SPs to start fully failing.
SgtPooki
left a comment
There was a problem hiding this comment.
two last things besides the one we discussed on slack!
BigLep
left a comment
There was a problem hiding this comment.
I have a few comments here. The only reason I didn't give a full approval was because I thought it would be good to confirm things look good after merge and there is the feedback item on why we're deviating from how "data storage check" does sub-statuses if we ware going to use "skipped" then use it throughout all the statuses (including car parsing).
|
@BigLep your comments regarding the status lead me down to restructure a few things (commit aa19374) because you were totally right regarding some skipped vs. error status remarks.
For reference here are all new sub-status meanings
|
c6116f6 to
afbf4ea
Compare
|
The typecheck CI failure also happens on I wanted to test and run the latest changes but nestjs is refusing to start due to the type errors: Output: |
looks like you got this fixed.. lmk if you need further help there. |
BigLep
left a comment
There was a problem hiding this comment.
Arg, apologies for the delay. I didn't realize I hadn't finished htis off.
No blocking comments from me, although looking at this again with eyes that spent too long today looking at our metric documentation, I was wonder if we want to be introducing other terms for ipni validation rather than success and failure. Your call.
Co-authored-by: Steve Loeppky <stvn@loeppky.com>
|
Things are good from a docs regard to me. Looks like just need code signoff from @SgtPooki (who I know is juggling other things for GA). |
|
@SgtPooki : are you planning to look at this more, or is this good to merge? |
Hi folks,
this PR adds the anon retrieval flow from #427. It is a follow-up of:
The main logic is in
./apps/backend/retrieval-anon:anon-retrieval.service.ts- "called" on a schedule as a job. It starts the retrieval process (select piece, fetch piece, validate car, store result)anon-piece-selector.service.ts- implements the subgraph query logiccar-validation.service.ts- parses the piece bytes as a CAR, checks IPNI availability, fetcheskblocks from that car and validates their hashespiece-retreival.service.ts- implements the HTTP request to download the piece and CommP validationThe anonymous piece selection logic works as follows:
The
retrievalAnoncheck probes an SP for non-dealbot pieces so we can detect SPs that behave well even if the teacher is not watching. To do this fairly, the piece selection should satisfy the following requirements:withIPFSIndexingpieces (so CAR/IPNI validation has something to check) but still exercise non-indexed pieces so an SP can't optimise only its CAR corpus.How it works in practice:
Every Root entity in the subgraph carries a
sampleKey = keccak256(setId-rootId)populated once at insert time. Becausekeccak256is uniform over 256 bits and independent of creation order/size/dataset,sampleKeysorts roots into a uniform random permutation that is stable across queries.This is necessary because you cannot just select a random element from a range query in GraphQL. If we knew the total number of pieces we could define a random
skipvalue but this is also capped at 5000. I've read that it becomes very inefficient at higher values. This would also require a non-trivial bookkeeping of active pieces/datasets counts. ThesampleKeyis much easier.Drawing a sample looks like this:
withIPFSIndexing: truewith probability 80%; otherwise no filter.$sampleKeyand query:$sampleKeywhich is effectively a uniform random pick, in O(log N).pdpPaymentEndEpochhas already passed the latest indexed block, or if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession). On a miss, redraws once with a fresh$sampleKey.Subgraph
Note
The deployed subgraphs don't contain the latest changes from the recent PR review. They should still work for testing.
I have deployed the new subgraphs:
mainnet: https://api.goldsky.com/api/public/project_cmo9sxe5xd4ai01x8cpageyid/subgraphs/dealbot-mainnet/0.3.0/gncalibration: https://api.goldsky.com/api/public/project_cmo9sxe5xd4ai01x8cpageyid/subgraphs/dealbot-calibration/0.3.0/gnA deployment looks like this from within the
subgraphfolder (prerequisite is a call togoldsky login):Comments
anonor rather something likesampled?