From 1308ab39565ec3762c2e78770e67eaf24cb3ad0a Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 24 Apr 2026 21:24:41 +0530 Subject: [PATCH 1/2] docs: clarify data retention metrics units and calculation details --- docs/checks/data-retention.md | 52 ++++++++++++++++++++----------- docs/checks/events-and-metrics.md | 4 +-- 2 files changed, 35 insertions(+), 21 deletions(-) diff --git a/docs/checks/data-retention.md b/docs/checks/data-retention.md index 804190c6..4bce77bc 100644 --- a/docs/checks/data-retention.md +++ b/docs/checks/data-retention.md @@ -14,9 +14,11 @@ Every data retention check cycle, dealbot: 1. Queries the [PDP subgraph](https://docs.filecoin.io/smart-contracts/advanced/proof-of-data-possession) for provider-level challenge statistics 2. Computes confirmed successful proving periods from the subgraph totals with estimated overdue periods for real-time monitoring -3. Calculates deltas since the last poll +3. Calculates proving-period deltas since the last poll and converts them to challenge counts 4. Records metrics to track provider reliability over time +**Provider selection**: Only providers returned by `WalletSdkService.getTestingProviders()` are polled, minus any matching the `spBlocklists` configuration (via `isSpBlocked`). + ## How It Works ### 1. Query PDP Subgraph @@ -56,10 +58,11 @@ Dealbot uses the subgraph-confirmed totals directly for cumulative counters: confirmedTotalSuccess = totalProvingPeriods - totalFaultedPeriods ``` -Additionally, dealbot calculates **estimated overdue periods** for real-time monitoring via a separate gauge metric. For each proof set where the deadline has passed (`nextDeadline < currentBlock`): +Additionally, dealbot calculates **estimated overdue periods** for real-time monitoring via a separate gauge metric. The value is the **sum across all of the provider's overdue proof sets** (those where `nextDeadline < currentBlock`); proof sets with `maxProvingPeriod === 0` are skipped: ``` -estimatedOverduePeriods = (currentBlock - (nextDeadline + 1)) / maxProvingPeriod +estimatedOverduePeriods = sum over overdue proofSets of: + (currentBlock - (nextDeadline + 1)) / maxProvingPeriod ``` This gauge provides immediate visibility into providers that are behind on submitting proofs, even before the subgraph confirms the faults. The gauge naturally resets to 0 when providers submit their proofs and the subgraph catches up. @@ -68,16 +71,18 @@ This gauge provides immediate visibility into providers that are behind on submi ### 3. Calculate Deltas -To avoid double-counting, dealbot maintains a baseline of cumulative totals for each provider. On each poll, it computes the delta (change) since the last poll: +To avoid double-counting, dealbot maintains a baseline of cumulative **proving-period** totals for each provider. On each poll, it computes the period delta since the last poll and converts it to a challenge count using a fixed multiplier (`CHALLENGES_PER_PROVING_PERIOD = 5`, sourced from the `FilecoinWarmStorageService` contract): ``` -faultedDelta = totalFaultedPeriods - previousTotalFaulted -successDelta = confirmedTotalSuccess - previousTotalSuccess +faultedChallengesDelta = (totalFaultedPeriods - previousTotalFaulted) * 5 +successChallengesDelta = (confirmedTotalSuccess - previousTotalSuccess) * 5 ``` +Baselines are stored and compared in **periods**; the `dataSetChallengeStatus` counter is incremented in **challenges**. + **First-seen provider handling**: When a provider has no prior baseline (fresh deploy or newly added provider), dealbot initializes the baseline to the current cumulative totals **without emitting any counters**. This prevents dumping the provider's full cumulative history as a single metric spike. Metrics for that provider will begin accumulating from the next poll onward. -**Negative delta handling**: If deltas are negative (due to chain reorgs, subgraph corrections, or data inconsistencies), the baseline is reset to current values without incrementing counters. This prevents stalled metrics. +**Negative delta handling**: If either challenge delta is negative (due to chain reorgs, subgraph corrections, or data inconsistencies), the baseline is reset to current values without incrementing counters. This prevents stalled metrics. **Baseline persistence**: Baselines are persisted to the `data_retention_baselines` database table after each successful poll. On service restart, baselines are reloaded from the database to prevent metric inflation. @@ -172,22 +177,31 @@ Source: [`pdp-subgraph.service.ts` (`enforceRateLimit`)](../../apps/backend/src/ See [`dataSetChallengeStatus`](./events-and-metrics.md#dataSetChallengeStatus) for more info. +**Unit**: challenges (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`). + +**`value` label**: + +- `success` — challenges in successfully-proven periods (`totalProvingPeriods - totalFaultedPeriods`) +- `failure` — challenges in faulted periods (`totalFaultedPeriods`) + **Increment behavior**: -- Only increments when positive deltas are detected -- Increments by the delta amount (not always 1) -- Handles large values (>MAX_SAFE_INTEGER) via chunked increments +- Only increments when the challenge delta is strictly positive +- Increments by the full challenge delta (not always 1) +- For deltas exceeding `Number.MAX_SAFE_INTEGER`, `safeIncrementCounter` splits the increment into `MAX_SAFE_INTEGER`-sized chunks to preserve precision + +### Gauge: `pdp_provider_estimated_overdue_periods` -### Gauge: `pdp_provider_overdue_periods` +See [`pdp_provider_estimated_overdue_periods`](./events-and-metrics.md#pdp_provider_estimated_overdue_periods) for more info. -See [`pdp_provider_overdue_periods`](./events-and-metrics.md#pdp_provider_overdue_periods) for more info. +**Unit**: proving periods (sum across the provider's overdue proof sets). **Emission behavior**: -- Emitted on every poll, independent of counter deltas +- Emitted on every poll for every processed provider, independent of counter deltas and independent of baseline state (emitted even on first-seen providers) - Reflects estimated unrecorded overdue proving periods in real-time - Naturally resets to 0 when providers submit proofs and the subgraph catches up -- Handles large values (>MAX_SAFE_INTEGER) via chunked `.inc()` calls +- For values exceeding `Number.MAX_SAFE_INTEGER`, `safeSetGauge` **clamps** the gauge to `Number.MAX_SAFE_INTEGER` and logs an `overdue_periods_overflow` warning (it does **not** chunk) ## Configuration @@ -300,20 +314,20 @@ Poll 1 (fresh start, no DB baseline): Poll 2: Subgraph: faulted=1005, success=9005 - Memory baseline: 1000, 9000 → Delta: 5, 5 - Emit: +5 faulted, +5 success + Memory baseline: 1000, 9000 → Period delta: 5, 5 (× 5 challenges/period) + Emit: +25 faulted challenges, +25 success challenges --- SERVICE RESTARTS --- Poll 3 (after restart): Subgraph: faulted=1005, success=9005 - DB baseline: 1005, 9005 (loaded) → Delta: 0, 0 + DB baseline: 1005, 9005 (loaded) → Period delta: 0, 0 Emit: nothing (no new challenges) Poll 4: Subgraph: faulted=1008, success=9012 - Memory baseline: 1005, 9005 → Delta: 3, 7 - Emit: +3 faulted, +7 success + Memory baseline: 1005, 9005 → Period delta: 3, 7 + Emit: +15 faulted challenges, +35 success challenges ``` If the database is unavailable on startup, the poll is aborted to prevent emitting inflated values. The service will retry on the next scheduled poll. diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md index 8ededa34..05153fe3 100644 --- a/docs/checks/events-and-metrics.md +++ b/docs/checks/events-and-metrics.md @@ -104,5 +104,5 @@ sequenceDiagram | `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | | | `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetChallengeStatus` | Data Retention | Not tied to an [event above](#event-list) but rather to the periodic chain-checking done in the [Data Retention Check](./data-retention.md) | `success`, `failure` | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | -| `pdp_provider_overdue_periods` | Data Retention | Emitted on every poll | Gauge value (estimated overdue periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | +| `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md)poll when a provider's confirmed proving-period totals advance (strictly positive deltas). Unit: **challenges** (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`). | `success` (challenges in successfully-proven periods), `failure` (challenges in faulted periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | +| `pdp_provider_estimated_overdue_periods` | Data Retention | Emitted on every [Data Retention Check](./data-retention.md) poll for every successfully processed provider. | Gauge value in proving periods (non-negative integer) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | From 63dd6b3cf7a6ffd2b23a8afc39c347fd9ccd3280 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 24 Apr 2026 21:30:09 +0530 Subject: [PATCH 2/2] chore: missing space --- docs/checks/events-and-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md index 05153fe3..a31cad06 100644 --- a/docs/checks/events-and-metrics.md +++ b/docs/checks/events-and-metrics.md @@ -104,5 +104,5 @@ sequenceDiagram | `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | | | `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md)poll when a provider's confirmed proving-period totals advance (strictly positive deltas). Unit: **challenges** (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`). | `success` (challenges in successfully-proven periods), `failure` (challenges in faulted periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | +| `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md) poll when a provider's confirmed proving-period totals advance (strictly positive deltas). Unit: **challenges** (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`). | `success` (challenges in successfully-proven periods), `failure` (challenges in faulted periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | | `pdp_provider_estimated_overdue_periods` | Data Retention | Emitted on every [Data Retention Check](./data-retention.md) poll for every successfully processed provider. | Gauge value in proving periods (non-negative integer) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) |