diff --git a/docs/checks/data-retention.md b/docs/checks/data-retention.md
index 804190c6..4bce77bc 100644
--- a/docs/checks/data-retention.md
+++ b/docs/checks/data-retention.md
@@ -14,9 +14,11 @@ Every data retention check cycle, dealbot:
1. Queries the [PDP subgraph](https://docs.filecoin.io/smart-contracts/advanced/proof-of-data-possession) for provider-level challenge statistics
2. Computes confirmed successful proving periods from the subgraph totals with estimated overdue periods for real-time monitoring
-3. Calculates deltas since the last poll
+3. Calculates proving-period deltas since the last poll and converts them to challenge counts
4. Records metrics to track provider reliability over time
+**Provider selection**: Only providers returned by `WalletSdkService.getTestingProviders()` are polled, minus any matching the `spBlocklists` configuration (via `isSpBlocked`).
+
## How It Works
### 1. Query PDP Subgraph
@@ -56,10 +58,11 @@ Dealbot uses the subgraph-confirmed totals directly for cumulative counters:
confirmedTotalSuccess = totalProvingPeriods - totalFaultedPeriods
```
-Additionally, dealbot calculates **estimated overdue periods** for real-time monitoring via a separate gauge metric. For each proof set where the deadline has passed (`nextDeadline < currentBlock`):
+Additionally, dealbot calculates **estimated overdue periods** for real-time monitoring via a separate gauge metric. The value is the **sum across all of the provider's overdue proof sets** (those where `nextDeadline < currentBlock`); proof sets with `maxProvingPeriod === 0` are skipped:
```
-estimatedOverduePeriods = (currentBlock - (nextDeadline + 1)) / maxProvingPeriod
+estimatedOverduePeriods = sum over overdue proofSets of:
+ (currentBlock - (nextDeadline + 1)) / maxProvingPeriod
```
This gauge provides immediate visibility into providers that are behind on submitting proofs, even before the subgraph confirms the faults. The gauge naturally resets to 0 when providers submit their proofs and the subgraph catches up.
@@ -68,16 +71,18 @@ This gauge provides immediate visibility into providers that are behind on submi
### 3. Calculate Deltas
-To avoid double-counting, dealbot maintains a baseline of cumulative totals for each provider. On each poll, it computes the delta (change) since the last poll:
+To avoid double-counting, dealbot maintains a baseline of cumulative **proving-period** totals for each provider. On each poll, it computes the period delta since the last poll and converts it to a challenge count using a fixed multiplier (`CHALLENGES_PER_PROVING_PERIOD = 5`, sourced from the `FilecoinWarmStorageService` contract):
```
-faultedDelta = totalFaultedPeriods - previousTotalFaulted
-successDelta = confirmedTotalSuccess - previousTotalSuccess
+faultedChallengesDelta = (totalFaultedPeriods - previousTotalFaulted) * 5
+successChallengesDelta = (confirmedTotalSuccess - previousTotalSuccess) * 5
```
+Baselines are stored and compared in **periods**; the `dataSetChallengeStatus` counter is incremented in **challenges**.
+
**First-seen provider handling**: When a provider has no prior baseline (fresh deploy or newly added provider), dealbot initializes the baseline to the current cumulative totals **without emitting any counters**. This prevents dumping the provider's full cumulative history as a single metric spike. Metrics for that provider will begin accumulating from the next poll onward.
-**Negative delta handling**: If deltas are negative (due to chain reorgs, subgraph corrections, or data inconsistencies), the baseline is reset to current values without incrementing counters. This prevents stalled metrics.
+**Negative delta handling**: If either challenge delta is negative (due to chain reorgs, subgraph corrections, or data inconsistencies), the baseline is reset to current values without incrementing counters. This prevents stalled metrics.
**Baseline persistence**: Baselines are persisted to the `data_retention_baselines` database table after each successful poll. On service restart, baselines are reloaded from the database to prevent metric inflation.
@@ -172,22 +177,31 @@ Source: [`pdp-subgraph.service.ts` (`enforceRateLimit`)](../../apps/backend/src/
See [`dataSetChallengeStatus`](./events-and-metrics.md#dataSetChallengeStatus) for more info.
+**Unit**: challenges (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`).
+
+**`value` label**:
+
+- `success` — challenges in successfully-proven periods (`totalProvingPeriods - totalFaultedPeriods`)
+- `failure` — challenges in faulted periods (`totalFaultedPeriods`)
+
**Increment behavior**:
-- Only increments when positive deltas are detected
-- Increments by the delta amount (not always 1)
-- Handles large values (>MAX_SAFE_INTEGER) via chunked increments
+- Only increments when the challenge delta is strictly positive
+- Increments by the full challenge delta (not always 1)
+- For deltas exceeding `Number.MAX_SAFE_INTEGER`, `safeIncrementCounter` splits the increment into `MAX_SAFE_INTEGER`-sized chunks to preserve precision
+
+### Gauge: `pdp_provider_estimated_overdue_periods`
-### Gauge: `pdp_provider_overdue_periods`
+See [`pdp_provider_estimated_overdue_periods`](./events-and-metrics.md#pdp_provider_estimated_overdue_periods) for more info.
-See [`pdp_provider_overdue_periods`](./events-and-metrics.md#pdp_provider_overdue_periods) for more info.
+**Unit**: proving periods (sum across the provider's overdue proof sets).
**Emission behavior**:
-- Emitted on every poll, independent of counter deltas
+- Emitted on every poll for every processed provider, independent of counter deltas and independent of baseline state (emitted even on first-seen providers)
- Reflects estimated unrecorded overdue proving periods in real-time
- Naturally resets to 0 when providers submit proofs and the subgraph catches up
-- Handles large values (>MAX_SAFE_INTEGER) via chunked `.inc()` calls
+- For values exceeding `Number.MAX_SAFE_INTEGER`, `safeSetGauge` **clamps** the gauge to `Number.MAX_SAFE_INTEGER` and logs an `overdue_periods_overflow` warning (it does **not** chunk)
## Configuration
@@ -300,20 +314,20 @@ Poll 1 (fresh start, no DB baseline):
Poll 2:
Subgraph: faulted=1005, success=9005
- Memory baseline: 1000, 9000 → Delta: 5, 5
- Emit: +5 faulted, +5 success
+ Memory baseline: 1000, 9000 → Period delta: 5, 5 (× 5 challenges/period)
+ Emit: +25 faulted challenges, +25 success challenges
--- SERVICE RESTARTS ---
Poll 3 (after restart):
Subgraph: faulted=1005, success=9005
- DB baseline: 1005, 9005 (loaded) → Delta: 0, 0
+ DB baseline: 1005, 9005 (loaded) → Period delta: 0, 0
Emit: nothing (no new challenges)
Poll 4:
Subgraph: faulted=1008, success=9012
- Memory baseline: 1005, 9005 → Delta: 3, 7
- Emit: +3 faulted, +7 success
+ Memory baseline: 1005, 9005 → Period delta: 3, 7
+ Emit: +15 faulted challenges, +35 success challenges
```
If the database is unavailable on startup, the poll is aborted to prevent emitting inflated values. The service will retry on the next scheduled poll.
diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md
index 1fc0f2fe..45c5423e 100644
--- a/docs/checks/events-and-metrics.md
+++ b/docs/checks/events-and-metrics.md
@@ -104,8 +104,8 @@ sequenceDiagram
| `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) |
| `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | |
| `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) |
-| `dataSetChallengeStatus` | Data Retention | Not tied to an [event above](#event-list) but rather to the periodic chain-checking done in the [Data Retention Check](./data-retention.md) | `success`, `failure` | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) |
-| `pdp_provider_overdue_periods` | Data Retention | Emitted on every poll | Gauge value (estimated overdue periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) |
+| `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md) poll when a provider's confirmed proving-period totals advance (strictly positive deltas). Unit: **challenges** (period delta × `CHALLENGES_PER_PROVING_PERIOD = 5`). | `success` (challenges in successfully-proven periods), `failure` (challenges in faulted periods) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) |
+| `pdp_provider_estimated_overdue_periods` | Data Retention | Emitted on every [Data Retention Check](./data-retention.md) poll for every successfully processed provider. | Gauge value in proving periods (non-negative integer) | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) |
## ClickHouse Tables