From d71b72145cb583b6b7893ec75dee63b5c9e20ba1 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Mon, 1 Jun 2026 17:53:58 +0530 Subject: [PATCH 01/16] docs: add data-set-creation job design documentation --- docs/data-set-creation.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 docs/data-set-creation.md diff --git a/docs/data-set-creation.md b/docs/data-set-creation.md new file mode 100644 index 00000000..992ef0c1 --- /dev/null +++ b/docs/data-set-creation.md @@ -0,0 +1,164 @@ +# Data Set Creation Job + +This doc explains the design of the `data_set_creation` job: why it exists, how dealbot schedules it, how each run decides what to do, and how it interacts with on-chain dataset state. + +## Summary + +- `data_set_creation` was originally added so dealbot could maintain enough datasets per provider for data-retention sampling and FWSS approval evaluation. +- It now also serves as the repair path for terminated datasets that are still resolved by metadata but are no longer live. +- Operationally, it ensures each active storage provider has at least `MIN_NUM_DATASETS_FOR_CHECKS` live datasets available for checks. +- The scheduler creates one `job_schedule_state` row per `` when `MIN_NUM_DATASETS_FOR_CHECKS >= 1`. +- Each run inspects dataset slots in order and handles **at most one** non-live slot per invocation. +- Missing datasets are provisioned by uploading a minimal seed piece. +- Terminated datasets are repaired first; replacement is deferred to a later run. +- The job runs on the shared `sp.work` queue with the same per-SP singleton behavior as `deal`, `retrieval`, `piece_cleanup`, and `pull_check`. + +## Why it was added and why it still matters + +This job was originally added so dealbot could maintain enough datasets per provider for `data_retention` to accumulate enough samples to evaluate FWSS approval criteria. In practice, `MIN_NUM_DATASETS_FOR_CHECKS` increases the number of datasets for provider for submitting on-chain proofs generating enough samples for data retention check. + +That original motivation is different from the job's current operational role. Today, `data_set_creation` is also the repair path for terminated datasets that still resolve via metadata in `createContext(...)` but are no longer usable because they are suffering unrecoverable proving failures on the SP side. + +Those terminated datasets are often first surfaced by the `deal` job. When that happens, `deal` does not repair them inline; it defers repair to `data_set_creation`, which then reconciles the slot incrementally. + +This job remains intentionally separate from `deal`: + +- `deal` focuses on running a data-storage check and may detect a terminated dataset. +- `data_set_creation` focuses on maintaining dataset inventory over time and repairing terminated slots. +- Keeping the repair path separate avoids mixing normal check execution with dataset lifecycle recovery. + +## Scheduling and queueing + +The scheduler creates per-provider `data_set_creation` schedules in the same loop that upserts `deal` and `retrieval` schedules. + +- Schedule creation is gated by `MIN_NUM_DATASETS_FOR_CHECKS >= 1`. +- The interval is derived from `DATASET_CREATIONS_PER_SP_PER_HOUR`. +- The initial `next_run_at` uses the same phase offset logic as other SP jobs. +- Enqueued jobs use payload `{ jobType: 'data_set_creation', spAddress, intervalSeconds }`. +- Jobs go to the shared `sp.work` pg-boss queue. +- The queue send path assigns `singletonKey=spAddress`, so only one SP-scoped job can be active for a provider at a time across all workers. + +That singleton behavior is important because `data_set_creation` mutates provider dataset state and should not race with the provider's other scheduled jobs. + +## Preconditions and skip conditions + +Before the handler does any provisioning work, it applies the same operational guards used by other SP-scoped jobs: + +- **Maintenance windows**: if a maintenance window is active, the job is deferred instead of running. +- **SP blocklists**: if the provider is blocked by configured blocklists, the job logs a skip and exits. +- **Timeouts**: the handler runs under an `AbortController` timeout based on `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`, with an effective floor of 120 seconds. + +## Dataset slot model + +The job treats required datasets as numbered slots from `0` to `MIN_NUM_DATASETS_FOR_CHECKS - 1`. + +For each slot, it computes deterministic metadata: + +- Base metadata comes from `DealService.getBaseDataSetMetadata()`. +- Base metadata always includes the IPNI metadata key (`withIpniIndexing=""`). +- If `DEALBOT_DATASET_VERSION` is configured, base metadata also includes `dealbotDataSetVersion`. +- Slot `0` uses only the base metadata. +- Slots `1+` add `dealbotDS: String(index)`. + +This makes each slot addressable and idempotent: repeated runs look up the same logical dataset slot using the same metadata. + +## Handler algorithm + +For one provider, one invocation of `data_set_creation` works like this: + +1. Resolve the provider context and verify the provider is runnable. +2. Read `MIN_NUM_DATASETS_FOR_CHECKS` and the base dataset metadata. +3. Iterate slots from `0` upward. +4. For each slot, call `DealService.getDataSetProvisioningStatus(spAddress, metadata, signal)`. +5. If the slot is `live`, continue to the next slot. +6. If the slot is `terminated`, repair it and stop for this tick. +7. If the slot is `missing`, create it and stop for this tick. +8. If every slot is `live`, log completion and exit. + +The key design choice is **incremental provisioning**: one run repairs or creates at most one slot. That keeps runtime bounded and spreads background load across scheduler ticks instead of attempting a full reconciliation in one execution. + +## How slot status is determined + +`DealService.getDataSetProvisioningStatus()` classifies a slot as `missing`, `live`, or `terminated`. + +The lookup flow is: + +1. Resolve provider info from the wallet/registry layer. +2. Call Synapse `storage.createContext({ providerId, metadata })`. +3. If no `dataSetId` is present in the context, the slot is `missing`. +4. If a `dataSetId` is present, probe liveness through `DatasetLivenessService`. +5. Return `live` if the probes succeed, otherwise return `terminated`. + +`terminated` means the dataset identifier still resolves from metadata, but liveness checks say the dataset is no longer usable. + +## Missing dataset flow + +When the first missing slot is found, the job provisions exactly one dataset by calling `DealService.createDataSetWithPiece()`. + +That method: + +- Resolves the provider from the registry. +- Creates the dataset using the same `createContext + executeUpload` path used by data-storage checks. +- Uploads a minimal seed piece so the dataset is non-empty. +- Does **not** persist a `Deal` row. +- Does **not** emit data-storage-check success or failure metrics. +- Does **not** perform retrieval checks or IPNI verification steps after upload. + +The goal is only to ensure the dataset exists and can later be used by the real checks. + +## Terminated dataset repair flow + +When the first terminated slot is found, the job calls `DealService.repairTerminatedDataSet()` and then exits without creating a replacement in the same run. + +Repair is intentionally idempotent: + +1. Read the FWSS dataset state. +2. If `pdpEndEpoch` is already non-zero, skip the terminate transaction. +3. Otherwise call `terminateDataSet`. +4. Wait for the transaction receipt when possible. +5. Poll FWSS until `pdpEndEpoch != 0`. +6. Mark existing `Deal` rows for that `dataSetId` as `cleanedUp=true` in one transaction. + +After that repair completes, the next scheduled run will see the slot as `missing` and provision a replacement dataset. + +This two-step approach avoids mixing termination cleanup and replacement provisioning into one long, failure-prone handler execution. + +## Interaction with other jobs + +- `deal` depends on these datasets being available. +- If a `deal` job hits a terminated dataset, it logs that the dataset is PDP-terminated and waits for `data_set_creation` repair. +- Because all SP-scoped jobs share the same singleton queue key, `data_set_creation` cannot overlap with the same provider's `deal`, `retrieval`, `piece_cleanup`, or `pull_check` job. + +## Configuration + +The main controls for this job are: + +- `MIN_NUM_DATASETS_FOR_CHECKS`: required number of live dataset slots per provider. +- [`DATASET_CREATIONS_PER_SP_PER_HOUR`](./environment-variables.md#dataset_creations_per_sp_per_hour): scheduling rate for reconciliation runs. +- `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`: job timeout before the abort signal fires. +- [`DEALBOT_DATASET_VERSION`](./environment-variables.md#dealbot_dataset_version): optional version tag added to base dataset metadata. +- `USE_ONLY_APPROVED_PROVIDERS`: indirectly affects which providers receive schedules. + +## Observability + +The job is observable in two layers: + +- **Structured logs** for provisioning, repair, aborts, failures, maintenance deferrals, and completed reconciliation. +- **Generic job metrics** through the shared Prometheus job counters and duration histogram (`jobs_started_total`, `jobs_completed_total`, `job_duration_seconds`). + +Common log events include: + +- `creating_provisioned_data_set` +- `data_set_provisioning_progress` +- `data_sets_provisioning_completed` +- `dataset_terminated_detected` +- `data_set_repair_completed` +- `data_set_creation_job_aborted` +- `data_set_creation_job_failed` + +## Source of truth + +- Scheduler and handler: [`apps/backend/src/jobs/jobs.service.ts`](../apps/backend/src/jobs/jobs.service.ts) +- Incremental provisioning logic: [`apps/backend/src/jobs/data-set-creation.handler.ts`](../apps/backend/src/jobs/data-set-creation.handler.ts) +- Dataset provisioning and repair: [`apps/backend/src/deal/deal.service.ts`](../apps/backend/src/deal/deal.service.ts) +- Job system overview: [`docs/jobs.md`](./jobs.md) From 45a3e90c4ef8f35e39d9e2d1634c1104f9208098 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Mon, 1 Jun 2026 23:50:35 +0530 Subject: [PATCH 02/16] docs: add data-set-deletion job design documentation --- docs/data-set-deletion.md | 197 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 197 insertions(+) create mode 100644 docs/data-set-deletion.md diff --git a/docs/data-set-deletion.md b/docs/data-set-deletion.md new file mode 100644 index 00000000..608f5acd --- /dev/null +++ b/docs/data-set-deletion.md @@ -0,0 +1,197 @@ +# Data Set Deletion Job + +This doc proposes a calibration-only `data_set_deletion` job that periodically deletes a managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. + +## Problem Context + +During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets (`MIN_NUM_DATASETS_FOR_CHECKS`). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. + +The missing capability is not more creation logic. The missing capability is a controlled way to create fresh demand for creation again. + +## Goals + +- Continuously exercise the calibration `createDataSet → deleteDataSet → createDataSet` lifecycle. +- Reuse the existing `data_set_creation` job as the replenishment mechanism. +- Minimize disruption to ongoing deal and retrieval checks. +- Make delete cadence explicitly configurable so the expected create cadence can be reasoned about. +- Ensure the job cannot run on mainnet. +- Expose enough metrics and logs to extend the existing BetterStack dashboards. + +## Proposed job + +Introduce a new SP-scoped job type: `data_set_deletion`. + +The job should: + +- run only on calibration +- run on a configurable cadence +- delete at most one safe managed dataset per provider per invocation +- rely on the existing `data_set_creation` job to recreate the missing slot on a later tick + +This keeps deletion simple and keeps creation logic centralized in the existing job. + +### Configuration + +The initial design adds these controls, which follow the same naming pattern as the creation job: + +- `DATASET_DELETIONS_PER_SP_PER_HOUR` + - mirrors the existing rate-based job controls + - converted internally to `intervalSeconds` + - used to reason about expected delete frequency + +- `DATA_SET_DELETION_JOB_TIMEOUT_SECONDS` + - max runtime for one deletion job invocation + +- `DATA_SET_DELETION_MIN_INDEX` + - the lowest slot index eligible for deletion (inclusive) + - default: `1` — only the baseline slot (index `0`) is protected + - slots `0` through `DATA_SET_DELETION_MIN_INDEX - 1` are never touched by this job + - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_DELETION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window + - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable deletion entirely — the canary window becomes empty and no schedule is created + - must be `>= 1` and `<= MIN_NUM_DATASETS_FOR_CHECKS`; violating either constraint crashes the application on startup + +### Scheduling and queueing + +The scheduling model mirrors `data_set_creation`: + +- queue: shared `sp.work` +- `singletonKey=spAddress` + +Sharing the singleton with other SP jobs prevents deletion from racing with a `deal` or `retrieval` job for the same provider. + +The schedule is only upserted when all of the following are true: + +- `NETWORK=calibration` +- `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_DELETION_MIN_INDEX > 0` + +The second condition covers the `DATA_SET_DELETION_MIN_INDEX = MIN_NUM_DATASETS_FOR_CHECKS` case (empty canary window, deletion effectively off) without crashing. It also handles the case where `MIN_NUM_DATASETS_FOR_CHECKS` is later lowered to meet `DATA_SET_DELETION_MIN_INDEX` — no schedule is created without requiring a config change. + +### Proposed handler algorithm + +For one provider, one invocation of `data_set_deletion` works like this: + +1. Check that the network is calibration. If not, log skip and exit. +2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. +3. Create an `AbortController` using `DATA_SET_DELETION_JOB_TIMEOUT_SECONDS`. +4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata. +5. Scan slots from `minDataSets - 1` down to `DATA_SET_DELETION_MIN_INDEX`. For each slot: + - a. Build its metadata using the same logic as `data_set_creation`. + - b. Classify it via `getDataSetProvisioningStatus()`. + - c. Skip if `missing` — nothing to delete. + - d. Skip if `terminated` — this means Synapse returned a `dataSetId` (`pdpEndEpoch === 0`) but liveness probes are failing. `data_set_creation` owns repair of these slots via `repairTerminatedDataSet`. Note: a slot whose `terminateService` was already called will never appear as `terminated` here — the Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so it shows as `missing` instead. + - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. +6. Call the delete flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). +7. After successful deletion, mark `deals.cleaned_up = true` for all deal rows associated with that `dataSetId` in a single transaction. +8. Log the outcome and exit for this tick. +9. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously deleted slot. + +As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. + +### Proposed delete flow + +The deletion flow should be implemented in a dedicated service method rather than inline in `JobsService`. + +1. Resolve provider info and the target `dataSetId`. +2. Call the on-chain `terminateService` path through Synapse. +3. Wait for transaction receipt. +4. Poll until `pdpEndEpoch !== 0`. A live dataset has `pdpEndEpoch === 0`; once `terminateService` confirms, `pdpEndEpoch !== 0` is set on-chain. The Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so `getDataSetProvisioningStatus()` will return `missing` for this slot from this point on. +5. Once `pdpEndEpoch !== 0` is observed, the deletion job's work is done. `data_set_creation` will see the slot as `missing` on its next run and provision a replacement directly. + +Polling until the chain confirms termination is important because the canary value comes from the full on-chain lifecycle, not just submitting a transaction. + +#### Idempotency + +The delete flow must tolerate races and retries: + +- If `pdpEndEpoch !== 0` is already set when the job starts (slot is already terminated), skip the `terminateService` call and treat the run as a no-op success. +- If `terminateService` reverts with an already-terminated error (for example, `"service already terminated"` or `"dataset not active"`), treat it as idempotent success and proceed to the polling step. +- If the transaction confirms but `pdpEndEpoch` does not become non-zero before the abort signal fires, treat the run as `failure.timedout` and let pg-boss retry on the next tick. + +### Metrics and BetterStack dashboards + +The deletion job has two distinct observability concerns: is the trigger firing, and is the canary signal it produces showing up in creation metrics. Creation metrics are the primary signal; deletion metrics are only there to confirm the trigger is working. + +All metrics carry the standard label set defined in [`checks/events-and-metrics.md`](../checks/events-and-metrics.md#metrics): +`network`, `checkType`, `providerId`, `providerName`, `providerStatus`. + +For deletion metrics, `checkType=dataSetDeletion`. For creation metrics referenced below, `checkType=dataSetCreation`. + +#### Creation metrics (primary signal) + +These already exist and are defined in [`events-and-metrics.md`](../checks/events-and-metrics.md). `data_set_deletion` creates the conditions for them to fire — if they stay silent after deletion is running, something is wrong with creation. + +| Metric | `value` labels | What to watch for | +|--------|---------------|-------------------| +| [`dataSetCreationStatus`](../checks/events-and-metrics.md#dataSetCreationStatus) | `pending`, `success`, `failure.timedout`, `failure.other` | `success` count should rise in the interval after each deletion; persistent `failure.*` after a deletion indicates a `createDataSet` regression | +| [`dataSetCreationMs`](../checks/events-and-metrics.md#dataSetCreationMs) | — | Latency histogram for `createDataSetWithPiece`; spikes after deletion may indicate on-chain congestion | + +#### Deletion metrics (trigger health) + +New metrics proposed here. These confirm deletion is producing the conditions for creation to run. If deletion metrics look healthy but creation metrics are silent, the loop is broken somewhere between the two jobs. + +| Metric | `value` labels | What to watch for | +|--------|---------------|-------------------| +| `dataSetDeletionStatus` | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` per provider confirms the trigger is firing; persistent `skipped.no_candidate` means `data_set_creation` is not replenishing fast enough | +| `dataSetDeletionMs` | — | Histogram from `terminateService` call to `pdpEndEpoch !== 0` confirmed; emitted on `success` and `failure.timedout` only. Analogous to [`dataSetCreationMs`](../checks/events-and-metrics.md#dataSetCreationMs) | + +#### Dashboard questions + +The BetterStack dashboards should make it easy to answer: + +- are `dataSetDeletionStatus{value="success"}` counts rising per provider on calibration? +- are `dataSetDeletionStatus{value="skipped.no_candidate"}` runs persisting longer than one creation interval, indicating `data_set_creation` is not replenishing? +- does `dataSetCreationStatus{value="success"}` follow `dataSetDeletionStatus{value="success"}` within the expected interval? +- are `dataSetCreationStatus{value="failure.*"}` counts rising after deletions, indicating a regression in `createDataSet`? + +## Relationship to `data_set_creation` + +The two jobs form a bounded loop. + +`data_set_deletion` only deletes datasets that correspond to dealbot-managed metadata slots. `data_set_creation` detects the resulting `missing` slot through its normal metadata lookup and recreates it without needing any new cross-job state. + +Expected healthy behavior: + +1. `data_set_deletion` calls `terminateService` and polls until `pdpEndEpoch !== 0`. +2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately — not `terminated`. `data_set_creation` provisions a replacement dataset directly in this run. +3. Existing creation metrics and alerts resume acting as the canary. + +**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_DELETIONS_PER_SP_PER_HOUR`. If deletion runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. + +**Canary window size:** The number of slots eligible for deletion is `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_DELETION_MIN_INDEX`. A canary window of `1` means a single slot cycles continuously; a larger window gives deletion more candidates when one slot has active deals blocking it. In practice, a window of `2`–`3` is usually enough buffer. + +**Interaction with `piece_cleanup`:** Marking deal rows `cleaned_up = true` after a successful deletion removes those rows from `piece_cleanup`'s candidate pool (which filters on `cleaned_up = false`). This is correct behavior — the data was intentionally deleted, not just freed up for quota management. It also means a deletion-heavy configuration will naturally reduce the number of pieces available for quota-driven cleanup, which operators should account for when sizing `MAX_DATASET_STORAGE_SIZE_BYTES`. + + +## FAQ + +### What happens on-chain after `terminateService` is called? + +`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain sequence that plays out over roughly 30 days. Understanding this is important because the deletion job only needs the first step to complete before it can exit and let `data_set_creation` replenish the slot. + +**Step 1 — terminateService tx confirms** + +`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. + +This is the point the deletion job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The deletion job's work is done here. + +**Step 2 — rail finalization (~30 days later)** + +When the PDP rail's `settledUpTo` reaches `endEpoch`, `finalizeTerminatedRail` fires atomically inside the same settle transaction. The rail is zeroed, `RailFinalized` is emitted, and any unused `lockupFixed` balance is returned to the payer. + +**Step 3 — dataset deletion at PDPVerifier (SP-initiated, after step 2)** + +After the rail finalizes, the storage provider calls `PDPVerifier.deleteDataSet`. This is an SP-only operation at the PDPVerifier layer. It clears the dataset's header state and invokes `FWSS.dataSetDeleted`, which verifies the rail has finalized and the lockup has elapsed before wiping FWSS-side state. Note that PDPVerifier's per-piece mappings are not cleared by this call. + +**Why the deletion job only waits for step 1** + +Step 2 happens when `settleRail` is called and the rail's `settledUpTo` reaches `endEpoch`. Step 3 requires the SP to call `PDPVerifier.deleteDataSet` after the rail finalizes. The deletion job does not need to wait for either — the slot is considered missing for dealbot's purposes as soon as `pdpEndEpoch !== 0` is set. Waiting for full finalization would mean waiting ~30 days per invocation, which defeats the purpose of a canary cycle. + +## Source of truth + +- Dataset creation design: [`docs/data-set-creation.md`](./data-set-creation.md) +- Job system overview: [`docs/jobs.md`](./jobs.md) +- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./checks/events-and-metrics.md) +- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../apps/backend/src/jobs/jobs.service.ts) +- Dataset creation handler: [`apps/backend/src/jobs/data-set-creation.handler.ts`](../apps/backend/src/jobs/data-set-creation.handler.ts) +- Deal service dataset logic: [`apps/backend/src/deal/deal.service.ts`](../apps/backend/src/deal/deal.service.ts) +- Piece cleanup reference: [`apps/backend/src/piece-cleanup/piece-cleanup.service.ts`](../apps/backend/src/piece-cleanup/piece-cleanup.service.ts) From e654660ea2dcff37d715d26192cb2d584064373a Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Tue, 2 Jun 2026 00:07:10 +0530 Subject: [PATCH 03/16] docs: simplify terminated slot skip --- docs/data-set-deletion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data-set-deletion.md b/docs/data-set-deletion.md index 608f5acd..0298f4d4 100644 --- a/docs/data-set-deletion.md +++ b/docs/data-set-deletion.md @@ -78,7 +78,7 @@ For one provider, one invocation of `data_set_deletion` works like this: - a. Build its metadata using the same logic as `data_set_creation`. - b. Classify it via `getDataSetProvisioningStatus()`. - c. Skip if `missing` — nothing to delete. - - d. Skip if `terminated` — this means Synapse returned a `dataSetId` (`pdpEndEpoch === 0`) but liveness probes are failing. `data_set_creation` owns repair of these slots via `repairTerminatedDataSet`. Note: a slot whose `terminateService` was already called will never appear as `terminated` here — the Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so it shows as `missing` instead. + - d. Skip if `terminated` — `data_set_creation` owns repair of these slots. - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. 6. Call the delete flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). 7. After successful deletion, mark `deals.cleaned_up = true` for all deal rows associated with that `dataSetId` in a single transaction. From 6af976bb7848a511366e7c4249dbed0a6ff17da6 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Wed, 3 Jun 2026 13:32:49 +0530 Subject: [PATCH 04/16] docs: rename + more explanations --- docs/data-set-deletion.md | 197 -------------------------------- docs/data-set-termination.md | 216 +++++++++++++++++++++++++++++++++++ 2 files changed, 216 insertions(+), 197 deletions(-) delete mode 100644 docs/data-set-deletion.md create mode 100644 docs/data-set-termination.md diff --git a/docs/data-set-deletion.md b/docs/data-set-deletion.md deleted file mode 100644 index 0298f4d4..00000000 --- a/docs/data-set-deletion.md +++ /dev/null @@ -1,197 +0,0 @@ -# Data Set Deletion Job - -This doc proposes a calibration-only `data_set_deletion` job that periodically deletes a managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. - -## Problem Context - -During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets (`MIN_NUM_DATASETS_FOR_CHECKS`). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. - -The missing capability is not more creation logic. The missing capability is a controlled way to create fresh demand for creation again. - -## Goals - -- Continuously exercise the calibration `createDataSet → deleteDataSet → createDataSet` lifecycle. -- Reuse the existing `data_set_creation` job as the replenishment mechanism. -- Minimize disruption to ongoing deal and retrieval checks. -- Make delete cadence explicitly configurable so the expected create cadence can be reasoned about. -- Ensure the job cannot run on mainnet. -- Expose enough metrics and logs to extend the existing BetterStack dashboards. - -## Proposed job - -Introduce a new SP-scoped job type: `data_set_deletion`. - -The job should: - -- run only on calibration -- run on a configurable cadence -- delete at most one safe managed dataset per provider per invocation -- rely on the existing `data_set_creation` job to recreate the missing slot on a later tick - -This keeps deletion simple and keeps creation logic centralized in the existing job. - -### Configuration - -The initial design adds these controls, which follow the same naming pattern as the creation job: - -- `DATASET_DELETIONS_PER_SP_PER_HOUR` - - mirrors the existing rate-based job controls - - converted internally to `intervalSeconds` - - used to reason about expected delete frequency - -- `DATA_SET_DELETION_JOB_TIMEOUT_SECONDS` - - max runtime for one deletion job invocation - -- `DATA_SET_DELETION_MIN_INDEX` - - the lowest slot index eligible for deletion (inclusive) - - default: `1` — only the baseline slot (index `0`) is protected - - slots `0` through `DATA_SET_DELETION_MIN_INDEX - 1` are never touched by this job - - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_DELETION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window - - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable deletion entirely — the canary window becomes empty and no schedule is created - - must be `>= 1` and `<= MIN_NUM_DATASETS_FOR_CHECKS`; violating either constraint crashes the application on startup - -### Scheduling and queueing - -The scheduling model mirrors `data_set_creation`: - -- queue: shared `sp.work` -- `singletonKey=spAddress` - -Sharing the singleton with other SP jobs prevents deletion from racing with a `deal` or `retrieval` job for the same provider. - -The schedule is only upserted when all of the following are true: - -- `NETWORK=calibration` -- `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_DELETION_MIN_INDEX > 0` - -The second condition covers the `DATA_SET_DELETION_MIN_INDEX = MIN_NUM_DATASETS_FOR_CHECKS` case (empty canary window, deletion effectively off) without crashing. It also handles the case where `MIN_NUM_DATASETS_FOR_CHECKS` is later lowered to meet `DATA_SET_DELETION_MIN_INDEX` — no schedule is created without requiring a config change. - -### Proposed handler algorithm - -For one provider, one invocation of `data_set_deletion` works like this: - -1. Check that the network is calibration. If not, log skip and exit. -2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. -3. Create an `AbortController` using `DATA_SET_DELETION_JOB_TIMEOUT_SECONDS`. -4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata. -5. Scan slots from `minDataSets - 1` down to `DATA_SET_DELETION_MIN_INDEX`. For each slot: - - a. Build its metadata using the same logic as `data_set_creation`. - - b. Classify it via `getDataSetProvisioningStatus()`. - - c. Skip if `missing` — nothing to delete. - - d. Skip if `terminated` — `data_set_creation` owns repair of these slots. - - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. -6. Call the delete flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). -7. After successful deletion, mark `deals.cleaned_up = true` for all deal rows associated with that `dataSetId` in a single transaction. -8. Log the outcome and exit for this tick. -9. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously deleted slot. - -As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. - -### Proposed delete flow - -The deletion flow should be implemented in a dedicated service method rather than inline in `JobsService`. - -1. Resolve provider info and the target `dataSetId`. -2. Call the on-chain `terminateService` path through Synapse. -3. Wait for transaction receipt. -4. Poll until `pdpEndEpoch !== 0`. A live dataset has `pdpEndEpoch === 0`; once `terminateService` confirms, `pdpEndEpoch !== 0` is set on-chain. The Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so `getDataSetProvisioningStatus()` will return `missing` for this slot from this point on. -5. Once `pdpEndEpoch !== 0` is observed, the deletion job's work is done. `data_set_creation` will see the slot as `missing` on its next run and provision a replacement directly. - -Polling until the chain confirms termination is important because the canary value comes from the full on-chain lifecycle, not just submitting a transaction. - -#### Idempotency - -The delete flow must tolerate races and retries: - -- If `pdpEndEpoch !== 0` is already set when the job starts (slot is already terminated), skip the `terminateService` call and treat the run as a no-op success. -- If `terminateService` reverts with an already-terminated error (for example, `"service already terminated"` or `"dataset not active"`), treat it as idempotent success and proceed to the polling step. -- If the transaction confirms but `pdpEndEpoch` does not become non-zero before the abort signal fires, treat the run as `failure.timedout` and let pg-boss retry on the next tick. - -### Metrics and BetterStack dashboards - -The deletion job has two distinct observability concerns: is the trigger firing, and is the canary signal it produces showing up in creation metrics. Creation metrics are the primary signal; deletion metrics are only there to confirm the trigger is working. - -All metrics carry the standard label set defined in [`checks/events-and-metrics.md`](../checks/events-and-metrics.md#metrics): -`network`, `checkType`, `providerId`, `providerName`, `providerStatus`. - -For deletion metrics, `checkType=dataSetDeletion`. For creation metrics referenced below, `checkType=dataSetCreation`. - -#### Creation metrics (primary signal) - -These already exist and are defined in [`events-and-metrics.md`](../checks/events-and-metrics.md). `data_set_deletion` creates the conditions for them to fire — if they stay silent after deletion is running, something is wrong with creation. - -| Metric | `value` labels | What to watch for | -|--------|---------------|-------------------| -| [`dataSetCreationStatus`](../checks/events-and-metrics.md#dataSetCreationStatus) | `pending`, `success`, `failure.timedout`, `failure.other` | `success` count should rise in the interval after each deletion; persistent `failure.*` after a deletion indicates a `createDataSet` regression | -| [`dataSetCreationMs`](../checks/events-and-metrics.md#dataSetCreationMs) | — | Latency histogram for `createDataSetWithPiece`; spikes after deletion may indicate on-chain congestion | - -#### Deletion metrics (trigger health) - -New metrics proposed here. These confirm deletion is producing the conditions for creation to run. If deletion metrics look healthy but creation metrics are silent, the loop is broken somewhere between the two jobs. - -| Metric | `value` labels | What to watch for | -|--------|---------------|-------------------| -| `dataSetDeletionStatus` | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` per provider confirms the trigger is firing; persistent `skipped.no_candidate` means `data_set_creation` is not replenishing fast enough | -| `dataSetDeletionMs` | — | Histogram from `terminateService` call to `pdpEndEpoch !== 0` confirmed; emitted on `success` and `failure.timedout` only. Analogous to [`dataSetCreationMs`](../checks/events-and-metrics.md#dataSetCreationMs) | - -#### Dashboard questions - -The BetterStack dashboards should make it easy to answer: - -- are `dataSetDeletionStatus{value="success"}` counts rising per provider on calibration? -- are `dataSetDeletionStatus{value="skipped.no_candidate"}` runs persisting longer than one creation interval, indicating `data_set_creation` is not replenishing? -- does `dataSetCreationStatus{value="success"}` follow `dataSetDeletionStatus{value="success"}` within the expected interval? -- are `dataSetCreationStatus{value="failure.*"}` counts rising after deletions, indicating a regression in `createDataSet`? - -## Relationship to `data_set_creation` - -The two jobs form a bounded loop. - -`data_set_deletion` only deletes datasets that correspond to dealbot-managed metadata slots. `data_set_creation` detects the resulting `missing` slot through its normal metadata lookup and recreates it without needing any new cross-job state. - -Expected healthy behavior: - -1. `data_set_deletion` calls `terminateService` and polls until `pdpEndEpoch !== 0`. -2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately — not `terminated`. `data_set_creation` provisions a replacement dataset directly in this run. -3. Existing creation metrics and alerts resume acting as the canary. - -**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_DELETIONS_PER_SP_PER_HOUR`. If deletion runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. - -**Canary window size:** The number of slots eligible for deletion is `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_DELETION_MIN_INDEX`. A canary window of `1` means a single slot cycles continuously; a larger window gives deletion more candidates when one slot has active deals blocking it. In practice, a window of `2`–`3` is usually enough buffer. - -**Interaction with `piece_cleanup`:** Marking deal rows `cleaned_up = true` after a successful deletion removes those rows from `piece_cleanup`'s candidate pool (which filters on `cleaned_up = false`). This is correct behavior — the data was intentionally deleted, not just freed up for quota management. It also means a deletion-heavy configuration will naturally reduce the number of pieces available for quota-driven cleanup, which operators should account for when sizing `MAX_DATASET_STORAGE_SIZE_BYTES`. - - -## FAQ - -### What happens on-chain after `terminateService` is called? - -`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain sequence that plays out over roughly 30 days. Understanding this is important because the deletion job only needs the first step to complete before it can exit and let `data_set_creation` replenish the slot. - -**Step 1 — terminateService tx confirms** - -`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. - -This is the point the deletion job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The deletion job's work is done here. - -**Step 2 — rail finalization (~30 days later)** - -When the PDP rail's `settledUpTo` reaches `endEpoch`, `finalizeTerminatedRail` fires atomically inside the same settle transaction. The rail is zeroed, `RailFinalized` is emitted, and any unused `lockupFixed` balance is returned to the payer. - -**Step 3 — dataset deletion at PDPVerifier (SP-initiated, after step 2)** - -After the rail finalizes, the storage provider calls `PDPVerifier.deleteDataSet`. This is an SP-only operation at the PDPVerifier layer. It clears the dataset's header state and invokes `FWSS.dataSetDeleted`, which verifies the rail has finalized and the lockup has elapsed before wiping FWSS-side state. Note that PDPVerifier's per-piece mappings are not cleared by this call. - -**Why the deletion job only waits for step 1** - -Step 2 happens when `settleRail` is called and the rail's `settledUpTo` reaches `endEpoch`. Step 3 requires the SP to call `PDPVerifier.deleteDataSet` after the rail finalizes. The deletion job does not need to wait for either — the slot is considered missing for dealbot's purposes as soon as `pdpEndEpoch !== 0` is set. Waiting for full finalization would mean waiting ~30 days per invocation, which defeats the purpose of a canary cycle. - -## Source of truth - -- Dataset creation design: [`docs/data-set-creation.md`](./data-set-creation.md) -- Job system overview: [`docs/jobs.md`](./jobs.md) -- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./checks/events-and-metrics.md) -- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../apps/backend/src/jobs/jobs.service.ts) -- Dataset creation handler: [`apps/backend/src/jobs/data-set-creation.handler.ts`](../apps/backend/src/jobs/data-set-creation.handler.ts) -- Deal service dataset logic: [`apps/backend/src/deal/deal.service.ts`](../apps/backend/src/deal/deal.service.ts) -- Piece cleanup reference: [`apps/backend/src/piece-cleanup/piece-cleanup.service.ts`](../apps/backend/src/piece-cleanup/piece-cleanup.service.ts) diff --git a/docs/data-set-termination.md b/docs/data-set-termination.md new file mode 100644 index 00000000..83a68f75 --- /dev/null +++ b/docs/data-set-termination.md @@ -0,0 +1,216 @@ +# Data Set Termination Job + +This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. + +## Summary + +- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider. +- Together with [`data_set_creation`](./data-set-creation.md), the two jobs form a bounded loop that keeps the `createDataSet` on-chain path continuously exercised as a canary. +- The job terminates **at most one dataset per invocation**; `data_set_creation` handles replenishment on its next scheduled tick. +- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. +- Schedule creation is gated by `NETWORK=calibration` and a non-empty canary window (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). +- Slots below `DATA_SET_TERMINATION_MIN_INDEX` are never touched, keeping a stable baseline for ongoing checks. + +## Problem Context + +During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. + +The missing capability is not more creation logic. The missing capability is a controlled way to create fresh demand for creation again. + +## Goals + +- Continuously exercise the calibration `createDataSet → terminateService → createDataSet` lifecycle. +- Reuse the existing `data_set_creation` job as the replenishment mechanism. +- Minimize disruption to ongoing deal and retrieval checks. +- Make termination cadence explicitly configurable so the expected create cadence can be reasoned about. +- Ensure the job cannot run on mainnet. +- Expose enough metrics and logs to extend the existing BetterStack dashboards. + +## Proposed job + +Introduce a new SP-scoped job type: `data_set_termination`. + +The job should: + +- run only on calibration +- run on a configurable cadence +- terminate at most one safe managed dataset per provider per invocation +- rely on the existing `data_set_creation` job to recreate the missing slot on a later tick + +This keeps termination simple and keeps creation logic centralized in the existing job. + +### Configuration + +The initial design adds these controls, which follow the same naming pattern as the creation job: + +- `DATASET_TERMINATIONS_PER_SP_PER_HOUR` + - mirrors the existing rate-based job controls + - converted internally to `intervalSeconds` + - used to reason about expected termination frequency + +- `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS` + - max runtime for one termination job invocation + +- `DATA_SET_TERMINATION_MIN_INDEX` + - the lowest slot index eligible for termination (inclusive) + - default: `1` — only the baseline slot (index `0`) is protected + - slots `0` through `DATA_SET_TERMINATION_MIN_INDEX - 1` are never touched by this job + - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_TERMINATION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window + - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely — the canary window becomes empty and no schedule is created + - must be `>= 1` and `<=` [`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks); violating either constraint crashes the application on startup + +### Scheduling and queueing + +The scheduling model mirrors `data_set_creation`: + +- queue: shared `sp.work` +- `singletonKey=spAddress` + +Sharing the singleton with other SP jobs prevents termination from racing with a `deal`, `retrieval`, `pull_check`, `piece_cleanup`, or `data_set_creation` job for the same provider. + +The schedule is only upserted when all of the following are true: + +- `NETWORK=calibration` +- `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0` + +The second condition covers the `DATA_SET_TERMINATION_MIN_INDEX = MIN_NUM_DATASETS_FOR_CHECKS` case (empty canary window, termination effectively off) without crashing. It also handles the case where `MIN_NUM_DATASETS_FOR_CHECKS` is later lowered to meet `DATA_SET_TERMINATION_MIN_INDEX` — no schedule is created without requiring a config change. + +### Proposed handler algorithm + +For one provider, one invocation of `data_set_termination` works like this: + +1. Check that the network is calibration. If not, log skip and exit. +2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. +3. Create an `AbortController` using `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`. +4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata. +5. Scan slots from `minDataSets - 1` down to `DATA_SET_TERMINATION_MIN_INDEX`. For each slot: + - a. Build its metadata using the same logic as `data_set_creation`. + - b. Classify it via `getDataSetProvisioningStatus()`. + - c. Skip if `missing` — nothing to terminate. + - d. Skip if `terminated` — `data_set_creation` owns repair of these slots. + - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. +6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). +7. Log the outcome and exit for this tick. +8. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously terminated slot. + +As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. + +### Proposed termination flow + +The termination flow should be implemented in a dedicated service method rather than inline in `JobsService`. + +1. Resolve provider info from cache and the target `dataSetId` using synapse-sdk by building slot dataset metadata. +2. Call the on-chain `terminateService` path through Synapse. +3. Wait for transaction receipt. +4. Poll until `pdpEndEpoch !== 0`. A live dataset has `pdpEndEpoch === 0`; once `terminateService` confirms, `pdpEndEpoch !== 0` is set on-chain. The Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so `getDataSetProvisioningStatus()` will return `missing` for this slot from this point on. +5. Once `pdpEndEpoch !== 0` is observed, the termination flow's work is done. `data_set_creation` will see the slot as `missing` on its next run and provision a replacement directly. + +Polling until the chain confirms termination is important because the canary value comes from the full on-chain lifecycle, not just submitting a transaction. + +#### Idempotency + +The termination flow must tolerate races and retries: + +- If `pdpEndEpoch !== 0` is already set when the job starts (slot is already terminated), skip the `terminateService` call and treat the run as a no-op success. +- If `terminateService` reverts with an already-terminated error (for example, `"service already terminated"` or `"dataset not active"`), treat it as idempotent success and proceed to the polling step. +- If the transaction confirms but `pdpEndEpoch` does not become non-zero before the abort signal fires, treat the run as `failure.timedout` and let pg-boss retry on the next tick. + +### Metrics and BetterStack dashboards + +The termination job has two distinct observability concerns: is the trigger firing, and is the canary signal it produces showing up in creation metrics. Creation metrics are the primary signal; termination metrics are only there to confirm the trigger is working. + +All metrics carry the standard label set defined in [`checks/events-and-metrics.md`](./checks/events-and-metrics.md#metrics): +`checkType`, `providerId`, `providerName`, `providerStatus`. + +For termination metrics, `checkType=dataSetTermination`. For creation metrics referenced below, `checkType=dataSetCreation`. + +#### Creation metrics (primary signal) + +| Metric | `value` labels | What to watch for | +|--------|---------------|-------------------| +| `dataSetCreationStatus`| `pending`, `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` count should rise in the interval after each termination; persistent `failure.*` after a termination indicates a `createDataSet` regression | +| `dataSetCreationMs`| — | Latency histogram for `createDataSetWithPiece`; spikes after termination may indicate on-chain congestion | + +#### Termination metrics (trigger health) + +New metrics proposed here. These confirm termination is producing the conditions for creation to run. If termination metrics look healthy but creation metrics are silent, the loop is broken somewhere between the two jobs. + +| Metric | `value` labels | What to watch for | +|--------|---------------|-------------------| +| `dataSetTerminationStatus` | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` per provider confirms the trigger is firing; persistent `skipped.no_candidate` means `data_set_creation` is not replenishing fast enough | +| `dataSetTerminationMs` | — | Histogram from `terminateService` call to `pdpEndEpoch !== 0` confirmed; emitted on `success` and `failure.timedout` only. Analogous to `dataSetCreationMs` | + +#### Dashboard questions + +The BetterStack dashboards should make it easy to answer: + +- are `dataSetTerminationStatus{value="success"}` counts rising per provider on calibration? +- are `dataSetTerminationStatus{value="skipped.no_candidate"}` runs persisting longer than one creation interval, indicating `data_set_creation` is not replenishing? +- does `dataSetCreationStatus{value="success"}` follow `dataSetTerminationStatus{value="success"}` within the expected interval? +- are `dataSetCreationStatus{value="failure.*"}` counts rising after terminations, indicating a regression in `createDataSet`? + +## Relationship to `data_set_creation` + +The two jobs form a bounded loop. + +`data_set_termination` only terminates datasets that correspond to dealbot-managed metadata slots. `data_set_creation` detects the resulting `missing` slot through its normal metadata lookup and recreates it without needing any new cross-job state. + +Expected healthy behavior: + +1. `data_set_termination` calls `terminateService` and polls until `pdpEndEpoch !== 0`. +2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately. `data_set_creation` provisions a replacement dataset directly in this run. +3. Existing creation metrics and alerts resume acting as the canary. + +**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. + +**Canary window size:** The number of slots eligible for termination is `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX`. A canary window of `1` means a single slot cycles continuously; a larger window gives termination more candidates when one slot has active deals blocking it. In practice, a window of `2`–`3` is usually enough buffer. + +## Open Questions + +### Should `terminated` be renamed in `getDataSetProvisioningStatus`? + +The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`). + +The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well. + +### Should `data_set_termination` absorb the repair path from `data_set_creation`? + +Currently, `data_set_creation` owns two distinct responsibilities: +1. Repairing `terminated` slots (liveness-probe failures) via `repairTerminatedDataSet`. +2. Provisioning `missing` slots via `createDataSetWithPiece`. + +Once `data_set_termination` exists and calls `terminateService` directly, it handles on-chain termination for managed slots. The question is whether `data_set_creation` should be simplified to only own replenishment, with all termination (including repair) moving to `data_set_termination`. This is left open pending implementation experience. + + +## FAQ + +### What happens on-chain after `terminateService` is called? + +`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain sequence that plays out over roughly 30 days. Understanding this is important because the termination job only needs the first step to complete before it can exit and let `data_set_creation` replenish the slot. + +**Step 1 — terminateService tx confirms** + +`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. + +This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here. + +**Step 2 — rail finalization (~30 days later)** + +When the PDP rail's `settledUpTo` reaches `endEpoch`, `finalizeTerminatedRail` fires atomically inside the same settle transaction. The rail is zeroed, `RailFinalized` is emitted, and any unused `lockupFixed` balance is returned to the payer. + +**Step 3 — dataset deletion at PDPVerifier (SP-initiated, after step 2)** + +After the rail finalizes, the storage provider calls `PDPVerifier.deleteDataSet`. This is an SP-only operation at the PDPVerifier layer. It clears the dataset's header state and invokes `FWSS.dataSetDeleted`, which verifies the rail has finalized and the lockup has elapsed before wiping FWSS-side state. Note that PDPVerifier's per-piece mappings are not cleared by this call. + +**Why the termination job only waits for step 1** + +Step 2 happens when `settleRail` is called and the rail's `settledUpTo` reaches `endEpoch`. Step 3 requires the SP to call `PDPVerifier.deleteDataSet` after the rail finalizes. The termination job does not need to wait for either — the slot is considered missing for dealbot's purposes as soon as `pdpEndEpoch !== 0` is set. Waiting for full finalization would mean waiting ~30 days per invocation, which defeats the purpose of a canary cycle. + +## Source of truth + +- Dataset creation design: [`docs/data-set-creation.md`](./data-set-creation.md) +- Job system overview: [`docs/jobs.md`](./jobs.md) +- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./checks/events-and-metrics.md) +- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../apps/backend/src/jobs/jobs.service.ts) +- Dataset creation handler: [`apps/backend/src/jobs/data-set-creation.handler.ts`](../apps/backend/src/jobs/data-set-creation.handler.ts) +- Deal service dataset logic (including `getDataSetProvisioningStatus`, `repairTerminatedDataSet`): [`apps/backend/src/deal/deal.service.ts`](../apps/backend/src/deal/deal.service.ts) From c5375095d78b0067638b580d337ceb0ae6a3dcf9 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Wed, 3 Jun 2026 13:45:45 +0530 Subject: [PATCH 05/16] chore: address pr comments --- docs/data-set-termination.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/data-set-termination.md b/docs/data-set-termination.md index 83a68f75..3c71e5fe 100644 --- a/docs/data-set-termination.md +++ b/docs/data-set-termination.md @@ -2,6 +2,8 @@ This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. +> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach. + ## Summary - `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider. @@ -100,7 +102,7 @@ As with `data_set_creation`, the job performs **at most one state-changing actio The termination flow should be implemented in a dedicated service method rather than inline in `JobsService`. 1. Resolve provider info from cache and the target `dataSetId` using synapse-sdk by building slot dataset metadata. -2. Call the on-chain `terminateService` path through Synapse. +2. Call the on-chain `terminateService` path through Synapse (`await synapse.storage.terminateDataSet({ dataSetId })`). 3. Wait for transaction receipt. 4. Poll until `pdpEndEpoch !== 0`. A live dataset has `pdpEndEpoch === 0`; once `terminateService` confirms, `pdpEndEpoch !== 0` is set on-chain. The Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so `getDataSetProvisioningStatus()` will return `missing` for this slot from this point on. 5. Once `pdpEndEpoch !== 0` is observed, the termination flow's work is done. `data_set_creation` will see the slot as `missing` on its next run and provision a replacement directly. @@ -126,10 +128,12 @@ For termination metrics, `checkType=dataSetTermination`. For creation metrics re #### Creation metrics (primary signal) +These already exist and are defined in [`events-and-metrics.md`](./checks/events-and-metrics.md). `data_set_termination` creates the conditions for them to fire — if they stay silent after termination is running, something is wrong with creation. + | Metric | `value` labels | What to watch for | |--------|---------------|-------------------| -| `dataSetCreationStatus`| `pending`, `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` count should rise in the interval after each termination; persistent `failure.*` after a termination indicates a `createDataSet` regression | -| `dataSetCreationMs`| — | Latency histogram for `createDataSetWithPiece`; spikes after termination may indicate on-chain congestion | +| [`dataSetCreationStatus`](./checks/events-and-metrics.md#dataSetCreationStatus) | `pending`, `success`, `failure.timedout`, `failure.other` | `success` count should rise in the interval after each termination; persistent `failure.*` after a termination indicates a `createDataSet` regression | +| [`dataSetCreationMs`](./checks/events-and-metrics.md#dataSetCreationMs) | — | Latency histogram for `createDataSetWithPiece`; spikes after termination may indicate on-chain congestion | #### Termination metrics (trigger health) From fac137f6b34bb640e11f6578306ebc489eca44a1 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Thu, 4 Jun 2026 15:54:24 +0530 Subject: [PATCH 06/16] feat: add data_set_termination canary job --- apps/backend/.env.example | 10 + apps/backend/src/config/app.config.ts | 64 ++++++ .../entities/job-schedule-state.entity.ts | 1 + apps/backend/src/deal/deal.service.spec.ts | 90 ++++++++ apps/backend/src/deal/deal.service.ts | 196 ++++++++++++++---- .../jobs/data-set-termination.handler.spec.ts | 152 ++++++++++++++ .../src/jobs/data-set-termination.handler.ts | 122 +++++++++++ apps/backend/src/jobs/jobs.service.spec.ts | 171 ++++++++++++++- apps/backend/src/jobs/jobs.service.ts | 180 +++++++++++++++- .../repositories/job-schedule.repository.ts | 28 ++- .../metrics-prometheus/check-metric-labels.ts | 8 +- .../check-metrics.service.ts | 27 +++ .../metrics-prometheus.module.ts | 16 ++ .../src/wallet-sdk/wallet-sdk.service.spec.ts | 1 + 14 files changed, 1021 insertions(+), 45 deletions(-) create mode 100644 apps/backend/src/jobs/data-set-termination.handler.spec.ts create mode 100644 apps/backend/src/jobs/data-set-termination.handler.ts diff --git a/apps/backend/.env.example b/apps/backend/.env.example index e614e6f0..3b3cf272 100644 --- a/apps/backend/.env.example +++ b/apps/backend/.env.example @@ -31,6 +31,11 @@ PDP_SUBGRAPH_ENDPOINT=https://api.thegraph.com/subgraphs/filecoin/pdp # Minimum number of datasets per SP (default: 1). When > 1, a separate data_set_creation job provisions extra datasets. MIN_NUM_DATASETS_FOR_CHECKS=1 +# Lowest dataset slot index the data_set_termination canary may terminate (inclusive). +# Slots 0..(index-1) are never touched. Must be >= 1 and <= MIN_NUM_DATASETS_FOR_CHECKS. +# Equal to MIN_NUM_DATASETS_FOR_CHECKS disables termination (empty canary window). +DATA_SET_TERMINATION_MIN_INDEX=1 + # Dataset Versioning (optional) # Uncomment and set to enable dataset versioning (e.g., "dealbot-v1", "dealbot-v2") # This allows creating new logical datasets without changing wallet addresses @@ -55,6 +60,11 @@ DEALBOT_MAINTENANCE_WINDOW_MINUTES=20 DEALS_PER_SP_PER_HOUR=2 DATASET_CREATIONS_PER_SP_PER_HOUR=1 RETRIEVALS_PER_SP_PER_HOUR=1 +# data_set_termination canary (defaults: enabled on calibration, disabled on mainnet). +# Keep DATASET_TERMINATIONS_PER_SP_PER_HOUR <= DATASET_CREATIONS_PER_SP_PER_HOUR. +# DATASET_TERMINATION_ENABLED=true +DATASET_TERMINATIONS_PER_SP_PER_HOUR=1 +DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS=300 # 5m: Max runtime for data_set_termination jobs PG_BOSS_LOCAL_CONCURRENCY=20 JOB_SCHEDULER_POLL_SECONDS=300 JOB_WORKER_POLL_SECONDS=60 diff --git a/apps/backend/src/config/app.config.ts b/apps/backend/src/config/app.config.ts index e2d69088..e4ebabb4 100644 --- a/apps/backend/src/config/app.config.ts +++ b/apps/backend/src/config/app.config.ts @@ -58,6 +58,22 @@ export const configValidationSchema = Joi.object({ DEALBOT_DATASET_VERSION: Joi.string().optional(), MIN_NUM_DATASETS_FOR_CHECKS: Joi.number().integer().min(1).default(1), PDP_SUBGRAPH_ENDPOINT: Joi.string().uri().optional().allow(""), + // Lowest dataset slot index eligible for termination (inclusive). Slots 0..(index-1) + // are never touched. Must be >= 1 and <= MIN_NUM_DATASETS_FOR_CHECKS; equal disables + // termination (empty canary window). See docs/data-set-termination.md. + DATA_SET_TERMINATION_MIN_INDEX: Joi.number() + .integer() + .min(1) + .default(1) + .custom((value, helpers) => { + const minDataSets = helpers.state.ancestors?.[0]?.MIN_NUM_DATASETS_FOR_CHECKS; + if (minDataSets != null && value > minDataSets) { + return helpers.error("any.invalid", { + message: `DATA_SET_TERMINATION_MIN_INDEX (${value}) must be <= MIN_NUM_DATASETS_FOR_CHECKS (${minDataSets})`, + }); + } + return value; + }, "min index <= min datasets validation"), // Scheduling PROVIDERS_REFRESH_INTERVAL_SECONDS: Joi.number().default(4 * 3600), @@ -80,7 +96,12 @@ export const configValidationSchema = Joi.object({ // Per-hour limits are guardrails to avoid excessive background load. DEALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(4), DATASET_CREATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), + DATASET_TERMINATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), RETRIEVALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(2), + // Enables the data_set_termination canary job. The network-dependent default (true on + // calibration, false on mainnet) is resolved in loadConfig; here we only validate the + // type when explicitly set. See docs/data-set-termination.md. + DATASET_TERMINATION_ENABLED: Joi.boolean().optional(), // Polling interval for pg-boss scheduler (lower = more responsive, higher = less DB chatter). JOB_SCHEDULER_POLL_SECONDS: Joi.number().min(60).default(300), JOB_WORKER_POLL_SECONDS: Joi.number().min(5).default(60), @@ -93,6 +114,7 @@ export const configValidationSchema = Joi.object({ DEAL_JOB_TIMEOUT_SECONDS: Joi.number().min(120).default(360), // 6 minutes max runtime for data storage jobs (TODO: reduce default to 3 minutes) RETRIEVAL_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(60), // 1 minute max runtime for retrieval jobs (TODO: reduce default to 30 seconds) DATA_SET_CREATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset creation jobs + DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset termination jobs // Seconds to hold the process alive after pg-boss drain completes, so Prometheus // captures at least one scrape of the terminal counter increments emitted during // shutdown. Default 35 covers the 30s ServiceMonitor interval plus a 5s buffer. @@ -199,6 +221,12 @@ export interface IBlockchainConfig { useOnlyApprovedProviders: boolean; dealbotDataSetVersion?: string; minNumDataSetsForChecks: number; + /** + * Lowest dataset slot index eligible for the `data_set_termination` job (inclusive). + * Slots `0..(dataSetTerminationMinIndex - 1)` are never terminated. Guaranteed to be + * `>= 1` and `<= minNumDataSetsForChecks` by config validation. + */ + dataSetTerminationMinIndex: number; pdpSubgraphEndpoint?: string; } @@ -226,6 +254,21 @@ export interface IJobsConfig { * Target number of dataset creation runs per storage provider per hour. */ dataSetCreationsPerSpPerHour: number; + /** + * Enables the calibration-focused `data_set_termination` canary job. + * + * Defaults to true on calibration and false on mainnet. Even when enabled, a + * schedule is only created when the canary window + * (`minNumDataSetsForChecks - dataSetTerminationMinIndex`) is non-empty. + */ + dataSetTerminationEnabled: boolean; + /** + * Target number of dataset termination runs per storage provider per hour. + * + * Should be <= `dataSetCreationsPerSpPerHour` so creation can replenish terminated + * slots without backlog. A startup warning is logged when this constraint is violated. + */ + dataSetTerminationsPerSpPerHour: number; /** * How often the scheduler polls Postgres for due jobs (seconds). * @@ -284,6 +327,13 @@ export interface IJobsConfig { * Uses AbortController to actively cancel job execution. */ dataSetCreationJobTimeoutSeconds: number; + /** + * Maximum runtime (seconds) for data-set termination jobs before forced abort. + * + * Bounds the terminateService call plus the `pdpEndEpoch != 0` confirmation poll. + * Uses AbortController to actively cancel job execution. + */ + dataSetTerminationJobTimeoutSeconds: number; /** * Maximum runtime (seconds) for retrieval jobs before forced abort. * @@ -458,6 +508,7 @@ export function loadConfig(): IConfig { useOnlyApprovedProviders: process.env.USE_ONLY_APPROVED_PROVIDERS !== "false", dealbotDataSetVersion: process.env.DEALBOT_DATASET_VERSION, minNumDataSetsForChecks: Number.parseInt(process.env.MIN_NUM_DATASETS_FOR_CHECKS || "1", 10), + dataSetTerminationMinIndex: Number.parseInt(process.env.DATA_SET_TERMINATION_MIN_INDEX || "1", 10), pdpSubgraphEndpoint: process.env.PDP_SUBGRAPH_ENDPOINT || "", }, scheduling: { @@ -473,6 +524,15 @@ export function loadConfig(): IConfig { dealsPerSpPerHour: Number.parseFloat(process.env.DEALS_PER_SP_PER_HOUR || "4"), retrievalsPerSpPerHour: Number.parseFloat(process.env.RETRIEVALS_PER_SP_PER_HOUR || "2"), dataSetCreationsPerSpPerHour: Number.parseFloat(process.env.DATASET_CREATIONS_PER_SP_PER_HOUR || "1"), + dataSetTerminationEnabled: (() => { + const raw = process.env.DATASET_TERMINATION_ENABLED; + if (raw == null || raw.trim().length === 0) { + // Default: enabled on calibration, disabled on mainnet. + return (process.env.NETWORK || "calibration") === "calibration"; + } + return raw === "true"; + })(), + dataSetTerminationsPerSpPerHour: Number.parseFloat(process.env.DATASET_TERMINATIONS_PER_SP_PER_HOUR || "1"), schedulerPollSeconds: Number.parseInt(process.env.JOB_SCHEDULER_POLL_SECONDS || "300", 10), workerPollSeconds: Number.parseInt(process.env.JOB_WORKER_POLL_SECONDS || "60", 10), pgbossLocalConcurrency: Number.parseInt(process.env.PG_BOSS_LOCAL_CONCURRENCY || "20", 10), @@ -484,6 +544,10 @@ export function loadConfig(): IConfig { dealJobTimeoutSeconds: Number.parseInt(process.env.DEAL_JOB_TIMEOUT_SECONDS || "360", 10), retrievalJobTimeoutSeconds: Number.parseInt(process.env.RETRIEVAL_JOB_TIMEOUT_SECONDS || "60", 10), dataSetCreationJobTimeoutSeconds: Number.parseInt(process.env.DATA_SET_CREATION_JOB_TIMEOUT_SECONDS || "300", 10), + dataSetTerminationJobTimeoutSeconds: Number.parseInt( + process.env.DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS || "300", + 10, + ), shutdownFinalScrapeDelaySeconds: Number.parseInt(process.env.SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS || "35", 10), pieceCleanupPerSpPerHour: Number.parseFloat(process.env.JOB_PIECE_CLEANUP_PER_SP_PER_HOUR || String(1 / 24)), maxPieceCleanupRuntimeSeconds: Number.parseInt(process.env.MAX_PIECE_CLEANUP_RUNTIME_SECONDS || "300", 10), diff --git a/apps/backend/src/database/entities/job-schedule-state.entity.ts b/apps/backend/src/database/entities/job-schedule-state.entity.ts index 4d801d2a..a42851eb 100644 --- a/apps/backend/src/database/entities/job-schedule-state.entity.ts +++ b/apps/backend/src/database/entities/job-schedule-state.entity.ts @@ -4,6 +4,7 @@ export type JobType = | "deal" | "retrieval" | "data_set_creation" + | "data_set_termination" | "pull_check" | "providers_refresh" | "data_retention_poll" diff --git a/apps/backend/src/deal/deal.service.spec.ts b/apps/backend/src/deal/deal.service.spec.ts index 0672a7a2..0508779b 100644 --- a/apps/backend/src/deal/deal.service.spec.ts +++ b/apps/backend/src/deal/deal.service.spec.ts @@ -17,6 +17,7 @@ import { DealAddonsService } from "../deal-addons/deal-addons.service.js"; import { DealPreprocessingResult } from "../deal-addons/types.js"; import { DataSetCreationCheckMetrics, + DataSetTerminationCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -169,6 +170,10 @@ describe("DealService", () => { observeCheckDuration: vi.fn(), recordStatus: vi.fn(), }; + const mockDataSetTerminationMetrics = { + observeCheckDuration: vi.fn(), + recordStatus: vi.fn(), + }; beforeEach(async () => { const module: TestingModule = await Test.createTestingModule({ @@ -184,6 +189,7 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, + { provide: DataSetTerminationCheckMetrics, useValue: mockDataSetTerminationMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1068,6 +1074,7 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, + { provide: DataSetTerminationCheckMetrics, useValue: mockDataSetTerminationMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1445,6 +1452,89 @@ describe("DealService", () => { }); }); + describe("terminateManagedDataSet", () => { + beforeEach(() => { + vi.spyOn(mockWalletSdkService, "getProviderInfo").mockReturnValue({ + id: 1n, + name: "sp", + isApproved: true, + } as any); + }); + + it("terminates, marks deals cleaned up, and records success + duration metrics", async () => { + const terminateMock = vi.fn().mockResolvedValue("0xhash"); + const synapseMock = { + storage: { terminateDataSet: terminateMock }, + client: { waitForTransactionReceipt: vi.fn().mockResolvedValue({ status: "success" }) }, + }; + vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); + + mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 0n }); + mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 4321n }); + + const updateFn = vi.fn().mockResolvedValue({ affected: 3 }); + const transactionMock = vi.fn(async (cb: any) => cb({ getRepository: () => ({ update: updateFn }) })); + Object.defineProperty(dealRepoMock, "manager", { + configurable: true, + value: { transaction: transactionMock }, + }); + + const result = await service.terminateManagedDataSet("0xaaa", 9n, undefined, 5_000); + + expect(terminateMock).toHaveBeenCalledWith({ dataSetId: 9n }); + expect(updateFn).toHaveBeenCalledWith( + { dataSetId: 9n, cleanedUp: false }, + expect.objectContaining({ cleanedUp: true }), + ); + expect(result).toEqual({ dealsAffected: 3, pdpEndEpoch: 4321n }); + expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetTermination" }), + "success", + ); + expect(mockDataSetTerminationMetrics.observeCheckDuration).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetTermination" }), + expect.any(Number), + ); + }); + + it("records failure.timedout and rethrows when the signal is already aborted", async () => { + const terminateMock = vi.fn(); + const synapseMock = { + storage: { terminateDataSet: terminateMock }, + client: { waitForTransactionReceipt: vi.fn() }, + }; + vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); + + const controller = new AbortController(); + controller.abort(new Error("Data set termination job timeout (300s)")); + + await expect(service.terminateManagedDataSet("0xaaa", 9n, controller.signal, 5_000)).rejects.toThrow(); + + expect(terminateMock).not.toHaveBeenCalled(); + expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetTermination" }), + "failure.timedout", + ); + }); + }); + + describe("recordDataSetTerminationSkipped", () => { + it("records skipped.no_candidate with dataSetTermination labels", () => { + vi.spyOn(mockWalletSdkService, "getProviderInfo").mockReturnValue({ + id: 7n, + name: "sp", + isApproved: false, + } as any); + + service.recordDataSetTerminationSkipped("0xaaa"); + + expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetTermination", providerId: "7", providerStatus: "unapproved" }), + "skipped.no_candidate", + ); + }); + }); + describe("createDeal isLive guard", () => { it("throws DealJobTerminatedDataSetError when data set is PDP-terminated; no metrics or save", async () => { const providerInfo: PDPProviderEx = { diff --git a/apps/backend/src/deal/deal.service.ts b/apps/backend/src/deal/deal.service.ts index df06ed9c..711e26a7 100644 --- a/apps/backend/src/deal/deal.service.ts +++ b/apps/backend/src/deal/deal.service.ts @@ -31,6 +31,7 @@ import type { DealPreprocessingResult } from "../deal-addons/types.js"; import { buildCheckMetricLabels, classifyFailureStatus } from "../metrics-prometheus/check-metric-labels.js"; import { DataSetCreationCheckMetrics, + DataSetTerminationCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -69,6 +70,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { private readonly dataStorageMetrics: DataStorageCheckMetrics, private readonly retrievalMetrics: RetrievalCheckMetrics, private readonly dataSetCreationMetrics: DataSetCreationCheckMetrics, + private readonly dataSetTerminationMetrics: DataSetTerminationCheckMetrics, private readonly clickhouseService: ClickhouseService, private readonly datasetLivenessService: DatasetLivenessService, ) { @@ -732,7 +734,9 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } /** - * Repair a PDP-terminated dataset (FWSS may or may not have flipped pdpEndEpoch). + * Terminate a dataset on-chain (if needed) and wait for FWSS to confirm + * `pdpEndEpoch != 0`. Shared by the `data_set_creation` repair path and the + * `data_set_termination` canary job. * * Idempotent sequence: * 1. Read FWSS pdpEndEpoch. If already non-zero, skip the on-chain call. @@ -740,74 +744,96 @@ export class DealService implements OnModuleInit, OnModuleDestroy { * FWSS pdpEndEpoch until non-zero. A revert that matches a known * already-terminated message is treated as a no-op and falls through * to the poll, so a partially-completed prior run can complete. - * 3. Mark every Deal row with this dataSetId as cleaned up in a single - * transaction (filtered on cleaned_up=false, so re-runs do not double-write). + * + * Returns the confirmed non-zero `pdpEndEpoch`. Throws on abort or poll timeout. */ - async repairTerminatedDataSet( + private async ensureDataSetTerminated( providerAddress: string, dataSetId: bigint, signal?: AbortSignal, pollTimeoutMs = 60_000, - ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { + ): Promise { signal?.throwIfAborted(); const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); - const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); const { warmStorageService } = this.walletSdkService.getWalletServices(); - let pdpEndEpoch: bigint; const existing = await awaitWithAbort(warmStorageService.getDataSet({ dataSetId }), signal); if (existing != null && existing.pdpEndEpoch !== 0n) { - pdpEndEpoch = existing.pdpEndEpoch; this.logger.log({ event: "dataset_already_terminated", message: "FWSS pdpEndEpoch already set; skipping terminateDataSet", providerAddress, dataSetId: dataSetId.toString(), - pdpEndEpoch: pdpEndEpoch.toString(), + pdpEndEpoch: existing.pdpEndEpoch.toString(), }); - } else { - let txHash: `0x${string}` | undefined; + return existing.pdpEndEpoch; + } + + let txHash: `0x${string}` | undefined; + try { + txHash = await awaitWithAbort(synapse.storage.terminateDataSet({ dataSetId }), signal); + } catch (error) { + if (signal?.aborted) throw error; + const message = error instanceof Error ? error.message : String(error); + if (!/already.*terminat|service.*terminated|pdpEndEpoch.*set/i.test(message)) { + throw error; + } + this.logger.warn({ + event: "dataset_terminate_already_handled", + message: "terminateDataSet reverted as already-terminated; continuing to poll", + providerAddress, + dataSetId: dataSetId.toString(), + revert: message, + }); + } + signal?.throwIfAborted(); + if (txHash != null) { try { - txHash = await awaitWithAbort(synapse.storage.terminateDataSet({ dataSetId }), signal); + await awaitWithAbort(synapse.client.waitForTransactionReceipt({ hash: txHash }), signal); } catch (error) { if (signal?.aborted) throw error; - const message = error instanceof Error ? error.message : String(error); - if (!/already.*terminat|service.*terminated|pdpEndEpoch.*set/i.test(message)) { - throw error; - } this.logger.warn({ - event: "dataset_terminate_already_handled", - message: "terminateDataSet reverted as already-terminated; continuing to poll", + event: "dataset_terminate_receipt_wait_failed", + message: "Receipt wait failed; falling back to FWSS state poll", providerAddress, dataSetId: dataSetId.toString(), - revert: message, + txHash, + error: toStructuredError(error), }); } - signal?.throwIfAborted(); - if (txHash != null) { - try { - await awaitWithAbort(synapse.client.waitForTransactionReceipt({ hash: txHash }), signal); - } catch (error) { - if (signal?.aborted) throw error; - this.logger.warn({ - event: "dataset_terminate_receipt_wait_failed", - message: "Receipt wait failed; falling back to FWSS state poll", - providerAddress, - dataSetId: dataSetId.toString(), - txHash, - error: toStructuredError(error), - }); - } - } - pdpEndEpoch = await this.waitForPdpEndEpoch(dataSetId, pollTimeoutMs, signal); } + return this.waitForPdpEndEpoch(dataSetId, pollTimeoutMs, signal); + } - const result = await this.dealRepository.manager.transaction(async (manager) => { + /** + * Mark every Deal row with `dataSetId` as cleaned up in a single transaction. + * Filtered on cleaned_up=false so re-runs do not double-write. Returns affected count. + */ + private async markDataSetDealsCleanedUp(dataSetId: bigint): Promise { + return this.dealRepository.manager.transaction(async (manager) => { const update = await manager .getRepository(Deal) .update({ dataSetId, cleanedUp: false }, { cleanedUp: true, cleanedUpAt: new Date() }); return update.affected ?? 0; }); + } + + /** + * Repair a PDP-terminated dataset (FWSS may or may not have flipped pdpEndEpoch). + * + * Idempotent sequence: + * 1-2. Terminate on-chain and confirm pdpEndEpoch != 0 (see ensureDataSetTerminated). + * 3. Mark every Deal row with this dataSetId as cleaned up. + */ + async repairTerminatedDataSet( + providerAddress: string, + dataSetId: bigint, + signal?: AbortSignal, + pollTimeoutMs = 60_000, + ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { + const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + const pdpEndEpoch = await this.ensureDataSetTerminated(providerAddress, dataSetId, signal, pollTimeoutMs); + const dealsAffected = await this.markDataSetDealsCleanedUp(dataSetId); this.logger.log({ event: "dataset_terminated_repaired", @@ -816,10 +842,102 @@ export class DealService implements OnModuleInit, OnModuleDestroy { providerId: providerInfo?.id, dataSetId: dataSetId.toString(), pdpEndEpoch: pdpEndEpoch.toString(), - dealsAffected: result, + dealsAffected, + }); + + return { dealsAffected, pdpEndEpoch }; + } + + /** + * Terminate a single dealbot-managed dataset slot and confirm on-chain, emitting + * data-set termination metrics. Used by the `data_set_termination` canary job. + * + * Mechanically identical to {@link repairTerminatedDataSet} (terminate -> confirm + * `pdpEndEpoch != 0` -> mark deals cleaned up) but additionally records + * `dataSetTerminationStatus` and `dataSetTerminationMs` so the canary trigger is + * observable. An abort (job timeout) or an internal poll timeout is classified as + * `failure.timedout`; pg-boss retries on the next scheduled tick. + */ + async terminateManagedDataSet( + providerAddress: string, + dataSetId: bigint, + signal?: AbortSignal, + pollTimeoutMs = 60_000, + ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { + const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + const labels = buildCheckMetricLabels({ + checkType: "dataSetTermination", + providerId: providerInfo?.id, + providerName: providerInfo?.name, + providerIsApproved: providerInfo?.isApproved, }); - return { dealsAffected: result, pdpEndEpoch }; + const startedAt = Date.now(); + this.logger.log({ + event: "dataset_termination_started", + message: "Starting managed data-set termination", + providerAddress, + providerId: providerInfo?.id, + providerName: providerInfo?.name, + dataSetId: dataSetId.toString(), + }); + + try { + const pdpEndEpoch = await this.ensureDataSetTerminated(providerAddress, dataSetId, signal, pollTimeoutMs); + const dealsAffected = await this.markDataSetDealsCleanedUp(dataSetId); + const durationMs = Date.now() - startedAt; + + this.dataSetTerminationMetrics.observeCheckDuration(labels, durationMs); + this.dataSetTerminationMetrics.recordStatus(labels, "success"); + this.logger.log({ + event: "dataset_termination_succeeded", + message: "Terminated managed data-set; data_set_creation will replenish the slot", + providerAddress, + providerId: providerInfo?.id, + providerName: providerInfo?.name, + dataSetId: dataSetId.toString(), + pdpEndEpoch: pdpEndEpoch.toString(), + dealsAffected, + durationMs, + }); + return { dealsAffected, pdpEndEpoch }; + } catch (error) { + const durationMs = Date.now() - startedAt; + // An abort (job-level timeout) or an internal poll timeout both count as failure.timedout. + const status = signal?.aborted ? "failure.timedout" : classifyFailureStatus(error); + if (status === "failure.timedout") { + this.dataSetTerminationMetrics.observeCheckDuration(labels, durationMs); + } + this.dataSetTerminationMetrics.recordStatus(labels, status); + this.logger.error({ + event: "dataset_termination_failed", + message: "Managed data-set termination failed", + providerAddress, + providerId: providerInfo?.id, + providerName: providerInfo?.name, + dataSetId: dataSetId.toString(), + durationMs, + status, + error: toStructuredError(error), + }); + throw error; + } + } + + /** + * Record a `skipped.no_candidate` data-set termination outcome for a provider. + * Emitted by the termination handler when every eligible slot resolves as `missing` + * (nothing live/terminated to act on this tick). + */ + recordDataSetTerminationSkipped(providerAddress: string): void { + const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + const labels = buildCheckMetricLabels({ + checkType: "dataSetTermination", + providerId: providerInfo?.id, + providerName: providerInfo?.name, + providerIsApproved: providerInfo?.isApproved, + }); + this.dataSetTerminationMetrics.recordStatus(labels, "skipped.no_candidate"); } /** diff --git a/apps/backend/src/jobs/data-set-termination.handler.spec.ts b/apps/backend/src/jobs/data-set-termination.handler.spec.ts new file mode 100644 index 00000000..3d1eba95 --- /dev/null +++ b/apps/backend/src/jobs/data-set-termination.handler.spec.ts @@ -0,0 +1,152 @@ +import { describe, expect, it, vi } from "vitest"; +import type { ProviderJobContext } from "../common/logging.js"; +import { terminateNextDataSet } from "./data-set-termination.handler.js"; + +const logContext: ProviderJobContext = { + jobId: "job-term", + providerAddress: "0xaaa", + providerId: 1n, + providerName: "sp", +}; + +const makeLogger = () => ({ log: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }) as any; + +const POLL_TIMEOUT_MS = 60_000; + +describe("terminateNextDataSet", () => { + it("terminates a live slot in the canary window and stops", async () => { + const dealService = { + getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 42n })), + terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 2, pdpEndEpoch: 10n })), + recordDataSetTerminationSkipped: vi.fn(), + }; + + // window [1, 2): only slot index 1 is eligible + await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 2, 1, {}, logContext, POLL_TIMEOUT_MS); + + expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith("0xaaa", { dealbotDS: "1" }, undefined); + expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith("0xaaa", 42n, undefined, POLL_TIMEOUT_MS); + expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); + }); + + it("skips missing slots and terminates the live/terminated one regardless of scan order", async () => { + // window [1, 3): index 1 is missing, index 2 is terminated -> index 2 must be the one terminated + const dealService = { + getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { + if (metadata.dealbotDS === "2") { + return { status: "terminated" as const, dataSetId: 99n }; + } + return { status: "missing" as const }; + }), + terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 0, pdpEndEpoch: 5n })), + recordDataSetTerminationSkipped: vi.fn(), + }; + + await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 3, 1, {}, logContext, POLL_TIMEOUT_MS); + + expect(dealService.terminateManagedDataSet).toHaveBeenCalledTimes(1); + expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith("0xaaa", 99n, undefined, POLL_TIMEOUT_MS); + expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); + }); + + it("records skipped.no_candidate when every candidate slot is missing", async () => { + const dealService = { + getDataSetProvisioningStatus: vi.fn(async () => ({ status: "missing" as const })), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + + await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 4, 1, {}, logContext, POLL_TIMEOUT_MS); + + expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); + expect(dealService.recordDataSetTerminationSkipped).toHaveBeenCalledWith("0xaaa"); + // All three candidate slots (1,2,3) were probed. + expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledTimes(3); + }); + + it("never probes slots below minIndex", async () => { + const probed: Array = []; + const dealService = { + getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { + probed.push(metadata.dealbotDS); + return { status: "missing" as const }; + }), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + + // window [3, 5): only slots 3 and 4 are eligible; 0,1,2 are protected + await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 5, 3, {}, logContext, POLL_TIMEOUT_MS); + + expect(probed.sort()).toEqual(["3", "4"]); + }); + + it("never probes the baseline slot 0 even if minIndex is misconfigured below 1", async () => { + const probed: Array = []; + const dealService = { + getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { + probed.push(metadata.dealbotDS); + return { status: "missing" as const }; + }), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + + // minIndex=0 should be clamped to 1; slot 0 (no dealbotDS tag) must never be probed + await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 3, 0, {}, logContext, POLL_TIMEOUT_MS); + + expect(probed.sort()).toEqual(["1", "2"]); + expect(probed).not.toContain(undefined); + }); + + it("merges base dataset metadata into each slot's lookup", async () => { + const dealService = { + getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 1n })), + terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 0, pdpEndEpoch: 1n })), + recordDataSetTerminationSkipped: vi.fn(), + }; + + await terminateNextDataSet( + { dealService, logger: makeLogger() }, + "0xaaa", + 2, + 1, + { withIPFSIndexing: "", dealbotDataSetVersion: "v1" }, + logContext, + POLL_TIMEOUT_MS, + ); + + expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith( + "0xaaa", + { withIPFSIndexing: "", dealbotDataSetVersion: "v1", dealbotDS: "1" }, + undefined, + ); + }); + + it("stops immediately and does not terminate when the signal is already aborted", async () => { + const dealService = { + getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 1n })), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + + const controller = new AbortController(); + controller.abort(new Error("Job timed out")); + + await expect( + terminateNextDataSet( + { dealService, logger: makeLogger() }, + "0xaaa", + 2, + 1, + {}, + logContext, + POLL_TIMEOUT_MS, + controller.signal, + ), + ).rejects.toThrow("Job timed out"); + + expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); + expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); + }); +}); diff --git a/apps/backend/src/jobs/data-set-termination.handler.ts b/apps/backend/src/jobs/data-set-termination.handler.ts new file mode 100644 index 00000000..3828ba9f --- /dev/null +++ b/apps/backend/src/jobs/data-set-termination.handler.ts @@ -0,0 +1,122 @@ +import type { Logger } from "@nestjs/common"; +import type { DataSetLogContext, ProviderJobContext } from "../common/logging.js"; +import type { DealService } from "../deal/deal.service.js"; + +export interface DataSetTerminationDeps { + dealService: Pick< + DealService, + "getDataSetProvisioningStatus" | "terminateManagedDataSet" | "recordDataSetTerminationSkipped" + >; + logger: Logger; +} + +/** + * Returns a randomly shuffled copy of the candidate slot indices `[start, minDataSets)`, + * where `start = max(1, minIndex)`. Fisher-Yates. + * + * The lower bound is clamped to `1` so slot `0` data-set is never a candidate. + */ +function shuffledCandidateIndices(minIndex: number, minDataSets: number): number[] { + const start = Math.max(1, minIndex); + const indices: number[] = []; + for (let i = start; i < minDataSets; i++) { + indices.push(i); + } + for (let i = indices.length - 1; i > 0; i--) { + const j = Math.floor(Math.random() * (i + 1)); + [indices[i], indices[j]] = [indices[j], indices[i]]; + } + return indices; +} + +/** + * Terminates at most one managed data-set slot per invocation (the canary trigger). + * + * Scans the canary window `[minIndex, minDataSets)` in random order and acts on the + * first slot that is `live` or `terminated`: + * - terminate it on-chain and wait for FWSS `pdpEndEpoch != 0`, marking its deals + * cleaned up. `data_set_creation` recreates the resulting `missing` slot on a + * later tick. + * `missing` slots are skipped (nothing to terminate; a replacement is already pending + * in `data_set_creation`). If every candidate is `missing`, emits `skipped.no_candidate`. + * + * Slots `0..(minIndex - 1)` are never touched, and slot `0` (the baseline data-set) is + * always protected because the canary window starts at `max(1, minIndex)`. Every + * candidate index is therefore `>= 1` and is tagged with `{ dealbotDS: String(i) }`, + * matching the slot metadata produced by `data_set_creation`. + */ +export async function terminateNextDataSet( + deps: DataSetTerminationDeps, + spAddress: string, + minDataSets: number, + minIndex: number, + baseDataSetMetadata: Record, + dataSetLogContext: ProviderJobContext, + pollTimeoutMs: number, + signal?: AbortSignal, +): Promise { + const { dealService, logger } = deps; + + const candidates = shuffledCandidateIndices(minIndex, minDataSets); + let skippedMissingCount = 0; + + for (const i of candidates) { + signal?.throwIfAborted(); + + // Candidates are always >= 1 (slot 0 is the protected baseline), so every slot in + // the canary window carries its dealbotDS tag, matching data_set_creation's slots. + const metadata: Record = { + ...baseDataSetMetadata, + dealbotDS: String(i), + }; + + const logContext: DataSetLogContext = { + ...dataSetLogContext, + metadata, + dataSetIndex: i, + }; + + const status = await dealService.getDataSetProvisioningStatus(spAddress, metadata, signal); + + if (status.status === "missing") { + skippedMissingCount++; + logger.debug({ + ...logContext, + event: "data_set_termination_slot_skipped_missing", + message: "Slot is missing; nothing to terminate (data_set_creation will replenish it)", + }); + continue; + } + + logger.log({ + ...logContext, + event: "terminating_data_set", + message: "Terminating managed data-set slot", + slotStatus: status.status, + dataSetId: status.dataSetId.toString(), + }); + const result = await dealService.terminateManagedDataSet(spAddress, status.dataSetId, signal, pollTimeoutMs); + logger.log({ + ...logContext, + event: "data_set_termination_completed", + message: "Terminated managed data-set; deferring recreation to data_set_creation", + dataSetId: status.dataSetId.toString(), + dealsAffected: result.dealsAffected, + skippedMissingCount, + }); + return; + } + + // Every candidate slot resolved as `missing`: nothing to terminate this tick. This is + // expected right after a termination when data_set_creation has not yet replenished + // the slot. Persistent skips indicate creation is lagging behind termination. + dealService.recordDataSetTerminationSkipped(spAddress); + logger.log({ + ...dataSetLogContext, + event: "data_set_termination_skipped_no_candidate", + message: "No eligible slot to terminate; all candidate slots are missing", + minDataSets, + minIndex, + skippedMissingCount, + }); +} diff --git a/apps/backend/src/jobs/jobs.service.spec.ts b/apps/backend/src/jobs/jobs.service.spec.ts index b25cd552..2a12fad8 100644 --- a/apps/backend/src/jobs/jobs.service.spec.ts +++ b/apps/backend/src/jobs/jobs.service.spec.ts @@ -25,6 +25,7 @@ describe("JobsService schedule rows", () => { let jobScheduleRepositoryMock: { upsertSchedule: ReturnType; deleteSchedulesForInactiveProviders: ReturnType; + deleteSchedulesByJobType: ReturnType; countPausedSchedules: ReturnType; findDueSchedulesWithManager: ReturnType; runTransaction: ReturnType; @@ -86,6 +87,7 @@ describe("JobsService schedule rows", () => { jobScheduleRepositoryMock = { upsertSchedule: vi.fn(), deleteSchedulesForInactiveProviders: vi.fn(async () => []), + deleteSchedulesByJobType: vi.fn(async () => 0), countPausedSchedules: vi.fn(async () => []), findDueSchedulesWithManager: vi.fn(), runTransaction: vi.fn(async (callback: (manager: unknown) => Promise) => { @@ -123,7 +125,11 @@ describe("JobsService schedule rows", () => { baseConfigValues = { app: { runMode: "both" } as IConfig["app"], - blockchain: { useOnlyApprovedProviders: false, minNumDataSetsForChecks: 1 } as IConfig["blockchain"], + blockchain: { + useOnlyApprovedProviders: false, + minNumDataSetsForChecks: 1, + dataSetTerminationMinIndex: 1, + } as IConfig["blockchain"], scheduling: { providersRefreshIntervalSeconds: 4 * 3600, dataRetentionPollIntervalSeconds: 3600, @@ -139,6 +145,9 @@ describe("JobsService schedule rows", () => { dealJobTimeoutSeconds: 360, retrievalJobTimeoutSeconds: 60, dataSetCreationJobTimeoutSeconds: 300, + dataSetTerminationEnabled: false, + dataSetTerminationsPerSpPerHour: 1, + dataSetTerminationJobTimeoutSeconds: 300, shutdownFinalScrapeDelaySeconds: 35, pieceCleanupPerSpPerHour: 1, maxPieceCleanupRuntimeSeconds: 300, @@ -1250,6 +1259,166 @@ describe("JobsService schedule rows", () => { expect(dealService.createDataSetWithPiece).not.toHaveBeenCalled(); }); + it("data_set_termination job skips when disabled", async () => { + baseConfigValues = { + ...baseConfigValues, + blockchain: { + ...baseConfigValues.blockchain, + minNumDataSetsForChecks: 3, + dataSetTerminationMinIndex: 1, + } as IConfig["blockchain"], + jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: false } as IConfig["jobs"], + }; + configService = { + get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), + } as unknown as JobsServiceDeps[0]; + + const dealService = { + getBaseDataSetMetadata: vi.fn(() => ({})), + getDataSetProvisioningStatus: vi.fn(), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; + + service = buildService({ + configService, + dealService: dealService as unknown as ConstructorParameters[3], + walletSdkService: walletSdkService as unknown as ConstructorParameters[5], + }); + + await callPrivate(service, "handleDataSetTerminationJob", { + id: "job-term-1", + data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, + }); + + expect(dealService.getDataSetProvisioningStatus).not.toHaveBeenCalled(); + expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); + }); + + it("data_set_termination job skips when canary window is empty", async () => { + baseConfigValues = { + ...baseConfigValues, + blockchain: { + ...baseConfigValues.blockchain, + minNumDataSetsForChecks: 2, + dataSetTerminationMinIndex: 2, // window = 2 - 2 = 0 + } as IConfig["blockchain"], + jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], + }; + configService = { + get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), + } as unknown as JobsServiceDeps[0]; + + const dealService = { + getBaseDataSetMetadata: vi.fn(() => ({})), + getDataSetProvisioningStatus: vi.fn(), + terminateManagedDataSet: vi.fn(), + recordDataSetTerminationSkipped: vi.fn(), + }; + const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; + + service = buildService({ + configService, + dealService: dealService as unknown as ConstructorParameters[3], + walletSdkService: walletSdkService as unknown as ConstructorParameters[5], + }); + + await callPrivate(service, "handleDataSetTerminationJob", { + id: "job-term-2", + data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, + }); + + expect(dealService.getDataSetProvisioningStatus).not.toHaveBeenCalled(); + expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); + }); + + it("data_set_termination job terminates a slot in the canary window when enabled", async () => { + baseConfigValues = { + ...baseConfigValues, + blockchain: { + ...baseConfigValues.blockchain, + minNumDataSetsForChecks: 2, + dataSetTerminationMinIndex: 1, // window = [1, 2) + } as IConfig["blockchain"], + jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], + }; + configService = { + get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), + } as unknown as JobsServiceDeps[0]; + + const dealService = { + getBaseDataSetMetadata: vi.fn(() => ({})), + getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 55n })), + terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 1, pdpEndEpoch: 9n })), + recordDataSetTerminationSkipped: vi.fn(), + }; + const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; + + service = buildService({ + configService, + dealService: dealService as unknown as ConstructorParameters[3], + walletSdkService: walletSdkService as unknown as ConstructorParameters[5], + }); + + await callPrivate(service, "handleDataSetTerminationJob", { + id: "job-term-3", + data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, + }); + + expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith( + "0xaaa", + { dealbotDS: "1" }, + expect.any(AbortSignal), + ); + expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith( + "0xaaa", + 55n, + expect.any(AbortSignal), + expect.any(Number), + ); + }); + + it("creates data_set_termination schedules when enabled with a non-empty window", async () => { + baseConfigValues = { + ...baseConfigValues, + blockchain: { + ...baseConfigValues.blockchain, + minNumDataSetsForChecks: 3, + dataSetTerminationMinIndex: 1, + } as IConfig["blockchain"], + jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], + }; + configService = { + get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), + } as unknown as JobsServiceDeps[0]; + service = buildService({ configService }); + + storageProviderRepositoryMock.find.mockResolvedValueOnce([{ address: "0xaaa" }]); + + await callPrivate(service, "ensureScheduleRows"); + + const terminationUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( + (call) => call[0] === "data_set_termination", + ); + expect(terminationUpserts).toHaveLength(1); + expect(terminationUpserts[0][1]).toBe("0xaaa"); + expect(jobScheduleRepositoryMock.deleteSchedulesByJobType).not.toHaveBeenCalled(); + }); + + it("removes data_set_termination schedules when disabled", async () => { + // base config has dataSetTerminationEnabled=false + storageProviderRepositoryMock.find.mockResolvedValueOnce([{ address: "0xaaa" }]); + + await callPrivate(service, "ensureScheduleRows"); + + const terminationUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( + (call) => call[0] === "data_set_termination", + ); + expect(terminationUpserts).toHaveLength(0); + expect(jobScheduleRepositoryMock.deleteSchedulesByJobType).toHaveBeenCalledWith("data_set_termination"); + }); + it("sets active, inactive, and tested provider gauge values after refresh", async () => { storageProviderRepositoryMock.count .mockResolvedValueOnce(10) // totalProviders diff --git a/apps/backend/src/jobs/jobs.service.ts b/apps/backend/src/jobs/jobs.service.ts index 957ce65a..e1d349a0 100644 --- a/apps/backend/src/jobs/jobs.service.ts +++ b/apps/backend/src/jobs/jobs.service.ts @@ -19,6 +19,7 @@ import { PullCheckService } from "../pull-check/pull-check.service.js"; import { RetrievalService } from "../retrieval/retrieval.service.js"; import { WalletSdkService } from "../wallet-sdk/wallet-sdk.service.js"; import { provisionNextMissingDataSet } from "./data-set-creation.handler.js"; +import { terminateNextDataSet } from "./data-set-termination.handler.js"; import { DATA_RETENTION_POLL_QUEUE, PROVIDERS_REFRESH_QUEUE, @@ -27,11 +28,12 @@ import { } from "./job-queues.js"; import { JobScheduleRepository } from "./repositories/job-schedule.repository.js"; -type SpJobType = "deal" | "retrieval" | "data_set_creation" | "piece_cleanup" | "pull_check"; +type SpJobType = "deal" | "retrieval" | "data_set_creation" | "data_set_termination" | "piece_cleanup" | "pull_check"; const SP_JOB_TYPES: ReadonlySet = new Set([ "deal", "retrieval", "data_set_creation", + "data_set_termination", "piece_cleanup", "pull_check", ]); @@ -168,12 +170,38 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { return; } + this.warnIfTerminationRateExceedsCreation(); + await this.tick(); this.schedulerInterval = setInterval(() => { void this.tick(); }, this.schedulerPollMs()); } + /** + * Emits a startup warning when the termination rate exceeds the creation rate. + * + * If `data_set_termination` runs faster than `data_set_creation`, the missing-slot + * backlog accumulates and the loop stops behaving like a simple steady-state canary. + * See docs/data-set-termination.md#relationship-to-data_set_creation. + */ + private warnIfTerminationRateExceedsCreation(): void { + const jobs = this.configService.get("jobs"); + if (!jobs.dataSetTerminationEnabled) { + return; + } + if (jobs.dataSetTerminationsPerSpPerHour > jobs.dataSetCreationsPerSpPerHour) { + this.logger.warn({ + event: "data_set_termination_rate_exceeds_creation", + message: + "DATASET_TERMINATIONS_PER_SP_PER_HOUR exceeds DATASET_CREATIONS_PER_SP_PER_HOUR; " + + "terminations may outpace creation and accumulate a missing-slot backlog.", + dataSetTerminationsPerSpPerHour: jobs.dataSetTerminationsPerSpPerHour, + dataSetCreationsPerSpPerHour: jobs.dataSetCreationsPerSpPerHour, + }); + } + } + /** * Cleans up resources on shutdown. * Stops the polling loop and gracefully stops pg-boss. @@ -204,6 +232,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { jobs.dealJobTimeoutSeconds, jobs.retrievalJobTimeoutSeconds, jobs.dataSetCreationJobTimeoutSeconds, + jobs.dataSetTerminationJobTimeoutSeconds, pullPiece.pullCheckJobTimeoutSeconds, ); const stopTimeoutMs = (longestJobTimeoutSec + 60) * 1000; @@ -333,6 +362,10 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { await this.handleDataSetCreationJob(job); return; } + if (job.data.jobType === "data_set_termination") { + await this.handleDataSetTerminationJob(job); + return; + } if (job.data.jobType === "piece_cleanup") { await this.handlePieceCleanupJob(job); return; @@ -919,6 +952,112 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { }); } + /** + * Handles one `data_set_termination` invocation for a provider. + * + * Terminates at most one managed dataset slot in the canary window so the existing + * `data_set_creation` job recreates it, keeping the on-chain createDataSet path + * continuously exercised. Gated by `DATASET_TERMINATION_ENABLED` and a non-empty + * canary window (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). + */ + private async handleDataSetTerminationJob(job: SpJob): Promise { + const data = job.data; + const spAddress = data.spAddress; + const now = new Date(); + const maintenance = this.getMaintenanceWindowStatus(now); + if (maintenance.active) { + this.logMaintenanceSkip(`data_set_termination job for ${spAddress}`, maintenance.window?.label, { + jobId: job.id, + providerAddress: spAddress, + providerId: this.walletSdkService.getProviderInfo(spAddress)?.id, + providerName: this.walletSdkService.getProviderInfo(spAddress)?.name, + }); + await this.deferJobForMaintenance("data_set_termination", data, maintenance, now); + return; + } + + const blockchain = this.configService.get("blockchain"); + const jobsConfig = this.configService.get("jobs"); + const minDataSets = blockchain.minNumDataSetsForChecks; + const minIndex = blockchain.dataSetTerminationMinIndex; + // Defensive gate: schedules are only created when enabled with a non-empty canary + // window, but a stale enqueued job (e.g. after disabling) must still no-op safely. + if (!jobsConfig.dataSetTerminationEnabled || minDataSets - minIndex <= 0) { + this.logger.log({ + jobId: job.id, + providerAddress: spAddress, + providerId: this.walletSdkService.getProviderInfo(spAddress)?.id, + providerName: this.walletSdkService.getProviderInfo(spAddress)?.name, + event: "data_set_termination_job_disabled", + message: "Data set termination job skipped: disabled or empty canary window", + enabled: jobsConfig.dataSetTerminationEnabled, + minDataSets, + minIndex, + }); + return; + } + + const baseDataSetMetadata = this.dealService.getBaseDataSetMetadata(); + + // Create AbortController for job timeout enforcement + const abortController = new AbortController(); + const timeoutSeconds = jobsConfig.dataSetTerminationJobTimeoutSeconds; + const timeoutMs = Math.max(60000, timeoutSeconds * 1000); + const effectiveTimeoutSeconds = Math.round(timeoutMs / 1000); + const abortReason = new Error(`Data set termination job timeout (${effectiveTimeoutSeconds}s) for ${spAddress}`); + const timeoutId = setTimeout(() => { + abortController.abort(abortReason); + }, timeoutMs); + + await this.recordJobExecution("data_set_termination", async () => { + const dataSetLogContext = await this.resolveRunnableProviderJobContext( + "data_set_termination", + spAddress, + job.id, + "Data set termination job skipped: provider is blocked for scheduled data-storage checks", + ); + if (dataSetLogContext == null) { + clearTimeout(timeoutId); + return "success"; + } + try { + await terminateNextDataSet( + { dealService: this.dealService, logger: this.logger }, + spAddress, + minDataSets, + minIndex, + baseDataSetMetadata, + dataSetLogContext, + timeoutMs, + abortController.signal, + ); + return "success"; + } catch (error) { + if (abortController.signal.aborted) { + const reason = abortController.signal.reason; + const reasonMessage = reason instanceof Error ? reason.message : String(reason ?? ""); + this.logger.error({ + ...dataSetLogContext, + event: "data_set_termination_job_aborted", + message: reasonMessage || "Data set termination job aborted after timeout", + timeoutSeconds: effectiveTimeoutSeconds, + error: toStructuredError(reason ?? error), + }); + return "aborted"; + } + this.logger.error({ + ...dataSetLogContext, + event: "data_set_termination_job_failed", + message: "Data set termination job failed", + error: toStructuredError(error), + }); + throw error; + } finally { + clearTimeout(timeoutId); + } + }); + } + private maintenanceResumeAt(now: Date, maintenance: ReturnType): Date | null { if (!maintenance.active || !maintenance.window) { return null; @@ -1009,6 +1148,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds: number; retrievalIntervalSeconds: number; dataSetCreationIntervalSeconds: number; + dataSetTerminationIntervalSeconds: number; dataRetentionPollIntervalSeconds: number; providersRefreshIntervalSeconds: number; pieceCleanupIntervalSeconds: number; @@ -1022,12 +1162,14 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const dealsPerHour = jobsConfig.dealsPerSpPerHour; const retrievalsPerHour = jobsConfig.retrievalsPerSpPerHour; const dataSetCreationsPerHour = jobsConfig.dataSetCreationsPerSpPerHour; + const dataSetTerminationsPerHour = jobsConfig.dataSetTerminationsPerSpPerHour; const pieceCleanupPerHour = jobsConfig.pieceCleanupPerSpPerHour; const pullChecksPerHour = pullPieceConfig.pullChecksPerSpPerHour; const dealIntervalSeconds = Math.max(1, Math.round(3600 / dealsPerHour)); const retrievalIntervalSeconds = Math.max(1, Math.round(3600 / retrievalsPerHour)); const dataSetCreationIntervalSeconds = Math.max(1, Math.round(3600 / dataSetCreationsPerHour)); + const dataSetTerminationIntervalSeconds = Math.max(1, Math.round(3600 / dataSetTerminationsPerHour)); const pieceCleanupIntervalSeconds = Math.max(1, Math.round(3600 / pieceCleanupPerHour)); const pullCheckIntervalSeconds = Math.max(1, Math.round(3600 / pullChecksPerHour)); const dataRetentionPollIntervalSeconds = scheduling.dataRetentionPollIntervalSeconds; @@ -1038,6 +1180,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds, retrievalIntervalSeconds, dataSetCreationIntervalSeconds, + dataSetTerminationIntervalSeconds, dataRetentionPollIntervalSeconds, providersRefreshIntervalSeconds, pieceCleanupIntervalSeconds, @@ -1059,6 +1202,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds, retrievalIntervalSeconds, dataSetCreationIntervalSeconds, + dataSetTerminationIntervalSeconds, dataRetentionPollIntervalSeconds, providersRefreshIntervalSeconds, pieceCleanupIntervalSeconds, @@ -1078,10 +1222,17 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const dealStartAt = new Date(now.getTime() + phaseMs); const retrievalStartAt = new Date(now.getTime() + phaseMs); const dataSetCreationStartAt = new Date(now.getTime() + phaseMs); + const dataSetTerminationStartAt = new Date(now.getTime() + phaseMs); const dataRetentionPollStartAt = new Date(now.getTime() + phaseMs); const providersRefreshStartAt = new Date(now.getTime() + phaseMs); - const minDataSets = this.configService.get("blockchain").minNumDataSetsForChecks; + const blockchainCfg = this.configService.get("blockchain"); + const minDataSets = blockchainCfg.minNumDataSetsForChecks; + // Termination schedules are only created when enabled with a non-empty canary window + // (slots [DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS)). + const terminationScheduleEnabled = + this.configService.get("jobs").dataSetTerminationEnabled && + minDataSets - blockchainCfg.dataSetTerminationMinIndex > 0; const cleanupStartAt = new Date(now.getTime() + phaseMs); const pullCheckStartAt = new Date(now.getTime() + phaseMs); @@ -1109,6 +1260,14 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dataSetCreationStartAt, ); } + if (terminationScheduleEnabled) { + await this.jobScheduleRepository.upsertSchedule( + "data_set_termination", + address, + dataSetTerminationIntervalSeconds, + dataSetTerminationStartAt, + ); + } await this.jobScheduleRepository.upsertSchedule( "piece_cleanup", address, @@ -1140,6 +1299,19 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { }); } + // When termination is disabled (or the canary window is empty), remove any stale + // data_set_termination schedules so they stop enqueuing no-op jobs. + if (!terminationScheduleEnabled) { + const removed = await this.jobScheduleRepository.deleteSchedulesByJobType("data_set_termination"); + if (removed > 0) { + this.logger.warn({ + event: "data_set_termination_schedules_removed", + message: "Removed data_set_termination schedules because the job is disabled or the canary window is empty", + removed, + }); + } + } + // Global job schedules (sp_address = '') await this.jobScheduleRepository.upsertSchedule( "data_retention_poll", @@ -1251,6 +1423,8 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { return SP_WORK_QUEUE; case "data_set_creation": return SP_WORK_QUEUE; + case "data_set_termination": + return SP_WORK_QUEUE; case "piece_cleanup": return SP_WORK_QUEUE; case "pull_check": @@ -1273,6 +1447,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { row.job_type === "deal" || row.job_type === "retrieval" || row.job_type === "data_set_creation" || + row.job_type === "data_set_termination" || row.job_type === "piece_cleanup" || row.job_type === "pull_check" ) { @@ -1346,6 +1521,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { "deal", "retrieval", "data_set_creation", + "data_set_termination", "piece_cleanup", "pull_check", "data_retention_poll", diff --git a/apps/backend/src/jobs/repositories/job-schedule.repository.ts b/apps/backend/src/jobs/repositories/job-schedule.repository.ts index 6411a3b1..86336b5c 100644 --- a/apps/backend/src/jobs/repositories/job-schedule.repository.ts +++ b/apps/backend/src/jobs/repositories/job-schedule.repository.ts @@ -71,7 +71,7 @@ export class JobScheduleRepository { const [rows] = (await this.dataSource.query( ` DELETE FROM job_schedule_state - WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'piece_cleanup', 'pull_check') + WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_termination', 'piece_cleanup', 'pull_check') AND sp_address <> '' RETURNING sp_address `, @@ -82,7 +82,7 @@ export class JobScheduleRepository { const [rows] = (await this.dataSource.query( ` DELETE FROM job_schedule_state - WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'piece_cleanup', 'pull_check') + WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_termination', 'piece_cleanup', 'pull_check') AND sp_address <> '' AND sp_address <> ALL($1::text[]) RETURNING sp_address @@ -100,6 +100,30 @@ export class JobScheduleRepository { } } + /** + * Deletes all per-provider schedule rows for a given job type. + * + * Used to stop a job entirely when it is disabled by config (for example the + * `data_set_termination` canary when `DATASET_TERMINATION_ENABLED=false` or the + * canary window is empty), so stale schedules do not keep enqueuing no-op jobs. + * + * @param jobType - The job type whose per-provider schedules should be removed. + * @returns Number of schedule rows deleted. + */ + async deleteSchedulesByJobType(jobType: JobType): Promise { + const result = await this.dataSource.query( + ` + DELETE FROM job_schedule_state + WHERE job_type = $1 + AND sp_address <> '' + `, + [jobType], + ); + // node-postgres returns [rows, rowCount] for DELETE without RETURNING. + const rowCount = Array.isArray(result) ? result[1] : undefined; + return typeof rowCount === "number" ? rowCount : 0; + } + /** * Counts manually paused jobs by type. */ diff --git a/apps/backend/src/metrics-prometheus/check-metric-labels.ts b/apps/backend/src/metrics-prometheus/check-metric-labels.ts index 07415d45..5a764a6e 100644 --- a/apps/backend/src/metrics-prometheus/check-metric-labels.ts +++ b/apps/backend/src/metrics-prometheus/check-metric-labels.ts @@ -1,4 +1,10 @@ -export type CheckType = "dataStorage" | "retrieval" | "dataRetention" | "dataSetCreation" | "pullCheck"; +export type CheckType = + | "dataStorage" + | "retrieval" + | "dataRetention" + | "dataSetCreation" + | "dataSetTermination" + | "pullCheck"; export type ProviderStatus = "approved" | "unapproved"; export type CheckMetricLabels = { diff --git a/apps/backend/src/metrics-prometheus/check-metrics.service.ts b/apps/backend/src/metrics-prometheus/check-metrics.service.ts index 7afd9935..63e2b7b2 100644 --- a/apps/backend/src/metrics-prometheus/check-metrics.service.ts +++ b/apps/backend/src/metrics-prometheus/check-metrics.service.ts @@ -285,6 +285,33 @@ export class DataSetCreationCheckMetrics { } } +@Injectable() +export class DataSetTerminationCheckMetrics { + constructor( + @InjectMetric("dataSetTerminationMs") + private readonly dataSetTerminationMs: Histogram, + @InjectMetric("dataSetTerminationStatus") + private readonly dataSetTerminationStatusCounter: Counter, + ) {} + + /** + * Observe the time from the `terminateService` call to `pdpEndEpoch != 0` confirmation. + * Emitted on `success` and `failure.timedout` only (analogous to `dataSetCreationMs`). + */ + observeCheckDuration(labels: CheckMetricLabels, value: number | null | undefined): void { + observePositive(this.dataSetTerminationMs, labels, value); + } + + /** + * Record data-set termination status. + * Values: `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate`. + * See docs/data-set-termination.md#termination-metrics-trigger-health. + */ + recordStatus(labels: CheckMetricLabels, value: string): void { + this.dataSetTerminationStatusCounter.inc({ ...labels, value }); + } +} + @Injectable() export class PullCheckCheckMetrics { constructor( diff --git a/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts b/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts index a27e945a..753b57b3 100644 --- a/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts +++ b/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts @@ -9,6 +9,7 @@ import { import { WalletSdkModule } from "../wallet-sdk/wallet-sdk.module.js"; import { DataSetCreationCheckMetrics, + DataSetTerminationCheckMetrics, DataStorageCheckMetrics, DiscoverabilityCheckMetrics, PullCheckCheckMetrics, @@ -154,6 +155,13 @@ const metricProviders = [ labelNames: ["checkType", "providerId", "providerName", "providerStatus"] as const, buckets: [100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000, 600000], }), + makeHistogramProvider({ + // docs/checks/events-and-metrics.md#dataSetTerminationMs + name: "dataSetTerminationMs", + help: "Duration from terminateService call to pdpEndEpoch != 0 confirmation (ms)", + labelNames: ["checkType", "providerId", "providerName", "providerStatus"] as const, + buckets: [100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000, 600000], + }), // Sub-status metrics (docs/checks/data-storage.md) makeCounterProvider({ // docs/checks/data-storage.md#sub-status-meanings (Upload Status) @@ -203,6 +211,12 @@ const metricProviders = [ help: "Data-set creation status counts", labelNames: ["checkType", "providerId", "providerName", "providerStatus", "value"] as const, }), + makeCounterProvider({ + // docs/checks/events-and-metrics.md#dataSetTerminationStatus + name: "dataSetTerminationStatus", + help: "Data-set termination status counts (success | failure.timedout | failure.other | skipped.no_candidate)", + labelNames: ["checkType", "providerId", "providerName", "providerStatus", "value"] as const, + }), // Pull check metrics (docs/checks/pull-check.md) makeHistogramProvider({ name: "pullRequestAcknowledgementLatencyMs", @@ -375,6 +389,7 @@ const metricProviders = [ RetrievalCheckMetrics, DiscoverabilityCheckMetrics, DataSetCreationCheckMetrics, + DataSetTerminationCheckMetrics, PullCheckCheckMetrics, WalletBalanceCollector, // HTTP metrics interceptor @@ -390,6 +405,7 @@ const metricProviders = [ RetrievalCheckMetrics, DiscoverabilityCheckMetrics, DataSetCreationCheckMetrics, + DataSetTerminationCheckMetrics, PullCheckCheckMetrics, WalletBalanceCollector, ], diff --git a/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts b/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts index d6613a31..e62691d8 100644 --- a/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts +++ b/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts @@ -18,6 +18,7 @@ const baseConfig: IBlockchainConfig = { checkDatasetCreationFees: false, useOnlyApprovedProviders: false, minNumDataSetsForChecks: 1, + dataSetTerminationMinIndex: 1, pdpSubgraphEndpoint: "https://api.thegraph.com/subgraphs/filecoin/pdp", }; From 91ffcdcba8008a6a53eac091645f18a873d9e2a9 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Thu, 4 Jun 2026 15:56:29 +0530 Subject: [PATCH 07/16] doc: udpate docs --- docs/checks/events-and-metrics.md | 4 +- docs/environment-variables.md | 73 ++++++++++++++++++++++++++++++- docs/jobs.md | 3 +- 3 files changed, 77 insertions(+), 3 deletions(-) diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md index 2d4a2b29..b1abdfc4 100644 --- a/docs/checks/events-and-metrics.md +++ b/docs/checks/events-and-metrics.md @@ -100,7 +100,7 @@ sequenceDiagram * They are exported via Prometheus. * All Prometheus/OpenTelemetry metrics have label/attributes for: - `network=calibration|mainnet` - - `checkType=dataStorage|retrieval|dataRetention|dataSetCreation|pullCheck` — attribute metrics to a particular check/job + - `checkType=dataStorage|retrieval|dataRetention|dataSetCreation|dataSetTermination|pullCheck` — attribute metrics to a particular check/job - `providerId` — attribute metrics to a particular SP - `providerName` — human-readable name of the SP (defaults to `"unknown"` when not available) - `providerStatus=approved|unapproved` — attribute metrics to only approved SPs for example @@ -126,6 +126,7 @@ sequenceDiagram | `dataStorageCheckMs` | Data Storage | [`uploadToSpStart`](#uploadToSpStart) | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Data Storage check | | | `retrievalCheckMs` | Retrieval | Retrieval check start | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Retrieval check | | | `dataSetCreationMs` | Data-Set Creation | Data-set creation uploadToSpStart | Data-set creation pieceConfirmed | Duration of one data-set creation with confirmed piece (all using `createDataSetWithPiece`) | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | +| `dataSetTerminationMs` | Data-Set Termination | `terminateService` call | FWSS `pdpEndEpoch != 0` confirmed | Duration of one managed data-set termination (`terminateManagedDataSet`). Emitted on `success` and `failure.timedout` only. See [data-set-termination.md](../data-set-termination.md). | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | | `pullRequestAcknowledgementLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestAcknowledgedBySp`](#pullRequestAcknowledgedBySp) | Time from `pullPieces` submission to SP request acknowledgement. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullRequestStartedMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestStartedBySp`](#pullRequestStartedBySp) | Time from `pullPieces` submission to the SP reading the first byte of `/api/piece/{pieceCid}`. Skipped (no observation) when the SP never fetches from dealbot. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts), [`pull-piece.controller.ts`](../../apps/backend/src/pull-check/pull-piece.controller.ts) | | `pullRequestCompletionLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestIsTerminal`](#pullRequestIsTerminal) | Time from `pullPieces` submission to terminal SP pull status. Emitted once for the check, either on success or failure. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | @@ -150,6 +151,7 @@ sequenceDiagram | `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | | 1 | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | On the Retrieval path, the pre-flight branches on the on-chain `PDPVerifier.pieceLive(dataSetId, pieceId)` result. When `pieceLive=false` (dataset terminated, piece never created, or piece hard-removed), `skipped.piece_missing` is emitted and the deal is marked `cleaned_up=true`; no SP probe runs. When `pieceLive=true` and the SP returns 404 on `/pdp/piece/:pieceCid/status`, `failure.other` is emitted and a failed retrieval row is recorded (deal stays in the candidate pool for re-probing). | 1 | | | `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | +| `dataSetTerminationStatus` | Data-Set Termination | When a `data_set_termination` invocation finishes acting on a slot (or finds none) | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` confirms a slot was terminated (FWSS `pdpEndEpoch != 0`). `skipped.no_candidate` means every slot in the canary window was already `missing`; persistent skips indicate creation is lagging. See [data-set-termination.md](../data-set-termination.md). | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts), [`data-set-termination.handler.ts`](../../apps/backend/src/jobs/data-set-termination.handler.ts) | | `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md) poll when a provider's confirmed proving-period totals advance (strictly positive deltas since the last poll). | `success` (challenges in newly confirmed successful proving periods), `failure` (challenges in newly confirmed faulted periods) | | Counter increment = **period delta × 5** (`CHALLENGES_PER_PROVING_PERIOD`). Period delta is the increase in subgraph-confirmed proving periods since the previous poll for that provider (not "challenges per poll" in the abstract). See [data-retention.md §3](./data-retention.md#3-calculate-deltas). | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | | `pullRequestProviderStatus` | Pull | When the SP reports a terminal pull status via `waitForPullPieces`. Recorded exactly once per check (intermediate poll statuses are not counted). | Raw SP-reported pull status, for example `complete`, `failed`, `not_found`. Use this to separate SP-side pull failures from dealbot-side validation failures. | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullCheckStatus` | Pull | When the [Pull Check](./pull-check.md) terminates (success after direct piece validation, or any failure). Recorded exactly once per check. | `success`, `failure.timedout`, `failure.other` from [Pull Check Status](./pull-check.md#pull-check-status). | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | diff --git a/docs/environment-variables.md b/docs/environment-variables.md index a5432879..61e0fe94 100644 --- a/docs/environment-variables.md +++ b/docs/environment-variables.md @@ -11,7 +11,7 @@ This document provides a comprehensive guide to all environment variables used b | [Blockchain](#blockchain-configuration) | `NETWORK`, `RPC_URL`, `WALLET_ADDRESS`, `WALLET_PRIVATE_KEY`, `SESSION_KEY_PRIVATE_KEY`, `CHECK_DATASET_CREATION_FEES`, `USE_ONLY_APPROVED_PROVIDERS`, `PDP_SUBGRAPH_ENDPOINT` | | [Dataset Versioning](#dataset-versioning) | `DEALBOT_DATASET_VERSION` | | [Scheduling](#scheduling-configuration) | `PROVIDERS_REFRESH_INTERVAL_SECONDS`, `DATA_RETENTION_POLL_INTERVAL_SECONDS`, `DEALBOT_MAINTENANCE_WINDOWS_UTC`, `DEALBOT_MAINTENANCE_WINDOW_MINUTES` | -| [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `MIN_NUM_DATASETS_FOR_CHECKS`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` | +| [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `MIN_NUM_DATASETS_FOR_CHECKS`, `DATA_SET_TERMINATION_MIN_INDEX`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `DATASET_TERMINATION_ENABLED`, `DATASET_TERMINATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`, `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` | | [Dataset](#dataset-configuration) | `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES` | | [ClickHouse](#clickhouse-configuration) | `CLICKHOUSE_URL`, `CLICKHOUSE_BATCH_SIZE`, `CLICKHOUSE_FLUSH_INTERVAL_MS`, `DEALBOT_PROBE_LOCATION` | | [Timeouts](#timeout-configuration) | `CONNECT_TIMEOUT_MS`, `HTTP_REQUEST_TIMEOUT_MS`, `HTTP2_REQUEST_TIMEOUT_MS`, `IPNI_VERIFICATION_TIMEOUT_MS`, `IPNI_VERIFICATION_POLLING_MS` | @@ -662,6 +662,28 @@ rate-based (per hour) and persisted in Postgres so restarts do not reset timing. --- +### `DATA_SET_TERMINATION_MIN_INDEX` + +- **Type**: `number` (integer) +- **Required**: No +- **Default**: `1` +- **Minimum**: `1` +- **Maximum**: `MIN_NUM_DATASETS_FOR_CHECKS` +- **Enforced**: Yes (config validation; violating either bound crashes the application on startup) + +**Role**: The lowest dataset slot index (inclusive) the `data_set_termination` canary may terminate. Slots `0..(DATA_SET_TERMINATION_MIN_INDEX - 1)` are never touched, keeping a stable baseline for ongoing checks. The canary window is `[DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS)`. + +**When to update**: + +- Increase to protect more low-index slots from termination. +- Set equal to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely (the canary window becomes empty and no schedule is created). + +**Example**: `MIN_NUM_DATASETS_FOR_CHECKS=10`, `DATA_SET_TERMINATION_MIN_INDEX=5` → slots 0–4 are stable, slots 5–9 cycle as the canary window. + +**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) + +--- + ### `DATASET_CREATIONS_PER_SP_PER_HOUR` - **Type**: `number` @@ -676,6 +698,34 @@ rate-based (per hour) and persisted in Postgres so restarts do not reset timing. --- +### `DATASET_TERMINATION_ENABLED` + +- **Type**: `boolean` +- **Required**: No +- **Default**: `true` on calibration, `false` on mainnet + +**Role**: Enables the `data_set_termination` canary job, which periodically terminates one managed dataset slot per provider so `data_set_creation` recreates it, keeping the on-chain `createDataSet` lifecycle continuously exercised. + +**Notes**: Even when enabled, a schedule is only created when the canary window is non-empty (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). The default-empty window with `MIN_NUM_DATASETS_FOR_CHECKS=1` means termination is effectively off until you raise `MIN_NUM_DATASETS_FOR_CHECKS`. + +**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) + +--- + +### `DATASET_TERMINATIONS_PER_SP_PER_HOUR` + +- **Type**: `number` +- **Required**: No +- **Default**: `1` + +**Role**: Target dataset termination rate per storage provider for the `data_set_termination` canary. + +**Limits**: Config schema caps this at 20. + +**Notes**: Should be **less than or equal to** `DATASET_CREATIONS_PER_SP_PER_HOUR` so creation can replenish terminated slots without backlog. A startup warning is logged if this constraint is violated. Fractional values are supported. + +--- + ### `JOB_SCHEDULER_POLL_SECONDS` - **Type**: `number` @@ -810,6 +860,27 @@ Use this to stagger multiple dealbot deployments that are not sharing a database --- +### `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS` + +- **Type**: `number` +- **Required**: No +- **Default**: `300` (5 minutes) +- **Minimum**: `60` (1 minute) +- **Enforced**: Yes (config validation, effective floor applied at runtime) + +**Role**: Maximum runtime for `data_set_termination` jobs before forced abort via `AbortController`. Bounds the slot scan, the `terminateService` call, and the `pdpEndEpoch != 0` confirmation poll. + +**When to update**: + +- Increase if `pdpEndEpoch` confirmation consistently times out on slow networks. +- Decrease for faster fail-fast behavior during testing. + +**Note**: If the configured value is below 60 seconds, the runtime silently raises it to 60 seconds as an effective floor. An abort due to this timeout (or an internal poll timeout) is recorded as `dataSetTerminationStatus{value="failure.timedout"}` and retried on the next scheduled tick. + +**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) + +--- + ### `DEAL_JOB_TIMEOUT_SECONDS` - **Type**: `number` diff --git a/docs/jobs.md b/docs/jobs.md index e1e778fc..e2f60f3a 100644 --- a/docs/jobs.md +++ b/docs/jobs.md @@ -15,7 +15,7 @@ This doc explains what a "job" is in dealbot, how jobs are defined, how they're | --- | --- | --- | | `job_schedule_state` | One per `` plus global rows | Schedule state owned by dealbot. | | Storage provider (SP) | One per SP in registry | Filtered by `USE_ONLY_APPROVED_PROVIDERS` when enabled. | -| Job type | `deal`, `retrieval`, `data_set_creation`, `piece_cleanup`, `pull_check`, `providers_refresh`, `data_retention_poll` | `deal` corresponds to "data storage check" externally; we keep `deal` in code/DB for compatibility. | +| Job type | `deal`, `retrieval`, `data_set_creation`, `data_set_termination`, `piece_cleanup`, `pull_check`, `providers_refresh`, `data_retention_poll` | `deal` corresponds to "data storage check" externally; we keep `deal` in code/DB for compatibility. | | pg-boss queue | `sp.work`, `providers.refresh`, `data.retention.poll` | `sp.work` is a singleton queue. | | Dealbot scheduler | One per process (when enabled) | Runs the scheduling loop. | | Dealbot worker process | One Node.js process with `DEALBOT_RUN_MODE=worker` or `both` | Hosts pg-boss workers. | @@ -37,6 +37,7 @@ This doc explains what a "job" is in dealbot, how jobs are defined, how they're | `piece_cleanup` | `sp.work` | [`JobsService.handlePieceCleanupJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'piece_cleanup', spAddress, intervalSeconds }` | — | | `pull_check` | `sp.work` | [`JobsService.handlePullCheckJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'pull_check', spAddress, intervalSeconds }` | [pull check](./checks/pull-check.md) | | `data_set_creation` | `sp.work` | [`JobsService.handleDataSetCreationJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'data_set_creation', spAddress, intervalSeconds }` | [data-set-creation](./data-set-creation.md) | +| `data_set_termination` | `sp.work` | [`JobsService.handleDataSetTerminationJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'data_set_termination', spAddress, intervalSeconds }` | [data-set-termination](./data-set-termination.md) | `sp.work` is created with `policy=singleton`, and jobs set `singletonKey=spAddress` so only one active job per SP can run at a time. From 484996e4f322164c86716e3eaea203f84a727f40 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 00:58:41 +0530 Subject: [PATCH 08/16] refactor: pivot to data_set_lifecycle_check --- apps/backend/.env.example | 15 +- apps/backend/src/config/app.config.ts | 69 ++-- .../entities/job-schedule-state.entity.ts | 2 +- apps/backend/src/deal/deal.service.spec.ts | 90 ++--- apps/backend/src/deal/deal.service.ts | 358 ++++++++++-------- .../jobs/data-set-termination.handler.spec.ts | 152 -------- .../src/jobs/data-set-termination.handler.ts | 122 ------ apps/backend/src/jobs/jobs.service.spec.ts | 133 ++----- apps/backend/src/jobs/jobs.service.ts | 172 ++++----- .../repositories/job-schedule.repository.ts | 8 +- .../metrics-prometheus/check-metric-labels.ts | 2 +- .../check-metrics.service.ts | 23 +- .../metrics-prometheus.module.ts | 18 +- .../src/wallet-sdk/wallet-sdk.service.spec.ts | 1 - docs/checks/README.md | 1 + docs/checks/data-set-lifecycle-check.md | 118 ++++++ docs/checks/events-and-metrics.md | 6 +- docs/data-set-termination.md | 220 ----------- docs/environment-variables.md | 50 +-- docs/jobs.md | 4 +- 20 files changed, 536 insertions(+), 1028 deletions(-) delete mode 100644 apps/backend/src/jobs/data-set-termination.handler.spec.ts delete mode 100644 apps/backend/src/jobs/data-set-termination.handler.ts create mode 100644 docs/checks/data-set-lifecycle-check.md delete mode 100644 docs/data-set-termination.md diff --git a/apps/backend/.env.example b/apps/backend/.env.example index 3b3cf272..a3c69b10 100644 --- a/apps/backend/.env.example +++ b/apps/backend/.env.example @@ -31,11 +31,6 @@ PDP_SUBGRAPH_ENDPOINT=https://api.thegraph.com/subgraphs/filecoin/pdp # Minimum number of datasets per SP (default: 1). When > 1, a separate data_set_creation job provisions extra datasets. MIN_NUM_DATASETS_FOR_CHECKS=1 -# Lowest dataset slot index the data_set_termination canary may terminate (inclusive). -# Slots 0..(index-1) are never touched. Must be >= 1 and <= MIN_NUM_DATASETS_FOR_CHECKS. -# Equal to MIN_NUM_DATASETS_FOR_CHECKS disables termination (empty canary window). -DATA_SET_TERMINATION_MIN_INDEX=1 - # Dataset Versioning (optional) # Uncomment and set to enable dataset versioning (e.g., "dealbot-v1", "dealbot-v2") # This allows creating new logical datasets without changing wallet addresses @@ -60,11 +55,11 @@ DEALBOT_MAINTENANCE_WINDOW_MINUTES=20 DEALS_PER_SP_PER_HOUR=2 DATASET_CREATIONS_PER_SP_PER_HOUR=1 RETRIEVALS_PER_SP_PER_HOUR=1 -# data_set_termination canary (defaults: enabled on calibration, disabled on mainnet). -# Keep DATASET_TERMINATIONS_PER_SP_PER_HOUR <= DATASET_CREATIONS_PER_SP_PER_HOUR. -# DATASET_TERMINATION_ENABLED=true -DATASET_TERMINATIONS_PER_SP_PER_HOUR=1 -DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS=300 # 5m: Max runtime for data_set_termination jobs +# data_set_lifecycle_check canary: creates a throwaway data set and terminates it each tick +# (defaults: enabled on calibration, disabled on mainnet). +# DATASET_LIFECYCLE_CHECK_ENABLED=true +DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR=1 +DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS=600 # 10m: create + upload + terminate + pdpEndEpoch poll PG_BOSS_LOCAL_CONCURRENCY=20 JOB_SCHEDULER_POLL_SECONDS=300 JOB_WORKER_POLL_SECONDS=60 diff --git a/apps/backend/src/config/app.config.ts b/apps/backend/src/config/app.config.ts index e4ebabb4..134da33e 100644 --- a/apps/backend/src/config/app.config.ts +++ b/apps/backend/src/config/app.config.ts @@ -58,22 +58,6 @@ export const configValidationSchema = Joi.object({ DEALBOT_DATASET_VERSION: Joi.string().optional(), MIN_NUM_DATASETS_FOR_CHECKS: Joi.number().integer().min(1).default(1), PDP_SUBGRAPH_ENDPOINT: Joi.string().uri().optional().allow(""), - // Lowest dataset slot index eligible for termination (inclusive). Slots 0..(index-1) - // are never touched. Must be >= 1 and <= MIN_NUM_DATASETS_FOR_CHECKS; equal disables - // termination (empty canary window). See docs/data-set-termination.md. - DATA_SET_TERMINATION_MIN_INDEX: Joi.number() - .integer() - .min(1) - .default(1) - .custom((value, helpers) => { - const minDataSets = helpers.state.ancestors?.[0]?.MIN_NUM_DATASETS_FOR_CHECKS; - if (minDataSets != null && value > minDataSets) { - return helpers.error("any.invalid", { - message: `DATA_SET_TERMINATION_MIN_INDEX (${value}) must be <= MIN_NUM_DATASETS_FOR_CHECKS (${minDataSets})`, - }); - } - return value; - }, "min index <= min datasets validation"), // Scheduling PROVIDERS_REFRESH_INTERVAL_SECONDS: Joi.number().default(4 * 3600), @@ -96,12 +80,12 @@ export const configValidationSchema = Joi.object({ // Per-hour limits are guardrails to avoid excessive background load. DEALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(4), DATASET_CREATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), - DATASET_TERMINATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), + DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), RETRIEVALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(2), - // Enables the data_set_termination canary job. The network-dependent default (true on + // Enables the data_set_lifecycle_check canary job. The network-dependent default (true on // calibration, false on mainnet) is resolved in loadConfig; here we only validate the - // type when explicitly set. See docs/data-set-termination.md. - DATASET_TERMINATION_ENABLED: Joi.boolean().optional(), + // type when explicitly set. See docs/checks/data-set-lifecycle-check.md. + DATASET_LIFECYCLE_CHECK_ENABLED: Joi.boolean().optional(), // Polling interval for pg-boss scheduler (lower = more responsive, higher = less DB chatter). JOB_SCHEDULER_POLL_SECONDS: Joi.number().min(60).default(300), JOB_WORKER_POLL_SECONDS: Joi.number().min(5).default(60), @@ -114,7 +98,7 @@ export const configValidationSchema = Joi.object({ DEAL_JOB_TIMEOUT_SECONDS: Joi.number().min(120).default(360), // 6 minutes max runtime for data storage jobs (TODO: reduce default to 3 minutes) RETRIEVAL_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(60), // 1 minute max runtime for retrieval jobs (TODO: reduce default to 30 seconds) DATA_SET_CREATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset creation jobs - DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset termination jobs + DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(600), // 10 minutes: covers create + seed-piece upload + terminate + pdpEndEpoch poll // Seconds to hold the process alive after pg-boss drain completes, so Prometheus // captures at least one scrape of the terminal counter increments emitted during // shutdown. Default 35 covers the 30s ServiceMonitor interval plus a 5s buffer. @@ -221,12 +205,6 @@ export interface IBlockchainConfig { useOnlyApprovedProviders: boolean; dealbotDataSetVersion?: string; minNumDataSetsForChecks: number; - /** - * Lowest dataset slot index eligible for the `data_set_termination` job (inclusive). - * Slots `0..(dataSetTerminationMinIndex - 1)` are never terminated. Guaranteed to be - * `>= 1` and `<= minNumDataSetsForChecks` by config validation. - */ - dataSetTerminationMinIndex: number; pdpSubgraphEndpoint?: string; } @@ -255,20 +233,16 @@ export interface IJobsConfig { */ dataSetCreationsPerSpPerHour: number; /** - * Enables the calibration-focused `data_set_termination` canary job. + * Enables the calibration-focused `data_set_lifecycle_check` canary job, which + * creates a throwaway data set and immediately terminates it in a single tick. * - * Defaults to true on calibration and false on mainnet. Even when enabled, a - * schedule is only created when the canary window - * (`minNumDataSetsForChecks - dataSetTerminationMinIndex`) is non-empty. + * Defaults to true on calibration and false on mainnet. */ - dataSetTerminationEnabled: boolean; + dataSetLifecycleCheckEnabled: boolean; /** - * Target number of dataset termination runs per storage provider per hour. - * - * Should be <= `dataSetCreationsPerSpPerHour` so creation can replenish terminated - * slots without backlog. A startup warning is logged when this constraint is violated. + * Target number of dataset lifecycle check runs per storage provider per hour. */ - dataSetTerminationsPerSpPerHour: number; + dataSetLifecycleChecksPerSpPerHour: number; /** * How often the scheduler polls Postgres for due jobs (seconds). * @@ -328,12 +302,12 @@ export interface IJobsConfig { */ dataSetCreationJobTimeoutSeconds: number; /** - * Maximum runtime (seconds) for data-set termination jobs before forced abort. + * Maximum runtime (seconds) for data-set lifecycle check jobs before forced abort. * - * Bounds the terminateService call plus the `pdpEndEpoch != 0` confirmation poll. - * Uses AbortController to actively cancel job execution. + * Bounds the create-with-seed-piece upload, the terminateService call, and the + * `pdpEndEpoch != 0` confirmation poll. Uses AbortController to actively cancel execution. */ - dataSetTerminationJobTimeoutSeconds: number; + dataSetLifecycleCheckJobTimeoutSeconds: number; /** * Maximum runtime (seconds) for retrieval jobs before forced abort. * @@ -508,7 +482,6 @@ export function loadConfig(): IConfig { useOnlyApprovedProviders: process.env.USE_ONLY_APPROVED_PROVIDERS !== "false", dealbotDataSetVersion: process.env.DEALBOT_DATASET_VERSION, minNumDataSetsForChecks: Number.parseInt(process.env.MIN_NUM_DATASETS_FOR_CHECKS || "1", 10), - dataSetTerminationMinIndex: Number.parseInt(process.env.DATA_SET_TERMINATION_MIN_INDEX || "1", 10), pdpSubgraphEndpoint: process.env.PDP_SUBGRAPH_ENDPOINT || "", }, scheduling: { @@ -524,15 +497,17 @@ export function loadConfig(): IConfig { dealsPerSpPerHour: Number.parseFloat(process.env.DEALS_PER_SP_PER_HOUR || "4"), retrievalsPerSpPerHour: Number.parseFloat(process.env.RETRIEVALS_PER_SP_PER_HOUR || "2"), dataSetCreationsPerSpPerHour: Number.parseFloat(process.env.DATASET_CREATIONS_PER_SP_PER_HOUR || "1"), - dataSetTerminationEnabled: (() => { - const raw = process.env.DATASET_TERMINATION_ENABLED; + dataSetLifecycleCheckEnabled: (() => { + const raw = process.env.DATASET_LIFECYCLE_CHECK_ENABLED; if (raw == null || raw.trim().length === 0) { // Default: enabled on calibration, disabled on mainnet. return (process.env.NETWORK || "calibration") === "calibration"; } return raw === "true"; })(), - dataSetTerminationsPerSpPerHour: Number.parseFloat(process.env.DATASET_TERMINATIONS_PER_SP_PER_HOUR || "1"), + dataSetLifecycleChecksPerSpPerHour: Number.parseFloat( + process.env.DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR || "1", + ), schedulerPollSeconds: Number.parseInt(process.env.JOB_SCHEDULER_POLL_SECONDS || "300", 10), workerPollSeconds: Number.parseInt(process.env.JOB_WORKER_POLL_SECONDS || "60", 10), pgbossLocalConcurrency: Number.parseInt(process.env.PG_BOSS_LOCAL_CONCURRENCY || "20", 10), @@ -544,8 +519,8 @@ export function loadConfig(): IConfig { dealJobTimeoutSeconds: Number.parseInt(process.env.DEAL_JOB_TIMEOUT_SECONDS || "360", 10), retrievalJobTimeoutSeconds: Number.parseInt(process.env.RETRIEVAL_JOB_TIMEOUT_SECONDS || "60", 10), dataSetCreationJobTimeoutSeconds: Number.parseInt(process.env.DATA_SET_CREATION_JOB_TIMEOUT_SECONDS || "300", 10), - dataSetTerminationJobTimeoutSeconds: Number.parseInt( - process.env.DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS || "300", + dataSetLifecycleCheckJobTimeoutSeconds: Number.parseInt( + process.env.DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS || "600", 10, ), shutdownFinalScrapeDelaySeconds: Number.parseInt(process.env.SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS || "35", 10), diff --git a/apps/backend/src/database/entities/job-schedule-state.entity.ts b/apps/backend/src/database/entities/job-schedule-state.entity.ts index a42851eb..06aa72c5 100644 --- a/apps/backend/src/database/entities/job-schedule-state.entity.ts +++ b/apps/backend/src/database/entities/job-schedule-state.entity.ts @@ -4,7 +4,7 @@ export type JobType = | "deal" | "retrieval" | "data_set_creation" - | "data_set_termination" + | "data_set_lifecycle_check" | "pull_check" | "providers_refresh" | "data_retention_poll" diff --git a/apps/backend/src/deal/deal.service.spec.ts b/apps/backend/src/deal/deal.service.spec.ts index 0508779b..cac3fa73 100644 --- a/apps/backend/src/deal/deal.service.spec.ts +++ b/apps/backend/src/deal/deal.service.spec.ts @@ -17,7 +17,7 @@ import { DealAddonsService } from "../deal-addons/deal-addons.service.js"; import { DealPreprocessingResult } from "../deal-addons/types.js"; import { DataSetCreationCheckMetrics, - DataSetTerminationCheckMetrics, + DataSetLifecycleCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -170,7 +170,7 @@ describe("DealService", () => { observeCheckDuration: vi.fn(), recordStatus: vi.fn(), }; - const mockDataSetTerminationMetrics = { + const mockDataSetLifecycleCheckMetrics = { observeCheckDuration: vi.fn(), recordStatus: vi.fn(), }; @@ -189,7 +189,7 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, - { provide: DataSetTerminationCheckMetrics, useValue: mockDataSetTerminationMetrics }, + { provide: DataSetLifecycleCheckMetrics, useValue: mockDataSetLifecycleCheckMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1074,7 +1074,7 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, - { provide: DataSetTerminationCheckMetrics, useValue: mockDataSetTerminationMetrics }, + { provide: DataSetLifecycleCheckMetrics, useValue: mockDataSetLifecycleCheckMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1452,7 +1452,7 @@ describe("DealService", () => { }); }); - describe("terminateManagedDataSet", () => { + describe("runDataSetLifecycleCheck", () => { beforeEach(() => { vi.spyOn(mockWalletSdkService, "getProviderInfo").mockReturnValue({ id: 1n, @@ -1461,80 +1461,74 @@ describe("DealService", () => { } as any); }); - it("terminates, marks deals cleaned up, and records success + duration metrics", async () => { + it("creates a throwaway data set, terminates it, and records only lifecycle metrics", async () => { const terminateMock = vi.fn().mockResolvedValue("0xhash"); const synapseMock = { - storage: { terminateDataSet: terminateMock }, + storage: { + createContext: vi.fn().mockResolvedValue({ dataSetId: 9n }), + terminateDataSet: terminateMock, + }, client: { waitForTransactionReceipt: vi.fn().mockResolvedValue({ status: "success" }) }, }; vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); + (executeUpload as Mock).mockImplementation(async (_s, _d, _r, options) => { + await triggerUploadProgress(options?.onProgress); + return { pieceCid: "bafk-seed", pieceId: 1, transactionHash: "0xhash" }; + }); + // getDataSet: first probe inside ensureDataSetTerminated, then the confirmation poll. mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 0n }); mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 4321n }); - const updateFn = vi.fn().mockResolvedValue({ affected: 3 }); - const transactionMock = vi.fn(async (cb: any) => cb({ getRepository: () => ({ update: updateFn }) })); - Object.defineProperty(dealRepoMock, "manager", { - configurable: true, - value: { transaction: transactionMock }, - }); - - const result = await service.terminateManagedDataSet("0xaaa", 9n, undefined, 5_000); + const result = await service.runDataSetLifecycleCheck( + "0xaaa", + { dealbotLifecycleCheck: "nonce-1" }, + undefined, + 5_000, + ); - expect(terminateMock).toHaveBeenCalledWith({ dataSetId: 9n }); - expect(updateFn).toHaveBeenCalledWith( - { dataSetId: 9n, cleanedUp: false }, - expect.objectContaining({ cleanedUp: true }), + expect(synapseMock.storage.createContext).toHaveBeenCalledWith( + expect.objectContaining({ metadata: { dealbotLifecycleCheck: "nonce-1" } }), ); - expect(result).toEqual({ dealsAffected: 3, pdpEndEpoch: 4321n }); - expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetTermination" }), + expect(terminateMock).toHaveBeenCalledWith({ dataSetId: 9n }); + expect(result).toEqual({ dataSetId: 9n, pdpEndEpoch: 4321n }); + expect(mockDataSetLifecycleCheckMetrics.recordStatus).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), "success", ); - expect(mockDataSetTerminationMetrics.observeCheckDuration).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetTermination" }), + expect(mockDataSetLifecycleCheckMetrics.observeCheckDuration).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), expect.any(Number), ); + // The create step must NOT record dataSetCreation metrics (those belong to data_set_creation). + expect(mockDataSetCreationMetrics.recordStatus).not.toHaveBeenCalled(); + expect(mockDataSetCreationMetrics.observeCheckDuration).not.toHaveBeenCalled(); + // No Deal rows exist for the throwaway set, so no cleanup is attempted. + expect(dealRepoMock.save).not.toHaveBeenCalled(); }); it("records failure.timedout and rethrows when the signal is already aborted", async () => { - const terminateMock = vi.fn(); + const createContextMock = vi.fn().mockResolvedValue({ dataSetId: 9n }); const synapseMock = { - storage: { terminateDataSet: terminateMock }, + storage: { createContext: createContextMock, terminateDataSet: vi.fn() }, client: { waitForTransactionReceipt: vi.fn() }, }; vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); const controller = new AbortController(); - controller.abort(new Error("Data set termination job timeout (300s)")); + controller.abort(new Error("Data set lifecycle check job timeout (600s)")); - await expect(service.terminateManagedDataSet("0xaaa", 9n, controller.signal, 5_000)).rejects.toThrow(); + await expect( + service.runDataSetLifecycleCheck("0xaaa", { dealbotLifecycleCheck: "nonce-2" }, controller.signal, 5_000), + ).rejects.toThrow(); - expect(terminateMock).not.toHaveBeenCalled(); - expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetTermination" }), + expect(mockDataSetLifecycleCheckMetrics.recordStatus).toHaveBeenCalledWith( + expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), "failure.timedout", ); }); }); - describe("recordDataSetTerminationSkipped", () => { - it("records skipped.no_candidate with dataSetTermination labels", () => { - vi.spyOn(mockWalletSdkService, "getProviderInfo").mockReturnValue({ - id: 7n, - name: "sp", - isApproved: false, - } as any); - - service.recordDataSetTerminationSkipped("0xaaa"); - - expect(mockDataSetTerminationMetrics.recordStatus).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetTermination", providerId: "7", providerStatus: "unapproved" }), - "skipped.no_candidate", - ); - }); - }); - describe("createDeal isLive guard", () => { it("throws DealJobTerminatedDataSetError when data set is PDP-terminated; no metrics or save", async () => { const providerInfo: PDPProviderEx = { diff --git a/apps/backend/src/deal/deal.service.ts b/apps/backend/src/deal/deal.service.ts index 711e26a7..53cc7343 100644 --- a/apps/backend/src/deal/deal.service.ts +++ b/apps/backend/src/deal/deal.service.ts @@ -1,11 +1,11 @@ -import { randomUUID } from "node:crypto"; -import { setTimeout as setTimeoutAsync } from "node:timers/promises"; import { METADATA_KEYS, SIZE_CONSTANTS, Synapse } from "@filoz/synapse-sdk"; import { Injectable, Logger, type OnModuleDestroy, type OnModuleInit } from "@nestjs/common"; import { ConfigService } from "@nestjs/config"; import { InjectRepository } from "@nestjs/typeorm"; import { executeUpload } from "filecoin-pin"; import { CID } from "multiformats/cid"; +import { randomUUID } from "node:crypto"; +import { setTimeout as setTimeoutAsync } from "node:timers/promises"; import type { Repository } from "typeorm"; import { ClickhouseService } from "../clickhouse/clickhouse.service.js"; import { awaitWithAbort } from "../common/abort-utils.js"; @@ -24,14 +24,14 @@ import type { IBlockchainConfig, IConfig } from "../config/app.config.js"; import { Deal } from "../database/entities/deal.entity.js"; import { StorageProvider } from "../database/entities/storage-provider.entity.js"; import { DealStatus, IpniStatus, ServiceType } from "../database/types.js"; -import { DataSourceService } from "../dataSource/dataSource.service.js"; import { DatasetLivenessService } from "../dataset-liveness/dataset-liveness.service.js"; +import { DataSourceService } from "../dataSource/dataSource.service.js"; import { DealAddonsService } from "../deal-addons/deal-addons.service.js"; import type { DealPreprocessingResult } from "../deal-addons/types.js"; import { buildCheckMetricLabels, classifyFailureStatus } from "../metrics-prometheus/check-metric-labels.js"; import { DataSetCreationCheckMetrics, - DataSetTerminationCheckMetrics, + DataSetLifecycleCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -70,7 +70,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { private readonly dataStorageMetrics: DataStorageCheckMetrics, private readonly retrievalMetrics: RetrievalCheckMetrics, private readonly dataSetCreationMetrics: DataSetCreationCheckMetrics, - private readonly dataSetTerminationMetrics: DataSetTerminationCheckMetrics, + private readonly dataSetLifecycleCheckMetrics: DataSetLifecycleCheckMetrics, private readonly clickhouseService: ClickhouseService, private readonly datasetLivenessService: DatasetLivenessService, ) { @@ -736,7 +736,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { /** * Terminate a dataset on-chain (if needed) and wait for FWSS to confirm * `pdpEndEpoch != 0`. Shared by the `data_set_creation` repair path and the - * `data_set_termination` canary job. + * `data_set_lifecycle_check` canary job. * * Idempotent sequence: * 1. Read FWSS pdpEndEpoch. If already non-zero, skip the on-chain call. @@ -849,24 +849,31 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } /** - * Terminate a single dealbot-managed dataset slot and confirm on-chain, emitting - * data-set termination metrics. Used by the `data_set_termination` canary job. + * Run one data-set lifecycle check: create a throwaway data set with a seed piece, + * then immediately terminate it and confirm `pdpEndEpoch != 0` on-chain. Used by the + * `data_set_lifecycle_check` canary job to validate that an SP honours the full + * create -> `terminateService` lifecycle. * - * Mechanically identical to {@link repairTerminatedDataSet} (terminate -> confirm - * `pdpEndEpoch != 0` -> mark deals cleaned up) but additionally records - * `dataSetTerminationStatus` and `dataSetTerminationMs` so the canary trigger is - * observable. An abort (job timeout) or an internal poll timeout is classified as - * `failure.timedout`; pg-boss retries on the next scheduled tick. + * Self-contained: it never touches the managed check data sets and creates no `Deal` + * rows, so no Deal cleanup is performed. The throwaway set is created with caller-supplied + * `metadata` carrying the fixed `dealbotLifecycleCheck` marker key (a per-run nonce value + * forces a fresh set each tick); operators can list/sweep leaks by that key. + * + * Emits only `dataSetLifecycleCheckStatus` / `dataSetLifecycleCheckMs` — never the + * `dataSetCreation` metrics (those belong to the `data_set_creation` job). An abort + * (job timeout) or an internal poll timeout is classified as `failure.timedout`. If + * creation succeeds but + * termination fails the set leaks (accepted trade-off); pg-boss retries on the next tick. */ - async terminateManagedDataSet( + async runDataSetLifecycleCheck( providerAddress: string, - dataSetId: bigint, + metadata: Record, signal?: AbortSignal, pollTimeoutMs = 60_000, - ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { + ): Promise<{ dataSetId: bigint; pdpEndEpoch: bigint }> { const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); const labels = buildCheckMetricLabels({ - checkType: "dataSetTermination", + checkType: "dataSetLifecycleCheck", providerId: providerInfo?.id, providerName: providerInfo?.name, providerIsApproved: providerInfo?.isApproved, @@ -874,48 +881,56 @@ export class DealService implements OnModuleInit, OnModuleDestroy { const startedAt = Date.now(); this.logger.log({ - event: "dataset_termination_started", - message: "Starting managed data-set termination", + event: "dataset_lifecycle_check_started", + message: "Starting data-set lifecycle check (create then terminate)", providerAddress, providerId: providerInfo?.id, providerName: providerInfo?.name, - dataSetId: dataSetId.toString(), + metadata, }); + let dataSetId: bigint | undefined; try { + // 1. Create a fresh throwaway data set with a seed piece (no creation metrics). + ({ dataSetId } = await this.createDataSetWithPieceInternal(providerAddress, metadata, signal)); + if (!dataSetId) { + throw new Error("Data-set creation upload completed without resolving a dataSetId"); + } + // 2. Immediately terminate the exact set we just created and confirm on-chain. const pdpEndEpoch = await this.ensureDataSetTerminated(providerAddress, dataSetId, signal, pollTimeoutMs); - const dealsAffected = await this.markDataSetDealsCleanedUp(dataSetId); const durationMs = Date.now() - startedAt; - this.dataSetTerminationMetrics.observeCheckDuration(labels, durationMs); - this.dataSetTerminationMetrics.recordStatus(labels, "success"); + this.dataSetLifecycleCheckMetrics.observeCheckDuration(labels, durationMs); + this.dataSetLifecycleCheckMetrics.recordStatus(labels, "success"); this.logger.log({ - event: "dataset_termination_succeeded", - message: "Terminated managed data-set; data_set_creation will replenish the slot", + event: "dataset_lifecycle_check_succeeded", + message: "Data-set lifecycle check completed: created and terminated throwaway data set", providerAddress, providerId: providerInfo?.id, providerName: providerInfo?.name, dataSetId: dataSetId.toString(), pdpEndEpoch: pdpEndEpoch.toString(), - dealsAffected, durationMs, }); - return { dealsAffected, pdpEndEpoch }; + return { dataSetId, pdpEndEpoch }; } catch (error) { const durationMs = Date.now() - startedAt; // An abort (job-level timeout) or an internal poll timeout both count as failure.timedout. const status = signal?.aborted ? "failure.timedout" : classifyFailureStatus(error); if (status === "failure.timedout") { - this.dataSetTerminationMetrics.observeCheckDuration(labels, durationMs); + this.dataSetLifecycleCheckMetrics.observeCheckDuration(labels, durationMs); } - this.dataSetTerminationMetrics.recordStatus(labels, status); + this.dataSetLifecycleCheckMetrics.recordStatus(labels, status); this.logger.error({ - event: "dataset_termination_failed", - message: "Managed data-set termination failed", + event: "dataset_lifecycle_check_failed", + message: + dataSetId === undefined + ? "Data-set lifecycle check failed during creation" + : "Data-set lifecycle check failed during termination; throwaway data set may have leaked", providerAddress, providerId: providerInfo?.id, providerName: providerInfo?.name, - dataSetId: dataSetId.toString(), + dataSetId: dataSetId?.toString(), durationMs, status, error: toStructuredError(error), @@ -924,22 +939,6 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } } - /** - * Record a `skipped.no_candidate` data-set termination outcome for a provider. - * Emitted by the termination handler when every eligible slot resolves as `missing` - * (nothing live/terminated to act on this tick). - */ - recordDataSetTerminationSkipped(providerAddress: string): void { - const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); - const labels = buildCheckMetricLabels({ - checkType: "dataSetTermination", - providerId: providerInfo?.id, - providerName: providerInfo?.name, - providerIsApproved: providerInfo?.isApproved, - }); - this.dataSetTerminationMetrics.recordStatus(labels, "skipped.no_candidate"); - } - /** * Poll FWSS getDataSet({dataSetId}).pdpEndEpoch until non-zero. Exponential * backoff capped at 8s. Throws on timeout. @@ -963,7 +962,9 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } /** - * Creates an on-chain data-set with a minimal 200 KiB piece for a provider. + * Creates an on-chain data-set with a minimal 200 KiB piece for a provider, + * recording `dataSetCreation` metrics. Used by the `data_set_creation` job. + * * Uses createContext + executeUpload (same flow as data storage check) instead of * PDPServer.createDataSet, since empty datasets are being removed from curio and synapse-sdk. * @@ -975,16 +976,12 @@ export class DealService implements OnModuleInit, OnModuleDestroy { metadata: Record, signal?: AbortSignal, ): Promise { - signal?.throwIfAborted(); const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); - if (!providerInfo) { - throw new Error(`Provider ${providerAddress} not found in registry`); - } const labels = buildCheckMetricLabels({ checkType: "dataSetCreation", - providerId: providerInfo.id, - providerName: providerInfo.name, - providerIsApproved: providerInfo.isApproved, + providerId: providerInfo?.id, + providerName: providerInfo?.name, + providerIsApproved: providerInfo?.isApproved, }); const startedAt = Date.now(); @@ -993,114 +990,26 @@ export class DealService implements OnModuleInit, OnModuleDestroy { event: "dataset_creation_with_piece_started", message: "Starting data-set creation with piece", providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + providerId: providerInfo?.id, + providerName: providerInfo?.name, metadata, }); - let pieceAdded = false; - let piecesConfirmed = false; - let pieceCid: string | undefined; - let pieceId: number | undefined; - let transactionHash: string | undefined; - try { - const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); - signal?.throwIfAborted(); - - const DATA_SET_CREATION_PIECE_SIZE = 200 * 1024; // 200 KiB - const payload = Buffer.alloc(DATA_SET_CREATION_PIECE_SIZE, 0x61); - const dataFile = { - data: payload, - size: DATA_SET_CREATION_PIECE_SIZE, - name: "dataset-seed.bin", - }; - - const carResult = await buildUnixfsCar(dataFile, { signal }); - signal?.throwIfAborted(); - - const storage = await awaitWithAbort( - synapse.storage.createContext({ - providerId: providerInfo.id, - metadata, - }), - signal, - ); - signal?.throwIfAborted(); - - const filecoinPinLogger = createFilecoinPinLogger(this.logger); - - const uploadResult = (await awaitWithAbort( - executeUpload(synapse, carResult.carData, carResult.rootCID, { - logger: filecoinPinLogger, - contextId: providerAddress, - contexts: [storage], - pieceMetadata: {}, - ipniValidation: { enabled: false }, - // Must stay synchronous — see issue #446. - onProgress: (event) => { - switch (event.type) { - case "stored": - pieceCid = event.data.pieceCid.toString(); - this.logger.debug({ - event: "dataset_creation_stored", - message: "Data-set creation stored", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, - pieceCid, - }); - break; - case "piecesAdded": - pieceAdded = true; - this.logger.debug({ - event: "dataset_creation_pieces_added", - message: "Data-set creation pieces added", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, - txHash: event.data.txHash ?? "unknown", - }); - break; - case "piecesConfirmed": - piecesConfirmed = true; - this.logger.debug({ - event: "dataset_creation_pieces_confirmed", - message: "Data-set creation pieces confirmed", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, - pieceIds: event.data.pieceIds, - }); - break; - } - }, - }), - signal, - )) as Partial | undefined; - - pieceCid = pieceCid ?? uploadResult?.pieceCid; - pieceId = uploadResult?.pieceId; - transactionHash = uploadResult?.transactionHash; - + const result = await this.createDataSetWithPieceInternal(providerAddress, metadata, signal); const durationMs = Date.now() - startedAt; this.dataSetCreationMetrics.observeCheckDuration(labels, durationMs); - - if (!pieceCid) { - throw new Error("Data-set creation upload completed without producing a pieceCid"); - } - this.dataSetCreationMetrics.recordStatus(labels, "success"); - if (!pieceAdded || !piecesConfirmed) { + if (!result.pieceAdded || !result.piecesConfirmed) { this.logger.warn({ event: "dataset_creation_missing_onchain_events", message: "Data-set creation succeeded without full on-chain progress events", providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, - pieceAdded, - piecesConfirmed, + providerId: providerInfo?.id, + providerName: providerInfo?.name, + pieceAdded: result.pieceAdded, + piecesConfirmed: result.piecesConfirmed, }); } @@ -1108,15 +1017,15 @@ export class DealService implements OnModuleInit, OnModuleDestroy { event: "dataset_creation_with_piece_succeeded", message: "Data-set created with piece", providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + providerId: providerInfo?.id, + providerName: providerInfo?.name, durationMs, - dataSetId: storage.dataSetId ?? "unknown", - pieceCid: pieceCid ?? "unknown", - pieceId: pieceId ?? "unknown", - txHash: transactionHash ?? "unknown", - pieceAdded, - piecesConfirmed, + dataSetId: result.dataSetId ?? "unknown", + pieceCid: result.pieceCid, + pieceId: result.pieceId ?? "unknown", + txHash: result.transactionHash ?? "unknown", + pieceAdded: result.pieceAdded, + piecesConfirmed: result.piecesConfirmed, }); } catch (error) { const durationMs = Date.now() - startedAt; @@ -1126,20 +1035,135 @@ export class DealService implements OnModuleInit, OnModuleDestroy { event: "dataset_creation_with_piece_failed", message: "Data-set creation with piece failed", providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + providerId: providerInfo?.id, + providerName: providerInfo?.name, durationMs, - pieceAdded, - piecesConfirmed, - pieceCid, - pieceId, - transactionHash, error: toStructuredError(error), }); throw error; } } + /** + * Metrics-free creation of an on-chain data-set with a minimal 200 KiB seed piece. + * + * Performs createContext + executeUpload and returns the created `dataSetId` and upload + * summary. Records NO check metrics so callers can attribute the work to the right check + * (`data_set_creation` via {@link createDataSetWithPiece}, or `data_set_lifecycle_check` + * via {@link runDataSetLifecycleCheck}). Throws if the upload produced no `pieceCid` or + * the context resolved no `dataSetId` (we cannot operate on an unidentified set). + */ + private async createDataSetWithPieceInternal( + providerAddress: string, + metadata: Record, + signal?: AbortSignal, + ): Promise<{ + dataSetId?: bigint; + pieceCid: string; + pieceId: number | undefined; + transactionHash: string | undefined; + pieceAdded: boolean; + piecesConfirmed: boolean; + }> { + signal?.throwIfAborted(); + const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + if (!providerInfo) { + throw new Error(`Provider ${providerAddress} not found in registry`); + } + + let pieceAdded = false; + let piecesConfirmed = false; + let pieceCid: string | undefined; + + const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); + signal?.throwIfAborted(); + + const DATA_SET_CREATION_PIECE_SIZE = 200 * 1024; // 200 KiB + const payload = Buffer.alloc(DATA_SET_CREATION_PIECE_SIZE, 0x61); + const dataFile = { + data: payload, + size: DATA_SET_CREATION_PIECE_SIZE, + name: "dataset-seed.bin", + }; + + const carResult = await buildUnixfsCar(dataFile, { signal }); + signal?.throwIfAborted(); + + const storage = await awaitWithAbort( + synapse.storage.createContext({ + providerId: providerInfo.id, + metadata, + }), + signal, + ); + signal?.throwIfAborted(); + + const filecoinPinLogger = createFilecoinPinLogger(this.logger); + + const uploadResult = (await awaitWithAbort( + executeUpload(synapse, carResult.carData, carResult.rootCID, { + logger: filecoinPinLogger, + contextId: providerAddress, + contexts: [storage], + pieceMetadata: {}, + ipniValidation: { enabled: false }, + // Must stay synchronous — see issue #446. + onProgress: (event) => { + switch (event.type) { + case "stored": + pieceCid = event.data.pieceCid.toString(); + this.logger.debug({ + event: "dataset_creation_stored", + message: "Data-set creation stored", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + pieceCid, + }); + break; + case "piecesAdded": + pieceAdded = true; + this.logger.debug({ + event: "dataset_creation_pieces_added", + message: "Data-set creation pieces added", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + txHash: event.data.txHash ?? "unknown", + }); + break; + case "piecesConfirmed": + piecesConfirmed = true; + this.logger.debug({ + event: "dataset_creation_pieces_confirmed", + message: "Data-set creation pieces confirmed", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + pieceIds: event.data.pieceIds, + }); + break; + } + }, + }), + signal, + )) as Partial | undefined; + + pieceCid = pieceCid ?? uploadResult?.pieceCid; + if (!pieceCid) { + throw new Error("Data-set creation upload completed without producing a pieceCid"); + } + + return { + dataSetId: storage.dataSetId, + pieceCid, + pieceId: uploadResult?.pieceId, + transactionHash: uploadResult?.transactionHash, + pieceAdded, + piecesConfirmed, + }; + } + // ============================================================================ // Deal Creation Helpers // ============================================================================ diff --git a/apps/backend/src/jobs/data-set-termination.handler.spec.ts b/apps/backend/src/jobs/data-set-termination.handler.spec.ts deleted file mode 100644 index 3d1eba95..00000000 --- a/apps/backend/src/jobs/data-set-termination.handler.spec.ts +++ /dev/null @@ -1,152 +0,0 @@ -import { describe, expect, it, vi } from "vitest"; -import type { ProviderJobContext } from "../common/logging.js"; -import { terminateNextDataSet } from "./data-set-termination.handler.js"; - -const logContext: ProviderJobContext = { - jobId: "job-term", - providerAddress: "0xaaa", - providerId: 1n, - providerName: "sp", -}; - -const makeLogger = () => ({ log: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }) as any; - -const POLL_TIMEOUT_MS = 60_000; - -describe("terminateNextDataSet", () => { - it("terminates a live slot in the canary window and stops", async () => { - const dealService = { - getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 42n })), - terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 2, pdpEndEpoch: 10n })), - recordDataSetTerminationSkipped: vi.fn(), - }; - - // window [1, 2): only slot index 1 is eligible - await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 2, 1, {}, logContext, POLL_TIMEOUT_MS); - - expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith("0xaaa", { dealbotDS: "1" }, undefined); - expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith("0xaaa", 42n, undefined, POLL_TIMEOUT_MS); - expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); - }); - - it("skips missing slots and terminates the live/terminated one regardless of scan order", async () => { - // window [1, 3): index 1 is missing, index 2 is terminated -> index 2 must be the one terminated - const dealService = { - getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { - if (metadata.dealbotDS === "2") { - return { status: "terminated" as const, dataSetId: 99n }; - } - return { status: "missing" as const }; - }), - terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 0, pdpEndEpoch: 5n })), - recordDataSetTerminationSkipped: vi.fn(), - }; - - await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 3, 1, {}, logContext, POLL_TIMEOUT_MS); - - expect(dealService.terminateManagedDataSet).toHaveBeenCalledTimes(1); - expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith("0xaaa", 99n, undefined, POLL_TIMEOUT_MS); - expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); - }); - - it("records skipped.no_candidate when every candidate slot is missing", async () => { - const dealService = { - getDataSetProvisioningStatus: vi.fn(async () => ({ status: "missing" as const })), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), - }; - - await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 4, 1, {}, logContext, POLL_TIMEOUT_MS); - - expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); - expect(dealService.recordDataSetTerminationSkipped).toHaveBeenCalledWith("0xaaa"); - // All three candidate slots (1,2,3) were probed. - expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledTimes(3); - }); - - it("never probes slots below minIndex", async () => { - const probed: Array = []; - const dealService = { - getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { - probed.push(metadata.dealbotDS); - return { status: "missing" as const }; - }), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), - }; - - // window [3, 5): only slots 3 and 4 are eligible; 0,1,2 are protected - await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 5, 3, {}, logContext, POLL_TIMEOUT_MS); - - expect(probed.sort()).toEqual(["3", "4"]); - }); - - it("never probes the baseline slot 0 even if minIndex is misconfigured below 1", async () => { - const probed: Array = []; - const dealService = { - getDataSetProvisioningStatus: vi.fn(async (_sp: string, metadata: Record) => { - probed.push(metadata.dealbotDS); - return { status: "missing" as const }; - }), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), - }; - - // minIndex=0 should be clamped to 1; slot 0 (no dealbotDS tag) must never be probed - await terminateNextDataSet({ dealService, logger: makeLogger() }, "0xaaa", 3, 0, {}, logContext, POLL_TIMEOUT_MS); - - expect(probed.sort()).toEqual(["1", "2"]); - expect(probed).not.toContain(undefined); - }); - - it("merges base dataset metadata into each slot's lookup", async () => { - const dealService = { - getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 1n })), - terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 0, pdpEndEpoch: 1n })), - recordDataSetTerminationSkipped: vi.fn(), - }; - - await terminateNextDataSet( - { dealService, logger: makeLogger() }, - "0xaaa", - 2, - 1, - { withIPFSIndexing: "", dealbotDataSetVersion: "v1" }, - logContext, - POLL_TIMEOUT_MS, - ); - - expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith( - "0xaaa", - { withIPFSIndexing: "", dealbotDataSetVersion: "v1", dealbotDS: "1" }, - undefined, - ); - }); - - it("stops immediately and does not terminate when the signal is already aborted", async () => { - const dealService = { - getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 1n })), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), - }; - - const controller = new AbortController(); - controller.abort(new Error("Job timed out")); - - await expect( - terminateNextDataSet( - { dealService, logger: makeLogger() }, - "0xaaa", - 2, - 1, - {}, - logContext, - POLL_TIMEOUT_MS, - controller.signal, - ), - ).rejects.toThrow("Job timed out"); - - expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); - expect(dealService.recordDataSetTerminationSkipped).not.toHaveBeenCalled(); - }); -}); diff --git a/apps/backend/src/jobs/data-set-termination.handler.ts b/apps/backend/src/jobs/data-set-termination.handler.ts deleted file mode 100644 index 3828ba9f..00000000 --- a/apps/backend/src/jobs/data-set-termination.handler.ts +++ /dev/null @@ -1,122 +0,0 @@ -import type { Logger } from "@nestjs/common"; -import type { DataSetLogContext, ProviderJobContext } from "../common/logging.js"; -import type { DealService } from "../deal/deal.service.js"; - -export interface DataSetTerminationDeps { - dealService: Pick< - DealService, - "getDataSetProvisioningStatus" | "terminateManagedDataSet" | "recordDataSetTerminationSkipped" - >; - logger: Logger; -} - -/** - * Returns a randomly shuffled copy of the candidate slot indices `[start, minDataSets)`, - * where `start = max(1, minIndex)`. Fisher-Yates. - * - * The lower bound is clamped to `1` so slot `0` data-set is never a candidate. - */ -function shuffledCandidateIndices(minIndex: number, minDataSets: number): number[] { - const start = Math.max(1, minIndex); - const indices: number[] = []; - for (let i = start; i < minDataSets; i++) { - indices.push(i); - } - for (let i = indices.length - 1; i > 0; i--) { - const j = Math.floor(Math.random() * (i + 1)); - [indices[i], indices[j]] = [indices[j], indices[i]]; - } - return indices; -} - -/** - * Terminates at most one managed data-set slot per invocation (the canary trigger). - * - * Scans the canary window `[minIndex, minDataSets)` in random order and acts on the - * first slot that is `live` or `terminated`: - * - terminate it on-chain and wait for FWSS `pdpEndEpoch != 0`, marking its deals - * cleaned up. `data_set_creation` recreates the resulting `missing` slot on a - * later tick. - * `missing` slots are skipped (nothing to terminate; a replacement is already pending - * in `data_set_creation`). If every candidate is `missing`, emits `skipped.no_candidate`. - * - * Slots `0..(minIndex - 1)` are never touched, and slot `0` (the baseline data-set) is - * always protected because the canary window starts at `max(1, minIndex)`. Every - * candidate index is therefore `>= 1` and is tagged with `{ dealbotDS: String(i) }`, - * matching the slot metadata produced by `data_set_creation`. - */ -export async function terminateNextDataSet( - deps: DataSetTerminationDeps, - spAddress: string, - minDataSets: number, - minIndex: number, - baseDataSetMetadata: Record, - dataSetLogContext: ProviderJobContext, - pollTimeoutMs: number, - signal?: AbortSignal, -): Promise { - const { dealService, logger } = deps; - - const candidates = shuffledCandidateIndices(minIndex, minDataSets); - let skippedMissingCount = 0; - - for (const i of candidates) { - signal?.throwIfAborted(); - - // Candidates are always >= 1 (slot 0 is the protected baseline), so every slot in - // the canary window carries its dealbotDS tag, matching data_set_creation's slots. - const metadata: Record = { - ...baseDataSetMetadata, - dealbotDS: String(i), - }; - - const logContext: DataSetLogContext = { - ...dataSetLogContext, - metadata, - dataSetIndex: i, - }; - - const status = await dealService.getDataSetProvisioningStatus(spAddress, metadata, signal); - - if (status.status === "missing") { - skippedMissingCount++; - logger.debug({ - ...logContext, - event: "data_set_termination_slot_skipped_missing", - message: "Slot is missing; nothing to terminate (data_set_creation will replenish it)", - }); - continue; - } - - logger.log({ - ...logContext, - event: "terminating_data_set", - message: "Terminating managed data-set slot", - slotStatus: status.status, - dataSetId: status.dataSetId.toString(), - }); - const result = await dealService.terminateManagedDataSet(spAddress, status.dataSetId, signal, pollTimeoutMs); - logger.log({ - ...logContext, - event: "data_set_termination_completed", - message: "Terminated managed data-set; deferring recreation to data_set_creation", - dataSetId: status.dataSetId.toString(), - dealsAffected: result.dealsAffected, - skippedMissingCount, - }); - return; - } - - // Every candidate slot resolved as `missing`: nothing to terminate this tick. This is - // expected right after a termination when data_set_creation has not yet replenished - // the slot. Persistent skips indicate creation is lagging behind termination. - dealService.recordDataSetTerminationSkipped(spAddress); - logger.log({ - ...dataSetLogContext, - event: "data_set_termination_skipped_no_candidate", - message: "No eligible slot to terminate; all candidate slots are missing", - minDataSets, - minIndex, - skippedMissingCount, - }); -} diff --git a/apps/backend/src/jobs/jobs.service.spec.ts b/apps/backend/src/jobs/jobs.service.spec.ts index 2a12fad8..b62e5f4b 100644 --- a/apps/backend/src/jobs/jobs.service.spec.ts +++ b/apps/backend/src/jobs/jobs.service.spec.ts @@ -128,7 +128,6 @@ describe("JobsService schedule rows", () => { blockchain: { useOnlyApprovedProviders: false, minNumDataSetsForChecks: 1, - dataSetTerminationMinIndex: 1, } as IConfig["blockchain"], scheduling: { providersRefreshIntervalSeconds: 4 * 3600, @@ -145,9 +144,9 @@ describe("JobsService schedule rows", () => { dealJobTimeoutSeconds: 360, retrievalJobTimeoutSeconds: 60, dataSetCreationJobTimeoutSeconds: 300, - dataSetTerminationEnabled: false, - dataSetTerminationsPerSpPerHour: 1, - dataSetTerminationJobTimeoutSeconds: 300, + dataSetLifecycleCheckEnabled: false, + dataSetLifecycleChecksPerSpPerHour: 1, + dataSetLifecycleCheckJobTimeoutSeconds: 600, shutdownFinalScrapeDelaySeconds: 35, pieceCleanupPerSpPerHour: 1, maxPieceCleanupRuntimeSeconds: 300, @@ -1259,25 +1258,17 @@ describe("JobsService schedule rows", () => { expect(dealService.createDataSetWithPiece).not.toHaveBeenCalled(); }); - it("data_set_termination job skips when disabled", async () => { + it("data_set_lifecycle_check job skips when disabled", async () => { baseConfigValues = { ...baseConfigValues, - blockchain: { - ...baseConfigValues.blockchain, - minNumDataSetsForChecks: 3, - dataSetTerminationMinIndex: 1, - } as IConfig["blockchain"], - jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: false } as IConfig["jobs"], + jobs: { ...baseConfigValues.jobs, dataSetLifecycleCheckEnabled: false } as IConfig["jobs"], }; configService = { get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), } as unknown as JobsServiceDeps[0]; const dealService = { - getBaseDataSetMetadata: vi.fn(() => ({})), - getDataSetProvisioningStatus: vi.fn(), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), + runDataSetLifecycleCheck: vi.fn(), }; const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; @@ -1287,34 +1278,25 @@ describe("JobsService schedule rows", () => { walletSdkService: walletSdkService as unknown as ConstructorParameters[5], }); - await callPrivate(service, "handleDataSetTerminationJob", { - id: "job-term-1", - data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, + await callPrivate(service, "handleDataSetLifecycleCheckJob", { + id: "job-lc-1", + data: { jobType: "data_set_lifecycle_check", spAddress: "0xaaa", intervalSeconds: 3600 }, }); - expect(dealService.getDataSetProvisioningStatus).not.toHaveBeenCalled(); - expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); + expect(dealService.runDataSetLifecycleCheck).not.toHaveBeenCalled(); }); - it("data_set_termination job skips when canary window is empty", async () => { + it("data_set_lifecycle_check job creates and terminates a throwaway data set when enabled", async () => { baseConfigValues = { ...baseConfigValues, - blockchain: { - ...baseConfigValues.blockchain, - minNumDataSetsForChecks: 2, - dataSetTerminationMinIndex: 2, // window = 2 - 2 = 0 - } as IConfig["blockchain"], - jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], + jobs: { ...baseConfigValues.jobs, dataSetLifecycleCheckEnabled: true } as IConfig["jobs"], }; configService = { get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), } as unknown as JobsServiceDeps[0]; const dealService = { - getBaseDataSetMetadata: vi.fn(() => ({})), - getDataSetProvisioningStatus: vi.fn(), - terminateManagedDataSet: vi.fn(), - recordDataSetTerminationSkipped: vi.fn(), + runDataSetLifecycleCheck: vi.fn(async () => ({ dataSetId: 55n, pdpEndEpoch: 9n })), }; const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; @@ -1324,70 +1306,26 @@ describe("JobsService schedule rows", () => { walletSdkService: walletSdkService as unknown as ConstructorParameters[5], }); - await callPrivate(service, "handleDataSetTerminationJob", { - id: "job-term-2", - data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, + await callPrivate(service, "handleDataSetLifecycleCheckJob", { + id: "job-lc-2", + data: { jobType: "data_set_lifecycle_check", spAddress: "0xaaa", intervalSeconds: 3600 }, }); - expect(dealService.getDataSetProvisioningStatus).not.toHaveBeenCalled(); - expect(dealService.terminateManagedDataSet).not.toHaveBeenCalled(); - }); - - it("data_set_termination job terminates a slot in the canary window when enabled", async () => { - baseConfigValues = { - ...baseConfigValues, - blockchain: { - ...baseConfigValues.blockchain, - minNumDataSetsForChecks: 2, - dataSetTerminationMinIndex: 1, // window = [1, 2) - } as IConfig["blockchain"], - jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], - }; - configService = { - get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), - } as unknown as JobsServiceDeps[0]; - - const dealService = { - getBaseDataSetMetadata: vi.fn(() => ({})), - getDataSetProvisioningStatus: vi.fn(async () => ({ status: "live" as const, dataSetId: 55n })), - terminateManagedDataSet: vi.fn(async () => ({ dealsAffected: 1, pdpEndEpoch: 9n })), - recordDataSetTerminationSkipped: vi.fn(), - }; - const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; - - service = buildService({ - configService, - dealService: dealService as unknown as ConstructorParameters[3], - walletSdkService: walletSdkService as unknown as ConstructorParameters[5], - }); - - await callPrivate(service, "handleDataSetTerminationJob", { - id: "job-term-3", - data: { jobType: "data_set_termination", spAddress: "0xaaa", intervalSeconds: 3600 }, - }); - - expect(dealService.getDataSetProvisioningStatus).toHaveBeenCalledWith( + expect(dealService.runDataSetLifecycleCheck).toHaveBeenCalledWith( "0xaaa", - { dealbotDS: "1" }, - expect.any(AbortSignal), - ); - expect(dealService.terminateManagedDataSet).toHaveBeenCalledWith( - "0xaaa", - 55n, + expect.objectContaining({ dealbotLifecycleCheck: expect.any(String) }), expect.any(AbortSignal), expect.any(Number), ); + // The fixed marker key is the only metadata; no base/slot metadata is attached. + const metadataArg = (dealService.runDataSetLifecycleCheck.mock.calls[0] as unknown[])[1] as Record; + expect(Object.keys(metadataArg)).toEqual(["dealbotLifecycleCheck"]); }); - it("creates data_set_termination schedules when enabled with a non-empty window", async () => { + it("creates data_set_lifecycle_check schedules when enabled", async () => { baseConfigValues = { ...baseConfigValues, - blockchain: { - ...baseConfigValues.blockchain, - minNumDataSetsForChecks: 3, - dataSetTerminationMinIndex: 1, - } as IConfig["blockchain"], - jobs: { ...baseConfigValues.jobs, dataSetTerminationEnabled: true } as IConfig["jobs"], + jobs: { ...baseConfigValues.jobs, dataSetLifecycleCheckEnabled: true } as IConfig["jobs"], }; configService = { get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), @@ -1398,25 +1336,25 @@ describe("JobsService schedule rows", () => { await callPrivate(service, "ensureScheduleRows"); - const terminationUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( - (call) => call[0] === "data_set_termination", + const lifecycleUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( + (call) => call[0] === "data_set_lifecycle_check", ); - expect(terminationUpserts).toHaveLength(1); - expect(terminationUpserts[0][1]).toBe("0xaaa"); + expect(lifecycleUpserts).toHaveLength(1); + expect(lifecycleUpserts[0][1]).toBe("0xaaa"); expect(jobScheduleRepositoryMock.deleteSchedulesByJobType).not.toHaveBeenCalled(); }); - it("removes data_set_termination schedules when disabled", async () => { - // base config has dataSetTerminationEnabled=false + it("removes data_set_lifecycle_check schedules when disabled", async () => { + // base config has dataSetLifecycleCheckEnabled=false storageProviderRepositoryMock.find.mockResolvedValueOnce([{ address: "0xaaa" }]); await callPrivate(service, "ensureScheduleRows"); - const terminationUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( - (call) => call[0] === "data_set_termination", + const lifecycleUpserts = jobScheduleRepositoryMock.upsertSchedule.mock.calls.filter( + (call) => call[0] === "data_set_lifecycle_check", ); - expect(terminationUpserts).toHaveLength(0); - expect(jobScheduleRepositoryMock.deleteSchedulesByJobType).toHaveBeenCalledWith("data_set_termination"); + expect(lifecycleUpserts).toHaveLength(0); + expect(jobScheduleRepositoryMock.deleteSchedulesByJobType).toHaveBeenCalledWith("data_set_lifecycle_check"); }); it("sets active, inactive, and tested provider gauge values after refresh", async () => { @@ -1678,9 +1616,10 @@ describe("JobsService schedule rows", () => { await vi.advanceTimersByTimeAsync(35_001); await shutdownPromise; - // Defaults: deal=360, retrieval=60, dataSetCreation=300, pullCheck=300 → max=360 → +60s buffer + // Defaults: deal=360, retrieval=60, dataSetCreation=300, dataSetLifecycleCheck=600, + // pullCheck=300 → max=600 → +60s buffer expect(bossMock.stop).toHaveBeenCalledTimes(1); - expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 420_000 }); + expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 660_000 }); }); it("picks the longest timeout across all job types, including pullCheck under pullPiece", async () => { diff --git a/apps/backend/src/jobs/jobs.service.ts b/apps/backend/src/jobs/jobs.service.ts index e1d349a0..42490d8b 100644 --- a/apps/backend/src/jobs/jobs.service.ts +++ b/apps/backend/src/jobs/jobs.service.ts @@ -19,7 +19,6 @@ import { PullCheckService } from "../pull-check/pull-check.service.js"; import { RetrievalService } from "../retrieval/retrieval.service.js"; import { WalletSdkService } from "../wallet-sdk/wallet-sdk.service.js"; import { provisionNextMissingDataSet } from "./data-set-creation.handler.js"; -import { terminateNextDataSet } from "./data-set-termination.handler.js"; import { DATA_RETENTION_POLL_QUEUE, PROVIDERS_REFRESH_QUEUE, @@ -28,12 +27,25 @@ import { } from "./job-queues.js"; import { JobScheduleRepository } from "./repositories/job-schedule.repository.js"; -type SpJobType = "deal" | "retrieval" | "data_set_creation" | "data_set_termination" | "piece_cleanup" | "pull_check"; +/** + * Fixed metadata marker key tagging every throwaway data set created by the + * `data_set_lifecycle_check` job. The value is a per-run nonce; the key is the stable + * handle operators use to list/sweep leaked sets (create-OK / terminate-failed runs). + */ +const LIFECYCLE_CHECK_METADATA_KEY = "dealbotLifecycleCheck"; + +type SpJobType = + | "deal" + | "retrieval" + | "data_set_creation" + | "data_set_lifecycle_check" + | "piece_cleanup" + | "pull_check"; const SP_JOB_TYPES: ReadonlySet = new Set([ "deal", "retrieval", "data_set_creation", - "data_set_termination", + "data_set_lifecycle_check", "piece_cleanup", "pull_check", ]); @@ -170,38 +182,12 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { return; } - this.warnIfTerminationRateExceedsCreation(); - await this.tick(); this.schedulerInterval = setInterval(() => { void this.tick(); }, this.schedulerPollMs()); } - /** - * Emits a startup warning when the termination rate exceeds the creation rate. - * - * If `data_set_termination` runs faster than `data_set_creation`, the missing-slot - * backlog accumulates and the loop stops behaving like a simple steady-state canary. - * See docs/data-set-termination.md#relationship-to-data_set_creation. - */ - private warnIfTerminationRateExceedsCreation(): void { - const jobs = this.configService.get("jobs"); - if (!jobs.dataSetTerminationEnabled) { - return; - } - if (jobs.dataSetTerminationsPerSpPerHour > jobs.dataSetCreationsPerSpPerHour) { - this.logger.warn({ - event: "data_set_termination_rate_exceeds_creation", - message: - "DATASET_TERMINATIONS_PER_SP_PER_HOUR exceeds DATASET_CREATIONS_PER_SP_PER_HOUR; " + - "terminations may outpace creation and accumulate a missing-slot backlog.", - dataSetTerminationsPerSpPerHour: jobs.dataSetTerminationsPerSpPerHour, - dataSetCreationsPerSpPerHour: jobs.dataSetCreationsPerSpPerHour, - }); - } - } - /** * Cleans up resources on shutdown. * Stops the polling loop and gracefully stops pg-boss. @@ -232,7 +218,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { jobs.dealJobTimeoutSeconds, jobs.retrievalJobTimeoutSeconds, jobs.dataSetCreationJobTimeoutSeconds, - jobs.dataSetTerminationJobTimeoutSeconds, + jobs.dataSetLifecycleCheckJobTimeoutSeconds, pullPiece.pullCheckJobTimeoutSeconds, ); const stopTimeoutMs = (longestJobTimeoutSec + 60) * 1000; @@ -362,8 +348,8 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { await this.handleDataSetCreationJob(job); return; } - if (job.data.jobType === "data_set_termination") { - await this.handleDataSetTerminationJob(job); + if (job.data.jobType === "data_set_lifecycle_check") { + await this.handleDataSetLifecycleCheckJob(job); return; } if (job.data.jobType === "piece_cleanup") { @@ -953,84 +939,79 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { } /** - * Handles one `data_set_termination` invocation for a provider. + * Handles one `data_set_lifecycle_check` invocation for a provider. * - * Terminates at most one managed dataset slot in the canary window so the existing - * `data_set_creation` job recreates it, keeping the on-chain createDataSet path - * continuously exercised. Gated by `DATASET_TERMINATION_ENABLED` and a non-empty - * canary window (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). + * Creates a throwaway data set with a seed piece, then immediately calls + * `terminateService` on it — exercising the full create -> terminate lifecycle in a + * single tick. The set carries a fixed `dealbotLifecycleCheck` marker key (with a + * per-run nonce value to force a fresh set), so any set leaked by a create-OK / + * terminate-failed run can be found and swept manually by that key. Gated by + * `DATASET_LIFECYCLE_CHECK_ENABLED`. */ - private async handleDataSetTerminationJob(job: SpJob): Promise { + private async handleDataSetLifecycleCheckJob(job: SpJob): Promise { const data = job.data; const spAddress = data.spAddress; const now = new Date(); const maintenance = this.getMaintenanceWindowStatus(now); if (maintenance.active) { - this.logMaintenanceSkip(`data_set_termination job for ${spAddress}`, maintenance.window?.label, { + this.logMaintenanceSkip(`data_set_lifecycle_check job for ${spAddress}`, maintenance.window?.label, { jobId: job.id, providerAddress: spAddress, providerId: this.walletSdkService.getProviderInfo(spAddress)?.id, providerName: this.walletSdkService.getProviderInfo(spAddress)?.name, }); - await this.deferJobForMaintenance("data_set_termination", data, maintenance, now); + await this.deferJobForMaintenance("data_set_lifecycle_check", data, maintenance, now); return; } - const blockchain = this.configService.get("blockchain"); - const jobsConfig = this.configService.get("jobs"); - const minDataSets = blockchain.minNumDataSetsForChecks; - const minIndex = blockchain.dataSetTerminationMinIndex; - // Defensive gate: schedules are only created when enabled with a non-empty canary - // window, but a stale enqueued job (e.g. after disabling) must still no-op safely. - if (!jobsConfig.dataSetTerminationEnabled || minDataSets - minIndex <= 0) { + const jobsConfig = this.configService.get("jobs", { infer: true }); + // Defensive gate: schedules are only created when enabled, but a stale enqueued job + // (e.g. after disabling) must still no-op safely. + if (!jobsConfig.dataSetLifecycleCheckEnabled) { this.logger.log({ jobId: job.id, providerAddress: spAddress, providerId: this.walletSdkService.getProviderInfo(spAddress)?.id, providerName: this.walletSdkService.getProviderInfo(spAddress)?.name, - event: "data_set_termination_job_disabled", - message: "Data set termination job skipped: disabled or empty canary window", - enabled: jobsConfig.dataSetTerminationEnabled, - minDataSets, - minIndex, + event: "data_set_lifecycle_check_job_disabled", + message: "Data set lifecycle check job skipped: disabled", + enabled: jobsConfig.dataSetLifecycleCheckEnabled, }); return; } - const baseDataSetMetadata = this.dealService.getBaseDataSetMetadata(); + // Fixed marker key + per-run nonce value. The key is the manual-cleanup handle; the + // nonce forces createContext to provision a fresh set each tick instead of resolving + // a prior (possibly leaked) set. Intentionally excludes base data-set metadata. + const metadata: Record = { + [LIFECYCLE_CHECK_METADATA_KEY]: Date.now().toString(), + }; // Create AbortController for job timeout enforcement const abortController = new AbortController(); - const timeoutSeconds = jobsConfig.dataSetTerminationJobTimeoutSeconds; + const timeoutSeconds = jobsConfig.dataSetLifecycleCheckJobTimeoutSeconds; const timeoutMs = Math.max(60000, timeoutSeconds * 1000); const effectiveTimeoutSeconds = Math.round(timeoutMs / 1000); - const abortReason = new Error(`Data set termination job timeout (${effectiveTimeoutSeconds}s) for ${spAddress}`); + const abortReason = new Error( + `Data set lifecycle check job timeout (${effectiveTimeoutSeconds}s) for ${spAddress}`, + ); const timeoutId = setTimeout(() => { abortController.abort(abortReason); }, timeoutMs); - await this.recordJobExecution("data_set_termination", async () => { + await this.recordJobExecution("data_set_lifecycle_check", async () => { const dataSetLogContext = await this.resolveRunnableProviderJobContext( - "data_set_termination", + "data_set_lifecycle_check", spAddress, job.id, - "Data set termination job skipped: provider is blocked for scheduled data-storage checks", + "Data set lifecycle check job skipped: provider is blocked for scheduled data-storage checks", ); if (dataSetLogContext == null) { clearTimeout(timeoutId); return "success"; } try { - await terminateNextDataSet( - { dealService: this.dealService, logger: this.logger }, - spAddress, - minDataSets, - minIndex, - baseDataSetMetadata, - dataSetLogContext, - timeoutMs, - abortController.signal, - ); + await this.dealService.runDataSetLifecycleCheck(spAddress, metadata, abortController.signal, timeoutMs); return "success"; } catch (error) { if (abortController.signal.aborted) { @@ -1038,8 +1019,8 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const reasonMessage = reason instanceof Error ? reason.message : String(reason ?? ""); this.logger.error({ ...dataSetLogContext, - event: "data_set_termination_job_aborted", - message: reasonMessage || "Data set termination job aborted after timeout", + event: "data_set_lifecycle_check_job_aborted", + message: reasonMessage || "Data set lifecycle check job aborted after timeout", timeoutSeconds: effectiveTimeoutSeconds, error: toStructuredError(reason ?? error), }); @@ -1047,8 +1028,8 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { } this.logger.error({ ...dataSetLogContext, - event: "data_set_termination_job_failed", - message: "Data set termination job failed", + event: "data_set_lifecycle_check_job_failed", + message: "Data set lifecycle check job failed", error: toStructuredError(error), }); throw error; @@ -1148,7 +1129,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds: number; retrievalIntervalSeconds: number; dataSetCreationIntervalSeconds: number; - dataSetTerminationIntervalSeconds: number; + dataSetLifecycleCheckIntervalSeconds: number; dataRetentionPollIntervalSeconds: number; providersRefreshIntervalSeconds: number; pieceCleanupIntervalSeconds: number; @@ -1162,14 +1143,14 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const dealsPerHour = jobsConfig.dealsPerSpPerHour; const retrievalsPerHour = jobsConfig.retrievalsPerSpPerHour; const dataSetCreationsPerHour = jobsConfig.dataSetCreationsPerSpPerHour; - const dataSetTerminationsPerHour = jobsConfig.dataSetTerminationsPerSpPerHour; + const dataSetLifecycleChecksPerHour = jobsConfig.dataSetLifecycleChecksPerSpPerHour; const pieceCleanupPerHour = jobsConfig.pieceCleanupPerSpPerHour; const pullChecksPerHour = pullPieceConfig.pullChecksPerSpPerHour; const dealIntervalSeconds = Math.max(1, Math.round(3600 / dealsPerHour)); const retrievalIntervalSeconds = Math.max(1, Math.round(3600 / retrievalsPerHour)); const dataSetCreationIntervalSeconds = Math.max(1, Math.round(3600 / dataSetCreationsPerHour)); - const dataSetTerminationIntervalSeconds = Math.max(1, Math.round(3600 / dataSetTerminationsPerHour)); + const dataSetLifecycleCheckIntervalSeconds = Math.max(1, Math.round(3600 / dataSetLifecycleChecksPerHour)); const pieceCleanupIntervalSeconds = Math.max(1, Math.round(3600 / pieceCleanupPerHour)); const pullCheckIntervalSeconds = Math.max(1, Math.round(3600 / pullChecksPerHour)); const dataRetentionPollIntervalSeconds = scheduling.dataRetentionPollIntervalSeconds; @@ -1180,7 +1161,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds, retrievalIntervalSeconds, dataSetCreationIntervalSeconds, - dataSetTerminationIntervalSeconds, + dataSetLifecycleCheckIntervalSeconds, dataRetentionPollIntervalSeconds, providersRefreshIntervalSeconds, pieceCleanupIntervalSeconds, @@ -1202,7 +1183,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dealIntervalSeconds, retrievalIntervalSeconds, dataSetCreationIntervalSeconds, - dataSetTerminationIntervalSeconds, + dataSetLifecycleCheckIntervalSeconds, dataRetentionPollIntervalSeconds, providersRefreshIntervalSeconds, pieceCleanupIntervalSeconds, @@ -1222,17 +1203,14 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const dealStartAt = new Date(now.getTime() + phaseMs); const retrievalStartAt = new Date(now.getTime() + phaseMs); const dataSetCreationStartAt = new Date(now.getTime() + phaseMs); - const dataSetTerminationStartAt = new Date(now.getTime() + phaseMs); + const dataSetLifecycleCheckStartAt = new Date(now.getTime() + phaseMs); const dataRetentionPollStartAt = new Date(now.getTime() + phaseMs); const providersRefreshStartAt = new Date(now.getTime() + phaseMs); - const blockchainCfg = this.configService.get("blockchain"); + const blockchainCfg = this.configService.get("blockchain", { infer: true }); const minDataSets = blockchainCfg.minNumDataSetsForChecks; - // Termination schedules are only created when enabled with a non-empty canary window - // (slots [DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS)). - const terminationScheduleEnabled = - this.configService.get("jobs").dataSetTerminationEnabled && - minDataSets - blockchainCfg.dataSetTerminationMinIndex > 0; + // Lifecycle check schedules are only created when enabled explicitly + const lifecycleCheckScheduleEnabled = this.configService.get("jobs", { infer: true }).dataSetLifecycleCheckEnabled; const cleanupStartAt = new Date(now.getTime() + phaseMs); const pullCheckStartAt = new Date(now.getTime() + phaseMs); @@ -1260,12 +1238,12 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { dataSetCreationStartAt, ); } - if (terminationScheduleEnabled) { + if (lifecycleCheckScheduleEnabled) { await this.jobScheduleRepository.upsertSchedule( - "data_set_termination", + "data_set_lifecycle_check", address, - dataSetTerminationIntervalSeconds, - dataSetTerminationStartAt, + dataSetLifecycleCheckIntervalSeconds, + dataSetLifecycleCheckStartAt, ); } await this.jobScheduleRepository.upsertSchedule( @@ -1299,14 +1277,14 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { }); } - // When termination is disabled (or the canary window is empty), remove any stale - // data_set_termination schedules so they stop enqueuing no-op jobs. - if (!terminationScheduleEnabled) { - const removed = await this.jobScheduleRepository.deleteSchedulesByJobType("data_set_termination"); + // When the lifecycle check is disabled, remove any stale data_set_lifecycle_check + // schedules so they stop enqueuing no-op jobs. + if (!lifecycleCheckScheduleEnabled) { + const removed = await this.jobScheduleRepository.deleteSchedulesByJobType("data_set_lifecycle_check"); if (removed > 0) { this.logger.warn({ - event: "data_set_termination_schedules_removed", - message: "Removed data_set_termination schedules because the job is disabled or the canary window is empty", + event: "data_set_lifecycle_check_schedules_removed", + message: "Removed data_set_lifecycle_check schedules because the job is disabled", removed, }); } @@ -1423,7 +1401,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { return SP_WORK_QUEUE; case "data_set_creation": return SP_WORK_QUEUE; - case "data_set_termination": + case "data_set_lifecycle_check": return SP_WORK_QUEUE; case "piece_cleanup": return SP_WORK_QUEUE; @@ -1447,7 +1425,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { row.job_type === "deal" || row.job_type === "retrieval" || row.job_type === "data_set_creation" || - row.job_type === "data_set_termination" || + row.job_type === "data_set_lifecycle_check" || row.job_type === "piece_cleanup" || row.job_type === "pull_check" ) { @@ -1521,7 +1499,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { "deal", "retrieval", "data_set_creation", - "data_set_termination", + "data_set_lifecycle_check", "piece_cleanup", "pull_check", "data_retention_poll", diff --git a/apps/backend/src/jobs/repositories/job-schedule.repository.ts b/apps/backend/src/jobs/repositories/job-schedule.repository.ts index 86336b5c..a2dda2f1 100644 --- a/apps/backend/src/jobs/repositories/job-schedule.repository.ts +++ b/apps/backend/src/jobs/repositories/job-schedule.repository.ts @@ -71,7 +71,7 @@ export class JobScheduleRepository { const [rows] = (await this.dataSource.query( ` DELETE FROM job_schedule_state - WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_termination', 'piece_cleanup', 'pull_check') + WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_lifecycle_check', 'piece_cleanup', 'pull_check') AND sp_address <> '' RETURNING sp_address `, @@ -82,7 +82,7 @@ export class JobScheduleRepository { const [rows] = (await this.dataSource.query( ` DELETE FROM job_schedule_state - WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_termination', 'piece_cleanup', 'pull_check') + WHERE job_type IN ('deal', 'retrieval', 'data_set_creation', 'data_set_lifecycle_check', 'piece_cleanup', 'pull_check') AND sp_address <> '' AND sp_address <> ALL($1::text[]) RETURNING sp_address @@ -104,8 +104,8 @@ export class JobScheduleRepository { * Deletes all per-provider schedule rows for a given job type. * * Used to stop a job entirely when it is disabled by config (for example the - * `data_set_termination` canary when `DATASET_TERMINATION_ENABLED=false` or the - * canary window is empty), so stale schedules do not keep enqueuing no-op jobs. + * `data_set_lifecycle_check` canary when `DATASET_LIFECYCLE_CHECK_ENABLED=false`), + * so stale schedules do not keep enqueuing no-op jobs. * * @param jobType - The job type whose per-provider schedules should be removed. * @returns Number of schedule rows deleted. diff --git a/apps/backend/src/metrics-prometheus/check-metric-labels.ts b/apps/backend/src/metrics-prometheus/check-metric-labels.ts index 5a764a6e..5c02eff6 100644 --- a/apps/backend/src/metrics-prometheus/check-metric-labels.ts +++ b/apps/backend/src/metrics-prometheus/check-metric-labels.ts @@ -3,7 +3,7 @@ export type CheckType = | "retrieval" | "dataRetention" | "dataSetCreation" - | "dataSetTermination" + | "dataSetLifecycleCheck" | "pullCheck"; export type ProviderStatus = "approved" | "unapproved"; diff --git a/apps/backend/src/metrics-prometheus/check-metrics.service.ts b/apps/backend/src/metrics-prometheus/check-metrics.service.ts index 63e2b7b2..39f697f2 100644 --- a/apps/backend/src/metrics-prometheus/check-metrics.service.ts +++ b/apps/backend/src/metrics-prometheus/check-metrics.service.ts @@ -286,29 +286,30 @@ export class DataSetCreationCheckMetrics { } @Injectable() -export class DataSetTerminationCheckMetrics { +export class DataSetLifecycleCheckMetrics { constructor( - @InjectMetric("dataSetTerminationMs") - private readonly dataSetTerminationMs: Histogram, - @InjectMetric("dataSetTerminationStatus") - private readonly dataSetTerminationStatusCounter: Counter, + @InjectMetric("dataSetLifecycleCheckMs") + private readonly dataSetLifecycleCheckMs: Histogram, + @InjectMetric("dataSetLifecycleCheckStatus") + private readonly dataSetLifecycleCheckStatusCounter: Counter, ) {} /** - * Observe the time from the `terminateService` call to `pdpEndEpoch != 0` confirmation. + * Observe the end-to-end duration of one lifecycle check (create throwaway data set + * with a seed piece, then `terminateService` and confirm `pdpEndEpoch != 0`). * Emitted on `success` and `failure.timedout` only (analogous to `dataSetCreationMs`). */ observeCheckDuration(labels: CheckMetricLabels, value: number | null | undefined): void { - observePositive(this.dataSetTerminationMs, labels, value); + observePositive(this.dataSetLifecycleCheckMs, labels, value); } /** - * Record data-set termination status. - * Values: `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate`. - * See docs/data-set-termination.md#termination-metrics-trigger-health. + * Record data-set lifecycle check status. + * Values: `success`, `failure.timedout`, `failure.other`. + * See docs/checks/data-set-lifecycle-check.md. */ recordStatus(labels: CheckMetricLabels, value: string): void { - this.dataSetTerminationStatusCounter.inc({ ...labels, value }); + this.dataSetLifecycleCheckStatusCounter.inc({ ...labels, value }); } } diff --git a/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts b/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts index 753b57b3..5fe80fc7 100644 --- a/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts +++ b/apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts @@ -9,7 +9,7 @@ import { import { WalletSdkModule } from "../wallet-sdk/wallet-sdk.module.js"; import { DataSetCreationCheckMetrics, - DataSetTerminationCheckMetrics, + DataSetLifecycleCheckMetrics, DataStorageCheckMetrics, DiscoverabilityCheckMetrics, PullCheckCheckMetrics, @@ -156,9 +156,9 @@ const metricProviders = [ buckets: [100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000, 600000], }), makeHistogramProvider({ - // docs/checks/events-and-metrics.md#dataSetTerminationMs - name: "dataSetTerminationMs", - help: "Duration from terminateService call to pdpEndEpoch != 0 confirmation (ms)", + // docs/checks/events-and-metrics.md#dataSetLifecycleCheckMs + name: "dataSetLifecycleCheckMs", + help: "End-to-end data-set lifecycle check duration: create with seed piece then terminate and confirm pdpEndEpoch != 0 (ms)", labelNames: ["checkType", "providerId", "providerName", "providerStatus"] as const, buckets: [100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000, 600000], }), @@ -212,9 +212,9 @@ const metricProviders = [ labelNames: ["checkType", "providerId", "providerName", "providerStatus", "value"] as const, }), makeCounterProvider({ - // docs/checks/events-and-metrics.md#dataSetTerminationStatus - name: "dataSetTerminationStatus", - help: "Data-set termination status counts (success | failure.timedout | failure.other | skipped.no_candidate)", + // docs/checks/events-and-metrics.md#dataSetLifecycleCheckStatus + name: "dataSetLifecycleCheckStatus", + help: "Data-set lifecycle check status counts (success | failure.timedout | failure.other)", labelNames: ["checkType", "providerId", "providerName", "providerStatus", "value"] as const, }), // Pull check metrics (docs/checks/pull-check.md) @@ -389,7 +389,7 @@ const metricProviders = [ RetrievalCheckMetrics, DiscoverabilityCheckMetrics, DataSetCreationCheckMetrics, - DataSetTerminationCheckMetrics, + DataSetLifecycleCheckMetrics, PullCheckCheckMetrics, WalletBalanceCollector, // HTTP metrics interceptor @@ -405,7 +405,7 @@ const metricProviders = [ RetrievalCheckMetrics, DiscoverabilityCheckMetrics, DataSetCreationCheckMetrics, - DataSetTerminationCheckMetrics, + DataSetLifecycleCheckMetrics, PullCheckCheckMetrics, WalletBalanceCollector, ], diff --git a/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts b/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts index e62691d8..d6613a31 100644 --- a/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts +++ b/apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts @@ -18,7 +18,6 @@ const baseConfig: IBlockchainConfig = { checkDatasetCreationFees: false, useOnlyApprovedProviders: false, minNumDataSetsForChecks: 1, - dataSetTerminationMinIndex: 1, pdpSubgraphEndpoint: "https://api.thegraph.com/subgraphs/filecoin/pdp", }; diff --git a/docs/checks/README.md b/docs/checks/README.md index 903e543c..b083afc7 100644 --- a/docs/checks/README.md +++ b/docs/checks/README.md @@ -6,6 +6,7 @@ The files are: - [retrievals.md](./retrievals.md): Defines the "retrieval check" and how it is calculated. - [data-retention.md](./data-retention.md): Defines the "data retention check" and how it is calculated. - [pull-check.md](./pull-check.md): Defines the "pull check" and how it is calculated. +- [data-set-lifecycle-check.md](./data-set-lifecycle-check.md): Defines the `data_set_lifecycle_check` canary that creates and terminates a throwaway data set each tick. - [events-and-metrics.md](./events-and-metrics.md): Defines the events and metrics that are used to assess SP performance. diff --git a/docs/checks/data-set-lifecycle-check.md b/docs/checks/data-set-lifecycle-check.md new file mode 100644 index 00000000..099e394f --- /dev/null +++ b/docs/checks/data-set-lifecycle-check.md @@ -0,0 +1,118 @@ +# Data Set Lifecycle Check + +`data_set_lifecycle_check` is a calibration-focused canary job that, in a **single tick**, +creates a throwaway data set with a seed piece and immediately terminates it +(`terminateService`). It exists to continuously exercise the full on-chain +`createDataSet → terminateService` lifecycle so dealbot detects regressions in either path, +independent of how many managed check data sets a provider already has. + +> **Note**: this job does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated. +> See the FAQ for what happens on-chain after `terminateService`. + +## Summary + +- Self-contained: one invocation **creates one data set and terminates it**. It does not + touch the managed check data sets (slots `0..MIN_NUM_DATASETS_FOR_CHECKS-1`) and does not + depend on `data_set_creation` to replenish anything. +- Runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with + `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. +- Schedule creation is gated by `DATASET_LIFECYCLE_CHECK_ENABLED` (default: true on + calibration, false on mainnet). +- The throwaway data set carries a single fixed metadata marker key, `dealbotLifecycleCheck`, + with a per-run nonce value. No base/slot metadata is attached. + +## Why a single-tick create + terminate + +The previous design terminated an existing managed slot and relied on `data_set_creation` to +recreate it on a later tick. That coupled the canary to `MIN_NUM_DATASETS_FOR_CHECKS`, a +min-index window, and the creation job's cadence. The lifecycle check collapses this into one +self-contained job: it always creates a fresh set and terminates it in the same run, so the +canary works regardless of provider state and needs no cross-job coordination. + +### Trade-off: leakage + +If creation succeeds but termination fails (process crash, job timeout, on-chain revert that +isn't an already-terminated no-op), the created data set **leaks** — it stays live on the SP. +This is an accepted trade-off for the job's simplicity. + +Because every set created by this job carries the fixed `dealbotLifecycleCheck` metadata key, +leaked sets are discoverable and can be swept manually (filter datasets by that metadata key). +If leakage grows significantly, that is the handle to clean up by. The +`dataset_lifecycle_check_failed` log line with `leakedDataSet: true` records the `dataSetId` +of each leak when it happens. + +## Configuration + +- [`DATASET_LIFECYCLE_CHECK_ENABLED`](../environment-variables.md#dataset_lifecycle_check_enabled) + — enables the job. Defaults to true on calibration, false on mainnet. When disabled, stale + schedules are removed so they stop enqueuing no-op jobs. +- [`DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR`](../environment-variables.md#dataset_lifecycle_checks_per_sp_per_hour) + — rate per provider, converted internally to `intervalSeconds`. Independent of + `DATASET_CREATIONS_PER_SP_PER_HOUR`. +- [`DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`](../environment-variables.md#data_set_lifecycle_check_job_timeout_seconds) + — max runtime for one invocation. Bounds the seed-piece upload, the `terminateService` + call, and the `pdpEndEpoch != 0` confirmation poll. Default `600`. + +## Handler algorithm + +For one provider, one invocation of `data_set_lifecycle_check`: + +1. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. +2. If `DATASET_LIFECYCLE_CHECK_ENABLED` is false, log a disabled skip and exit (defensive gate + for stale enqueued jobs). +3. Build metadata `{ dealbotLifecycleCheck: "-" }`. The fixed key is the + manual-cleanup handle; the per-run nonce value forces `createContext` to provision a fresh + set instead of resolving a prior (possibly leaked) set. +4. Create an `AbortController` from `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. +5. Call `DealService.runDataSetLifecycleCheck(spAddress, metadata, signal, timeoutMs)`, which: + - a. Creates the data set with a 200 KiB seed piece (metrics-free; **no** `dataSetCreation` + metrics — those belong to `data_set_creation`). + - b. Calls `terminateService` on the created `dataSetId` and polls until FWSS confirms + `pdpEndEpoch != 0`. + - c. Records `dataSetLifecycleCheckStatus` / `dataSetLifecycleCheckMs`. + +### Idempotency / abort handling + +- An abort (job timeout) or internal poll timeout is classified as `failure.timedout`; + pg-boss does not retry (failures are handled by the next scheduled tick). +- The terminate step tolerates an already-terminated revert as a no-op and continues polling. + +## Metrics + +All metrics carry the standard label set (`checkType`, `providerId`, `providerName`, +`providerStatus`) with `checkType=dataSetLifecycleCheck`. See +[`events-and-metrics.md`](./events-and-metrics.md). + +| Metric | `value` labels | What to watch for | +|--------|---------------|-------------------| +| [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | `success`, `failure.timedout`, `failure.other` | `success` per provider confirms the full create→terminate lifecycle works on calibration; persistent `failure.*` indicates a `createDataSet` or `terminateService` regression | +| [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | — | End-to-end duration (create + upload + terminate + confirm); emitted on `success` and `failure.timedout` only | + +## FAQ + +### What happens on-chain after `terminateService` is called? + +`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain +sequence that plays out over roughly 30 days. The lifecycle check only needs the first step to +complete before it exits. + +**Step 1 — terminateService tx confirms.** `terminateService` calls +`FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on +the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores +`info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. This is the +point the job polls for: `pdpEndEpoch != 0`. + +**Step 2 — rail finalization (~30 days later).** When the PDP rail's `settledUpTo` reaches +`endEpoch`, `finalizeTerminatedRail` fires atomically inside the settle transaction. + +**Step 3 — dataset deletion at PDPVerifier (SP-initiated).** After the rail finalizes, the SP +may call `PDPVerifier.deleteDataSet`. The lifecycle check does not wait for steps 2 or 3 — +waiting ~30 days per invocation would defeat the purpose of a canary cycle. + +## Source of truth + +- Dataset creation design: [`docs/data-set-creation.md`](../data-set-creation.md) +- Job system overview: [`docs/jobs.md`](../jobs.md) +- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./events-and-metrics.md) +- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../../apps/backend/src/jobs/jobs.service.ts) +- Deal service dataset logic (`createDataSetWithPiece`, `runDataSetLifecycleCheck`): [`apps/backend/src/deal/deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md index b1abdfc4..bf250c0e 100644 --- a/docs/checks/events-and-metrics.md +++ b/docs/checks/events-and-metrics.md @@ -100,7 +100,7 @@ sequenceDiagram * They are exported via Prometheus. * All Prometheus/OpenTelemetry metrics have label/attributes for: - `network=calibration|mainnet` - - `checkType=dataStorage|retrieval|dataRetention|dataSetCreation|dataSetTermination|pullCheck` — attribute metrics to a particular check/job + - `checkType=dataStorage|retrieval|dataRetention|dataSetCreation|dataSetLifecycleCheck|pullCheck` — attribute metrics to a particular check/job - `providerId` — attribute metrics to a particular SP - `providerName` — human-readable name of the SP (defaults to `"unknown"` when not available) - `providerStatus=approved|unapproved` — attribute metrics to only approved SPs for example @@ -126,7 +126,7 @@ sequenceDiagram | `dataStorageCheckMs` | Data Storage | [`uploadToSpStart`](#uploadToSpStart) | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Data Storage check | | | `retrievalCheckMs` | Retrieval | Retrieval check start | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Retrieval check | | | `dataSetCreationMs` | Data-Set Creation | Data-set creation uploadToSpStart | Data-set creation pieceConfirmed | Duration of one data-set creation with confirmed piece (all using `createDataSetWithPiece`) | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetTerminationMs` | Data-Set Termination | `terminateService` call | FWSS `pdpEndEpoch != 0` confirmed | Duration of one managed data-set termination (`terminateManagedDataSet`). Emitted on `success` and `failure.timedout` only. See [data-set-termination.md](../data-set-termination.md). | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | +| `dataSetLifecycleCheckMs` | Data-Set Lifecycle Check | Data-set create with seed piece | FWSS `pdpEndEpoch != 0` confirmed | End-to-end duration of one lifecycle check: create a throwaway data set then terminate it (`runDataSetLifecycleCheck`). Emitted on `success` and `failure.timedout` only. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | | `pullRequestAcknowledgementLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestAcknowledgedBySp`](#pullRequestAcknowledgedBySp) | Time from `pullPieces` submission to SP request acknowledgement. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullRequestStartedMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestStartedBySp`](#pullRequestStartedBySp) | Time from `pullPieces` submission to the SP reading the first byte of `/api/piece/{pieceCid}`. Skipped (no observation) when the SP never fetches from dealbot. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts), [`pull-piece.controller.ts`](../../apps/backend/src/pull-check/pull-piece.controller.ts) | | `pullRequestCompletionLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestIsTerminal`](#pullRequestIsTerminal) | Time from `pullPieces` submission to terminal SP pull status. Emitted once for the check, either on success or failure. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | @@ -151,7 +151,7 @@ sequenceDiagram | `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | | 1 | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | On the Retrieval path, the pre-flight branches on the on-chain `PDPVerifier.pieceLive(dataSetId, pieceId)` result. When `pieceLive=false` (dataset terminated, piece never created, or piece hard-removed), `skipped.piece_missing` is emitted and the deal is marked `cleaned_up=true`; no SP probe runs. When `pieceLive=true` and the SP returns 404 on `/pdp/piece/:pieceCid/status`, `failure.other` is emitted and a failed retrieval row is recorded (deal stays in the candidate pool for re-probing). | 1 | | | `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetTerminationStatus` | Data-Set Termination | When a `data_set_termination` invocation finishes acting on a slot (or finds none) | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` confirms a slot was terminated (FWSS `pdpEndEpoch != 0`). `skipped.no_candidate` means every slot in the canary window was already `missing`; persistent skips indicate creation is lagging. See [data-set-termination.md](../data-set-termination.md). | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts), [`data-set-termination.handler.ts`](../../apps/backend/src/jobs/data-set-termination.handler.ts) | +| `dataSetLifecycleCheckStatus` | Data-Set Lifecycle Check | When a `data_set_lifecycle_check` invocation finishes (create + terminate) | `success`, `failure.timedout`, `failure.other` | `success` confirms the full create→terminate lifecycle completed (FWSS `pdpEndEpoch != 0`). Persistent `failure.*` indicates a `createDataSet` or `terminateService` regression. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | | `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md) poll when a provider's confirmed proving-period totals advance (strictly positive deltas since the last poll). | `success` (challenges in newly confirmed successful proving periods), `failure` (challenges in newly confirmed faulted periods) | | Counter increment = **period delta × 5** (`CHALLENGES_PER_PROVING_PERIOD`). Period delta is the increase in subgraph-confirmed proving periods since the previous poll for that provider (not "challenges per poll" in the abstract). See [data-retention.md §3](./data-retention.md#3-calculate-deltas). | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | | `pullRequestProviderStatus` | Pull | When the SP reports a terminal pull status via `waitForPullPieces`. Recorded exactly once per check (intermediate poll statuses are not counted). | Raw SP-reported pull status, for example `complete`, `failed`, `not_found`. Use this to separate SP-side pull failures from dealbot-side validation failures. | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullCheckStatus` | Pull | When the [Pull Check](./pull-check.md) terminates (success after direct piece validation, or any failure). Recorded exactly once per check. | `success`, `failure.timedout`, `failure.other` from [Pull Check Status](./pull-check.md#pull-check-status). | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | diff --git a/docs/data-set-termination.md b/docs/data-set-termination.md deleted file mode 100644 index 3c71e5fe..00000000 --- a/docs/data-set-termination.md +++ /dev/null @@ -1,220 +0,0 @@ -# Data Set Termination Job - -This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. - -> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach. - -## Summary - -- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider. -- Together with [`data_set_creation`](./data-set-creation.md), the two jobs form a bounded loop that keeps the `createDataSet` on-chain path continuously exercised as a canary. -- The job terminates **at most one dataset per invocation**; `data_set_creation` handles replenishment on its next scheduled tick. -- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. -- Schedule creation is gated by `NETWORK=calibration` and a non-empty canary window (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). -- Slots below `DATA_SET_TERMINATION_MIN_INDEX` are never touched, keeping a stable baseline for ongoing checks. - -## Problem Context - -During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. - -The missing capability is not more creation logic. The missing capability is a controlled way to create fresh demand for creation again. - -## Goals - -- Continuously exercise the calibration `createDataSet → terminateService → createDataSet` lifecycle. -- Reuse the existing `data_set_creation` job as the replenishment mechanism. -- Minimize disruption to ongoing deal and retrieval checks. -- Make termination cadence explicitly configurable so the expected create cadence can be reasoned about. -- Ensure the job cannot run on mainnet. -- Expose enough metrics and logs to extend the existing BetterStack dashboards. - -## Proposed job - -Introduce a new SP-scoped job type: `data_set_termination`. - -The job should: - -- run only on calibration -- run on a configurable cadence -- terminate at most one safe managed dataset per provider per invocation -- rely on the existing `data_set_creation` job to recreate the missing slot on a later tick - -This keeps termination simple and keeps creation logic centralized in the existing job. - -### Configuration - -The initial design adds these controls, which follow the same naming pattern as the creation job: - -- `DATASET_TERMINATIONS_PER_SP_PER_HOUR` - - mirrors the existing rate-based job controls - - converted internally to `intervalSeconds` - - used to reason about expected termination frequency - -- `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS` - - max runtime for one termination job invocation - -- `DATA_SET_TERMINATION_MIN_INDEX` - - the lowest slot index eligible for termination (inclusive) - - default: `1` — only the baseline slot (index `0`) is protected - - slots `0` through `DATA_SET_TERMINATION_MIN_INDEX - 1` are never touched by this job - - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_TERMINATION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window - - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely — the canary window becomes empty and no schedule is created - - must be `>= 1` and `<=` [`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks); violating either constraint crashes the application on startup - -### Scheduling and queueing - -The scheduling model mirrors `data_set_creation`: - -- queue: shared `sp.work` -- `singletonKey=spAddress` - -Sharing the singleton with other SP jobs prevents termination from racing with a `deal`, `retrieval`, `pull_check`, `piece_cleanup`, or `data_set_creation` job for the same provider. - -The schedule is only upserted when all of the following are true: - -- `NETWORK=calibration` -- `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0` - -The second condition covers the `DATA_SET_TERMINATION_MIN_INDEX = MIN_NUM_DATASETS_FOR_CHECKS` case (empty canary window, termination effectively off) without crashing. It also handles the case where `MIN_NUM_DATASETS_FOR_CHECKS` is later lowered to meet `DATA_SET_TERMINATION_MIN_INDEX` — no schedule is created without requiring a config change. - -### Proposed handler algorithm - -For one provider, one invocation of `data_set_termination` works like this: - -1. Check that the network is calibration. If not, log skip and exit. -2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. -3. Create an `AbortController` using `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`. -4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata. -5. Scan slots from `minDataSets - 1` down to `DATA_SET_TERMINATION_MIN_INDEX`. For each slot: - - a. Build its metadata using the same logic as `data_set_creation`. - - b. Classify it via `getDataSetProvisioningStatus()`. - - c. Skip if `missing` — nothing to terminate. - - d. Skip if `terminated` — `data_set_creation` owns repair of these slots. - - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. -6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). -7. Log the outcome and exit for this tick. -8. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously terminated slot. - -As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. - -### Proposed termination flow - -The termination flow should be implemented in a dedicated service method rather than inline in `JobsService`. - -1. Resolve provider info from cache and the target `dataSetId` using synapse-sdk by building slot dataset metadata. -2. Call the on-chain `terminateService` path through Synapse (`await synapse.storage.terminateDataSet({ dataSetId })`). -3. Wait for transaction receipt. -4. Poll until `pdpEndEpoch !== 0`. A live dataset has `pdpEndEpoch === 0`; once `terminateService` confirms, `pdpEndEpoch !== 0` is set on-chain. The Synapse SDK filters datasets with `pdpEndEpoch !== 0` from metadata lookups, so `getDataSetProvisioningStatus()` will return `missing` for this slot from this point on. -5. Once `pdpEndEpoch !== 0` is observed, the termination flow's work is done. `data_set_creation` will see the slot as `missing` on its next run and provision a replacement directly. - -Polling until the chain confirms termination is important because the canary value comes from the full on-chain lifecycle, not just submitting a transaction. - -#### Idempotency - -The termination flow must tolerate races and retries: - -- If `pdpEndEpoch !== 0` is already set when the job starts (slot is already terminated), skip the `terminateService` call and treat the run as a no-op success. -- If `terminateService` reverts with an already-terminated error (for example, `"service already terminated"` or `"dataset not active"`), treat it as idempotent success and proceed to the polling step. -- If the transaction confirms but `pdpEndEpoch` does not become non-zero before the abort signal fires, treat the run as `failure.timedout` and let pg-boss retry on the next tick. - -### Metrics and BetterStack dashboards - -The termination job has two distinct observability concerns: is the trigger firing, and is the canary signal it produces showing up in creation metrics. Creation metrics are the primary signal; termination metrics are only there to confirm the trigger is working. - -All metrics carry the standard label set defined in [`checks/events-and-metrics.md`](./checks/events-and-metrics.md#metrics): -`checkType`, `providerId`, `providerName`, `providerStatus`. - -For termination metrics, `checkType=dataSetTermination`. For creation metrics referenced below, `checkType=dataSetCreation`. - -#### Creation metrics (primary signal) - -These already exist and are defined in [`events-and-metrics.md`](./checks/events-and-metrics.md). `data_set_termination` creates the conditions for them to fire — if they stay silent after termination is running, something is wrong with creation. - -| Metric | `value` labels | What to watch for | -|--------|---------------|-------------------| -| [`dataSetCreationStatus`](./checks/events-and-metrics.md#dataSetCreationStatus) | `pending`, `success`, `failure.timedout`, `failure.other` | `success` count should rise in the interval after each termination; persistent `failure.*` after a termination indicates a `createDataSet` regression | -| [`dataSetCreationMs`](./checks/events-and-metrics.md#dataSetCreationMs) | — | Latency histogram for `createDataSetWithPiece`; spikes after termination may indicate on-chain congestion | - -#### Termination metrics (trigger health) - -New metrics proposed here. These confirm termination is producing the conditions for creation to run. If termination metrics look healthy but creation metrics are silent, the loop is broken somewhere between the two jobs. - -| Metric | `value` labels | What to watch for | -|--------|---------------|-------------------| -| `dataSetTerminationStatus` | `success`, `failure.timedout`, `failure.other`, `skipped.no_candidate` | `success` per provider confirms the trigger is firing; persistent `skipped.no_candidate` means `data_set_creation` is not replenishing fast enough | -| `dataSetTerminationMs` | — | Histogram from `terminateService` call to `pdpEndEpoch !== 0` confirmed; emitted on `success` and `failure.timedout` only. Analogous to `dataSetCreationMs` | - -#### Dashboard questions - -The BetterStack dashboards should make it easy to answer: - -- are `dataSetTerminationStatus{value="success"}` counts rising per provider on calibration? -- are `dataSetTerminationStatus{value="skipped.no_candidate"}` runs persisting longer than one creation interval, indicating `data_set_creation` is not replenishing? -- does `dataSetCreationStatus{value="success"}` follow `dataSetTerminationStatus{value="success"}` within the expected interval? -- are `dataSetCreationStatus{value="failure.*"}` counts rising after terminations, indicating a regression in `createDataSet`? - -## Relationship to `data_set_creation` - -The two jobs form a bounded loop. - -`data_set_termination` only terminates datasets that correspond to dealbot-managed metadata slots. `data_set_creation` detects the resulting `missing` slot through its normal metadata lookup and recreates it without needing any new cross-job state. - -Expected healthy behavior: - -1. `data_set_termination` calls `terminateService` and polls until `pdpEndEpoch !== 0`. -2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately. `data_set_creation` provisions a replacement dataset directly in this run. -3. Existing creation metrics and alerts resume acting as the canary. - -**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. - -**Canary window size:** The number of slots eligible for termination is `MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX`. A canary window of `1` means a single slot cycles continuously; a larger window gives termination more candidates when one slot has active deals blocking it. In practice, a window of `2`–`3` is usually enough buffer. - -## Open Questions - -### Should `terminated` be renamed in `getDataSetProvisioningStatus`? - -The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`). - -The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well. - -### Should `data_set_termination` absorb the repair path from `data_set_creation`? - -Currently, `data_set_creation` owns two distinct responsibilities: -1. Repairing `terminated` slots (liveness-probe failures) via `repairTerminatedDataSet`. -2. Provisioning `missing` slots via `createDataSetWithPiece`. - -Once `data_set_termination` exists and calls `terminateService` directly, it handles on-chain termination for managed slots. The question is whether `data_set_creation` should be simplified to only own replenishment, with all termination (including repair) moving to `data_set_termination`. This is left open pending implementation experience. - - -## FAQ - -### What happens on-chain after `terminateService` is called? - -`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain sequence that plays out over roughly 30 days. Understanding this is important because the termination job only needs the first step to complete before it can exit and let `data_set_creation` replenish the slot. - -**Step 1 — terminateService tx confirms** - -`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. - -This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here. - -**Step 2 — rail finalization (~30 days later)** - -When the PDP rail's `settledUpTo` reaches `endEpoch`, `finalizeTerminatedRail` fires atomically inside the same settle transaction. The rail is zeroed, `RailFinalized` is emitted, and any unused `lockupFixed` balance is returned to the payer. - -**Step 3 — dataset deletion at PDPVerifier (SP-initiated, after step 2)** - -After the rail finalizes, the storage provider calls `PDPVerifier.deleteDataSet`. This is an SP-only operation at the PDPVerifier layer. It clears the dataset's header state and invokes `FWSS.dataSetDeleted`, which verifies the rail has finalized and the lockup has elapsed before wiping FWSS-side state. Note that PDPVerifier's per-piece mappings are not cleared by this call. - -**Why the termination job only waits for step 1** - -Step 2 happens when `settleRail` is called and the rail's `settledUpTo` reaches `endEpoch`. Step 3 requires the SP to call `PDPVerifier.deleteDataSet` after the rail finalizes. The termination job does not need to wait for either — the slot is considered missing for dealbot's purposes as soon as `pdpEndEpoch !== 0` is set. Waiting for full finalization would mean waiting ~30 days per invocation, which defeats the purpose of a canary cycle. - -## Source of truth - -- Dataset creation design: [`docs/data-set-creation.md`](./data-set-creation.md) -- Job system overview: [`docs/jobs.md`](./jobs.md) -- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./checks/events-and-metrics.md) -- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../apps/backend/src/jobs/jobs.service.ts) -- Dataset creation handler: [`apps/backend/src/jobs/data-set-creation.handler.ts`](../apps/backend/src/jobs/data-set-creation.handler.ts) -- Deal service dataset logic (including `getDataSetProvisioningStatus`, `repairTerminatedDataSet`): [`apps/backend/src/deal/deal.service.ts`](../apps/backend/src/deal/deal.service.ts) diff --git a/docs/environment-variables.md b/docs/environment-variables.md index 61e0fe94..5dd4e395 100644 --- a/docs/environment-variables.md +++ b/docs/environment-variables.md @@ -11,7 +11,7 @@ This document provides a comprehensive guide to all environment variables used b | [Blockchain](#blockchain-configuration) | `NETWORK`, `RPC_URL`, `WALLET_ADDRESS`, `WALLET_PRIVATE_KEY`, `SESSION_KEY_PRIVATE_KEY`, `CHECK_DATASET_CREATION_FEES`, `USE_ONLY_APPROVED_PROVIDERS`, `PDP_SUBGRAPH_ENDPOINT` | | [Dataset Versioning](#dataset-versioning) | `DEALBOT_DATASET_VERSION` | | [Scheduling](#scheduling-configuration) | `PROVIDERS_REFRESH_INTERVAL_SECONDS`, `DATA_RETENTION_POLL_INTERVAL_SECONDS`, `DEALBOT_MAINTENANCE_WINDOWS_UTC`, `DEALBOT_MAINTENANCE_WINDOW_MINUTES` | -| [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `MIN_NUM_DATASETS_FOR_CHECKS`, `DATA_SET_TERMINATION_MIN_INDEX`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `DATASET_TERMINATION_ENABLED`, `DATASET_TERMINATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`, `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` | +| [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `MIN_NUM_DATASETS_FOR_CHECKS`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `DATASET_LIFECYCLE_CHECK_ENABLED`, `DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DATA_SET_CREATION_JOB_TIMEOUT_SECONDS`, `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` | | [Dataset](#dataset-configuration) | `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES` | | [ClickHouse](#clickhouse-configuration) | `CLICKHOUSE_URL`, `CLICKHOUSE_BATCH_SIZE`, `CLICKHOUSE_FLUSH_INTERVAL_MS`, `DEALBOT_PROBE_LOCATION` | | [Timeouts](#timeout-configuration) | `CONNECT_TIMEOUT_MS`, `HTTP_REQUEST_TIMEOUT_MS`, `HTTP2_REQUEST_TIMEOUT_MS`, `IPNI_VERIFICATION_TIMEOUT_MS`, `IPNI_VERIFICATION_POLLING_MS` | @@ -662,28 +662,6 @@ rate-based (per hour) and persisted in Postgres so restarts do not reset timing. --- -### `DATA_SET_TERMINATION_MIN_INDEX` - -- **Type**: `number` (integer) -- **Required**: No -- **Default**: `1` -- **Minimum**: `1` -- **Maximum**: `MIN_NUM_DATASETS_FOR_CHECKS` -- **Enforced**: Yes (config validation; violating either bound crashes the application on startup) - -**Role**: The lowest dataset slot index (inclusive) the `data_set_termination` canary may terminate. Slots `0..(DATA_SET_TERMINATION_MIN_INDEX - 1)` are never touched, keeping a stable baseline for ongoing checks. The canary window is `[DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS)`. - -**When to update**: - -- Increase to protect more low-index slots from termination. -- Set equal to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely (the canary window becomes empty and no schedule is created). - -**Example**: `MIN_NUM_DATASETS_FOR_CHECKS=10`, `DATA_SET_TERMINATION_MIN_INDEX=5` → slots 0–4 are stable, slots 5–9 cycle as the canary window. - -**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) - ---- - ### `DATASET_CREATIONS_PER_SP_PER_HOUR` - **Type**: `number` @@ -698,31 +676,31 @@ rate-based (per hour) and persisted in Postgres so restarts do not reset timing. --- -### `DATASET_TERMINATION_ENABLED` +### `DATASET_LIFECYCLE_CHECK_ENABLED` - **Type**: `boolean` - **Required**: No - **Default**: `true` on calibration, `false` on mainnet -**Role**: Enables the `data_set_termination` canary job, which periodically terminates one managed dataset slot per provider so `data_set_creation` recreates it, keeping the on-chain `createDataSet` lifecycle continuously exercised. +**Role**: Enables the `data_set_lifecycle_check` canary job, which in a single tick creates a throwaway data set with a seed piece and immediately terminates it (`terminateService`), continuously exercising the on-chain `createDataSet → terminateService` lifecycle. -**Notes**: Even when enabled, a schedule is only created when the canary window is non-empty (`MIN_NUM_DATASETS_FOR_CHECKS - DATA_SET_TERMINATION_MIN_INDEX > 0`). The default-empty window with `MIN_NUM_DATASETS_FOR_CHECKS=1` means termination is effectively off until you raise `MIN_NUM_DATASETS_FOR_CHECKS`. +**Notes**: Self-contained — it does not touch the managed check data sets and does not depend on `data_set_creation`. When disabled, stale schedules are removed so they stop enqueuing no-op jobs. -**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) +**See also**: [`docs/checks/data-set-lifecycle-check.md`](./checks/data-set-lifecycle-check.md) --- -### `DATASET_TERMINATIONS_PER_SP_PER_HOUR` +### `DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR` - **Type**: `number` - **Required**: No - **Default**: `1` -**Role**: Target dataset termination rate per storage provider for the `data_set_termination` canary. +**Role**: Target lifecycle check rate per storage provider for the `data_set_lifecycle_check` canary. Each run creates and terminates one throwaway data set. **Limits**: Config schema caps this at 20. -**Notes**: Should be **less than or equal to** `DATASET_CREATIONS_PER_SP_PER_HOUR` so creation can replenish terminated slots without backlog. A startup warning is logged if this constraint is violated. Fractional values are supported. +**Notes**: Independent of `DATASET_CREATIONS_PER_SP_PER_HOUR`. Fractional values are supported. --- @@ -860,24 +838,24 @@ Use this to stagger multiple dealbot deployments that are not sharing a database --- -### `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS` +### `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` - **Type**: `number` - **Required**: No -- **Default**: `300` (5 minutes) +- **Default**: `600` (10 minutes) - **Minimum**: `60` (1 minute) - **Enforced**: Yes (config validation, effective floor applied at runtime) -**Role**: Maximum runtime for `data_set_termination` jobs before forced abort via `AbortController`. Bounds the slot scan, the `terminateService` call, and the `pdpEndEpoch != 0` confirmation poll. +**Role**: Maximum runtime for `data_set_lifecycle_check` jobs before forced abort via `AbortController`. Bounds the seed-piece upload, the `terminateService` call, and the `pdpEndEpoch != 0` confirmation poll. **When to update**: -- Increase if `pdpEndEpoch` confirmation consistently times out on slow networks. +- Increase if create-plus-terminate consistently times out on slow networks. - Decrease for faster fail-fast behavior during testing. -**Note**: If the configured value is below 60 seconds, the runtime silently raises it to 60 seconds as an effective floor. An abort due to this timeout (or an internal poll timeout) is recorded as `dataSetTerminationStatus{value="failure.timedout"}` and retried on the next scheduled tick. +**Note**: If the configured value is below 60 seconds, the runtime silently raises it to 60 seconds as an effective floor. An abort due to this timeout (or an internal poll timeout) is recorded as `dataSetLifecycleCheckStatus{value="failure.timedout"}` and retried on the next scheduled tick. -**See also**: [`docs/data-set-termination.md`](./data-set-termination.md) +**See also**: [`docs/checks/data-set-lifecycle-check.md`](./checks/data-set-lifecycle-check.md) --- diff --git a/docs/jobs.md b/docs/jobs.md index e2f60f3a..b29fc6b4 100644 --- a/docs/jobs.md +++ b/docs/jobs.md @@ -15,7 +15,7 @@ This doc explains what a "job" is in dealbot, how jobs are defined, how they're | --- | --- | --- | | `job_schedule_state` | One per `` plus global rows | Schedule state owned by dealbot. | | Storage provider (SP) | One per SP in registry | Filtered by `USE_ONLY_APPROVED_PROVIDERS` when enabled. | -| Job type | `deal`, `retrieval`, `data_set_creation`, `data_set_termination`, `piece_cleanup`, `pull_check`, `providers_refresh`, `data_retention_poll` | `deal` corresponds to "data storage check" externally; we keep `deal` in code/DB for compatibility. | +| Job type | `deal`, `retrieval`, `data_set_creation`, `data_set_lifecycle_check`, `piece_cleanup`, `pull_check`, `providers_refresh`, `data_retention_poll` | `deal` corresponds to "data storage check" externally; we keep `deal` in code/DB for compatibility. | | pg-boss queue | `sp.work`, `providers.refresh`, `data.retention.poll` | `sp.work` is a singleton queue. | | Dealbot scheduler | One per process (when enabled) | Runs the scheduling loop. | | Dealbot worker process | One Node.js process with `DEALBOT_RUN_MODE=worker` or `both` | Hosts pg-boss workers. | @@ -37,7 +37,7 @@ This doc explains what a "job" is in dealbot, how jobs are defined, how they're | `piece_cleanup` | `sp.work` | [`JobsService.handlePieceCleanupJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'piece_cleanup', spAddress, intervalSeconds }` | — | | `pull_check` | `sp.work` | [`JobsService.handlePullCheckJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'pull_check', spAddress, intervalSeconds }` | [pull check](./checks/pull-check.md) | | `data_set_creation` | `sp.work` | [`JobsService.handleDataSetCreationJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'data_set_creation', spAddress, intervalSeconds }` | [data-set-creation](./data-set-creation.md) | -| `data_set_termination` | `sp.work` | [`JobsService.handleDataSetTerminationJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'data_set_termination', spAddress, intervalSeconds }` | [data-set-termination](./data-set-termination.md) | +| `data_set_lifecycle_check` | `sp.work` | [`JobsService.handleDataSetLifecycleCheckJob`](../apps/backend/src/jobs/jobs.service.ts) | `{ jobType: 'data_set_lifecycle_check', spAddress, intervalSeconds }` | [data-set-lifecycle-check](./checks/data-set-lifecycle-check.md) | `sp.work` is created with `policy=singleton`, and jobs set `singletonKey=spAddress` so only one active job per SP can run at a time. From 701b463709f0c2fbeb2dfafa2133f3a716bc93f7 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 00:59:36 +0530 Subject: [PATCH 09/16] chore: format --- apps/backend/src/deal/deal.service.ts | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/apps/backend/src/deal/deal.service.ts b/apps/backend/src/deal/deal.service.ts index 53cc7343..796dae05 100644 --- a/apps/backend/src/deal/deal.service.ts +++ b/apps/backend/src/deal/deal.service.ts @@ -1,11 +1,11 @@ +import { randomUUID } from "node:crypto"; +import { setTimeout as setTimeoutAsync } from "node:timers/promises"; import { METADATA_KEYS, SIZE_CONSTANTS, Synapse } from "@filoz/synapse-sdk"; import { Injectable, Logger, type OnModuleDestroy, type OnModuleInit } from "@nestjs/common"; import { ConfigService } from "@nestjs/config"; import { InjectRepository } from "@nestjs/typeorm"; import { executeUpload } from "filecoin-pin"; import { CID } from "multiformats/cid"; -import { randomUUID } from "node:crypto"; -import { setTimeout as setTimeoutAsync } from "node:timers/promises"; import type { Repository } from "typeorm"; import { ClickhouseService } from "../clickhouse/clickhouse.service.js"; import { awaitWithAbort } from "../common/abort-utils.js"; @@ -24,8 +24,8 @@ import type { IBlockchainConfig, IConfig } from "../config/app.config.js"; import { Deal } from "../database/entities/deal.entity.js"; import { StorageProvider } from "../database/entities/storage-provider.entity.js"; import { DealStatus, IpniStatus, ServiceType } from "../database/types.js"; -import { DatasetLivenessService } from "../dataset-liveness/dataset-liveness.service.js"; import { DataSourceService } from "../dataSource/dataSource.service.js"; +import { DatasetLivenessService } from "../dataset-liveness/dataset-liveness.service.js"; import { DealAddonsService } from "../deal-addons/deal-addons.service.js"; import type { DealPreprocessingResult } from "../deal-addons/types.js"; import { buildCheckMetricLabels, classifyFailureStatus } from "../metrics-prometheus/check-metric-labels.js"; From 7024e33068b7752c92e857c0e2652c51da7c984b Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 01:15:53 +0530 Subject: [PATCH 10/16] refactor: consolidate logging context + early checks --- apps/backend/src/deal/deal.service.ts | 86 +++++++++++++-------------- 1 file changed, 43 insertions(+), 43 deletions(-) diff --git a/apps/backend/src/deal/deal.service.ts b/apps/backend/src/deal/deal.service.ts index 796dae05..4936a24c 100644 --- a/apps/backend/src/deal/deal.service.ts +++ b/apps/backend/src/deal/deal.service.ts @@ -872,27 +872,34 @@ export class DealService implements OnModuleInit, OnModuleDestroy { pollTimeoutMs = 60_000, ): Promise<{ dataSetId: bigint; pdpEndEpoch: bigint }> { const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + if (!providerInfo) { + throw new Error(`Provider ${providerAddress} not found in registry`); + } + + const logContext = { + providerAddress, + providerName: providerInfo.name, + providerId: providerInfo.id, + }; const labels = buildCheckMetricLabels({ checkType: "dataSetLifecycleCheck", - providerId: providerInfo?.id, - providerName: providerInfo?.name, - providerIsApproved: providerInfo?.isApproved, + providerId: providerInfo.id, + providerName: providerInfo.name, + providerIsApproved: providerInfo.isApproved, }); const startedAt = Date.now(); this.logger.log({ event: "dataset_lifecycle_check_started", message: "Starting data-set lifecycle check (create then terminate)", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, metadata, }); let dataSetId: bigint | undefined; try { // 1. Create a fresh throwaway data set with a seed piece (no creation metrics). - ({ dataSetId } = await this.createDataSetWithPieceInternal(providerAddress, metadata, signal)); + ({ dataSetId } = await this.createDataSetWithPieceInternal(providerInfo, metadata, signal)); if (!dataSetId) { throw new Error("Data-set creation upload completed without resolving a dataSetId"); } @@ -905,9 +912,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.log({ event: "dataset_lifecycle_check_succeeded", message: "Data-set lifecycle check completed: created and terminated throwaway data set", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, dataSetId: dataSetId.toString(), pdpEndEpoch: pdpEndEpoch.toString(), durationMs, @@ -927,9 +932,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { dataSetId === undefined ? "Data-set lifecycle check failed during creation" : "Data-set lifecycle check failed during termination; throwaway data set may have leaked", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, dataSetId: dataSetId?.toString(), durationMs, status, @@ -977,26 +980,33 @@ export class DealService implements OnModuleInit, OnModuleDestroy { signal?: AbortSignal, ): Promise { const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); + if (!providerInfo) { + throw new Error(`Provider ${providerAddress} not found in registry`); + } + + const logContext = { + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + }; const labels = buildCheckMetricLabels({ checkType: "dataSetCreation", - providerId: providerInfo?.id, - providerName: providerInfo?.name, - providerIsApproved: providerInfo?.isApproved, + providerId: providerInfo.id, + providerName: providerInfo.name, + providerIsApproved: providerInfo.isApproved, }); const startedAt = Date.now(); this.dataSetCreationMetrics.recordStatus(labels, "pending"); this.logger.log({ + ...logContext, event: "dataset_creation_with_piece_started", message: "Starting data-set creation with piece", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, metadata, }); try { - const result = await this.createDataSetWithPieceInternal(providerAddress, metadata, signal); + const result = await this.createDataSetWithPieceInternal(providerInfo, metadata, signal); const durationMs = Date.now() - startedAt; this.dataSetCreationMetrics.observeCheckDuration(labels, durationMs); this.dataSetCreationMetrics.recordStatus(labels, "success"); @@ -1005,9 +1015,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.warn({ event: "dataset_creation_missing_onchain_events", message: "Data-set creation succeeded without full on-chain progress events", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, pieceAdded: result.pieceAdded, piecesConfirmed: result.piecesConfirmed, }); @@ -1016,9 +1024,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.log({ event: "dataset_creation_with_piece_succeeded", message: "Data-set created with piece", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, durationMs, dataSetId: result.dataSetId ?? "unknown", pieceCid: result.pieceCid, @@ -1034,9 +1040,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.error({ event: "dataset_creation_with_piece_failed", message: "Data-set creation with piece failed", - providerAddress, - providerId: providerInfo?.id, - providerName: providerInfo?.name, + ...logContext, durationMs, error: toStructuredError(error), }); @@ -1054,7 +1058,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { * the context resolved no `dataSetId` (we cannot operate on an unidentified set). */ private async createDataSetWithPieceInternal( - providerAddress: string, + providerInfo: PDPProviderEx, metadata: Record, signal?: AbortSignal, ): Promise<{ @@ -1066,11 +1070,13 @@ export class DealService implements OnModuleInit, OnModuleDestroy { piecesConfirmed: boolean; }> { signal?.throwIfAborted(); - const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); - if (!providerInfo) { - throw new Error(`Provider ${providerAddress} not found in registry`); - } + const providerAddress = providerInfo.serviceProvider; + const logContext = { + providerAddress, + providerName: providerInfo.name, + providerId: providerInfo.id, + }; let pieceAdded = false; let piecesConfirmed = false; let pieceCid: string | undefined; @@ -1115,9 +1121,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.debug({ event: "dataset_creation_stored", message: "Data-set creation stored", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + ...logContext, pieceCid, }); break; @@ -1126,9 +1130,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.debug({ event: "dataset_creation_pieces_added", message: "Data-set creation pieces added", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + ...logContext, txHash: event.data.txHash ?? "unknown", }); break; @@ -1137,9 +1139,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.debug({ event: "dataset_creation_pieces_confirmed", message: "Data-set creation pieces confirmed", - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, + ...logContext, pieceIds: event.data.pieceIds, }); break; From 4ee5714e227ed70fb5cc7d2c99fec87c42f93602 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 15:14:34 +0530 Subject: [PATCH 11/16] fix: default job timeout to 6 mins --- apps/backend/src/common/constants.ts | 7 +++++++ apps/backend/src/config/app.config.ts | 8 ++++---- apps/backend/src/jobs/jobs.service.spec.ts | 13 +++++-------- apps/backend/src/jobs/jobs.service.ts | 16 +++------------- 4 files changed, 19 insertions(+), 25 deletions(-) diff --git a/apps/backend/src/common/constants.ts b/apps/backend/src/common/constants.ts index 57416ae0..ebcc3ef6 100644 --- a/apps/backend/src/common/constants.ts +++ b/apps/backend/src/common/constants.ts @@ -7,3 +7,10 @@ export const ZERO_ADDRESS = "0x0000000000000000000000000000000000000000"; export const MAX_BLOCK_SIZE = 5 * 1024 * 1024; export const DEV_TAG = stringToHex("dev"); + +/** + * Fixed metadata marker key tagging every throwaway data set created by the + * `data_set_lifecycle_check` job. The value is a per-run nonce; the key is the stable + * handle operators use to list/sweep leaked sets (create-OK / terminate-failed runs). + */ +export const LIFECYCLE_CHECK_METADATA_KEY = "dealbotLifecycleCheck"; diff --git a/apps/backend/src/config/app.config.ts b/apps/backend/src/config/app.config.ts index 134da33e..b32c096b 100644 --- a/apps/backend/src/config/app.config.ts +++ b/apps/backend/src/config/app.config.ts @@ -98,7 +98,7 @@ export const configValidationSchema = Joi.object({ DEAL_JOB_TIMEOUT_SECONDS: Joi.number().min(120).default(360), // 6 minutes max runtime for data storage jobs (TODO: reduce default to 3 minutes) RETRIEVAL_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(60), // 1 minute max runtime for retrieval jobs (TODO: reduce default to 30 seconds) DATA_SET_CREATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset creation jobs - DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(600), // 10 minutes: covers create + seed-piece upload + terminate + pdpEndEpoch poll + DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(360), // 6 minutes: covers create + seed-piece upload + terminate + pdpEndEpoch poll // Seconds to hold the process alive after pg-boss drain completes, so Prometheus // captures at least one scrape of the terminal counter increments emitted during // shutdown. Default 35 covers the 30s ServiceMonitor interval plus a 5s buffer. @@ -233,8 +233,8 @@ export interface IJobsConfig { */ dataSetCreationsPerSpPerHour: number; /** - * Enables the calibration-focused `data_set_lifecycle_check` canary job, which - * creates a throwaway data set and immediately terminates it in a single tick. + * Enables the `data_set_lifecycle_check` canary job, which creates a + * throwaway data set and immediately terminates it in a single tick. * * Defaults to true on calibration and false on mainnet. */ @@ -520,7 +520,7 @@ export function loadConfig(): IConfig { retrievalJobTimeoutSeconds: Number.parseInt(process.env.RETRIEVAL_JOB_TIMEOUT_SECONDS || "60", 10), dataSetCreationJobTimeoutSeconds: Number.parseInt(process.env.DATA_SET_CREATION_JOB_TIMEOUT_SECONDS || "300", 10), dataSetLifecycleCheckJobTimeoutSeconds: Number.parseInt( - process.env.DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS || "600", + process.env.DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS || "360", 10, ), shutdownFinalScrapeDelaySeconds: Number.parseInt(process.env.SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS || "35", 10), diff --git a/apps/backend/src/jobs/jobs.service.spec.ts b/apps/backend/src/jobs/jobs.service.spec.ts index b62e5f4b..b0304097 100644 --- a/apps/backend/src/jobs/jobs.service.spec.ts +++ b/apps/backend/src/jobs/jobs.service.spec.ts @@ -125,10 +125,7 @@ describe("JobsService schedule rows", () => { baseConfigValues = { app: { runMode: "both" } as IConfig["app"], - blockchain: { - useOnlyApprovedProviders: false, - minNumDataSetsForChecks: 1, - } as IConfig["blockchain"], + blockchain: { useOnlyApprovedProviders: false, minNumDataSetsForChecks: 1 } as IConfig["blockchain"], scheduling: { providersRefreshIntervalSeconds: 4 * 3600, dataRetentionPollIntervalSeconds: 3600, @@ -146,7 +143,7 @@ describe("JobsService schedule rows", () => { dataSetCreationJobTimeoutSeconds: 300, dataSetLifecycleCheckEnabled: false, dataSetLifecycleChecksPerSpPerHour: 1, - dataSetLifecycleCheckJobTimeoutSeconds: 600, + dataSetLifecycleCheckJobTimeoutSeconds: 360, shutdownFinalScrapeDelaySeconds: 35, pieceCleanupPerSpPerHour: 1, maxPieceCleanupRuntimeSeconds: 300, @@ -1616,10 +1613,10 @@ describe("JobsService schedule rows", () => { await vi.advanceTimersByTimeAsync(35_001); await shutdownPromise; - // Defaults: deal=360, retrieval=60, dataSetCreation=300, dataSetLifecycleCheck=600, - // pullCheck=300 → max=600 → +60s buffer + // Defaults: deal=360, retrieval=60, dataSetCreation=300, dataSetLifecycleCheck=360, + // pullCheck=300 → max=360 → +60s buffer expect(bossMock.stop).toHaveBeenCalledTimes(1); - expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 660_000 }); + expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 420_000 }); }); it("picks the longest timeout across all job types, including pullCheck under pullPiece", async () => { diff --git a/apps/backend/src/jobs/jobs.service.ts b/apps/backend/src/jobs/jobs.service.ts index 42490d8b..c020a37a 100644 --- a/apps/backend/src/jobs/jobs.service.ts +++ b/apps/backend/src/jobs/jobs.service.ts @@ -5,6 +5,7 @@ import { InjectMetric } from "@willsoto/nestjs-prometheus"; import { type Job, PgBoss, type SendOptions } from "pg-boss"; import type { Counter, Gauge, Histogram } from "prom-client"; import type { Repository } from "typeorm"; +import { LIFECYCLE_CHECK_METADATA_KEY } from "../common/constants.js"; import { DealJobTerminatedDataSetError } from "../common/errors.js"; import { type JobLogContext, type ProviderJobContext, toStructuredError } from "../common/logging.js"; import { getMaintenanceWindowStatus } from "../common/maintenance-window.js"; @@ -27,13 +28,6 @@ import { } from "./job-queues.js"; import { JobScheduleRepository } from "./repositories/job-schedule.repository.js"; -/** - * Fixed metadata marker key tagging every throwaway data set created by the - * `data_set_lifecycle_check` job. The value is a per-run nonce; the key is the stable - * handle operators use to list/sweep leaked sets (create-OK / terminate-failed runs). - */ -const LIFECYCLE_CHECK_METADATA_KEY = "dealbotLifecycleCheck"; - type SpJobType = | "deal" | "retrieval" @@ -943,10 +937,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { * * Creates a throwaway data set with a seed piece, then immediately calls * `terminateService` on it — exercising the full create -> terminate lifecycle in a - * single tick. The set carries a fixed `dealbotLifecycleCheck` marker key (with a - * per-run nonce value to force a fresh set), so any set leaked by a create-OK / - * terminate-failed run can be found and swept manually by that key. Gated by - * `DATASET_LIFECYCLE_CHECK_ENABLED`. + * single tick. */ private async handleDataSetLifecycleCheckJob(job: SpJob): Promise { const data = job.data; @@ -1207,8 +1198,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { const dataRetentionPollStartAt = new Date(now.getTime() + phaseMs); const providersRefreshStartAt = new Date(now.getTime() + phaseMs); - const blockchainCfg = this.configService.get("blockchain", { infer: true }); - const minDataSets = blockchainCfg.minNumDataSetsForChecks; + const minDataSets = this.configService.get("blockchain", { infer: true }).minNumDataSetsForChecks; // Lifecycle check schedules are only created when enabled explicitly const lifecycleCheckScheduleEnabled = this.configService.get("jobs", { infer: true }).dataSetLifecycleCheckEnabled; const cleanupStartAt = new Date(now.getTime() + phaseMs); From 11b870129637eccfac28200036054a0fcc2a2c1d Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 15:15:04 +0530 Subject: [PATCH 12/16] docs: update data set lifecycle check doc --- docs/checks/data-set-lifecycle-check.md | 215 ++++++++++++------------ 1 file changed, 112 insertions(+), 103 deletions(-) diff --git a/docs/checks/data-set-lifecycle-check.md b/docs/checks/data-set-lifecycle-check.md index 099e394f..aa131cd5 100644 --- a/docs/checks/data-set-lifecycle-check.md +++ b/docs/checks/data-set-lifecycle-check.md @@ -1,118 +1,127 @@ # Data Set Lifecycle Check -`data_set_lifecycle_check` is a calibration-focused canary job that, in a **single tick**, -creates a throwaway data set with a seed piece and immediately terminates it -(`terminateService`). It exists to continuously exercise the full on-chain -`createDataSet → terminateService` lifecycle so dealbot detects regressions in either path, -independent of how many managed check data sets a provider already has. - -> **Note**: this job does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated. -> See the FAQ for what happens on-chain after `terminateService`. - -## Summary - -- Self-contained: one invocation **creates one data set and terminates it**. It does not - touch the managed check data sets (slots `0..MIN_NUM_DATASETS_FOR_CHECKS-1`) and does not - depend on `data_set_creation` to replenish anything. -- Runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with - `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. -- Schedule creation is gated by `DATASET_LIFECYCLE_CHECK_ENABLED` (default: true on - calibration, false on mainnet). -- The throwaway data set carries a single fixed metadata marker key, `dealbotLifecycleCheck`, - with a per-run nonce value. No base/slot metadata is attached. - -## Why a single-tick create + terminate - -The previous design terminated an existing managed slot and relied on `data_set_creation` to -recreate it on a later tick. That coupled the canary to `MIN_NUM_DATASETS_FOR_CHECKS`, a -min-index window, and the creation job's cadence. The lifecycle check collapses this into one -self-contained job: it always creates a fresh set and terminates it in the same run, so the -canary works regardless of provider state and needs no cross-job coordination. - -### Trade-off: leakage - -If creation succeeds but termination fails (process crash, job timeout, on-chain revert that -isn't an already-terminated no-op), the created data set **leaks** — it stays live on the SP. -This is an accepted trade-off for the job's simplicity. - -Because every set created by this job carries the fixed `dealbotLifecycleCheck` metadata key, -leaked sets are discoverable and can be swept manually (filter datasets by that metadata key). -If leakage grows significantly, that is the handle to clean up by. The -`dataset_lifecycle_check_failed` log line with `leakedDataSet: true` records the `dataSetId` -of each leak when it happens. +This document is the **source of truth** for how dealbot's Data Set Lifecycle check works. + +Source code links throughout this document point to the current implementation. + +For event and metric definitions used by the dashboard, see [Dealbot Events & Metrics](./events-and-metrics.md). + +> **Note**: This check calls `terminateService` to start the on-chain termination sequence. It does **not** call `PDPVerifier.deleteDataSet`, which is SP-initiated. See the [FAQ](#what-happens-on-chain-after-terminateservice-is-called) for details on what happens after termination. + +## Overview + +A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates a throwaway data set with a small seed piece and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. + +Every data set lifecycle check, dealbot: + +1. Creates a new data set with a 200 KiB seed piece, tagged with a `dealbotLifecycleCheck` metadata key so any leaked sets are discoverable later +2. Calls `terminateService` on the created data set +3. Polls FWSS until `pdpEndEpoch != 0`, confirming termination was recorded on-chain + +A successful check requires all [assertions in the table below](#what-gets-asserted) to pass. Failure occurs if any step fails or the check exceeds its max allowed time. + +## What Gets Asserted + +Each data set lifecycle check asserts the following for every SP: + +| # | Assertion | How It's Checked | Relevant Metric | +|---|-----------|-----------------|-----------------| +| 1 | SP creates a data set with a seed piece | `createContext` + `executeUpload` call completes and returns a `dataSetId` | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | +| 2 | `terminateService` succeeds on the created data set | `terminateService` call completes without error (already-terminated reverts are treated as success) | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | +| 3 | Termination is confirmed on-chain | Dealbot polls FWSS until `pdpEndEpoch != 0` | [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | +| 4 | All steps complete within the timeout | Check is not marked successful until all steps pass within `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | + +## Data Set Lifecycle Check Lifecycle + +The dealbot scheduler triggers data set lifecycle check jobs at a configurable rate. + +```mermaid +flowchart TD + CreateDataSet["Create data set with 200 KiB seed piece"] --> Terminate["Call terminateService"] + Terminate --> Poll["Poll FWSS until pdpEndEpoch != 0"] + Poll -->|confirmed| Success["Mark check successful"] + Poll -->|timeout| Fail["Mark check failed (timedout)"] + Terminate -->|error| Fail + CreateDataSet -->|error| Fail +``` + +### 1. Apply job guards + +Dealbot applies the same maintenance-window and SP-blocklist rules used by all other SP jobs. If `DATASET_LIFECYCLE_CHECK_ENABLED` is `false`, the job logs a disabled skip and exits. + +### 2. Create the data set + +Dealbot creates a new data set with a 200 KiB seed piece. The data set is tagged with metadata `{ dealbotLifecycleCheck: "" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. See the [FAQ](#why-does-data-set-creation-use-a-seed-piece) for why we use a seed piece. + +This step does **not** emit `dataSetCreation` metrics — those belong to the `data_set_creation` job. + +Source: [`deal.service.ts` (`runDataSetLifecycleCheck`)](../../apps/backend/src/deal/deal.service.ts) + +### 3. Terminate the service + +Dealbot calls `synapse.storage.terminateDataSet` (aka `terminateService` at contract level) on the newly created `dataSetId`, which sets `pdpEndEpoch` to a near future epoch (~30 days). + +### 4. Wait for on-chain confirmation + +Dealbot polls FWSS until `pdpEndEpoch != 0`, confirming the termination was recorded on-chain. This is Step 1 of the [full on-chain termination sequence](#what-happens-on-chain-after-terminateservice-is-called). The job does not wait for the full ~30-day rail finalization. + +The entire check (creation + upload + termination + confirmation) is bounded by `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. A timeout is classified as `failure.timedout`. + +## Check Status Progression + +A data set lifecycle check has a single terminal status, recorded once per check via [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus): + +| Overall Status | Meaning | +|--------|---------| +| `success` | All steps passed: data set created, service terminated, and termination confirmed on-chain. | +| `failure.timedout` | The job was aborted because it exceeded `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. | +| `failure.other` | Any other failure: `createDataSet` failed, `terminateService` failed, or on-chain confirmation polling failed. | + +## Metrics Recorded + +Metric definitions live in [Dealbot Events & Metrics](./events-and-metrics.md). The metrics emitted by a data set lifecycle check are: + +- [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) — `success`, `failure.timedout`, or `failure.other` per provider per run +- [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) — end-to-end duration (create + upload + terminate + confirm); emitted on `success` and `failure.timedout` ## Configuration -- [`DATASET_LIFECYCLE_CHECK_ENABLED`](../environment-variables.md#dataset_lifecycle_check_enabled) - — enables the job. Defaults to true on calibration, false on mainnet. When disabled, stale - schedules are removed so they stop enqueuing no-op jobs. -- [`DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR`](../environment-variables.md#dataset_lifecycle_checks_per_sp_per_hour) - — rate per provider, converted internally to `intervalSeconds`. Independent of - `DATASET_CREATIONS_PER_SP_PER_HOUR`. -- [`DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`](../environment-variables.md#data_set_lifecycle_check_job_timeout_seconds) - — max runtime for one invocation. Bounds the seed-piece upload, the `terminateService` - call, and the `pdpEndEpoch != 0` confirmation poll. Default `600`. - -## Handler algorithm - -For one provider, one invocation of `data_set_lifecycle_check`: - -1. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. -2. If `DATASET_LIFECYCLE_CHECK_ENABLED` is false, log a disabled skip and exit (defensive gate - for stale enqueued jobs). -3. Build metadata `{ dealbotLifecycleCheck: "-" }`. The fixed key is the - manual-cleanup handle; the per-run nonce value forces `createContext` to provision a fresh - set instead of resolving a prior (possibly leaked) set. -4. Create an `AbortController` from `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. -5. Call `DealService.runDataSetLifecycleCheck(spAddress, metadata, signal, timeoutMs)`, which: - - a. Creates the data set with a 200 KiB seed piece (metrics-free; **no** `dataSetCreation` - metrics — those belong to `data_set_creation`). - - b. Calls `terminateService` on the created `dataSetId` and polls until FWSS confirms - `pdpEndEpoch != 0`. - - c. Records `dataSetLifecycleCheckStatus` / `dataSetLifecycleCheckMs`. - -### Idempotency / abort handling - -- An abort (job timeout) or internal poll timeout is classified as `failure.timedout`; - pg-boss does not retry (failures are handled by the next scheduled tick). -- The terminate step tolerates an already-terminated revert as a no-op and continues polling. - -## Metrics - -All metrics carry the standard label set (`checkType`, `providerId`, `providerName`, -`providerStatus`) with `checkType=dataSetLifecycleCheck`. See -[`events-and-metrics.md`](./events-and-metrics.md). - -| Metric | `value` labels | What to watch for | -|--------|---------------|-------------------| -| [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | `success`, `failure.timedout`, `failure.other` | `success` per provider confirms the full create→terminate lifecycle works on calibration; persistent `failure.*` indicates a `createDataSet` or `terminateService` regression | -| [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | — | End-to-end duration (create + upload + terminate + confirm); emitted on `success` and `failure.timedout` only | +Key environment variables that control data set lifecycle check behavior: + +| Variable | Description | +|----------|-------------| +| `DATASET_LIFECYCLE_CHECK_ENABLED` | Enables or disables the check. Defaults to `true` on calibration, `false` on mainnet. When disabled, stale schedules are removed so they stop enqueuing no-op jobs. | +| `DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR` | Per-SP check rate. Independent of `DATASET_CREATIONS_PER_SP_PER_HOUR`. | +| `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | Max end-to-end job runtime before forced abort. Default `600`. | + +Source: [`apps/backend/src/config/app.config.ts`](../../apps/backend/src/config/app.config.ts) + +See also: [`docs/environment-variables.md`](../environment-variables.md) for the source-of-truth configuration reference. ## FAQ ### What happens on-chain after `terminateService` is called? -`terminateService` does not delete a dataset instantly. It starts a multi-step on-chain -sequence that plays out over roughly 30 days. The lifecycle check only needs the first step to -complete before it exits. +`terminateService` does not delete a data set instantly. It starts a multi-step on-chain sequence that plays out over roughly 30 days. The lifecycle check only waits for the first step before it exits. + +**Step 1 — terminateService confirms.** `terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. This is the point dealbot polls for: `pdpEndEpoch != 0`. + +**Step 2 — rail finalization (~30 days later).** When the PDP rail's `settledUpTo` reaches `endEpoch`, `finalizeTerminatedRail` fires atomically inside the settle transaction. + +**Step 3 — data set deletion at PDPVerifier (SP-initiated).** After the rail finalizes, the SP may call `PDPVerifier.deleteDataSet`. The lifecycle check does not wait for steps 2 or 3 — waiting ~30 days per invocation would defeat the purpose of a canary. + +### Why does data set creation use a seed piece? + +Data set creation goes through `createContext` + `executeUpload` (the same flow as the data storage check) rather than calling `PDPVerifier.createDataSet` directly, because support for empty data sets is being removed from Curio and `synapse-sdk`. + +### What if creation succeeds but termination fails? -**Step 1 — terminateService tx confirms.** `terminateService` calls -`FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on -the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores -`info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. This is the -point the job polls for: `pdpEndEpoch != 0`. +If creation succeeds but termination fails (process crash, job timeout, or an on-chain error that is not an already-terminated no-op), the created data set stays live on the SP. This is called a leak and is an accepted trade-off for keeping the job self-contained. -**Step 2 — rail finalization (~30 days later).** When the PDP rail's `settledUpTo` reaches -`endEpoch`, `finalizeTerminatedRail` fires atomically inside the settle transaction. +Leaked sets are discoverable by filtering data sets with the `dealbotLifecycleCheck` metadata key. Each leak is also recorded in the `dataset_lifecycle_check_failed` log line with `leakedDataSet: true` and the `dataSetId` for easy identification. -**Step 3 — dataset deletion at PDPVerifier (SP-initiated).** After the rail finalizes, the SP -may call `PDPVerifier.deleteDataSet`. The lifecycle check does not wait for steps 2 or 3 — -waiting ~30 days per invocation would defeat the purpose of a canary cycle. +### Why does the job create and terminate in the same run? -## Source of truth +An earlier design terminated an existing managed slot and relied on `data_set_creation` to recreate it on a later tick. That approach was coupled to `MIN_NUM_DATASETS_FOR_CHECKS`, a minimum-index window, and the creation job's schedule — making the canary sensitive to overall provider state. -- Dataset creation design: [`docs/data-set-creation.md`](../data-set-creation.md) -- Job system overview: [`docs/jobs.md`](../jobs.md) -- Metrics and event definitions: [`docs/checks/events-and-metrics.md`](./events-and-metrics.md) -- Scheduler and workers: [`apps/backend/src/jobs/jobs.service.ts`](../../apps/backend/src/jobs/jobs.service.ts) -- Deal service dataset logic (`createDataSetWithPiece`, `runDataSetLifecycleCheck`): [`apps/backend/src/deal/deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) +The current design is self-contained: it always creates a fresh data set and terminates it in the same run. The check works regardless of provider state and needs no coordination with other jobs. From f2fa4cea5b65ce96257fbdcd637dd87fea3b9136 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 23:38:31 +0530 Subject: [PATCH 13/16] chore: revert back deal service --- apps/backend/src/deal/deal.service.spec.ts | 84 ---- apps/backend/src/deal/deal.service.ts | 454 +++++++-------------- 2 files changed, 156 insertions(+), 382 deletions(-) diff --git a/apps/backend/src/deal/deal.service.spec.ts b/apps/backend/src/deal/deal.service.spec.ts index cac3fa73..0672a7a2 100644 --- a/apps/backend/src/deal/deal.service.spec.ts +++ b/apps/backend/src/deal/deal.service.spec.ts @@ -17,7 +17,6 @@ import { DealAddonsService } from "../deal-addons/deal-addons.service.js"; import { DealPreprocessingResult } from "../deal-addons/types.js"; import { DataSetCreationCheckMetrics, - DataSetLifecycleCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -170,10 +169,6 @@ describe("DealService", () => { observeCheckDuration: vi.fn(), recordStatus: vi.fn(), }; - const mockDataSetLifecycleCheckMetrics = { - observeCheckDuration: vi.fn(), - recordStatus: vi.fn(), - }; beforeEach(async () => { const module: TestingModule = await Test.createTestingModule({ @@ -189,7 +184,6 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, - { provide: DataSetLifecycleCheckMetrics, useValue: mockDataSetLifecycleCheckMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1074,7 +1068,6 @@ describe("DealService", () => { { provide: DataStorageCheckMetrics, useValue: mockDataStorageMetrics }, { provide: RetrievalCheckMetrics, useValue: mockRetrievalMetrics }, { provide: DataSetCreationCheckMetrics, useValue: mockDataSetCreationMetrics }, - { provide: DataSetLifecycleCheckMetrics, useValue: mockDataSetLifecycleCheckMetrics }, { provide: ClickhouseService, useValue: { insert: vi.fn(), probeLocation: "test" } }, { provide: DatasetLivenessService, useValue: mockDatasetLivenessService }, ], @@ -1452,83 +1445,6 @@ describe("DealService", () => { }); }); - describe("runDataSetLifecycleCheck", () => { - beforeEach(() => { - vi.spyOn(mockWalletSdkService, "getProviderInfo").mockReturnValue({ - id: 1n, - name: "sp", - isApproved: true, - } as any); - }); - - it("creates a throwaway data set, terminates it, and records only lifecycle metrics", async () => { - const terminateMock = vi.fn().mockResolvedValue("0xhash"); - const synapseMock = { - storage: { - createContext: vi.fn().mockResolvedValue({ dataSetId: 9n }), - terminateDataSet: terminateMock, - }, - client: { waitForTransactionReceipt: vi.fn().mockResolvedValue({ status: "success" }) }, - }; - vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); - (executeUpload as Mock).mockImplementation(async (_s, _d, _r, options) => { - await triggerUploadProgress(options?.onProgress); - return { pieceCid: "bafk-seed", pieceId: 1, transactionHash: "0xhash" }; - }); - - // getDataSet: first probe inside ensureDataSetTerminated, then the confirmation poll. - mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 0n }); - mockWarmStorageService.getDataSet.mockResolvedValueOnce({ pdpEndEpoch: 4321n }); - - const result = await service.runDataSetLifecycleCheck( - "0xaaa", - { dealbotLifecycleCheck: "nonce-1" }, - undefined, - 5_000, - ); - - expect(synapseMock.storage.createContext).toHaveBeenCalledWith( - expect.objectContaining({ metadata: { dealbotLifecycleCheck: "nonce-1" } }), - ); - expect(terminateMock).toHaveBeenCalledWith({ dataSetId: 9n }); - expect(result).toEqual({ dataSetId: 9n, pdpEndEpoch: 4321n }); - expect(mockDataSetLifecycleCheckMetrics.recordStatus).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), - "success", - ); - expect(mockDataSetLifecycleCheckMetrics.observeCheckDuration).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), - expect.any(Number), - ); - // The create step must NOT record dataSetCreation metrics (those belong to data_set_creation). - expect(mockDataSetCreationMetrics.recordStatus).not.toHaveBeenCalled(); - expect(mockDataSetCreationMetrics.observeCheckDuration).not.toHaveBeenCalled(); - // No Deal rows exist for the throwaway set, so no cleanup is attempted. - expect(dealRepoMock.save).not.toHaveBeenCalled(); - }); - - it("records failure.timedout and rethrows when the signal is already aborted", async () => { - const createContextMock = vi.fn().mockResolvedValue({ dataSetId: 9n }); - const synapseMock = { - storage: { createContext: createContextMock, terminateDataSet: vi.fn() }, - client: { waitForTransactionReceipt: vi.fn() }, - }; - vi.spyOn(service as any, "createSynapseInstance").mockImplementation(() => synapseMock as unknown as Synapse); - - const controller = new AbortController(); - controller.abort(new Error("Data set lifecycle check job timeout (600s)")); - - await expect( - service.runDataSetLifecycleCheck("0xaaa", { dealbotLifecycleCheck: "nonce-2" }, controller.signal, 5_000), - ).rejects.toThrow(); - - expect(mockDataSetLifecycleCheckMetrics.recordStatus).toHaveBeenCalledWith( - expect.objectContaining({ checkType: "dataSetLifecycleCheck" }), - "failure.timedout", - ); - }); - }); - describe("createDeal isLive guard", () => { it("throws DealJobTerminatedDataSetError when data set is PDP-terminated; no metrics or save", async () => { const providerInfo: PDPProviderEx = { diff --git a/apps/backend/src/deal/deal.service.ts b/apps/backend/src/deal/deal.service.ts index 4936a24c..df06ed9c 100644 --- a/apps/backend/src/deal/deal.service.ts +++ b/apps/backend/src/deal/deal.service.ts @@ -31,7 +31,6 @@ import type { DealPreprocessingResult } from "../deal-addons/types.js"; import { buildCheckMetricLabels, classifyFailureStatus } from "../metrics-prometheus/check-metric-labels.js"; import { DataSetCreationCheckMetrics, - DataSetLifecycleCheckMetrics, DataStorageCheckMetrics, RetrievalCheckMetrics, } from "../metrics-prometheus/check-metrics.service.js"; @@ -70,7 +69,6 @@ export class DealService implements OnModuleInit, OnModuleDestroy { private readonly dataStorageMetrics: DataStorageCheckMetrics, private readonly retrievalMetrics: RetrievalCheckMetrics, private readonly dataSetCreationMetrics: DataSetCreationCheckMetrics, - private readonly dataSetLifecycleCheckMetrics: DataSetLifecycleCheckMetrics, private readonly clickhouseService: ClickhouseService, private readonly datasetLivenessService: DatasetLivenessService, ) { @@ -734,9 +732,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } /** - * Terminate a dataset on-chain (if needed) and wait for FWSS to confirm - * `pdpEndEpoch != 0`. Shared by the `data_set_creation` repair path and the - * `data_set_lifecycle_check` canary job. + * Repair a PDP-terminated dataset (FWSS may or may not have flipped pdpEndEpoch). * * Idempotent sequence: * 1. Read FWSS pdpEndEpoch. If already non-zero, skip the on-chain call. @@ -744,96 +740,74 @@ export class DealService implements OnModuleInit, OnModuleDestroy { * FWSS pdpEndEpoch until non-zero. A revert that matches a known * already-terminated message is treated as a no-op and falls through * to the poll, so a partially-completed prior run can complete. - * - * Returns the confirmed non-zero `pdpEndEpoch`. Throws on abort or poll timeout. + * 3. Mark every Deal row with this dataSetId as cleaned up in a single + * transaction (filtered on cleaned_up=false, so re-runs do not double-write). */ - private async ensureDataSetTerminated( + async repairTerminatedDataSet( providerAddress: string, dataSetId: bigint, signal?: AbortSignal, pollTimeoutMs = 60_000, - ): Promise { + ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { signal?.throwIfAborted(); const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); + const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); const { warmStorageService } = this.walletSdkService.getWalletServices(); + let pdpEndEpoch: bigint; const existing = await awaitWithAbort(warmStorageService.getDataSet({ dataSetId }), signal); if (existing != null && existing.pdpEndEpoch !== 0n) { + pdpEndEpoch = existing.pdpEndEpoch; this.logger.log({ event: "dataset_already_terminated", message: "FWSS pdpEndEpoch already set; skipping terminateDataSet", providerAddress, dataSetId: dataSetId.toString(), - pdpEndEpoch: existing.pdpEndEpoch.toString(), - }); - return existing.pdpEndEpoch; - } - - let txHash: `0x${string}` | undefined; - try { - txHash = await awaitWithAbort(synapse.storage.terminateDataSet({ dataSetId }), signal); - } catch (error) { - if (signal?.aborted) throw error; - const message = error instanceof Error ? error.message : String(error); - if (!/already.*terminat|service.*terminated|pdpEndEpoch.*set/i.test(message)) { - throw error; - } - this.logger.warn({ - event: "dataset_terminate_already_handled", - message: "terminateDataSet reverted as already-terminated; continuing to poll", - providerAddress, - dataSetId: dataSetId.toString(), - revert: message, + pdpEndEpoch: pdpEndEpoch.toString(), }); - } - signal?.throwIfAborted(); - if (txHash != null) { + } else { + let txHash: `0x${string}` | undefined; try { - await awaitWithAbort(synapse.client.waitForTransactionReceipt({ hash: txHash }), signal); + txHash = await awaitWithAbort(synapse.storage.terminateDataSet({ dataSetId }), signal); } catch (error) { if (signal?.aborted) throw error; + const message = error instanceof Error ? error.message : String(error); + if (!/already.*terminat|service.*terminated|pdpEndEpoch.*set/i.test(message)) { + throw error; + } this.logger.warn({ - event: "dataset_terminate_receipt_wait_failed", - message: "Receipt wait failed; falling back to FWSS state poll", + event: "dataset_terminate_already_handled", + message: "terminateDataSet reverted as already-terminated; continuing to poll", providerAddress, dataSetId: dataSetId.toString(), - txHash, - error: toStructuredError(error), + revert: message, }); } + signal?.throwIfAborted(); + if (txHash != null) { + try { + await awaitWithAbort(synapse.client.waitForTransactionReceipt({ hash: txHash }), signal); + } catch (error) { + if (signal?.aborted) throw error; + this.logger.warn({ + event: "dataset_terminate_receipt_wait_failed", + message: "Receipt wait failed; falling back to FWSS state poll", + providerAddress, + dataSetId: dataSetId.toString(), + txHash, + error: toStructuredError(error), + }); + } + } + pdpEndEpoch = await this.waitForPdpEndEpoch(dataSetId, pollTimeoutMs, signal); } - return this.waitForPdpEndEpoch(dataSetId, pollTimeoutMs, signal); - } - /** - * Mark every Deal row with `dataSetId` as cleaned up in a single transaction. - * Filtered on cleaned_up=false so re-runs do not double-write. Returns affected count. - */ - private async markDataSetDealsCleanedUp(dataSetId: bigint): Promise { - return this.dealRepository.manager.transaction(async (manager) => { + const result = await this.dealRepository.manager.transaction(async (manager) => { const update = await manager .getRepository(Deal) .update({ dataSetId, cleanedUp: false }, { cleanedUp: true, cleanedUpAt: new Date() }); return update.affected ?? 0; }); - } - - /** - * Repair a PDP-terminated dataset (FWSS may or may not have flipped pdpEndEpoch). - * - * Idempotent sequence: - * 1-2. Terminate on-chain and confirm pdpEndEpoch != 0 (see ensureDataSetTerminated). - * 3. Mark every Deal row with this dataSetId as cleaned up. - */ - async repairTerminatedDataSet( - providerAddress: string, - dataSetId: bigint, - signal?: AbortSignal, - pollTimeoutMs = 60_000, - ): Promise<{ dealsAffected: number; pdpEndEpoch: bigint }> { - const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); - const pdpEndEpoch = await this.ensureDataSetTerminated(providerAddress, dataSetId, signal, pollTimeoutMs); - const dealsAffected = await this.markDataSetDealsCleanedUp(dataSetId); this.logger.log({ event: "dataset_terminated_repaired", @@ -842,104 +816,10 @@ export class DealService implements OnModuleInit, OnModuleDestroy { providerId: providerInfo?.id, dataSetId: dataSetId.toString(), pdpEndEpoch: pdpEndEpoch.toString(), - dealsAffected, - }); - - return { dealsAffected, pdpEndEpoch }; - } - - /** - * Run one data-set lifecycle check: create a throwaway data set with a seed piece, - * then immediately terminate it and confirm `pdpEndEpoch != 0` on-chain. Used by the - * `data_set_lifecycle_check` canary job to validate that an SP honours the full - * create -> `terminateService` lifecycle. - * - * Self-contained: it never touches the managed check data sets and creates no `Deal` - * rows, so no Deal cleanup is performed. The throwaway set is created with caller-supplied - * `metadata` carrying the fixed `dealbotLifecycleCheck` marker key (a per-run nonce value - * forces a fresh set each tick); operators can list/sweep leaks by that key. - * - * Emits only `dataSetLifecycleCheckStatus` / `dataSetLifecycleCheckMs` — never the - * `dataSetCreation` metrics (those belong to the `data_set_creation` job). An abort - * (job timeout) or an internal poll timeout is classified as `failure.timedout`. If - * creation succeeds but - * termination fails the set leaks (accepted trade-off); pg-boss retries on the next tick. - */ - async runDataSetLifecycleCheck( - providerAddress: string, - metadata: Record, - signal?: AbortSignal, - pollTimeoutMs = 60_000, - ): Promise<{ dataSetId: bigint; pdpEndEpoch: bigint }> { - const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); - if (!providerInfo) { - throw new Error(`Provider ${providerAddress} not found in registry`); - } - - const logContext = { - providerAddress, - providerName: providerInfo.name, - providerId: providerInfo.id, - }; - const labels = buildCheckMetricLabels({ - checkType: "dataSetLifecycleCheck", - providerId: providerInfo.id, - providerName: providerInfo.name, - providerIsApproved: providerInfo.isApproved, - }); - - const startedAt = Date.now(); - this.logger.log({ - event: "dataset_lifecycle_check_started", - message: "Starting data-set lifecycle check (create then terminate)", - ...logContext, - metadata, + dealsAffected: result, }); - let dataSetId: bigint | undefined; - try { - // 1. Create a fresh throwaway data set with a seed piece (no creation metrics). - ({ dataSetId } = await this.createDataSetWithPieceInternal(providerInfo, metadata, signal)); - if (!dataSetId) { - throw new Error("Data-set creation upload completed without resolving a dataSetId"); - } - // 2. Immediately terminate the exact set we just created and confirm on-chain. - const pdpEndEpoch = await this.ensureDataSetTerminated(providerAddress, dataSetId, signal, pollTimeoutMs); - const durationMs = Date.now() - startedAt; - - this.dataSetLifecycleCheckMetrics.observeCheckDuration(labels, durationMs); - this.dataSetLifecycleCheckMetrics.recordStatus(labels, "success"); - this.logger.log({ - event: "dataset_lifecycle_check_succeeded", - message: "Data-set lifecycle check completed: created and terminated throwaway data set", - ...logContext, - dataSetId: dataSetId.toString(), - pdpEndEpoch: pdpEndEpoch.toString(), - durationMs, - }); - return { dataSetId, pdpEndEpoch }; - } catch (error) { - const durationMs = Date.now() - startedAt; - // An abort (job-level timeout) or an internal poll timeout both count as failure.timedout. - const status = signal?.aborted ? "failure.timedout" : classifyFailureStatus(error); - if (status === "failure.timedout") { - this.dataSetLifecycleCheckMetrics.observeCheckDuration(labels, durationMs); - } - this.dataSetLifecycleCheckMetrics.recordStatus(labels, status); - this.logger.error({ - event: "dataset_lifecycle_check_failed", - message: - dataSetId === undefined - ? "Data-set lifecycle check failed during creation" - : "Data-set lifecycle check failed during termination; throwaway data set may have leaked", - ...logContext, - dataSetId: dataSetId?.toString(), - durationMs, - status, - error: toStructuredError(error), - }); - throw error; - } + return { dealsAffected: result, pdpEndEpoch }; } /** @@ -965,9 +845,7 @@ export class DealService implements OnModuleInit, OnModuleDestroy { } /** - * Creates an on-chain data-set with a minimal 200 KiB piece for a provider, - * recording `dataSetCreation` metrics. Used by the `data_set_creation` job. - * + * Creates an on-chain data-set with a minimal 200 KiB piece for a provider. * Uses createContext + executeUpload (same flow as data storage check) instead of * PDPServer.createDataSet, since empty datasets are being removed from curio and synapse-sdk. * @@ -979,16 +857,11 @@ export class DealService implements OnModuleInit, OnModuleDestroy { metadata: Record, signal?: AbortSignal, ): Promise { + signal?.throwIfAborted(); const providerInfo = this.walletSdkService.getProviderInfo(providerAddress); if (!providerInfo) { throw new Error(`Provider ${providerAddress} not found in registry`); } - - const logContext = { - providerAddress, - providerId: providerInfo.id, - providerName: providerInfo.name, - }; const labels = buildCheckMetricLabels({ checkType: "dataSetCreation", providerId: providerInfo.id, @@ -999,39 +872,133 @@ export class DealService implements OnModuleInit, OnModuleDestroy { const startedAt = Date.now(); this.dataSetCreationMetrics.recordStatus(labels, "pending"); this.logger.log({ - ...logContext, event: "dataset_creation_with_piece_started", message: "Starting data-set creation with piece", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, metadata, }); + let pieceAdded = false; + let piecesConfirmed = false; + let pieceCid: string | undefined; + let pieceId: number | undefined; + let transactionHash: string | undefined; + try { - const result = await this.createDataSetWithPieceInternal(providerInfo, metadata, signal); + const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); + signal?.throwIfAborted(); + + const DATA_SET_CREATION_PIECE_SIZE = 200 * 1024; // 200 KiB + const payload = Buffer.alloc(DATA_SET_CREATION_PIECE_SIZE, 0x61); + const dataFile = { + data: payload, + size: DATA_SET_CREATION_PIECE_SIZE, + name: "dataset-seed.bin", + }; + + const carResult = await buildUnixfsCar(dataFile, { signal }); + signal?.throwIfAborted(); + + const storage = await awaitWithAbort( + synapse.storage.createContext({ + providerId: providerInfo.id, + metadata, + }), + signal, + ); + signal?.throwIfAborted(); + + const filecoinPinLogger = createFilecoinPinLogger(this.logger); + + const uploadResult = (await awaitWithAbort( + executeUpload(synapse, carResult.carData, carResult.rootCID, { + logger: filecoinPinLogger, + contextId: providerAddress, + contexts: [storage], + pieceMetadata: {}, + ipniValidation: { enabled: false }, + // Must stay synchronous — see issue #446. + onProgress: (event) => { + switch (event.type) { + case "stored": + pieceCid = event.data.pieceCid.toString(); + this.logger.debug({ + event: "dataset_creation_stored", + message: "Data-set creation stored", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + pieceCid, + }); + break; + case "piecesAdded": + pieceAdded = true; + this.logger.debug({ + event: "dataset_creation_pieces_added", + message: "Data-set creation pieces added", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + txHash: event.data.txHash ?? "unknown", + }); + break; + case "piecesConfirmed": + piecesConfirmed = true; + this.logger.debug({ + event: "dataset_creation_pieces_confirmed", + message: "Data-set creation pieces confirmed", + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + pieceIds: event.data.pieceIds, + }); + break; + } + }, + }), + signal, + )) as Partial | undefined; + + pieceCid = pieceCid ?? uploadResult?.pieceCid; + pieceId = uploadResult?.pieceId; + transactionHash = uploadResult?.transactionHash; + const durationMs = Date.now() - startedAt; this.dataSetCreationMetrics.observeCheckDuration(labels, durationMs); + + if (!pieceCid) { + throw new Error("Data-set creation upload completed without producing a pieceCid"); + } + this.dataSetCreationMetrics.recordStatus(labels, "success"); - if (!result.pieceAdded || !result.piecesConfirmed) { + if (!pieceAdded || !piecesConfirmed) { this.logger.warn({ event: "dataset_creation_missing_onchain_events", message: "Data-set creation succeeded without full on-chain progress events", - ...logContext, - pieceAdded: result.pieceAdded, - piecesConfirmed: result.piecesConfirmed, + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, + pieceAdded, + piecesConfirmed, }); } this.logger.log({ event: "dataset_creation_with_piece_succeeded", message: "Data-set created with piece", - ...logContext, + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, durationMs, - dataSetId: result.dataSetId ?? "unknown", - pieceCid: result.pieceCid, - pieceId: result.pieceId ?? "unknown", - txHash: result.transactionHash ?? "unknown", - pieceAdded: result.pieceAdded, - piecesConfirmed: result.piecesConfirmed, + dataSetId: storage.dataSetId ?? "unknown", + pieceCid: pieceCid ?? "unknown", + pieceId: pieceId ?? "unknown", + txHash: transactionHash ?? "unknown", + pieceAdded, + piecesConfirmed, }); } catch (error) { const durationMs = Date.now() - startedAt; @@ -1040,130 +1007,21 @@ export class DealService implements OnModuleInit, OnModuleDestroy { this.logger.error({ event: "dataset_creation_with_piece_failed", message: "Data-set creation with piece failed", - ...logContext, + providerAddress, + providerId: providerInfo.id, + providerName: providerInfo.name, durationMs, + pieceAdded, + piecesConfirmed, + pieceCid, + pieceId, + transactionHash, error: toStructuredError(error), }); throw error; } } - /** - * Metrics-free creation of an on-chain data-set with a minimal 200 KiB seed piece. - * - * Performs createContext + executeUpload and returns the created `dataSetId` and upload - * summary. Records NO check metrics so callers can attribute the work to the right check - * (`data_set_creation` via {@link createDataSetWithPiece}, or `data_set_lifecycle_check` - * via {@link runDataSetLifecycleCheck}). Throws if the upload produced no `pieceCid` or - * the context resolved no `dataSetId` (we cannot operate on an unidentified set). - */ - private async createDataSetWithPieceInternal( - providerInfo: PDPProviderEx, - metadata: Record, - signal?: AbortSignal, - ): Promise<{ - dataSetId?: bigint; - pieceCid: string; - pieceId: number | undefined; - transactionHash: string | undefined; - pieceAdded: boolean; - piecesConfirmed: boolean; - }> { - signal?.throwIfAborted(); - - const providerAddress = providerInfo.serviceProvider; - const logContext = { - providerAddress, - providerName: providerInfo.name, - providerId: providerInfo.id, - }; - let pieceAdded = false; - let piecesConfirmed = false; - let pieceCid: string | undefined; - - const synapse = this.sharedSynapse ?? (await this.createSynapseInstance()); - signal?.throwIfAborted(); - - const DATA_SET_CREATION_PIECE_SIZE = 200 * 1024; // 200 KiB - const payload = Buffer.alloc(DATA_SET_CREATION_PIECE_SIZE, 0x61); - const dataFile = { - data: payload, - size: DATA_SET_CREATION_PIECE_SIZE, - name: "dataset-seed.bin", - }; - - const carResult = await buildUnixfsCar(dataFile, { signal }); - signal?.throwIfAborted(); - - const storage = await awaitWithAbort( - synapse.storage.createContext({ - providerId: providerInfo.id, - metadata, - }), - signal, - ); - signal?.throwIfAborted(); - - const filecoinPinLogger = createFilecoinPinLogger(this.logger); - - const uploadResult = (await awaitWithAbort( - executeUpload(synapse, carResult.carData, carResult.rootCID, { - logger: filecoinPinLogger, - contextId: providerAddress, - contexts: [storage], - pieceMetadata: {}, - ipniValidation: { enabled: false }, - // Must stay synchronous — see issue #446. - onProgress: (event) => { - switch (event.type) { - case "stored": - pieceCid = event.data.pieceCid.toString(); - this.logger.debug({ - event: "dataset_creation_stored", - message: "Data-set creation stored", - ...logContext, - pieceCid, - }); - break; - case "piecesAdded": - pieceAdded = true; - this.logger.debug({ - event: "dataset_creation_pieces_added", - message: "Data-set creation pieces added", - ...logContext, - txHash: event.data.txHash ?? "unknown", - }); - break; - case "piecesConfirmed": - piecesConfirmed = true; - this.logger.debug({ - event: "dataset_creation_pieces_confirmed", - message: "Data-set creation pieces confirmed", - ...logContext, - pieceIds: event.data.pieceIds, - }); - break; - } - }, - }), - signal, - )) as Partial | undefined; - - pieceCid = pieceCid ?? uploadResult?.pieceCid; - if (!pieceCid) { - throw new Error("Data-set creation upload completed without producing a pieceCid"); - } - - return { - dataSetId: storage.dataSetId, - pieceCid, - pieceId: uploadResult?.pieceId, - transactionHash: uploadResult?.transactionHash, - pieceAdded, - piecesConfirmed, - }; - } - // ============================================================================ // Deal Creation Helpers // ============================================================================ From 81b049de7f775e7399cb28890430350b2d81a92e Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Fri, 5 Jun 2026 23:41:07 +0530 Subject: [PATCH 14/16] refactor: create separate data set lifecycle service --- .../data-set-lifecycle.module.ts | 11 ++ .../data-set-lifecycle.service.spec.ts | 131 +++++++++++++++ .../data-set-lifecycle.service.ts | 149 ++++++++++++++++++ apps/backend/src/jobs/jobs.module.ts | 2 + apps/backend/src/jobs/jobs.service.spec.ts | 24 +-- apps/backend/src/jobs/jobs.service.ts | 4 +- 6 files changed, 308 insertions(+), 13 deletions(-) create mode 100644 apps/backend/src/data-set-lifecycle/data-set-lifecycle.module.ts create mode 100644 apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.spec.ts create mode 100644 apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts diff --git a/apps/backend/src/data-set-lifecycle/data-set-lifecycle.module.ts b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.module.ts new file mode 100644 index 00000000..8d917f17 --- /dev/null +++ b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.module.ts @@ -0,0 +1,11 @@ +import { Module } from "@nestjs/common"; +import { MetricsPrometheusModule } from "../metrics-prometheus/metrics-prometheus.module.js"; +import { WalletSdkModule } from "../wallet-sdk/wallet-sdk.module.js"; +import { DataSetLifecycleService } from "./data-set-lifecycle.service.js"; + +@Module({ + imports: [WalletSdkModule, MetricsPrometheusModule], + providers: [DataSetLifecycleService], + exports: [DataSetLifecycleService], +}) +export class DataSetLifecycleModule {} diff --git a/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.spec.ts b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.spec.ts new file mode 100644 index 00000000..be6b57b2 --- /dev/null +++ b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.spec.ts @@ -0,0 +1,131 @@ +import { beforeEach, describe, expect, it, vi } from "vitest"; +import { DataSetLifecycleCheckMetrics } from "../metrics-prometheus/check-metrics.service.js"; +import { WalletSdkService } from "../wallet-sdk/wallet-sdk.service.js"; +import { DataSetLifecycleService } from "./data-set-lifecycle.service.js"; + +vi.mock("@filoz/synapse-core/sp", () => ({ + createDataSet: vi.fn(), + waitForCreateDataSet: vi.fn(), +})); + +vi.mock("@filoz/synapse-core/warm-storage", () => ({ + terminateServiceSync: vi.fn(), +})); + +const { createDataSet, waitForCreateDataSet } = await import("@filoz/synapse-core/sp"); +const { terminateServiceSync } = await import("@filoz/synapse-core/warm-storage"); + +const mockClient = { account: { address: "0xwallet" } }; + +const mockProviderInfo = { + id: 1n, + name: "test-sp", + isApproved: true, + serviceProvider: "0xsp" as `0x${string}`, + payee: "0xpayee" as `0x${string}`, + pdp: { serviceURL: "https://sp.example.com" }, +}; + +const mockWalletSdkService = { + getProviderInfo: vi.fn(() => mockProviderInfo), + getSynapseClient: vi.fn(() => mockClient), +} as unknown as WalletSdkService; + +const mockMetrics = { + observeCheckDuration: vi.fn(), + recordStatus: vi.fn(), +} as unknown as DataSetLifecycleCheckMetrics; + +describe("DataSetLifecycleService", () => { + let service: DataSetLifecycleService; + + beforeEach(() => { + vi.clearAllMocks(); + service = new DataSetLifecycleService(mockWalletSdkService, mockMetrics); + }); + + it("creates an empty data set, waits for confirmation, terminates it, and records success", async () => { + vi.mocked(createDataSet).mockResolvedValue({ txHash: "0xhash1", statusUrl: "https://sp.example.com/status/1" }); + vi.mocked(waitForCreateDataSet).mockResolvedValue({ + dataSetId: 42n, + dataSetCreated: true, + txStatus: "confirmed", + ok: true, + createMessageHash: "0xmsg", + service: "https://sp.example.com", + }); + vi.mocked(terminateServiceSync).mockResolvedValue({ receipt: {} as any, event: {} as any }); + + await service.runLifecycleCheck("0xsp", { dealbotLifecycleCheck: "nonce-123" }); + + expect(createDataSet).toHaveBeenCalledWith( + mockClient, + expect.objectContaining({ + cdn: false, + payee: "0xpayee", + serviceURL: "https://sp.example.com", + metadata: { dealbotLifecycleCheck: "nonce-123" }, + }), + ); + expect(waitForCreateDataSet).toHaveBeenCalledWith( + expect.objectContaining({ statusUrl: "https://sp.example.com/status/1" }), + ); + expect(terminateServiceSync).toHaveBeenCalledWith(mockClient, expect.objectContaining({ dataSetId: 42n })); + expect(mockMetrics.observeCheckDuration).toHaveBeenCalledOnce(); + expect(mockMetrics.recordStatus).toHaveBeenCalledWith(expect.any(Object), "success"); + }); + + it("records failure.timedout when signal is aborted before creation", async () => { + const controller = new AbortController(); + controller.abort(new Error("job timeout")); + + await expect( + service.runLifecycleCheck("0xsp", { dealbotLifecycleCheck: "nonce-456" }, controller.signal), + ).rejects.toThrow(); + + expect(createDataSet).not.toHaveBeenCalled(); + expect(mockMetrics.recordStatus).toHaveBeenCalledWith(expect.any(Object), "failure.timedout"); + }); + + it("records failure.other when creation rejects with a non-abort error", async () => { + vi.mocked(createDataSet).mockRejectedValue(new Error("SP unreachable")); + + await expect(service.runLifecycleCheck("0xsp", { dealbotLifecycleCheck: "nonce-789" })).rejects.toThrow( + "SP unreachable", + ); + + expect(terminateServiceSync).not.toHaveBeenCalled(); + expect(mockMetrics.recordStatus).toHaveBeenCalledWith(expect.any(Object), "failure.other"); + }); + + it("records failure.other when termination fails after creation, logging the dataSetId as leaked", async () => { + vi.mocked(createDataSet).mockResolvedValue({ txHash: "0xhash2", statusUrl: "https://sp.example.com/status/2" }); + vi.mocked(waitForCreateDataSet).mockResolvedValue({ + dataSetId: 99n, + dataSetCreated: true, + txStatus: "confirmed", + ok: true, + createMessageHash: "0xmsg2", + service: "https://sp.example.com", + }); + vi.mocked(terminateServiceSync).mockRejectedValue(new Error("terminate failed")); + + await expect(service.runLifecycleCheck("0xsp", { dealbotLifecycleCheck: "nonce-999" })).rejects.toThrow( + "terminate failed", + ); + + expect(mockMetrics.recordStatus).toHaveBeenCalledWith(expect.any(Object), "failure.other"); + }); + + it("throws when provider is not found in registry", async () => { + vi.mocked(mockWalletSdkService.getProviderInfo).mockReturnValueOnce(undefined); + + await expect(service.runLifecycleCheck("0xunknown", {})).rejects.toThrow("not found in registry"); + }); + + it("throws when synapse client is not initialized", async () => { + vi.mocked(mockWalletSdkService.getSynapseClient).mockReturnValueOnce(null); + + await expect(service.runLifecycleCheck("0xsp", {})).rejects.toThrow("not initialized"); + }); +}); diff --git a/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts new file mode 100644 index 00000000..421358f9 --- /dev/null +++ b/apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts @@ -0,0 +1,149 @@ +import { createDataSet, waitForCreateDataSet } from "@filoz/synapse-core/sp"; +import { terminateServiceSync } from "@filoz/synapse-core/warm-storage"; +import { Injectable, Logger } from "@nestjs/common"; +import { awaitWithAbort } from "../common/abort-utils.js"; +import { toStructuredError } from "../common/logging.js"; +import { buildCheckMetricLabels, classifyFailureStatus } from "../metrics-prometheus/check-metric-labels.js"; +import { DataSetLifecycleCheckMetrics } from "../metrics-prometheus/check-metrics.service.js"; +import { WalletSdkService } from "../wallet-sdk/wallet-sdk.service.js"; + +@Injectable() +export class DataSetLifecycleService { + private readonly logger = new Logger(DataSetLifecycleService.name); + + constructor( + private readonly walletSdkService: WalletSdkService, + private readonly lifecycleCheckMetrics: DataSetLifecycleCheckMetrics, + ) {} + + /** + * Run one data-set lifecycle check: create an empty throwaway data set on the SP, + * wait for on-chain confirmation, then immediately terminate it. Used by the + * `data_set_lifecycle_check` canary job to validate that an SP honours the full + * create → terminate lifecycle. + * + * Never touches managed check data sets and creates no Deal rows. The throwaway set + * is identified by the fixed `dealbotLifecycleCheck` marker key in `metadata`; a + * per-run nonce value prevents accidentally reusing a prior leaked set. If creation + * succeeds but termination fails the set leaks (accepted trade-off); operators can + * sweep leaks by that key. + * + * Emits only `dataSetLifecycleCheckStatus` / `dataSetLifecycleCheckMs` metrics. + */ + async runLifecycleCheck(spAddress: string, metadata: Record, signal?: AbortSignal): Promise { + const providerInfo = this.walletSdkService.getProviderInfo(spAddress); + if (!providerInfo) { + throw new Error(`Provider ${spAddress} not found in registry`); + } + + const client = this.walletSdkService.getSynapseClient(); + if (!client) { + throw new Error("Synapse client not initialized"); + } + + const labels = buildCheckMetricLabels({ + checkType: "dataSetLifecycleCheck", + providerId: providerInfo.id, + providerName: providerInfo.name, + providerIsApproved: providerInfo.isApproved, + }); + + const logContext = { + providerAddress: spAddress, + providerName: providerInfo.name, + providerId: providerInfo.id, + }; + + const startedAt = Date.now(); + this.logger.log({ + event: "dataset_lifecycle_check_started", + message: "Starting data-set lifecycle check", + ...logContext, + }); + + let dataSetId: bigint | undefined; + try { + signal?.throwIfAborted(); + + // 1. Request creation of an empty data set on the SP. + const createResult = await awaitWithAbort( + createDataSet(client, { + cdn: false, + payee: providerInfo.payee, + serviceURL: providerInfo.pdp.serviceURL, + metadata, + }), + signal, + ); + signal?.throwIfAborted(); + + this.logger.log({ + event: "dataset_lifecycle_check_creating", + message: "Empty data set creation submitted; waiting for SP confirmation", + ...logContext, + txHash: createResult.txHash, + }); + + // 2. Wait for the SP to confirm the data set is created and extract the dataSetId. + const confirmed = await awaitWithAbort(waitForCreateDataSet({ statusUrl: createResult.statusUrl }), signal); + dataSetId = confirmed.dataSetId; + signal?.throwIfAborted(); + + this.logger.log({ + event: "dataset_lifecycle_check_created", + message: "Empty data set created and confirmed on-chain", + ...logContext, + dataSetId: dataSetId.toString(), + }); + + // 3. Immediately terminate the throwaway data set. + await awaitWithAbort( + terminateServiceSync(client, { + dataSetId, + onHash: (hash) => { + this.logger.log({ + event: "dataset_lifecycle_check_terminating", + message: "Terminate transaction submitted", + ...logContext, + dataSetId: (dataSetId as bigint).toString(), + txHash: hash, + }); + }, + }), + signal, + ); + + const durationMs = Date.now() - startedAt; + this.lifecycleCheckMetrics.observeCheckDuration(labels, durationMs); + this.lifecycleCheckMetrics.recordStatus(labels, "success"); + + this.logger.log({ + event: "dataset_lifecycle_check_succeeded", + message: "Data-set lifecycle check completed: created and terminated throwaway data set", + ...logContext, + dataSetId: dataSetId.toString(), + durationMs, + }); + } catch (error) { + const durationMs = Date.now() - startedAt; + const status = signal?.aborted ? "failure.timedout" : classifyFailureStatus(error); + if (status === "failure.timedout") { + this.lifecycleCheckMetrics.observeCheckDuration(labels, durationMs); + } + this.lifecycleCheckMetrics.recordStatus(labels, status); + this.logger.error({ + event: "dataset_lifecycle_check_failed", + message: + dataSetId === undefined + ? "Data-set lifecycle check failed during creation" + : "Data-set lifecycle check failed during termination; throwaway data set may have leaked", + ...logContext, + dataSetId: dataSetId?.toString(), + durationMs, + status, + error: toStructuredError(error), + }); + throw error; + } + } +} diff --git a/apps/backend/src/jobs/jobs.module.ts b/apps/backend/src/jobs/jobs.module.ts index 12328093..4fc3dfbf 100644 --- a/apps/backend/src/jobs/jobs.module.ts +++ b/apps/backend/src/jobs/jobs.module.ts @@ -1,6 +1,7 @@ import { Module } from "@nestjs/common"; import { TypeOrmModule } from "@nestjs/typeorm"; import { DataRetentionModule } from "../data-retention/data-retention.module.js"; +import { DataSetLifecycleModule } from "../data-set-lifecycle/data-set-lifecycle.module.js"; import { DatabaseModule } from "../database/database.module.js"; import { JobScheduleState } from "../database/entities/job-schedule-state.entity.js"; import { StorageProvider } from "../database/entities/storage-provider.entity.js"; @@ -16,6 +17,7 @@ import { JobScheduleRepository } from "./repositories/job-schedule.repository.js imports: [ DatabaseModule, TypeOrmModule.forFeature([StorageProvider, JobScheduleState]), + DataSetLifecycleModule, DealModule, RetrievalModule, WalletSdkModule, diff --git a/apps/backend/src/jobs/jobs.service.spec.ts b/apps/backend/src/jobs/jobs.service.spec.ts index b0304097..8101acdf 100644 --- a/apps/backend/src/jobs/jobs.service.spec.ts +++ b/apps/backend/src/jobs/jobs.service.spec.ts @@ -74,6 +74,7 @@ describe("JobsService schedule rows", () => { jobDuration: JobsServiceDeps[18]; storageProvidersActive: JobsServiceDeps[19]; storageProvidersTested: JobsServiceDeps[20]; + dataSetLifecycleService: JobsServiceDeps[21]; }>, ) => JobsService; @@ -197,6 +198,7 @@ describe("JobsService schedule rows", () => { overrides.jobDuration ?? metricsMocks.jobDuration, overrides.storageProvidersActive ?? metricsMocks.storageProvidersActive, overrides.storageProvidersTested ?? metricsMocks.storageProvidersTested, + overrides.dataSetLifecycleService ?? ({} as JobsServiceDeps[21]), ); service = buildService(); @@ -1264,14 +1266,12 @@ describe("JobsService schedule rows", () => { get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), } as unknown as JobsServiceDeps[0]; - const dealService = { - runDataSetLifecycleCheck: vi.fn(), - }; + const dataSetLifecycleService = { runLifecycleCheck: vi.fn() }; const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; service = buildService({ configService, - dealService: dealService as unknown as ConstructorParameters[3], + dataSetLifecycleService: dataSetLifecycleService as unknown as JobsServiceDeps[21], walletSdkService: walletSdkService as unknown as ConstructorParameters[5], }); @@ -1280,7 +1280,7 @@ describe("JobsService schedule rows", () => { data: { jobType: "data_set_lifecycle_check", spAddress: "0xaaa", intervalSeconds: 3600 }, }); - expect(dealService.runDataSetLifecycleCheck).not.toHaveBeenCalled(); + expect(dataSetLifecycleService.runLifecycleCheck).not.toHaveBeenCalled(); }); it("data_set_lifecycle_check job creates and terminates a throwaway data set when enabled", async () => { @@ -1292,14 +1292,12 @@ describe("JobsService schedule rows", () => { get: vi.fn((key: keyof IConfig) => baseConfigValues[key]), } as unknown as JobsServiceDeps[0]; - const dealService = { - runDataSetLifecycleCheck: vi.fn(async () => ({ dataSetId: 55n, pdpEndEpoch: 9n })), - }; + const dataSetLifecycleService = { runLifecycleCheck: vi.fn(async () => undefined) }; const walletSdkService = { getProviderInfo: vi.fn(() => ({ id: 1, name: "test-provider" })) }; service = buildService({ configService, - dealService: dealService as unknown as ConstructorParameters[3], + dataSetLifecycleService: dataSetLifecycleService as unknown as JobsServiceDeps[21], walletSdkService: walletSdkService as unknown as ConstructorParameters[5], }); @@ -1308,14 +1306,16 @@ describe("JobsService schedule rows", () => { data: { jobType: "data_set_lifecycle_check", spAddress: "0xaaa", intervalSeconds: 3600 }, }); - expect(dealService.runDataSetLifecycleCheck).toHaveBeenCalledWith( + expect(dataSetLifecycleService.runLifecycleCheck).toHaveBeenCalledWith( "0xaaa", expect.objectContaining({ dealbotLifecycleCheck: expect.any(String) }), expect.any(AbortSignal), - expect.any(Number), ); // The fixed marker key is the only metadata; no base/slot metadata is attached. - const metadataArg = (dealService.runDataSetLifecycleCheck.mock.calls[0] as unknown[])[1] as Record; + const metadataArg = (dataSetLifecycleService.runLifecycleCheck.mock.calls[0] as unknown[])[1] as Record< + string, + string + >; expect(Object.keys(metadataArg)).toEqual(["dealbotLifecycleCheck"]); }); diff --git a/apps/backend/src/jobs/jobs.service.ts b/apps/backend/src/jobs/jobs.service.ts index c020a37a..9233a19f 100644 --- a/apps/backend/src/jobs/jobs.service.ts +++ b/apps/backend/src/jobs/jobs.service.ts @@ -12,6 +12,7 @@ import { getMaintenanceWindowStatus } from "../common/maintenance-window.js"; import { isSpBlocked } from "../common/sp-blocklist.js"; import type { IConfig, ISpBlocklistConfig } from "../config/app.config.js"; import { DataRetentionService } from "../data-retention/data-retention.service.js"; +import { DataSetLifecycleService } from "../data-set-lifecycle/data-set-lifecycle.service.js"; import type { JobType } from "../database/entities/job-schedule-state.entity.js"; import { StorageProvider } from "../database/entities/storage-provider.entity.js"; import { DealService } from "../deal/deal.service.js"; @@ -107,6 +108,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { private readonly storageProvidersActive: Gauge, @InjectMetric("storage_providers_tested") private readonly storageProvidersTested: Gauge, + private readonly dataSetLifecycleService: DataSetLifecycleService, ) {} /** @@ -1002,7 +1004,7 @@ export class JobsService implements OnModuleInit, OnApplicationShutdown { return "success"; } try { - await this.dealService.runDataSetLifecycleCheck(spAddress, metadata, abortController.signal, timeoutMs); + await this.dataSetLifecycleService.runLifecycleCheck(spAddress, metadata, abortController.signal); return "success"; } catch (error) { if (abortController.signal.aborted) { From 980e6687fec0d72abf0393143d4c32430c8ebb64 Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Sat, 6 Jun 2026 00:06:31 +0530 Subject: [PATCH 15/16] docs: update docs --- apps/backend/src/config/app.config.ts | 4 +- apps/backend/src/jobs/jobs.service.spec.ts | 8 ++-- docs/checks/data-set-lifecycle-check.md | 51 +++++++++++----------- docs/checks/events-and-metrics.md | 4 +- docs/environment-variables.md | 4 +- 5 files changed, 36 insertions(+), 35 deletions(-) diff --git a/apps/backend/src/config/app.config.ts b/apps/backend/src/config/app.config.ts index b32c096b..45305ff2 100644 --- a/apps/backend/src/config/app.config.ts +++ b/apps/backend/src/config/app.config.ts @@ -98,7 +98,7 @@ export const configValidationSchema = Joi.object({ DEAL_JOB_TIMEOUT_SECONDS: Joi.number().min(120).default(360), // 6 minutes max runtime for data storage jobs (TODO: reduce default to 3 minutes) RETRIEVAL_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(60), // 1 minute max runtime for retrieval jobs (TODO: reduce default to 30 seconds) DATA_SET_CREATION_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(300), // 5 minutes max runtime for dataset creation jobs - DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(360), // 6 minutes: covers create + seed-piece upload + terminate + pdpEndEpoch poll + DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS: Joi.number().min(60).default(600), // 10 minutes: covers create + seed-piece upload + terminate + pdpEndEpoch poll // Seconds to hold the process alive after pg-boss drain completes, so Prometheus // captures at least one scrape of the terminal counter increments emitted during // shutdown. Default 35 covers the 30s ServiceMonitor interval plus a 5s buffer. @@ -520,7 +520,7 @@ export function loadConfig(): IConfig { retrievalJobTimeoutSeconds: Number.parseInt(process.env.RETRIEVAL_JOB_TIMEOUT_SECONDS || "60", 10), dataSetCreationJobTimeoutSeconds: Number.parseInt(process.env.DATA_SET_CREATION_JOB_TIMEOUT_SECONDS || "300", 10), dataSetLifecycleCheckJobTimeoutSeconds: Number.parseInt( - process.env.DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS || "360", + process.env.DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS || "600", 10, ), shutdownFinalScrapeDelaySeconds: Number.parseInt(process.env.SHUTDOWN_FINAL_SCRAPE_DELAY_SECONDS || "35", 10), diff --git a/apps/backend/src/jobs/jobs.service.spec.ts b/apps/backend/src/jobs/jobs.service.spec.ts index 8101acdf..350ff463 100644 --- a/apps/backend/src/jobs/jobs.service.spec.ts +++ b/apps/backend/src/jobs/jobs.service.spec.ts @@ -144,7 +144,7 @@ describe("JobsService schedule rows", () => { dataSetCreationJobTimeoutSeconds: 300, dataSetLifecycleCheckEnabled: false, dataSetLifecycleChecksPerSpPerHour: 1, - dataSetLifecycleCheckJobTimeoutSeconds: 360, + dataSetLifecycleCheckJobTimeoutSeconds: 600, shutdownFinalScrapeDelaySeconds: 35, pieceCleanupPerSpPerHour: 1, maxPieceCleanupRuntimeSeconds: 300, @@ -1613,10 +1613,10 @@ describe("JobsService schedule rows", () => { await vi.advanceTimersByTimeAsync(35_001); await shutdownPromise; - // Defaults: deal=360, retrieval=60, dataSetCreation=300, dataSetLifecycleCheck=360, - // pullCheck=300 → max=360 → +60s buffer + // Defaults: deal=360, retrieval=60, dataSetCreation=300, dataSetLifecycleCheck=600, + // pullCheck=300 → max=600 → +60s buffer expect(bossMock.stop).toHaveBeenCalledTimes(1); - expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 420_000 }); + expect(bossMock.stop).toHaveBeenCalledWith({ graceful: true, timeout: 660_000 }); }); it("picks the longest timeout across all job types, including pullCheck under pullPiece", async () => { diff --git a/docs/checks/data-set-lifecycle-check.md b/docs/checks/data-set-lifecycle-check.md index aa131cd5..cbe265df 100644 --- a/docs/checks/data-set-lifecycle-check.md +++ b/docs/checks/data-set-lifecycle-check.md @@ -10,13 +10,13 @@ For event and metric definitions used by the dashboard, see [Dealbot Events & Me ## Overview -A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates a throwaway data set with a small seed piece and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. +A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. Every data set lifecycle check, dealbot: -1. Creates a new data set with a 200 KiB seed piece, tagged with a `dealbotLifecycleCheck` metadata key so any leaked sets are discoverable later -2. Calls `terminateService` on the created data set -3. Polls FWSS until `pdpEndEpoch != 0`, confirming termination was recorded on-chain +1. Creates a new empty data set, tagged with a `dealbotLifecycleCheck` metadata key so any leaked sets are discoverable later +2. Waits for the SP to confirm the data set is created on-chain and returns a `dataSetId` +3. Calls `terminateService` on the created data set and waits for the transaction receipt A successful check requires all [assertions in the table below](#what-gets-asserted) to pass. Failure occurs if any step fails or the check exceeds its max allowed time. @@ -26,9 +26,9 @@ Each data set lifecycle check asserts the following for every SP: | # | Assertion | How It's Checked | Relevant Metric | |---|-----------|-----------------|-----------------| -| 1 | SP creates a data set with a seed piece | `createContext` + `executeUpload` call completes and returns a `dataSetId` | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | -| 2 | `terminateService` succeeds on the created data set | `terminateService` call completes without error (already-terminated reverts are treated as success) | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | -| 3 | Termination is confirmed on-chain | Dealbot polls FWSS until `pdpEndEpoch != 0` | [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | +| 1 | SP accepts an empty data set creation | `createDataSet` call completes and the SP returns a `statusUrl` | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | +| 2 | Data set is confirmed on-chain | `waitForCreateDataSet` resolves with a `dataSetId` | [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) | +| 3 | `terminateService` succeeds on the created data set | `terminateServiceSync` call completes and the transaction receipt is received | [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | | 4 | All steps complete within the timeout | Check is not marked successful until all steps pass within `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) | ## Data Set Lifecycle Check Lifecycle @@ -37,35 +37,36 @@ The dealbot scheduler triggers data set lifecycle check jobs at a configurable r ```mermaid flowchart TD - CreateDataSet["Create data set with 200 KiB seed piece"] --> Terminate["Call terminateService"] - Terminate --> Poll["Poll FWSS until pdpEndEpoch != 0"] - Poll -->|confirmed| Success["Mark check successful"] - Poll -->|timeout| Fail["Mark check failed (timedout)"] - Terminate -->|error| Fail + CreateDataSet["createDataSet (empty data set)"] --> Wait["waitForCreateDataSet"] + Wait -->|dataSetId confirmed| Terminate["terminateServiceSync"] + Terminate -->|tx receipt received| Success["Mark check successful"] + Terminate -->|error| Fail["Mark check failed"] + Wait -->|error| Fail CreateDataSet -->|error| Fail + CreateDataSet -->|abort signal| Fail ``` ### 1. Apply job guards Dealbot applies the same maintenance-window and SP-blocklist rules used by all other SP jobs. If `DATASET_LIFECYCLE_CHECK_ENABLED` is `false`, the job logs a disabled skip and exits. -### 2. Create the data set +### 2. Create the empty data set -Dealbot creates a new data set with a 200 KiB seed piece. The data set is tagged with metadata `{ dealbotLifecycleCheck: "" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. See the [FAQ](#why-does-data-set-creation-use-a-seed-piece) for why we use a seed piece. +Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. This step does **not** emit `dataSetCreation` metrics — those belong to the `data_set_creation` job. -Source: [`deal.service.ts` (`runDataSetLifecycleCheck`)](../../apps/backend/src/deal/deal.service.ts) +Source: [`data-set-lifecycle.service.ts` (`runLifecycleCheck`)](../../apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts) -### 3. Terminate the service +### 3. Wait for creation confirmation -Dealbot calls `synapse.storage.terminateDataSet` (aka `terminateService` at contract level) on the newly created `dataSetId`, which sets `pdpEndEpoch` to a near future epoch (~30 days). +Dealbot calls `waitForCreateDataSet` with the `statusUrl` returned by the SP. When the SP confirms the data set is created on-chain, it resolves with a `dataSetId`. -### 4. Wait for on-chain confirmation +### 4. Terminate the service -Dealbot polls FWSS until `pdpEndEpoch != 0`, confirming the termination was recorded on-chain. This is Step 1 of the [full on-chain termination sequence](#what-happens-on-chain-after-terminateservice-is-called). The job does not wait for the full ~30-day rail finalization. +Dealbot calls `terminateServiceSync` (from `@filoz/synapse-core/warm-storage`) on the newly created `dataSetId`. This submits the terminate transaction and waits for the receipt, confirming the termination was recorded on-chain. This is Step 1 of the [full on-chain termination sequence](#what-happens-on-chain-after-terminateservice-is-called). The job does not wait for the full ~30-day rail finalization. -The entire check (creation + upload + termination + confirmation) is bounded by `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. A timeout is classified as `failure.timedout`. +The entire check (creation + confirmation + termination) is bounded by `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS`. A timeout is classified as `failure.timedout`. ## Check Status Progression @@ -82,7 +83,7 @@ A data set lifecycle check has a single terminal status, recorded once per check Metric definitions live in [Dealbot Events & Metrics](./events-and-metrics.md). The metrics emitted by a data set lifecycle check are: - [`dataSetLifecycleCheckStatus`](./events-and-metrics.md#dataSetLifecycleCheckStatus) — `success`, `failure.timedout`, or `failure.other` per provider per run -- [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) — end-to-end duration (create + upload + terminate + confirm); emitted on `success` and `failure.timedout` +- [`dataSetLifecycleCheckMs`](./events-and-metrics.md#dataSetLifecycleCheckMs) — end-to-end duration (create + confirm + terminate); emitted on `success` and `failure.timedout` ## Configuration @@ -92,7 +93,7 @@ Key environment variables that control data set lifecycle check behavior: |----------|-------------| | `DATASET_LIFECYCLE_CHECK_ENABLED` | Enables or disables the check. Defaults to `true` on calibration, `false` on mainnet. When disabled, stale schedules are removed so they stop enqueuing no-op jobs. | | `DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR` | Per-SP check rate. Independent of `DATASET_CREATIONS_PER_SP_PER_HOUR`. | -| `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | Max end-to-end job runtime before forced abort. Default `600`. | +| `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | Max end-to-end job runtime before forced abort. Default `360`. | Source: [`apps/backend/src/config/app.config.ts`](../../apps/backend/src/config/app.config.ts) @@ -110,15 +111,15 @@ See also: [`docs/environment-variables.md`](../environment-variables.md) for the **Step 3 — data set deletion at PDPVerifier (SP-initiated).** After the rail finalizes, the SP may call `PDPVerifier.deleteDataSet`. The lifecycle check does not wait for steps 2 or 3 — waiting ~30 days per invocation would defeat the purpose of a canary. -### Why does data set creation use a seed piece? +### Why does data set creation use an empty data set? -Data set creation goes through `createContext` + `executeUpload` (the same flow as the data storage check) rather than calling `PDPVerifier.createDataSet` directly, because support for empty data sets is being removed from Curio and `synapse-sdk`. +Empty data set creation calls `createDataSet` from `@filoz/synapse-core/sp` directly, bypassing the upload flow used by the data storage check. This keeps the lifecycle check lightweight: it validates the SP's `createDataSet → terminateService` path without storing any actual data or consuming upload capacity. ### What if creation succeeds but termination fails? If creation succeeds but termination fails (process crash, job timeout, or an on-chain error that is not an already-terminated no-op), the created data set stays live on the SP. This is called a leak and is an accepted trade-off for keeping the job self-contained. -Leaked sets are discoverable by filtering data sets with the `dealbotLifecycleCheck` metadata key. Each leak is also recorded in the `dataset_lifecycle_check_failed` log line with `leakedDataSet: true` and the `dataSetId` for easy identification. +Leaked sets are discoverable by filtering data sets with the `dealbotLifecycleCheck` metadata key. Each leak is also recorded in the `dataset_lifecycle_check_failed` log line (message: "throwaway data set may have leaked") with the `dataSetId` included for easy identification. ### Why does the job create and terminate in the same run? diff --git a/docs/checks/events-and-metrics.md b/docs/checks/events-and-metrics.md index bf250c0e..c0694eee 100644 --- a/docs/checks/events-and-metrics.md +++ b/docs/checks/events-and-metrics.md @@ -126,7 +126,7 @@ sequenceDiagram | `dataStorageCheckMs` | Data Storage | [`uploadToSpStart`](#uploadToSpStart) | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Data Storage check | | | `retrievalCheckMs` | Retrieval | Retrieval check start | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | Duration of a Retrieval check | | | `dataSetCreationMs` | Data-Set Creation | Data-set creation uploadToSpStart | Data-set creation pieceConfirmed | Duration of one data-set creation with confirmed piece (all using `createDataSetWithPiece`) | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetLifecycleCheckMs` | Data-Set Lifecycle Check | Data-set create with seed piece | FWSS `pdpEndEpoch != 0` confirmed | End-to-end duration of one lifecycle check: create a throwaway data set then terminate it (`runDataSetLifecycleCheck`). Emitted on `success` and `failure.timedout` only. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | +| `dataSetLifecycleCheckMs` | Data-Set Lifecycle Check | Empty data set creation start | `terminateServiceSync` tx receipt received | End-to-end duration of one lifecycle check: create an empty throwaway data set then terminate it (`runLifecycleCheck`). Emitted on `success` and `failure.timedout` only. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | [`data-set-lifecycle.service.ts`](../../apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts) | | `pullRequestAcknowledgementLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestAcknowledgedBySp`](#pullRequestAcknowledgedBySp) | Time from `pullPieces` submission to SP request acknowledgement. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullRequestStartedMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestStartedBySp`](#pullRequestStartedBySp) | Time from `pullPieces` submission to the SP reading the first byte of `/api/piece/{pieceCid}`. Skipped (no observation) when the SP never fetches from dealbot. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts), [`pull-piece.controller.ts`](../../apps/backend/src/pull-check/pull-piece.controller.ts) | | `pullRequestCompletionLatencyMs` | Pull | [`pullRequestSubmittedToSp`](#pullRequestSubmittedToSp) | [`pullRequestIsTerminal`](#pullRequestIsTerminal) | Time from `pullPieces` submission to terminal SP pull status. Emitted once for the check, either on success or failure. | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | @@ -151,7 +151,7 @@ sequenceDiagram | `ipfsRetrievalHttpResponseCode` | Data Storage, Retrieval | [`ipfsRetrievalLastByteReceived`](#ipfsRetrievalLastByteReceived) | `200`, `500`, `2xxSuccess`, `4xxClientError`, `5xxServerError`, `otherHttpStatusCodes`, `failure` | | 1 | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | `retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | On the Retrieval path, the pre-flight branches on the on-chain `PDPVerifier.pieceLive(dataSetId, pieceId)` result. When `pieceLive=false` (dataset terminated, piece never created, or piece hard-removed), `skipped.piece_missing` is emitted and the deal is marked `cleaned_up=true`; no SP probe runs. When `pieceLive=true` and the SP returns 404 on `/pdp/piece/:pieceCid/status`, `failure.other` is emitted and a failed retrieval row is recorded (deal stays in the candidate pool for re-probing). | 1 | | | `dataSetCreationStatus` | Data-Set Creation | Not tied to an [event above](#event-list) but rather to data-set creation start (`pending`) and completion (`success`/`failure.*`) | `pending`, `success`, `failure.timedout`, `failure.other` | | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | -| `dataSetLifecycleCheckStatus` | Data-Set Lifecycle Check | When a `data_set_lifecycle_check` invocation finishes (create + terminate) | `success`, `failure.timedout`, `failure.other` | `success` confirms the full create→terminate lifecycle completed (FWSS `pdpEndEpoch != 0`). Persistent `failure.*` indicates a `createDataSet` or `terminateService` regression. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | 1 | [`deal.service.ts`](../../apps/backend/src/deal/deal.service.ts) | +| `dataSetLifecycleCheckStatus` | Data-Set Lifecycle Check | When a `data_set_lifecycle_check` invocation finishes (create + terminate) | `success`, `failure.timedout`, `failure.other` | `success` confirms the full create→terminate lifecycle completed (`terminateServiceSync` tx receipt received). Persistent `failure.*` indicates a `createDataSet` or `terminateService` regression. See [data-set-lifecycle-check.md](./data-set-lifecycle-check.md). | 1 | [`data-set-lifecycle.service.ts`](../../apps/backend/src/data-set-lifecycle/data-set-lifecycle.service.ts) | | `dataSetChallengeStatus` | Data Retention | Emitted on each [Data Retention Check](./data-retention.md) poll when a provider's confirmed proving-period totals advance (strictly positive deltas since the last poll). | `success` (challenges in newly confirmed successful proving periods), `failure` (challenges in newly confirmed faulted periods) | | Counter increment = **period delta × 5** (`CHALLENGES_PER_PROVING_PERIOD`). Period delta is the increase in subgraph-confirmed proving periods since the previous poll for that provider (not "challenges per poll" in the abstract). See [data-retention.md §3](./data-retention.md#3-calculate-deltas). | [`data-retention.service.ts`](../../apps/backend/src/data-retention/data-retention.service.ts) | | `pullRequestProviderStatus` | Pull | When the SP reports a terminal pull status via `waitForPullPieces`. Recorded exactly once per check (intermediate poll statuses are not counted). | Raw SP-reported pull status, for example `complete`, `failed`, `not_found`. Use this to separate SP-side pull failures from dealbot-side validation failures. | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | | `pullCheckStatus` | Pull | When the [Pull Check](./pull-check.md) terminates (success after direct piece validation, or any failure). Recorded exactly once per check. | `success`, `failure.timedout`, `failure.other` from [Pull Check Status](./pull-check.md#pull-check-status). | | 1 | [`pull-check.service.ts`](../../apps/backend/src/pull-check/pull-check.service.ts) | diff --git a/docs/environment-variables.md b/docs/environment-variables.md index 5dd4e395..ecf5e6a9 100644 --- a/docs/environment-variables.md +++ b/docs/environment-variables.md @@ -682,7 +682,7 @@ rate-based (per hour) and persisted in Postgres so restarts do not reset timing. - **Required**: No - **Default**: `true` on calibration, `false` on mainnet -**Role**: Enables the `data_set_lifecycle_check` canary job, which in a single tick creates a throwaway data set with a seed piece and immediately terminates it (`terminateService`), continuously exercising the on-chain `createDataSet → terminateService` lifecycle. +**Role**: Enables the `data_set_lifecycle_check` canary job, which in a single tick creates an empty throwaway data set and immediately terminates it (`terminateServiceSync`), continuously exercising the on-chain `createDataSet → terminateService` lifecycle. **Notes**: Self-contained — it does not touch the managed check data sets and does not depend on `data_set_creation`. When disabled, stale schedules are removed so they stop enqueuing no-op jobs. @@ -846,7 +846,7 @@ Use this to stagger multiple dealbot deployments that are not sharing a database - **Minimum**: `60` (1 minute) - **Enforced**: Yes (config validation, effective floor applied at runtime) -**Role**: Maximum runtime for `data_set_lifecycle_check` jobs before forced abort via `AbortController`. Bounds the seed-piece upload, the `terminateService` call, and the `pdpEndEpoch != 0` confirmation poll. +**Role**: Maximum runtime for `data_set_lifecycle_check` jobs before forced abort via `AbortController`. Bounds the empty data set creation (`createDataSet` + `waitForCreateDataSet`) and the `terminateServiceSync` call. **When to update**: From 524cdc5d3accf1f719c0f2e8b4b3714a39f018bd Mon Sep 17 00:00:00 2001 From: silent-cipher Date: Sat, 6 Jun 2026 00:09:12 +0530 Subject: [PATCH 16/16] docs: update default value --- docs/checks/data-set-lifecycle-check.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/checks/data-set-lifecycle-check.md b/docs/checks/data-set-lifecycle-check.md index cbe265df..9e559f5a 100644 --- a/docs/checks/data-set-lifecycle-check.md +++ b/docs/checks/data-set-lifecycle-check.md @@ -93,7 +93,7 @@ Key environment variables that control data set lifecycle check behavior: |----------|-------------| | `DATASET_LIFECYCLE_CHECK_ENABLED` | Enables or disables the check. Defaults to `true` on calibration, `false` on mainnet. When disabled, stale schedules are removed so they stop enqueuing no-op jobs. | | `DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR` | Per-SP check rate. Independent of `DATASET_CREATIONS_PER_SP_PER_HOUR`. | -| `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | Max end-to-end job runtime before forced abort. Default `360`. | +| `DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS` | Max end-to-end job runtime before forced abort. Default `600`. | Source: [`apps/backend/src/config/app.config.ts`](../../apps/backend/src/config/app.config.ts)