feat: data set lifecycle job by silent-cipher · Pull Request #588 · FilOzone/dealbot

silent-cipher · 2026-06-01T18:24:19Z

Summary

Adds a new data_set_lifecycle_check job that exercises the full createDataSet → terminateService lifecycle in a single self-contained run. It creates a throwaway empty data set, terminates it, and waits for transaction receipt. This runs regardless of how many datasets a provider already has.

The job is calibration-only by default (DATASET_LIFECYCLE_CHECK_ENABLED=false on mainnet).

Changes

jobs.service.ts — new data_set_lifecycle_check job type: scheduling, worker registration, singleton-per-SP enforcement
data-set-lifecycle.service.ts — runLifecycleCheck: creates a tagged empty data set (dealbotLifecycleCheck metadata key), calls terminateService, and waits for the transaction receipt
check-metrics.service.ts — new dataSetLifecycleCheckStatus and dataSetLifecycleCheckMs metrics
app.config.ts / .env.example — three new config variables (see below)
Docs — docs/checks/data-set-lifecycle-check.md, updated environment-variables.md, jobs.md, checks/README.md, events-and-metrics.md

New Config Variables

DATASET_LIFECYCLE_CHECK_ENABLED – default: true (calibration) / false (mainnet) , enable/disable the job
DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR – check rate per provider
DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS – default: 600 secs, max job runtime before forced abort

Closes #586

Copilot

Pull request overview

Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.

Changes:

Introduces a detailed design/spec for a new data_set_termination job, including scheduling, handler algorithm, and idempotency expectations.
Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.

BigLep · 2026-06-03T14:57:06Z

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

BigLep

Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.

I don't think we should say that this closes #586

We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.

silent-cipher · 2026-06-03T18:54:26Z

I don't think we should say that this closes #586

I was planning to include implementation in this same PR.

BigLep

I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.

(This now concludes the review I started with #588 (review))

BigLep · 2026-06-03T20:45:47Z

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Post GA item created: FilOzone/filecoin-services#503

BigLep · 2026-06-04T16:25:10Z

2026-06-04 notes/decisions from verbal parking lot after standup:

Participants: @BigLep, @silent-cipher, @SgtPooki

We discussed the current design where DATA_SET_TERMINATION_MIN_INDEX partitions slots into "golden" (stable for retrieval) and "recyclable" ranges. The concern is that the recyclable slots aren't truly separate from the data storage/retrieval checks — data_set_creation and deal checks may still add pieces to them, so they're neither fully golden nor fully throwaway. The current approach sits in an awkward middle ground.

Decision: simplify to a standalone create-then-terminate canary job.

Instead of the slot-partitioning approach, we agreed to:

Decouple entirely from golden data sets. The termination canary should not recycle slots used by data_set_creation, data storage checks, or retrieval checks. It's a separate, lightweight job.
Create and immediately terminate in a single job invocation. The job creates a new data set, confirms success, then calls terminateService on it. No pieces are added. No interaction with MIN_NUM_DATASETS_FOR_CHECKS or existing slots.
No DB tracking. In the happy path, the data set is created and terminated within one job run. No new rows in the data sets table. Use a recognizable metadata key on the created data set so that if termination fails and data sets accumulate, they can be identified on-chain for manual cleanup.
Accept possible resource leakage. If creation succeeds but termination fails, we'll leak data sets. This is an acceptable trade-off for simplicity — the metadata tagging gives us a way to query for and clean up any leaked data sets if necessary.
Calibration-only by default. Should not run on mainnet by default, but don't hard-block mainnet — allow manual opt-in.
Chaos monkey deferred. Randomly killing golden data sets (to test resilience of the full lifecycle) is a good future goal but out of scope for now. We want the simplest canary that catches createDataSet/terminateService regressions.

This is a meaningful pivot from the current design — sorry for not catching it sooner in review. The existing work on the spec and slot-management logic is appreciated, but this approach should result in simpler code and docs.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

SgtPooki · 2026-06-05T13:40:01Z

DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS – default: 360 secs, max job runtime before forced abort

I think we might need to increase this default. is 6 minutes enough to create a data set and delete it?

SgtPooki · 2026-06-05T13:40:59Z

Also,

Create and immediately terminate in a single job invocation. The job creates a new data set, confirms success, then calls terminateService on it. No pieces are added. No interaction with MIN_NUM_DATASETS_FOR_CHECKS or existing slots.

current pr description says that we are seeding a piece, but we shouldn't need to do that. we can create a data-set without a piece right?

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

silent-cipher · 2026-06-05T14:21:48Z

I think we might need to increase this default. is 6 minutes enough to create a data set and delete it?

It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.

current pr description says that we are seeding a piece, but we shouldn't need to do that. we can create a data-set without a piece right?

I was also thinking about creating an empty data set but then I saw your doc comment above createDataSetWithPiece - "empty data sets is being removed from Curio and synapse-sdk".

SgtPooki · 2026-06-05T16:31:59Z

It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.

yeah lets do 10 mins.. its just the max before forced abort, and i'd rather give cleanup room to finish than exit too early.

I was also thinking about creating an empty data set but then I saw your doc comment above createDataSetWithPiece - "empty data sets is being removed from Curio and synapse-sdk".

good catch.. i went and double checked and my doc comment is stale. createDataset was removed in FilOzone/pdp#201 but restored in FilOzone/pdp#219 specifically to allow empty data sets. curio still supports it (POST /pdp/data-sets, no pieces) and synapse-core still exports createDataSet. so empty is possible.. lets drop the seed piece per the parking-lot decision, and i'll fix that comment in a followup.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

silent-cipher · 2026-06-05T18:47:09Z

lets drop the seed piece per the parking-lot decision

Done. Now creating empty data set.

Also, reverted all deal.service.ts changes and moved them to separate DataSetLifecycle module.

BigLep

Main feedback is item is to consider to doing createDataSet and createDataSetAddPiece (sp - can't remember). I would hate for this canary to pass but then the main operation that users actually use to fail...

BigLep · 2026-06-05T20:50:54Z

+
+## Overview
+
+A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.


One thought: most of our tools do createDataSetAndAddPiece right? If it possible that createDataSet could succeed but createDataSetAddPiece could fail then I think it's important to do both (where createDataSetAndAddPiece uses a tiny piece).

I think there can be case where createDataSet could succeed but createDataSetAndAddPiece could fail.

Thought on how do we do it:

We select randomly which path to test from - createDataSet or createDataSetAndAddPiece.

We add path label to both metrics with values - createDataSet or createDataSetAndAddPiece

BigLep · 2026-06-05T20:53:13Z

+
+### 2. Create the empty data set
+
+Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.


What is the "per run" value?

Suggested change

Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.

Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the `<timestamp>` value ensures a fresh data set is created on every invocation rather than resolving a prior one.

I did this because I was originally confused by what was meant by "per-run value". I see what you're meaning now, but I think this makes it clearer.

BigLep · 2026-06-05T20:54:28Z

+
+Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.
+
+This step does **not** emit `dataSetCreation` metrics — those belong to the `data_set_creation` job.


Turn data_set_creation into a hyperlink?

BigLep · 2026-06-05T21:06:32Z

+
+## Overview
+
+A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.


Suggested change

A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.

A "data set lifecycle check" tests the `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.

Other dataset operations like adding pieces and retrieving them is handled by the [data storage check](LINK). Data set creation isn't covered by the data storage check, and it also relies on a relatively stable set of datasets, which is why this separate check was developed.

silent-cipher added 2 commits June 1, 2026 17:53

docs: add data-set-creation job design documentation

d71b721

docs: add data-set-deletion job design documentation

45a3e90

FilOzzy added this to FOC Jun 1, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Jun 1, 2026

silent-cipher changed the base branch from main to docs/data-set-creation-design-doc June 1, 2026 18:24

docs: simplify terminated slot skip

e654660

silent-cipher self-assigned this Jun 1, 2026

rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Jun 2, 2026

Base automatically changed from docs/data-set-creation-design-doc to main June 3, 2026 06:18

silent-cipher added 2 commits June 3, 2026 11:50

Merge branch 'main' into feat/data-set-deletion-job

c08d54f

docs: rename + more explanations

6af976b

silent-cipher requested review from BigLep and Copilot June 3, 2026 08:03

Copilot started reviewing on behalf of silent-cipher June 3, 2026 08:03 View session

silent-cipher requested a review from SgtPooki June 3, 2026 08:04

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread docs/data-set-termination.md Outdated

Comment thread docs/data-set-termination.md Outdated

Comment thread docs/data-set-termination.md Outdated

chore: address pr comments

c537509

silent-cipher changed the title ~~docs: data set deletion job design documentation~~ docs: data set termination job design documentation Jun 3, 2026

rjan90 marked this pull request as ready for review June 3, 2026 14:40

rjan90 moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Jun 3, 2026

BigLep reviewed Jun 3, 2026

View reviewed changes

silent-cipher added 2 commits June 4, 2026 15:54

feat: add data_set_termination canary job

fac137f

doc: udpate docs

91ffcdc

BigLep marked this pull request as draft June 4, 2026 16:25

BigLep moved this from 🔎 Awaiting review to ⌨️ In Progress in FOC Jun 4, 2026

silent-cipher added 2 commits June 5, 2026 00:58

refactor: pivot to data_set_lifecycle_check

484996e

chore: format

701b463

silent-cipher changed the title ~~docs: data set termination job design documentation~~ feat: data set lifecycle job Jun 4, 2026

silent-cipher added 3 commits June 5, 2026 01:15

refactor: consolidate logging context + early checks

7024e33

fix: default job timeout to 6 mins

4ee5714

docs: update data set lifecycle check doc

11b8701

silent-cipher marked this pull request as ready for review June 5, 2026 09:55

silent-cipher requested review from BigLep and Copilot June 5, 2026 09:55

Copilot AI reviewed Jun 5, 2026

View reviewed changes

SgtPooki requested a review from Copilot June 5, 2026 13:38

Copilot started reviewing on behalf of SgtPooki June 5, 2026 13:38 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread docs/environment-variables.md

Comment thread docs/checks/data-set-lifecycle-check.md

Comment thread apps/backend/.env.example

BigLep requested a review from Copilot June 5, 2026 17:01

Copilot started reviewing on behalf of BigLep June 5, 2026 17:01 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

silent-cipher added 4 commits June 5, 2026 23:38

chore: revert back deal service

f2fa4ce

refactor: create separate data set lifecycle service

81b049d

docs: update docs

980e668

docs: update default value

524cdc5

BigLep approved these changes Jun 5, 2026

View reviewed changes


		## Overview

		A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.


		### 2. Create the empty data set

		Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.


		Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.

		This step does not emit `dataSetCreation` metrics — those belong to the `data_set_creation` job.

-A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
+A "data set lifecycle check" tests the `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
+Other dataset operations like adding pieces and retrieving them is handled by the [data storage check](LINK). Data set creation isn't covered by the data storage check, and it also relies on a relatively stable set of datasets, which is why this separate check was developed.

Conversation

silent-cipher commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Config Variables

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep commented Jun 3, 2026

Uh oh!

BigLep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher commented Jun 3, 2026

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep commented Jun 3, 2026

Uh oh!

BigLep commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

SgtPooki commented Jun 5, 2026

Uh oh!

SgtPooki commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher commented Jun 5, 2026

Uh oh!

SgtPooki commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

silent-cipher commented Jun 5, 2026

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

BigLep Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

silent-cipher Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

BigLep Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

silent-cipher commented Jun 1, 2026 •

edited

Loading

BigLep left a comment •

edited

Loading