Skip to content

feat: data set lifecycle job#588

Open
silent-cipher wants to merge 17 commits into
mainfrom
feat/data-set-deletion-job
Open

feat: data set lifecycle job#588
silent-cipher wants to merge 17 commits into
mainfrom
feat/data-set-deletion-job

Conversation

@silent-cipher
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher commented Jun 1, 2026

Summary

Adds a new data_set_lifecycle_check job that exercises the full createDataSetterminateService lifecycle in a single self-contained run. It creates a throwaway empty data set, terminates it, and waits for transaction receipt. This runs regardless of how many datasets a provider already has.

The job is calibration-only by default (DATASET_LIFECYCLE_CHECK_ENABLED=false on mainnet).

Changes

  • jobs.service.ts — new data_set_lifecycle_check job type: scheduling, worker registration, singleton-per-SP enforcement
  • data-set-lifecycle.service.tsrunLifecycleCheck: creates a tagged empty data set (dealbotLifecycleCheck metadata key), calls terminateService, and waits for the transaction receipt
  • check-metrics.service.ts — new dataSetLifecycleCheckStatus and dataSetLifecycleCheckMs metrics
  • app.config.ts / .env.example — three new config variables (see below)
  • Docs — docs/checks/data-set-lifecycle-check.md, updated environment-variables.md, jobs.md, checks/README.md, events-and-metrics.md

New Config Variables

  • DATASET_LIFECYCLE_CHECK_ENABLED – default: true (calibration) / false (mainnet) , enable/disable the job
  • DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR – check rate per provider
  • DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS – default: 600 secs, max job runtime before forced abort

Closes #586

@FilOzzy FilOzzy added this to FOC Jun 1, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Jun 1, 2026
@silent-cipher silent-cipher changed the base branch from main to docs/data-set-creation-design-doc June 1, 2026 18:24
@silent-cipher silent-cipher self-assigned this Jun 1, 2026
@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Jun 2, 2026
Base automatically changed from docs/data-set-creation-design-doc to main June 3, 2026 06:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.

Changes:

  • Introduces a detailed design/spec for a new data_set_termination job, including scheduling, handler algorithm, and idempotency expectations.
  • Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
  • Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@silent-cipher silent-cipher changed the title docs: data set deletion job design documentation docs: data set termination job design documentation Jun 3, 2026
@rjan90 rjan90 marked this pull request as ready for review June 3, 2026 14:40
@rjan90 rjan90 moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Jun 3, 2026
@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Jun 3, 2026

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.

I don't think we should say that this closes #586

We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@silent-cipher
Copy link
Copy Markdown
Collaborator Author

I don't think we should say that this closes #586

I was planning to include implementation in this same PR.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.

(This now concludes the review I started with #588 (review))

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Jun 3, 2026

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Post GA item created: FilOzone/filecoin-services#503

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Jun 4, 2026

2026-06-04 notes/decisions from verbal parking lot after standup:

Participants: @BigLep, @silent-cipher, @SgtPooki

We discussed the current design where DATA_SET_TERMINATION_MIN_INDEX partitions slots into "golden" (stable for retrieval) and "recyclable" ranges. The concern is that the recyclable slots aren't truly separate from the data storage/retrieval checks — data_set_creation and deal checks may still add pieces to them, so they're neither fully golden nor fully throwaway. The current approach sits in an awkward middle ground.

Decision: simplify to a standalone create-then-terminate canary job.

Instead of the slot-partitioning approach, we agreed to:

  1. Decouple entirely from golden data sets. The termination canary should not recycle slots used by data_set_creation, data storage checks, or retrieval checks. It's a separate, lightweight job.
  2. Create and immediately terminate in a single job invocation. The job creates a new data set, confirms success, then calls terminateService on it. No pieces are added. No interaction with MIN_NUM_DATASETS_FOR_CHECKS or existing slots.
  3. No DB tracking. In the happy path, the data set is created and terminated within one job run. No new rows in the data sets table. Use a recognizable metadata key on the created data set so that if termination fails and data sets accumulate, they can be identified on-chain for manual cleanup.
  4. Accept possible resource leakage. If creation succeeds but termination fails, we'll leak data sets. This is an acceptable trade-off for simplicity — the metadata tagging gives us a way to query for and clean up any leaked data sets if necessary.
  5. Calibration-only by default. Should not run on mainnet by default, but don't hard-block mainnet — allow manual opt-in.
  6. Chaos monkey deferred. Randomly killing golden data sets (to test resilience of the full lifecycle) is a good future goal but out of scope for now. We want the simplest canary that catches createDataSet/terminateService regressions.

This is a meaningful pivot from the current design — sorry for not catching it sooner in review. The existing work on the spec and slot-management logic is appreciated, but this approach should result in simpler code and docs.

@BigLep BigLep marked this pull request as draft June 4, 2026 16:25
@BigLep BigLep moved this from 🔎 Awaiting review to ⌨️ In Progress in FOC Jun 4, 2026
@silent-cipher silent-cipher changed the title docs: data set termination job design documentation feat: data set lifecycle job Jun 4, 2026
@silent-cipher silent-cipher marked this pull request as ready for review June 5, 2026 09:55
@silent-cipher silent-cipher requested review from BigLep and Copilot June 5, 2026 09:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@SgtPooki
Copy link
Copy Markdown
Collaborator

SgtPooki commented Jun 5, 2026

DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS – default: 360 secs, max job runtime before forced abort

I think we might need to increase this default. is 6 minutes enough to create a data set and delete it?

@SgtPooki
Copy link
Copy Markdown
Collaborator

SgtPooki commented Jun 5, 2026

Also,

Create and immediately terminate in a single job invocation. The job creates a new data set, confirms success, then calls terminateService on it. No pieces are added. No interaction with MIN_NUM_DATASETS_FOR_CHECKS or existing slots.

current pr description says that we are seeding a piece, but we shouldn't need to do that. we can create a data-set without a piece right?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Comment thread docs/environment-variables.md
Comment thread docs/checks/data-set-lifecycle-check.md
Comment thread apps/backend/.env.example
@silent-cipher
Copy link
Copy Markdown
Collaborator Author

I think we might need to increase this default. is 6 minutes enough to create a data set and delete it?

It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.

current pr description says that we are seeding a piece, but we shouldn't need to do that. we can create a data-set without a piece right?

I was also thinking about creating an empty data set but then I saw your doc comment above createDataSetWithPiece - "empty data sets is being removed from Curio and synapse-sdk".

@SgtPooki
Copy link
Copy Markdown
Collaborator

SgtPooki commented Jun 5, 2026

It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.

yeah lets do 10 mins.. its just the max before forced abort, and i'd rather give cleanup room to finish than exit too early.

I was also thinking about creating an empty data set but then I saw your doc comment above createDataSetWithPiece - "empty data sets is being removed from Curio and synapse-sdk".

good catch.. i went and double checked and my doc comment is stale. createDataset was removed in FilOzone/pdp#201 but restored in FilOzone/pdp#219 specifically to allow empty data sets. curio still supports it (POST /pdp/data-sets, no pieces) and synapse-core still exports createDataSet. so empty is possible.. lets drop the seed piece per the parking-lot decision, and i'll fix that comment in a followup.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@silent-cipher
Copy link
Copy Markdown
Collaborator Author

lets drop the seed piece per the parking-lot decision

Done. Now creating empty data set.

Also, reverted all deal.service.ts changes and moved them to separate DataSetLifecycle module.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main feedback is item is to consider to doing createDataSet and createDataSetAddPiece (sp - can't remember). I would hate for this canary to pass but then the main operation that users actually use to fail...


## Overview

A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought: most of our tools do createDataSetAndAddPiece right? If it possible that createDataSet could succeed but createDataSetAddPiece could fail then I think it's important to do both (where createDataSetAndAddPiece uses a tiny piece).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there can be case where createDataSet could succeed but createDataSetAndAddPiece could fail.

Thought on how do we do it:

  1. We select randomly which path to test from - createDataSet or createDataSetAndAddPiece.
  2. We add path label to both metrics with values - createDataSet or createDataSetAndAddPiece


### 2. Create the empty data set

Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "per run" value?

Suggested change
Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.
Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the `<timestamp>` value ensures a fresh data set is created on every invocation rather than resolving a prior one.

I did this because I was originally confused by what was meant by "per-run value". I see what you're meaning now, but I think this makes it clearer.


Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one.

This step does **not** emit `dataSetCreation` metrics — those belong to the `data_set_creation` job.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn data_set_creation into a hyperlink?


## Overview

A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
A "data set lifecycle check" tests the `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP.
Other dataset operations like adding pieces and retrieving them is handled by the [data storage check](LINK). Data set creation isn't covered by the data storage check, and it also relies on a relatively stable set of datasets, which is why this separate check was developed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ⌨️ In Progress

Development

Successfully merging this pull request may close these issues.

Periodic dataset deletion job in calibration to canary createDataSet flow

6 participants