feat: data set lifecycle job#588
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.
Changes:
- Introduces a detailed design/spec for a new
data_set_terminationjob, including scheduling, handler algorithm, and idempotency expectations. - Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
- Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.
|
Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03. |
There was a problem hiding this comment.
Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.
I don't think we should say that this closes #586
We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.
I was planning to include implementation in this same PR. |
BigLep
left a comment
There was a problem hiding this comment.
I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.
(This now concludes the review I started with #588 (review))
Post GA item created: FilOzone/filecoin-services#503 |
|
2026-06-04 notes/decisions from verbal parking lot after standup: Participants: @BigLep, @silent-cipher, @SgtPooki We discussed the current design where Decision: simplify to a standalone create-then-terminate canary job. Instead of the slot-partitioning approach, we agreed to:
This is a meaningful pivot from the current design — sorry for not catching it sooner in review. The existing work on the spec and slot-management logic is appreciated, but this approach should result in simpler code and docs. |
I think we might need to increase this default. is 6 minutes enough to create a data set and delete it? |
|
Also,
current pr description says that we are seeding a piece, but we shouldn't need to do that. we can create a data-set without a piece right? |
It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.
I was also thinking about creating an empty data set but then I saw your doc comment above |
yeah lets do 10 mins.. its just the max before forced abort, and i'd rather give cleanup room to finish than exit too early.
good catch.. i went and double checked and my doc comment is stale. |
Done. Now creating empty data set. Also, reverted all |
BigLep
left a comment
There was a problem hiding this comment.
Main feedback is item is to consider to doing createDataSet and createDataSetAddPiece (sp - can't remember). I would hate for this canary to pass but then the main operation that users actually use to fail...
|
|
||
| ## Overview | ||
|
|
||
| A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. |
There was a problem hiding this comment.
One thought: most of our tools do createDataSetAndAddPiece right? If it possible that createDataSet could succeed but createDataSetAddPiece could fail then I think it's important to do both (where createDataSetAndAddPiece uses a tiny piece).
There was a problem hiding this comment.
I think there can be case where createDataSet could succeed but createDataSetAndAddPiece could fail.
Thought on how do we do it:
- We select randomly which path to test from -
createDataSetorcreateDataSetAndAddPiece. - We add
pathlabel to both metrics with values -createDataSetorcreateDataSetAndAddPiece
|
|
||
| ### 2. Create the empty data set | ||
|
|
||
| Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. |
There was a problem hiding this comment.
What is the "per run" value?
| Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. | |
| Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the `<timestamp>` value ensures a fresh data set is created on every invocation rather than resolving a prior one. |
I did this because I was originally confused by what was meant by "per-run value". I see what you're meaning now, but I think this makes it clearer.
|
|
||
| Dealbot calls `createDataSet` (from `@filoz/synapse-core/sp`) to create a new empty data set on the SP. The data set is tagged with metadata `{ dealbotLifecycleCheck: "<timestamp>" }`. The fixed `dealbotLifecycleCheck` key is the handle for finding leaked sets later; the per-run value ensures a fresh data set is created on every invocation rather than resolving a prior one. | ||
|
|
||
| This step does **not** emit `dataSetCreation` metrics — those belong to the `data_set_creation` job. |
There was a problem hiding this comment.
Turn data_set_creation into a hyperlink?
|
|
||
| ## Overview | ||
|
|
||
| A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. |
There was a problem hiding this comment.
| A "data set lifecycle check" tests the full `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. | |
| A "data set lifecycle check" tests the `createDataSet → terminateService` lifecycle for a storage provider. Dealbot creates an empty throwaway data set and immediately terminates it in the same run. A successful check confirms both the `createDataSet` and `terminateService` paths work correctly on the SP. | |
| Other dataset operations like adding pieces and retrieving them is handled by the [data storage check](LINK). Data set creation isn't covered by the data storage check, and it also relies on a relatively stable set of datasets, which is why this separate check was developed. |
Summary
Adds a new
data_set_lifecycle_checkjob that exercises the fullcreateDataSet→terminateServicelifecycle in a single self-contained run. It creates a throwaway empty data set, terminates it, and waits for transaction receipt. This runs regardless of how many datasets a provider already has.The job is calibration-only by default (
DATASET_LIFECYCLE_CHECK_ENABLED=falseon mainnet).Changes
jobs.service.ts— newdata_set_lifecycle_checkjob type: scheduling, worker registration, singleton-per-SP enforcementdata-set-lifecycle.service.ts—runLifecycleCheck: creates a tagged empty data set (dealbotLifecycleCheckmetadata key), callsterminateService, and waits for the transaction receiptcheck-metrics.service.ts— newdataSetLifecycleCheckStatusanddataSetLifecycleCheckMsmetricsapp.config.ts/.env.example— three new config variables (see below)docs/checks/data-set-lifecycle-check.md, updatedenvironment-variables.md,jobs.md,checks/README.md,events-and-metrics.mdNew Config Variables
DATASET_LIFECYCLE_CHECK_ENABLED– default:true(calibration) /false(mainnet) , enable/disable the jobDATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR– check rate per providerDATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS– default: 600 secs, max job runtime before forced abortCloses #586