Skip to content

feat(jobs): add dataset_cleanup_sweep global job#548

Draft
SgtPooki wants to merge 1 commit into
mainfrom
fix/dataset-cleanup-sweep
Draft

feat(jobs): add dataset_cleanup_sweep global job#548
SgtPooki wants to merge 1 commit into
mainfrom
fix/dataset-cleanup-sweep

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

What changed

New scheduled global job dataset_cleanup_sweep that periodically reconciles Deal.cleaned_up with FWSS state. For every uncleaned Deal row with a non-null data_set_id, the sweeper probes FWSS getDataSet(dataSetId):

  • pdpEndEpoch != 0n → flip cleaned_up=true (terminated)
  • getDataSet returns null → flip cleaned_up=true (dataset removed)
  • otherwise → leave alone (live)
  • probe error → skip + log + retry next tick

UPDATEs filter cleaned_up=false and run inside a single transaction per bucket. Reuses the existing UPDATE block from repairTerminatedDataSet via a new shared method DealService.markDealsCleanedUpForDataSets.

Why

In session-key + multisig payer mode, dealbot cannot auto-terminate datasets (#546). Operators run terminateService via Safe (precedent: #545). After the Safe batch lands, synapse-sdk's createContext filters out the terminated dataset, so getDataSetProvisioningStatus returns "missing" instead of "terminated" and repairTerminatedDataSet is never invoked for those rows. The retrieval candidate selector keeps picking the stale Deal rows and pollutes failure metrics.

Empirically on calibration staging (7-day window):

  • 4530 successful retrievals, 398 failed
  • 340 / 398 failed retrievals (87%) tied to FWSS-terminated datasets
  • 6 / 398 tied to FWSS-DNE datasets
  • Two datasets account for 337/340 of the terminated pollution

How to verify

pnpm test from repo root (364 tests pass).

Manual: in staging after deploy, watch for dataset_cleanup_sweep_completed log events. First run drains the historical backlog; steady-state ticks should report datasetsTerminated: 0 after the queue is clean.

Notes / risks

  • Draft. New unit tests for the sweeper + handler are a follow-up.
  • Cadence is 24h by default (DATASET_CLEANUP_SWEEP_INTERVAL_SECONDS=86400). Operators can dial down via env if a backlog spikes after a Safe batch.
  • Partial index migration: CREATE INDEX idx_deals_unclean_dataset ON deals(data_set_id) WHERE cleaned_up = false AND data_set_id IS NOT NULL. Without it, the SELECT DISTINCT data_set_id WHERE cleaned_up = false triggers a full table scan on every tick.
  • pdpEndEpoch != 0n is treated as irreversible (verified against FWSS source: no path resets it to zero once set).
  • Multicall errors are bucketed separately from result == null so a flaky RPC will not wrongly mark deals cleaned up.
  • Out of scope for this PR: Deal rows with data_set_id IS NULL (separate root cause from Synapse upgrade causes Dealbot to create a new dataset per deal for providers beyond first 100 client datasets #511 piece_id=0 pattern). Filed as a separate concern.
  • Follow-up optimization noted in Dealbot can't auto-terminate datasets when payer is a Safe multisig #546 review: subgraph already exposes pdpEndEpoch per dataset, so a future iteration could skip the per-dataset multicall entirely.

Refs

Scheduled global job that flips Deal.cleaned_up=true for any uncleaned
Deal row whose data_set_id shows pdpEndEpoch != 0n on FWSS (terminated)
or whose getDataSet returns null (removed).

Closes the gap from the operator-must-terminate workflow tracked in
#546: after an operator runs
a Safe terminateService batch (like #545), synapse-sdk's createContext
filters out the now-terminated dataset, so getDataSetProvisioningStatus
returns "missing" instead of "terminated" and repairTerminatedDataSet
is never invoked for those rows.

Empirically, 87% of recent failed retrievals tie to terminated datasets
that should have been auto-cleaned. The sweeper eliminates this noise.

Cadence default 24h (DATASET_CLEANUP_SWEEP_INTERVAL_SECONDS=86400),
batch size 50 (DATASET_CLEANUP_SWEEP_BATCH_SIZE=50). Reuses
recordJobExecution for jobs_* metrics. Idempotent UPDATE filters
cleaned_up=false. Adds a partial index on
deals(data_set_id) WHERE cleaned_up=false AND data_set_id IS NOT NULL
to keep the SELECT DISTINCT cheap as the deals table grows.

Extracts the existing UPDATE block from repairTerminatedDataSet into
DealService.markDealsCleanedUpForDataSets and shares it with the
sweeper to avoid divergence.

Tracking: #546
@FilOzzy FilOzzy added this to FOC May 19, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 19, 2026
@BigLep BigLep moved this from 📌 Triage to ⌨️ In Progress in FOC May 20, 2026
@SgtPooki SgtPooki moved this from ⌨️ In Progress to 🐱 Todo in FOC May 26, 2026
@BigLep BigLep assigned SgtPooki and unassigned SgtPooki Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🐱 Todo

Development

Successfully merging this pull request may close these issues.

3 participants