Skip to content

feat(mediorum): bound ops table via dormant cleanup, gap signal, and opt-in retention#304

Open
RolfAris wants to merge 3 commits into
OpenAudio:mainfrom
RolfAris:feat/mediorum-ops-retention-minimal
Open

feat(mediorum): bound ops table via dormant cleanup, gap signal, and opt-in retention#304
RolfAris wants to merge 3 commits into
OpenAudio:mainfrom
RolfAris:feat/mediorum-ops-retention-minimal

Conversation

@RolfAris
Copy link
Copy Markdown
Contributor

@RolfAris RolfAris commented May 21, 2026

The crudr ops table is currently unbounded: every CRUD-tracked write appends a row that lives forever. On scaled tables (~250M rows) it dominates the database. This PR introduces three coordinated mechanisms to bound it safely, plus the correctness invariants that keep them safe under peer divergence.

Mechanisms

  • One-time dormant cleanup. On boot, drop ops rows for tables with no write in the dormant window (default 90d). Idempotent; opt-out via OPENAUDIO_MEDIORUM_KEEP_DORMANT_OPS=true. Runs off the boot path so /health-check stays reachable while a multi-million-row backlog drains.

  • Retention gap signal. When a peer's sweep cursor falls below our lowest available ulid, ServeCrudSweep emits X-Mediorum-Retention-Gap: true and X-Mediorum-Available-Min-Ulid: <min>. The peer's doSweep stages a cursor advance across the gap. Wire format unchanged; older clients ignore the headers.

  • Opt-in per-table retention sweep. When OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYS is set, a managed routine prunes per-table ops older than the configured window, gated by the slowest active peer cursor (with a safety margin). Unset = no deletions.

Correctness invariants

The sweep can be silently disabled or permanently desynchronized in three independent ways. Each is fixed:

  1. Non-peer rows pin the retention floor. The cursors table is shared with workers like qm_fix_truncated, which writes a CID into LastULID. A bare ulid.Parse on every cursor row would pin the floor at the zero ULID indefinitely. computeRetentionCutoff filters to active peer hosts and treats unparseable cursors as missing.

  2. Bad cursors heal nowhere. A persisted LastULID outside the plausible window (far-future or epoch) gets echoed as ?after= on the next sweep and re-upserted at end of sweep, locking the peer into a permanent prune-and-recreate cycle. doSweep treats an implausible LastULID as missing on load; ApplyOp rejects implausible op.ULID at the boundary.

  3. Gap signal stages too eagerly. Honoring the gap header on first contact lets a peer set our initial cursor at any chosen position. Persisting the staged value before the body decodes also loses the gap on partial failure. The gap branch now requires an existing cursor, defers persistence to the end-of-sweep upsert, and counts MarkSweepGapAdvance only after the upsert succeeds and the durable cursor still sits at or above the staged floor.

Additional correctness

  • ServeCrudSweep snapshot. MIN(ulid) and the body Find now run inside a REPEATABLE READ read-only transaction so a retention DELETE that commits between the two reads cannot leave us serving a row the gap header doesn't cover.

  • EnsureOpsTableIndex pinned-conn lock + invalid-index self-heal. The composite ops("table", ulid) index can take 30-60 min to build on scaled tables. The advisory lock is on a pinned *sql.Conn because gorm.WithContext returns a session over the shared pool that would release the lock on a different connection. The invalid-leftover probe + DROP handles the recovery case where a prior process died mid-CREATE INDEX CONCURRENTLY; without it, IF NOT EXISTS short-circuits on the INVALID index name forever and retention silently degrades to seq-scans.

  • Self-host normalization. server.New lowercases and trailing-slash-strips config.Self.Host before the peer-vs-self comparison so a chain-registry drift cannot land self in peerHosts and block all retention via the empty-peer sentinel.

  • Panic recovery on managed routines. lifecycle.AddManagedRoutine has no built-in recover; the dormant cleanup, ops-index ensure, and retention sweep are all wrapped in defer recover so a panic in any one cannot crash mediorum.

Tests

pkg/mediorum/crudr/retention_test.go covers:

  • Dormant cleanup: per-table, dormancy threshold, opt-out, dry-run.
  • Gap signal: header emit, validation, first-contact rejection, hostile far-future ulid, garbage ulid.
  • Retention sweep: cursor floor, safety margin, ancient cursor, empty cursor, malformed cursor, concurrent sweep + delete.
  • TestRetentionTick_NonPeerCursorRowsIgnored covers the qm_fix_truncated-shaped row that motivated the activePeers filter.

go test -count=1 ./pkg/mediorum/crudr/ green.

Risk / rollout

  • Correctness-additive; no new env flags are required (both OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYS and OPENAUDIO_MEDIORUM_KEEP_DORMANT_OPS default to no-op).
  • The activePeers filter is the most behavior-shifting change: a fleet that relied on a non-peer cursor row to block retention would stop seeing that block. No such use case in mediorum today; that path is the bug, not a contract.
  • REPEATABLE READ on ServeCrudSweep is read-only and short. No deadlock risk against the retention DELETE path.
  • The pinned-conn advisory lock is session-scoped: concurrent boots block at pg_try_advisory_lock returning false (no-op) rather than racing.

Supersedes #277.

…opt-in retention

The crudr ops table is currently unbounded: every CRUD-tracked write
appends a row that lives forever. On scaled tables (~250M rows) it
dominates the database. This PR introduces three coordinated
mechanisms to bound it safely, plus the correctness invariants that
keep them safe under peer divergence.

# Mechanisms

* One-time dormant cleanup. On boot, drop ops rows for tables with
  no write in the dormant window (default 90d). Idempotent; opt-out
  via OPENAUDIO_MEDIORUM_KEEP_DORMANT_OPS=true. Runs off the boot
  path so /health-check stays reachable while a multi-million-row
  backlog drains.

* Retention gap signal. When a peer's sweep cursor falls below our
  lowest available ulid, ServeCrudSweep emits
  X-Mediorum-Retention-Gap=true and X-Mediorum-Available-Min-Ulid=<min>.
  The peer's doSweep stages a cursor advance across the gap. Wire
  format unchanged; older clients ignore the headers.

* Opt-in per-table retention sweep. When
  OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYS is set, a managed routine
  prunes per-table ops older than the configured window, gated by
  the slowest active peer cursor (with a safety margin). Unset = no
  deletions.

# Correctness invariants

The sweep can be silently disabled or permanently desynchronized in
three independent ways. Each is fixed:

1. Non-peer rows pin the retention floor. The cursors table is
   shared with workers like qm_fix_truncated, which writes a CID
   into LastULID. computeRetentionCutoff filters to active peer
   hosts and treats unparseable cursors as missing.

2. Bad cursors heal nowhere. A persisted LastULID outside the
   plausible window (far-future or epoch) gets echoed as ?after=
   and re-upserted at end of sweep, locking the peer into a
   permanent prune-and-recreate cycle. doSweep treats an
   implausible LastULID as missing on load; ApplyOp rejects
   implausible op.ULID at the boundary.

3. Gap signal stages too eagerly. Honoring the gap header on first
   contact lets a peer set our initial cursor at any chosen
   position. Persisting the staged value before the body decodes
   also loses the gap on partial failure. The gap branch now
   requires an existing cursor, defers persistence to the
   end-of-sweep upsert, and counts MarkSweepGapAdvance only after
   the upsert succeeds and the durable cursor still sits at or
   above the staged floor.

# Additional correctness

* ServeCrudSweep snapshot. MIN(ulid) and the body Find run in a
  REPEATABLE READ read-only transaction so a retention DELETE that
  commits between the two reads cannot leave us serving a row the
  gap header doesn't cover.

* EnsureOpsTableIndex pinned-conn lock + invalid-index self-heal.
  The composite ops("table", ulid) index can take 30-60 min to
  build on scaled tables. The advisory lock is on a pinned
  *sql.Conn because gorm.WithContext returns a session over the
  shared pool that would release the lock on a different
  connection. The invalid-leftover probe + DROP handles the
  recovery case where a prior process died mid-CREATE INDEX
  CONCURRENTLY; without it, IF NOT EXISTS short-circuits on the
  INVALID index name forever and retention silently degrades to
  seq-scans.

* Self-host normalization. server.New lowercases and
  trailing-slash-strips config.Self.Host before the peer-vs-self
  comparison so a chain-registry drift cannot land self in
  peerHosts and block all retention via the empty-peer sentinel.

* Panic recovery on managed routines.
  lifecycle.AddManagedRoutine has no built-in recover; the dormant
  cleanup, ops-index ensure, and retention sweep are wrapped in
  defer recover so a panic in any one cannot crash mediorum.

# Tests

pkg/mediorum/crudr/retention_test.go covers dormant cleanup
(per-table, threshold, opt-out, dry-run), gap signal (emit,
validation, first-contact rejection, hostile far-future ulid,
garbage ulid), retention sweep (cursor floor, safety margin,
ancient cursor, empty cursor, malformed cursor, concurrent sweep
+ delete). TestRetentionTick_NonPeerCursorRowsIgnored covers the
qm_fix_truncated-shaped row that motivated the activePeers filter.

go test -count=1 ./pkg/mediorum/crudr/ green.
@RolfAris RolfAris force-pushed the feat/mediorum-ops-retention-minimal branch from afe1a61 to 6e59e39 Compare May 21, 2026 02:46
@RolfAris RolfAris changed the title feat(mediorum/crudr): tighten ops retention cursor invariants feat(mediorum): bound ops table via dormant cleanup, gap signal, and opt-in retention May 21, 2026
@RolfAris
Copy link
Copy Markdown
Contributor Author

Rollout note from live fleet readiness on 2026-05-21:

  • Pushed b596358 to gate EnsureOpsTableIndex behind OPENAUDIO_MEDIORUM_ENSURE_OPS_TABLE_INDEX=true (default off). This avoids a canary boot attempting a production-scale composite index build by default.
  • val005 is the better official canary candidate by current headroom, but still only has about 4 GiB free on /.
  • val005 ops is ~89.7 GiB total; existing ops_pkey alone is ~5.8 GiB. Building ops("table", ulid) on that disk budget is not safe without first freeing/adding space.
  • Built a local amd64 candidate tarball for the gated commit: ghcr.io/rolfaris/go-openaudio:pr304-mediorum-ops-retention-b596358-local@sha256:dd8d4507d745ab5c840bbf53da590b1457b53246b197de6f1299a33fbfab60ba; tarball SHA256 c1374930e8c82994b7af4bc2c312622708804d17246fd14b55c014828708b81c.
  • Read-only preflight on val005 is blocked only by candidate_image_not_loaded_locally; runtime/compose/external health are otherwise OK.

Recommendation: canary #304 with the index gate left off first. Only enable OPENAUDIO_MEDIORUM_ENSURE_OPS_TABLE_INDEX=true after a separate disk-headroom step or an explicit index-build window.

@RolfAris
Copy link
Copy Markdown
Contributor Author

Canary update from val005:

  • b596358 did start and pass HTTP health, but it was not safe to keep running: with the index build gated off, the default dormant cleanup still launched SELECT MAX(ulid) FROM ops WHERE "table" = $1 on val005's ~90 GB ops relation. pg_stat_activity showed it active for >5 min on IO/DataFileRead, so I rolled it back.
  • Rollback exposed that PR304 has a forward-only DB migration from the baseline runtime's perspective: v1.2.13 fails core startup with unknown migration in database after PR304 has recorded 00034_drop_redundant_core_tx_stats_tx_hash_index.sql. So post-canary rollback needs a PR304-compatible image or a DB migration cleanup plan; baseline v1.2.13 is not a clean rollback target after this migration runs.
  • I pushed 35e3045, which makes dormant cleanup opt-in as well as keeping the composite index build opt-in. val005 is now running ghcr.io/rolfaris/go-openaudio@sha256:f06edada8e4d914aa1d6b864edbc3d7822e5b2074e82e471868bde2a495835a7.
  • Verification after roll-forward: val005 check-fleet passes with http=200, live=True, storage=True; full fleet check-fleet --forks=20 completed successfully; logs show ops table index ensure disabled, dormant ops cleanup disabled, and ops retention disabled; pg_stat_activity shows no active long-running ops scan.

Implication: PR304 is not a physical disk-space reclamation fix by itself. It is now safe enough to continue canary observation with the expensive ops cleanup paths default-off, but actual filesystem space reclaim still needs a separate controlled ops physical reclaim plan.

@RolfAris
Copy link
Copy Markdown
Contributor Author

Final val005 canary/fold result from the 20-validator fleet, 2026-05-22 UTC:

  • 35e3045 stayed healthy with the expensive paths default-off: external health 200, storage healthy, version=1.3.0, no restarts, and no long-running ops scan.
  • No reclaim happened while default-off. The only observed ops index was ops_pkey; no retention/dormant cleanup env gates were enabled. A short sample showed +4,178 ops rows and +3,112,960 relation bytes over ~374.6s, roughly 11.15 rows/s and 0.72 GB/day in that window.
  • I folded val005 back to official openaudio/go-openaudio:v1.3.0@sha256:af19d4fdeb135ccec3aae6fbcac9496f02308064a3099ed843adab9318d5ec34 before reclaim.
  • The final physical reclaim used the operator-side guarded cursor-seeded reset, not PR304 retention. Fresh keyset proof showed val001 was not the right reference because uploads differed; val009 matched val005 exactly for uploads, audio_previews, and qm_audio_analyses, so the reset used val009 and failed closed on that match.
  • val005 disk moved from /dev/sda1 used=202,032,758,784B available=4,850,745,344B use=98% to used=104,683,884,544B available=102,199,619,584B use=51% immediately after reset. Removing unused PR304 and old v1.2.13 images moved it to used=101,878,546,432B available=105,004,957,696B use=50%.
  • ops was 97,349,885,952B before the reset, 16,384B after truncate inside the transaction, and 64MB in the post-reset fleet probe after new rows resumed.
  • Final verification: image drift clean on 20/20, full check-fleet.yml --forks=20 green, targeted val005 checks green after Docker cleanup, manual controller health wrapper green, cron-health wrapper green.

Operator implication: this PR is a useful safety/observability building block when the expensive paths are default-off, but it is not disk relief by itself. For production operators, the next upstream bar is either proven bounded growth or actual physical reclaim semantics. DELETE-only retention is not enough unless paired with a story that returns filesystem space; partition detach/drop, a safe table swap, or snapshot/bootstrap plus partitioned retention are the shapes that would close the operator problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant