feat(mediorum): bound ops table via dormant cleanup, gap signal, and opt-in retention#304
feat(mediorum): bound ops table via dormant cleanup, gap signal, and opt-in retention#304RolfAris wants to merge 3 commits into
Conversation
…opt-in retention
The crudr ops table is currently unbounded: every CRUD-tracked write
appends a row that lives forever. On scaled tables (~250M rows) it
dominates the database. This PR introduces three coordinated
mechanisms to bound it safely, plus the correctness invariants that
keep them safe under peer divergence.
# Mechanisms
* One-time dormant cleanup. On boot, drop ops rows for tables with
no write in the dormant window (default 90d). Idempotent; opt-out
via OPENAUDIO_MEDIORUM_KEEP_DORMANT_OPS=true. Runs off the boot
path so /health-check stays reachable while a multi-million-row
backlog drains.
* Retention gap signal. When a peer's sweep cursor falls below our
lowest available ulid, ServeCrudSweep emits
X-Mediorum-Retention-Gap=true and X-Mediorum-Available-Min-Ulid=<min>.
The peer's doSweep stages a cursor advance across the gap. Wire
format unchanged; older clients ignore the headers.
* Opt-in per-table retention sweep. When
OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYS is set, a managed routine
prunes per-table ops older than the configured window, gated by
the slowest active peer cursor (with a safety margin). Unset = no
deletions.
# Correctness invariants
The sweep can be silently disabled or permanently desynchronized in
three independent ways. Each is fixed:
1. Non-peer rows pin the retention floor. The cursors table is
shared with workers like qm_fix_truncated, which writes a CID
into LastULID. computeRetentionCutoff filters to active peer
hosts and treats unparseable cursors as missing.
2. Bad cursors heal nowhere. A persisted LastULID outside the
plausible window (far-future or epoch) gets echoed as ?after=
and re-upserted at end of sweep, locking the peer into a
permanent prune-and-recreate cycle. doSweep treats an
implausible LastULID as missing on load; ApplyOp rejects
implausible op.ULID at the boundary.
3. Gap signal stages too eagerly. Honoring the gap header on first
contact lets a peer set our initial cursor at any chosen
position. Persisting the staged value before the body decodes
also loses the gap on partial failure. The gap branch now
requires an existing cursor, defers persistence to the
end-of-sweep upsert, and counts MarkSweepGapAdvance only after
the upsert succeeds and the durable cursor still sits at or
above the staged floor.
# Additional correctness
* ServeCrudSweep snapshot. MIN(ulid) and the body Find run in a
REPEATABLE READ read-only transaction so a retention DELETE that
commits between the two reads cannot leave us serving a row the
gap header doesn't cover.
* EnsureOpsTableIndex pinned-conn lock + invalid-index self-heal.
The composite ops("table", ulid) index can take 30-60 min to
build on scaled tables. The advisory lock is on a pinned
*sql.Conn because gorm.WithContext returns a session over the
shared pool that would release the lock on a different
connection. The invalid-leftover probe + DROP handles the
recovery case where a prior process died mid-CREATE INDEX
CONCURRENTLY; without it, IF NOT EXISTS short-circuits on the
INVALID index name forever and retention silently degrades to
seq-scans.
* Self-host normalization. server.New lowercases and
trailing-slash-strips config.Self.Host before the peer-vs-self
comparison so a chain-registry drift cannot land self in
peerHosts and block all retention via the empty-peer sentinel.
* Panic recovery on managed routines.
lifecycle.AddManagedRoutine has no built-in recover; the dormant
cleanup, ops-index ensure, and retention sweep are wrapped in
defer recover so a panic in any one cannot crash mediorum.
# Tests
pkg/mediorum/crudr/retention_test.go covers dormant cleanup
(per-table, threshold, opt-out, dry-run), gap signal (emit,
validation, first-contact rejection, hostile far-future ulid,
garbage ulid), retention sweep (cursor floor, safety margin,
ancient cursor, empty cursor, malformed cursor, concurrent sweep
+ delete). TestRetentionTick_NonPeerCursorRowsIgnored covers the
qm_fix_truncated-shaped row that motivated the activePeers filter.
go test -count=1 ./pkg/mediorum/crudr/ green.
afe1a61 to
6e59e39
Compare
|
Rollout note from live fleet readiness on 2026-05-21:
Recommendation: canary #304 with the index gate left off first. Only enable |
|
Canary update from val005:
Implication: PR304 is not a physical disk-space reclamation fix by itself. It is now safe enough to continue canary observation with the expensive ops cleanup paths default-off, but actual filesystem space reclaim still needs a separate controlled |
|
Final val005 canary/fold result from the 20-validator fleet, 2026-05-22 UTC:
Operator implication: this PR is a useful safety/observability building block when the expensive paths are default-off, but it is not disk relief by itself. For production operators, the next upstream bar is either proven bounded growth or actual physical reclaim semantics. DELETE-only retention is not enough unless paired with a story that returns filesystem space; partition detach/drop, a safe table swap, or snapshot/bootstrap plus partitioned retention are the shapes that would close the operator problem. |
The crudr ops table is currently unbounded: every CRUD-tracked write appends a row that lives forever. On scaled tables (~250M rows) it dominates the database. This PR introduces three coordinated mechanisms to bound it safely, plus the correctness invariants that keep them safe under peer divergence.
Mechanisms
One-time dormant cleanup. On boot, drop ops rows for tables with no write in the dormant window (default 90d). Idempotent; opt-out via
OPENAUDIO_MEDIORUM_KEEP_DORMANT_OPS=true. Runs off the boot path so/health-checkstays reachable while a multi-million-row backlog drains.Retention gap signal. When a peer's sweep cursor falls below our lowest available ulid,
ServeCrudSweepemitsX-Mediorum-Retention-Gap: trueandX-Mediorum-Available-Min-Ulid: <min>. The peer'sdoSweepstages a cursor advance across the gap. Wire format unchanged; older clients ignore the headers.Opt-in per-table retention sweep. When
OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYSis set, a managed routine prunes per-table ops older than the configured window, gated by the slowest active peer cursor (with a safety margin). Unset = no deletions.Correctness invariants
The sweep can be silently disabled or permanently desynchronized in three independent ways. Each is fixed:
Non-peer rows pin the retention floor. The
cursorstable is shared with workers likeqm_fix_truncated, which writes a CID intoLastULID. A bareulid.Parseon every cursor row would pin the floor at the zero ULID indefinitely.computeRetentionCutofffilters to active peer hosts and treats unparseable cursors as missing.Bad cursors heal nowhere. A persisted
LastULIDoutside the plausible window (far-future or epoch) gets echoed as?after=on the next sweep and re-upserted at end of sweep, locking the peer into a permanent prune-and-recreate cycle.doSweeptreats an implausibleLastULIDas missing on load;ApplyOprejects implausibleop.ULIDat the boundary.Gap signal stages too eagerly. Honoring the gap header on first contact lets a peer set our initial cursor at any chosen position. Persisting the staged value before the body decodes also loses the gap on partial failure. The gap branch now requires an existing cursor, defers persistence to the end-of-sweep upsert, and counts
MarkSweepGapAdvanceonly after the upsert succeeds and the durable cursor still sits at or above the staged floor.Additional correctness
ServeCrudSweepsnapshot.MIN(ulid)and the bodyFindnow run inside a REPEATABLE READ read-only transaction so a retentionDELETEthat commits between the two reads cannot leave us serving a row the gap header doesn't cover.EnsureOpsTableIndexpinned-conn lock + invalid-index self-heal. The compositeops("table", ulid)index can take 30-60 min to build on scaled tables. The advisory lock is on a pinned*sql.Connbecausegorm.WithContextreturns a session over the shared pool that would release the lock on a different connection. The invalid-leftover probe +DROPhandles the recovery case where a prior process died mid-CREATE INDEX CONCURRENTLY; without it,IF NOT EXISTSshort-circuits on the INVALID index name forever and retention silently degrades to seq-scans.Self-host normalization.
server.Newlowercases and trailing-slash-stripsconfig.Self.Hostbefore the peer-vs-self comparison so a chain-registry drift cannot land self inpeerHostsand block all retention via the empty-peer sentinel.Panic recovery on managed routines.
lifecycle.AddManagedRoutinehas no built-in recover; the dormant cleanup, ops-index ensure, and retention sweep are all wrapped indefer recoverso a panic in any one cannot crash mediorum.Tests
pkg/mediorum/crudr/retention_test.gocovers:TestRetentionTick_NonPeerCursorRowsIgnoredcovers theqm_fix_truncated-shaped row that motivated the activePeers filter.go test -count=1 ./pkg/mediorum/crudr/green.Risk / rollout
OPENAUDIO_MEDIORUM_OPS_RETENTION_DAYSandOPENAUDIO_MEDIORUM_KEEP_DORMANT_OPSdefault to no-op).activePeersfilter is the most behavior-shifting change: a fleet that relied on a non-peer cursor row to block retention would stop seeing that block. No such use case in mediorum today; that path is the bug, not a contract.ServeCrudSweepis read-only and short. No deadlock risk against the retentionDELETEpath.pg_try_advisory_lockreturning false (no-op) rather than racing.Supersedes #277.