Summary
@lde/distribution-monitor appends to the observations table on every probe (every monitor × every poll cycle) and never deletes anything, so the table grows without bound.
Context
This surfaced while debugging the Network of Terms status DB: observations was already at ~2M rows (a few hundred MB) on a slow (HDD) volume. At the time it was compounding a separate problem — the latest_observations materialized view refresh scanned the whole table every cycle and timed out.
That refresh is now gone (latest is maintained by an upsert as of #493), so table size no longer affects status freshness or correctness. This issue is therefore pure hygiene, not a live incident:
- Unbounded disk growth on a fixed-size PVC (the status DB volume is 5Gi).
- Ever-larger backups / slower full-history queries over time.
Proposal
Add configurable retention to the monitor: periodically delete observations older than a cutoff, keeping observations a bounded time-series.
Sketch:
- A
retentionDays (or retentionInterval) config option; unset = keep forever (current behaviour, backwards compatible).
- A periodic prune, e.g.
DELETE FROM observations WHERE observed_at < now() - $retention, run on a cron alongside polling (or piggy-backed on the poll cycle).
latest_observations is unaffected — it holds one row per monitor independently of how much history is pruned.
- Consider batched deletes (
LIMIT in a loop) so a first prune of a large backlog doesn't lock or bloat, and so it plays nicely with autovacuum.
Open questions
- Default retention when enabled (30 / 90 days?), and whether it should be opt-in or have a sane default.
- Whether to expose it via the existing config schema in
network-of-terms status as well.
Notes
Not urgent — flagged here so it isn't lost. Related work: #492 (boot index lock), #493 (latest-via-upsert).
Summary
@lde/distribution-monitorappends to theobservationstable on every probe (every monitor × every poll cycle) and never deletes anything, so the table grows without bound.Context
This surfaced while debugging the Network of Terms status DB:
observationswas already at ~2M rows (a few hundred MB) on a slow (HDD) volume. At the time it was compounding a separate problem — thelatest_observationsmaterialized view refresh scanned the whole table every cycle and timed out.That refresh is now gone (latest is maintained by an upsert as of #493), so table size no longer affects status freshness or correctness. This issue is therefore pure hygiene, not a live incident:
Proposal
Add configurable retention to the monitor: periodically delete observations older than a cutoff, keeping
observationsa bounded time-series.Sketch:
retentionDays(orretentionInterval) config option; unset = keep forever (current behaviour, backwards compatible).DELETE FROM observations WHERE observed_at < now() - $retention, run on a cron alongside polling (or piggy-backed on the poll cycle).latest_observationsis unaffected — it holds one row per monitor independently of how much history is pruned.LIMITin a loop) so a first prune of a large backlog doesn't lock or bloat, and so it plays nicely with autovacuum.Open questions
network-of-termsstatus as well.Notes
Not urgent — flagged here so it isn't lost. Related work: #492 (boot index lock), #493 (latest-via-upsert).