You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Audit of the app surfaced a cluster of efficiency issues. The biggest is SSH connection churn: SSHKit.run/2 opens and closes a fresh SSH connection on every command (verified in deps/sshkit/lib/sshkit.ex:313-318 — connect → run → close), and the wrapper docstring claiming per-process connection caching is factually wrong. ConnectionCheck.probe/1 pays this 6x per server per tick.
We want to fix these in one clean slice, with a benchmark task so we can measure before/after.
Verification harness (build this first)
Add a one-off mix task so we can measure without waiting on ticks:
Runs the chosen worker's probe path against one real server N times (default ~10).
Prints SSH call count and min / median / max wall time per worker.
Defaults to running all SSH-bound workers so one invocation gives a full picture.
Lives under lib/mix/tasks/mast.bench.ex, dev/test env only, no Oban enqueue — call the probe functions directly so we measure the SSH path, not the queue.
Expected baseline (what we're fixing):
ConnectionCheck: 6 calls, median ~4s
StatsCollect: 1 call, median ~0.9s # already correct, use as the model
PatchScan: 2 calls
Acceptance: after the fixes below, ConnectionCheck reports 1 call and a markedly lower median; rerun the task and paste output into the PR.
probe/1 issues 6 separate SSH.run/2 calls (os-release, top, free, df, loadavg, nproc). Each is a full TCP+SSH handshake + DB key load + key decryption. StatsCollect already solves this: one command with echo __MARKER__ delimiters, parsed by section (lib/mast/workers/stats_collect.ex:29-36). Mirror that pattern here.
RED: stub-backed test asserting probe/1 issues exactly oneSSH.run. (Currently 6.)
GREEN: combined heredoc command with markers; split output, route each section to the existing OS.parse / Metrics.parse_* calls.
Keep the with-style error handling: a non-zero exit on the combined command still marks the server unreachable.
2. Correct the SSHKit docstring — HIGH (do alongside #1)
lib/mast/ssh/sshkit.ex:6-9
The "Connection caching: SSHKit opens a connection on first use within the calling process... yields connection reuse" claim is false. SSHKit.run/2 connects and closes per call. Rewrite to state that each run/2 is a fresh connection, so callers must batch multiple commands into a single command string (and point at StatsCollect as the reference). This misled the audit itself.
3. PatchScan: 2 connections → 1 — MED
lib/mast/workers/patch_scan.ex:88-89
apt-get update and apt list --upgradable run as two SSH.run/2 calls. Collapse to apt-get update && apt list --upgradable (preserve LANG=C and the sudo -n prefix per the security guardrails). One connection.
4. DashboardLive: memoize derived data out of render/1 — HIGH (cheap)
lib/mast_web/live/dashboard_live.ex:156-157
render/1 recomputes filtered/2 + fleet_stats/2 (full list scans) on every render, including diffs unrelated to the server list. Compute :visible_servers and :stats in assigns, refreshed only when :servers or :filter changes (mount + the relevant handle_* clauses).
5. ServerLive: stop over-fetching on PubSub — HIGH
lib/mast_web/live/server_live.ex:74-87, 132-133
handle_info(:server_updated, ...) reloads audit activity (:132) and re-runs assign_series/Fleet.list_stats (:133) on every status/metrics broadcast — neither of which a metrics update changes. Gate the stats refetch to range changes / stat-collection events, and only reload activity on explicit navigation or audit-affecting events.
6. Release handle lookups: filter in SQL → split to #28 — MOVED
This item needed a migration (a Postgres generated column + unique index) to be done correctly, and the original "replace the advisory taken?/3 check" framing was wrong — taken?/3 is kept as the UX layer, the DB constraint is added as the race backstop. Settled in a /domain-plan session (ADR 0008 amended, CONTEXT.md gains an Effective Handle term). Tracked separately in #28. It is out of this clean slice.
7. Apps.upsert_from_probe: duplicate query — MED
lib/mast/apps.ex:54, 94
Calls list_for_release/1 twice per probe (diff, then return). Reuse the post-write in-memory list instead of re-querying.
Lower priority (note here, fix opportunistically)
ReleaseLive log_buffer (lib/mast_web/live/release_live.ex:240-269) accumulates streamed lines in a regular assign — diffed per line. Use stream/temporary_assigns. — MED
AuditLive re-presents the full event list each render (lib/mast_web/live/audit_live.ex:99). — MED
StatsDownsample.process_transition/1 (lib/mast/workers/stats_downsample.ex:46-57) groups all source rows in memory; move to SQL aggregation as fleet grows. — MED
Oban unique: [period: 50-55] is shorter than the trigger window in connection_check.ex:22, stats_collect.ex:20, app_probe.ex:14 — a job running longer than the period can enqueue a duplicate. Set period ≥ window. — LOW
Ticker default 30s (config/dev.exs:25) fans out into checks: 5 queue (config/config.exs:22); fixing Inline 'paste new key' form in Add System modal #1 (6→1 connections) ~6x's the headroom. Confirm prod has no 30s override. — LOW
Done =
mix mast.bench task exists and reports SSH call count + timing per worker
ConnectionCheck issues 1 SSH call (test-asserted); benchmark shows the drop
PatchScan issues 1 SSH call
SSHKit docstring corrected
DashboardLive + ServerLive render/refetch fixes
Release SQL-filter items resolved or split out (see open question)
Summary
Audit of the app surfaced a cluster of efficiency issues. The biggest is SSH connection churn:
SSHKit.run/2opens and closes a fresh SSH connection on every command (verified indeps/sshkit/lib/sshkit.ex:313-318—connect → run → close), and the wrapper docstring claiming per-process connection caching is factually wrong.ConnectionCheck.probe/1pays this 6x per server per tick.We want to fix these in one clean slice, with a benchmark task so we can measure before/after.
Verification harness (build this first)
Add a one-off mix task so we can measure without waiting on ticks:
lib/mix/tasks/mast.bench.ex, dev/test env only, no Oban enqueue — call the probe functions directly so we measure the SSH path, not the queue.Expected baseline (what we're fixing):
Acceptance: after the fixes below,
ConnectionCheckreports 1 call and a markedly lower median; rerun the task and paste output into the PR.Fixes (priority order)
1. ConnectionCheck: 6 SSH connections → 1 combined command — HIGH
lib/mast/workers/connection_check.ex:56-62probe/1issues 6 separateSSH.run/2calls (os-release, top, free, df, loadavg, nproc). Each is a full TCP+SSH handshake + DB key load + key decryption.StatsCollectalready solves this: one command withecho __MARKER__delimiters, parsed by section (lib/mast/workers/stats_collect.ex:29-36). Mirror that pattern here.probe/1issues exactly oneSSH.run. (Currently 6.)OS.parse/Metrics.parse_*calls.with-style error handling: a non-zero exit on the combined command still marks the server unreachable.2. Correct the SSHKit docstring — HIGH (do alongside #1)
lib/mast/ssh/sshkit.ex:6-9The "Connection caching: SSHKit opens a connection on first use within the calling process... yields connection reuse" claim is false.
SSHKit.run/2connects and closes per call. Rewrite to state that eachrun/2is a fresh connection, so callers must batch multiple commands into a single command string (and point at StatsCollect as the reference). This misled the audit itself.3. PatchScan: 2 connections → 1 — MED
lib/mast/workers/patch_scan.ex:88-89apt-get updateandapt list --upgradablerun as twoSSH.run/2calls. Collapse toapt-get update && apt list --upgradable(preserveLANG=Cand thesudo -nprefix per the security guardrails). One connection.4. DashboardLive: memoize derived data out of render/1 — HIGH (cheap)
lib/mast_web/live/dashboard_live.ex:156-157render/1recomputesfiltered/2+fleet_stats/2(full list scans) on every render, including diffs unrelated to the server list. Compute:visible_serversand:statsin assigns, refreshed only when:serversor:filterchanges (mount + the relevant handle_* clauses).5. ServerLive: stop over-fetching on PubSub — HIGH
lib/mast_web/live/server_live.ex:74-87, 132-133handle_info(:server_updated, ...)reloads audit activity (:132) and re-runsassign_series/Fleet.list_stats(:133) on every status/metrics broadcast — neither of which a metrics update changes. Gate the stats refetch to range changes / stat-collection events, and only reload activity on explicit navigation or audit-affecting events.6.
Release handle lookups: filter in SQL→ split to #28 — MOVEDThis item needed a migration (a Postgres generated column + unique index) to be done correctly, and the original "replace the advisory
taken?/3check" framing was wrong —taken?/3is kept as the UX layer, the DB constraint is added as the race backstop. Settled in a/domain-plansession (ADR 0008 amended, CONTEXT.md gains an Effective Handle term). Tracked separately in #28. It is out of this clean slice.7. Apps.upsert_from_probe: duplicate query — MED
lib/mast/apps.ex:54, 94Calls
list_for_release/1twice per probe (diff, then return). Reuse the post-write in-memory list instead of re-querying.Lower priority (note here, fix opportunistically)
ReleaseLivelog_buffer (lib/mast_web/live/release_live.ex:240-269) accumulates streamed lines in a regular assign — diffed per line. Usestream/temporary_assigns. — MEDAuditLivere-presents the full event list each render (lib/mast_web/live/audit_live.ex:99). — MEDStatsDownsample.process_transition/1(lib/mast/workers/stats_downsample.ex:46-57) groups all source rows in memory; move to SQL aggregation as fleet grows. — MEDunique: [period: 50-55]is shorter than the trigger window inconnection_check.ex:22,stats_collect.ex:20,app_probe.ex:14— a job running longer than the period can enqueue a duplicate. Set period ≥ window. — LOWconfig/dev.exs:25) fans out intochecks: 5queue (config/config.exs:22); fixing Inline 'paste new key' form in Add System modal #1 (6→1 connections) ~6x's the headroom. Confirm prod has no 30s override. — LOWDone =
mix mast.benchtask exists and reports SSH call count + timing per workermix precommitgreen