Skip to content

perf: SSH connection churn + render/query inefficiencies (audited, batch fix) #27

@pachev

Description

@pachev

Summary

Audit of the app surfaced a cluster of efficiency issues. The biggest is SSH connection churn: SSHKit.run/2 opens and closes a fresh SSH connection on every command (verified in deps/sshkit/lib/sshkit.ex:313-318connect → run → close), and the wrapper docstring claiming per-process connection caching is factually wrong. ConnectionCheck.probe/1 pays this 6x per server per tick.

We want to fix these in one clean slice, with a benchmark task so we can measure before/after.


Verification harness (build this first)

Add a one-off mix task so we can measure without waiting on ticks:

mix mast.bench --server <host-or-id> [--runs N] [--worker connection_check|stats_collect|patch_scan|app_probe]
  • Runs the chosen worker's probe path against one real server N times (default ~10).
  • Prints SSH call count and min / median / max wall time per worker.
  • Defaults to running all SSH-bound workers so one invocation gives a full picture.
  • Lives under lib/mix/tasks/mast.bench.ex, dev/test env only, no Oban enqueue — call the probe functions directly so we measure the SSH path, not the queue.

Expected baseline (what we're fixing):

ConnectionCheck: 6 calls, median ~4s
StatsCollect:    1 call,  median ~0.9s   # already correct, use as the model
PatchScan:       2 calls

Acceptance: after the fixes below, ConnectionCheck reports 1 call and a markedly lower median; rerun the task and paste output into the PR.


Fixes (priority order)

1. ConnectionCheck: 6 SSH connections → 1 combined command — HIGH

lib/mast/workers/connection_check.ex:56-62

probe/1 issues 6 separate SSH.run/2 calls (os-release, top, free, df, loadavg, nproc). Each is a full TCP+SSH handshake + DB key load + key decryption. StatsCollect already solves this: one command with echo __MARKER__ delimiters, parsed by section (lib/mast/workers/stats_collect.ex:29-36). Mirror that pattern here.

  • RED: stub-backed test asserting probe/1 issues exactly one SSH.run. (Currently 6.)
  • GREEN: combined heredoc command with markers; split output, route each section to the existing OS.parse / Metrics.parse_* calls.
  • Keep the with-style error handling: a non-zero exit on the combined command still marks the server unreachable.

2. Correct the SSHKit docstring — HIGH (do alongside #1)

lib/mast/ssh/sshkit.ex:6-9

The "Connection caching: SSHKit opens a connection on first use within the calling process... yields connection reuse" claim is false. SSHKit.run/2 connects and closes per call. Rewrite to state that each run/2 is a fresh connection, so callers must batch multiple commands into a single command string (and point at StatsCollect as the reference). This misled the audit itself.

3. PatchScan: 2 connections → 1 — MED

lib/mast/workers/patch_scan.ex:88-89

apt-get update and apt list --upgradable run as two SSH.run/2 calls. Collapse to apt-get update && apt list --upgradable (preserve LANG=C and the sudo -n prefix per the security guardrails). One connection.

4. DashboardLive: memoize derived data out of render/1 — HIGH (cheap)

lib/mast_web/live/dashboard_live.ex:156-157

render/1 recomputes filtered/2 + fleet_stats/2 (full list scans) on every render, including diffs unrelated to the server list. Compute :visible_servers and :stats in assigns, refreshed only when :servers or :filter changes (mount + the relevant handle_* clauses).

5. ServerLive: stop over-fetching on PubSub — HIGH

lib/mast_web/live/server_live.ex:74-87, 132-133

handle_info(:server_updated, ...) reloads audit activity (:132) and re-runs assign_series/Fleet.list_stats (:133) on every status/metrics broadcast — neither of which a metrics update changes. Gate the stats refetch to range changes / stat-collection events, and only reload activity on explicit navigation or audit-affecting events.

6. Release handle lookups: filter in SQL → split to #28 — MOVED

This item needed a migration (a Postgres generated column + unique index) to be done correctly, and the original "replace the advisory taken?/3 check" framing was wrong — taken?/3 is kept as the UX layer, the DB constraint is added as the race backstop. Settled in a /domain-plan session (ADR 0008 amended, CONTEXT.md gains an Effective Handle term). Tracked separately in #28. It is out of this clean slice.

7. Apps.upsert_from_probe: duplicate query — MED

lib/mast/apps.ex:54, 94

Calls list_for_release/1 twice per probe (diff, then return). Reuse the post-write in-memory list instead of re-querying.


Lower priority (note here, fix opportunistically)

  • ReleaseLive log_buffer (lib/mast_web/live/release_live.ex:240-269) accumulates streamed lines in a regular assign — diffed per line. Use stream/temporary_assigns. — MED
  • AuditLive re-presents the full event list each render (lib/mast_web/live/audit_live.ex:99). — MED
  • StatsDownsample.process_transition/1 (lib/mast/workers/stats_downsample.ex:46-57) groups all source rows in memory; move to SQL aggregation as fleet grows. — MED
  • Oban unique: [period: 50-55] is shorter than the trigger window in connection_check.ex:22, stats_collect.ex:20, app_probe.ex:14 — a job running longer than the period can enqueue a duplicate. Set period ≥ window. — LOW
  • Ticker default 30s (config/dev.exs:25) fans out into checks: 5 queue (config/config.exs:22); fixing Inline 'paste new key' form in Add System modal #1 (6→1 connections) ~6x's the headroom. Confirm prod has no 30s override. — LOW

Done =

  • mix mast.bench task exists and reports SSH call count + timing per worker
  • ConnectionCheck issues 1 SSH call (test-asserted); benchmark shows the drop
  • PatchScan issues 1 SSH call
  • SSHKit docstring corrected
  • DashboardLive + ServerLive render/refetch fixes
  • Release SQL-filter items resolved or split out (see open question)
  • mix precommit green

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions