From af81002f27d6d1412f5f02315540f4178942193b Mon Sep 17 00:00:00 2001 From: s-b-e-n-s-o-n <80784472+s-b-e-n-s-o-n@users.noreply.github.com> Date: Sat, 30 May 2026 17:01:29 -0400 Subject: [PATCH 1/2] =?UTF-8?q?=F0=9F=93=9D=20docs(release):=20finalize=20?= =?UTF-8?q?v1.5.0=20changelog=20and=20fix=20trigger-taxonomy=20doc=20links?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - πŸ“ Fold [Unreleased] + all rc.1–rc.28 sections into a single cumulative `## [1.5.0] β€” 2026-05-30` entry so release-cut emits complete notes; add the previously-undocumented #402 auth `/health` readiness fix and drop orphaned [Unreleased]/[1.5.0-rc.*] compare-link defs - πŸ› Fix broken deep-link in actions docs β†’ triggers#environment-variable-prefixes - πŸ› Fix broken deep-link in watchers docs β†’ deprecations#legacy-trigger-labels - πŸ“ Clarify migration-CLI callout: messaging providers may rename to DD_NOTIFICATION_* manually (CLI emits DD_ACTION_* for all types) --- CHANGELOG.md | 793 +++++------------- .../current/configuration/actions/index.mdx | 2 +- .../current/configuration/triggers/index.mdx | 2 +- .../current/configuration/watchers/index.mdx | 2 +- 4 files changed, 224 insertions(+), 575 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index be241750..0645d466 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,791 +8,454 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 > **Fork point:** upstream post-8.1.1 (2025-11-27) > **Upstream baseline:** WUD 8.1.1 + 65 merged PRs on `main` (Vue 3 migration, Alpine base image, Rocket.Chat trigger, threshold system, semver improvements, requestβ†’axios migration, and more) -## [Unreleased] +## [1.5.0] β€” 2026-05-30 -### Fixed - -- **[#386](https://github.com/CodesWhat/drydock/issues/386) follow-through β€” a fresh-restart agent whose in-memory store has not yet been re-populated no longer wipes the controller's last-known container state on handshake.** A user running drydock in a controller + agent topology reported on rc.27 that their `ml` agent still rendered 0 running containers in the controller UI even after the rc.25 watcher-snapshot suppression (`d02080ae`) and the rc.26 stats-changed broadcast (`512c3751`). Cause: when `AgentClient._doHandshake` (`app/agent/AgentClient.ts`) reconnects to an agent's SSE stream it handshakes via `GET /api/containers`, which serves from the agent's *in-memory* `storeContainer`. If the agent process has just restarted and its `watchatstart` cron has not yet finished its first run (the agent's store is non-persistent across restarts), that endpoint legitimately returns `[]` even though the docker daemon has N running containers. `_doHandshake` then called `pruneOldContainers([])` unconditionally, deleting every controller-side container the agent had previously contributed β€” even though the agent's first `dd:watcher-snapshot` was about to repopulate the store seconds later. The rc.25 fix in `Docker.watch()` only suppresses *outgoing* snapshots from the agent when enumeration fails on the agent itself; it does not cover the controller-side handshake path. The fix makes the handshake's prune step ambiguity-aware: `_doHandshake` now skips `pruneOldContainers` whenever `containers.length === 0` and emits a `Handshake returned 0 containers; preserving last-known state until the first watch cycle completes` warning (only after `hasConnectedOnce`, so the first-ever connection of a genuinely empty agent stays silent). Pruning is deferred to the next authoritative `dd:watcher-snapshot`, which is already gated on `!containerEnumerationFailed && enrichmentErrors === 0` (`app/watchers/providers/docker/Docker.ts:1136`) and is therefore unambiguous: a 0-container snapshot means the agent really has 0 running containers. Non-zero handshakes continue to prune normally β€” the behaviour change is scoped strictly to the 0-container case that exposed the cold-start race. - -## [1.5.0-rc.27] β€” 2026-05-24 - -### Fixed - -- **[#289](https://github.com/CodesWhat/drydock/issues/289) β€” Agent-hosted container updates no longer leave an orphaned queued operation row on the controller that the 30-minute TTL sweep force-fails into a misleading "update failed" Pushover/Telegram notification long after the update actually succeeded.** A user running drydock in a controller + agent topology reported on rc.25 that an "Update All" of Tautulli on two hosts produced the success notification only for the controller-host container; the agent-host container's success notification was missing and, ~30 minutes later, a second Pushover arrived saying `[mediavault] Container Tautulli update failed β€” Marked failed after exceeding active update TTL (1800000ms) while queued.` even though the update had in fact succeeded on the agent. Cause: when the controller queues a container update via `createAcceptedContainerUpdateRequest` (`app/updates/request-update.ts`) it mints a controller-side `operationId` and inserts a `queued` row; the dispatcher then calls `entry.trigger.trigger(entry.container, { operationId })`. For containers hosted on an agent the trigger is `AgentTrigger`, whose `trigger(container)` previously accepted only the container and discarded the `runtimeContext`. `AgentClient.runRemoteTrigger` posted `{id, name}` to the agent without the operationId, so the agent's `/api/triggers/:type/:name` endpoint called `requestContainerUpdate` with no operationId and minted its own row; the agent's `dd:update-applied` / `dd:update-operation-changed` events then arrived back at the controller carrying the agent-side id, which the controller routed through `toAgentScopedId` into a third, agent-scoped row (`agent--`). The original controller-side queued row was therefore never touched, sat queued past the `UPDATE_OPERATION_ACTIVE_TTL_MS` deadline in `app/store/update-operation.ts:295-300`, and was force-failed by the TTL sweep β€” which fired the misleading "failed" notification with the row's still-valid container snapshot (hence the correct `[mediavault]` agent prefix). The fix threads the controller's `operationId` end-to-end so a single row is the source of truth for the whole lifecycle: `AgentTrigger.trigger` / `triggerBatch` now accept and forward `runtimeContext`; `AgentClient.runRemoteTrigger` / `runRemoteTriggerBatch` extract per-container operationIds via the existing `getRequestedOperationId` helper and include them in the agent payload (`{id, name, operationId}` for single triggers; `{...container, operationId}` per entry for batches); the agent-side controller `runTrigger` accepts an `operationId` in the request body (validated by `triggerRequestBodySchema`) and threads it into `requestContainerUpdate`; the agent-side batch endpoint extracts per-container operationIds into an `{operationIds}` runtimeContext before forwarding to the local trigger; `EnqueueContainerUpdateOptions` gains an `operationId` field honored by `createAcceptedContainerUpdateRequest` (single-container batches only; multi-container batches still mint per-container UUIDs); and a new `AgentClient.resolveAgentOperationId` helper checks the controller's operation store for an existing row at the raw (unscoped) id and reuses it when found β€” falling back to the `toAgentScopedId` form only when the agent does not echo a known controller id, preserving backwards compatibility with older agents. The controller-side queued row therefore transitions directly to `in-progress` and `succeeded`/`failed` from the agent's lifecycle events, no parallel agent-scoped row is created, the TTL sweep has nothing stale to fail, and the spurious "update failed" notification disappears. - -- **[#289](https://github.com/CodesWhat/drydock/issues/289) β€” Update-applied and update-failed notification triggers (Pushover, Telegram, etc.) and UI success toasts no longer silently drop for containers running on a connected agent.** A user running drydock in a controller + agent topology reported on rc.25 that an "Update All" across two hosts produced the success toast and Pushover notification only for the container on the controller host, never for the same-name container on the agent host. Cause: when the agent finishes an update it sends a `dd:update-applied` SSE payload to the controller carrying a full `container` snapshot. The controller's `AgentClient.handleEvent` routes this through `maybeMarkAgentOperationSucceededFromAppliedPayload` β†’ `markAgentOperationTerminal` β†’ `ensureAgentOperationForTerminal` β†’ `updateOperationStore.insertOperation` + `markOperationTerminal`, but `buildAgentOperationBase` in `app/agent/AgentClient.ts` constructed the inserted row from `{id, kind, containerName, containerId, newContainerId}` only β€” the container snapshot was dropped on the floor. When `markOperationTerminal` then fired `emitTerminalLifecycleEvent` (`app/store/update-operation.ts`), the resulting `emitContainerUpdateApplied` / `emitContainerUpdateFailed` payload built by `buildTerminalLifecycleEventBase` lacked `container`. The notification handler `handleContainerUpdateAppliedEvent` (`app/triggers/providers/Trigger.ts`) then fell back to `findContainerByBusinessId(containerName)`, which compares the agent's bare `containerName` (e.g. `tautulli`) against the controller-side `fullName` (e.g. `mediavault_docker_tautulli`) and silently dropped β€” the same class of `findContainerByBusinessId` miss as [#385](https://github.com/CodesWhat/drydock/issues/385) but on the agent-scoped operation path that [#385](https://github.com/CodesWhat/drydock/issues/385) did not cover. The fix threads the agent's container snapshot through every level of the agent-scoped operation pipeline β€” `buildAgentOperationBase`, `ensureAgentOperationForTerminal`, `markAgentOperationTerminal`, `maybeMarkAgentOperationSucceededFromAppliedPayload`, and `maybeMarkAgentOperationFailedFromFailedPayload` β€” stamping `agent: this.name` so the controller's view of the container is consistent. The `dd:update-operation-changed`-before-`dd:update-applied` race is handled by patching the container snapshot onto the existing active row via `updateOperation` before the terminal emit runs (only when the existing row lacks a container, never overwriting an existing snapshot). `container` is added to `MutableUpdateOperationFields` in `app/store/update-operation.ts` so terminal and active patches accept it. The store's terminal-lifecycle emit therefore naturally carries the agent's container into `emitContainerUpdateApplied` / `emitContainerUpdateFailed`, the `payloadContainer` shortcut in the trigger handler succeeds, and both the notification trigger and the SSE toast fire end-to-end on the controller for agent-originated updates. - -## [1.5.0-rc.26] β€” 2026-05-22 - -### Fixed - -- **Image reference construction β€” unanchored `/v2` strip could silently corrupt references when the image name contained a `/v2` path segment.** `Registry.getImageFullName` and the controller-mode fallback in `resolveContainerImageFullName` both applied `.replace(/\/v2/, '')` to the fully concatenated `registryUrl/imageName:tag` string. Because the regex was unanchored and non-global, if the image name contained a `/v2` segment (e.g. `library/v2/tool`) the strip would remove it from the image name rather than the registry URL β€” producing a silently wrong reference handed to Trivy. The fix extracts a shared pure helper `buildImageReference` (`app/registries/image-reference.ts`) that cleans the registry URL *before* concatenation using anchored regexes (`^https?:\/\/` and `/v2\/?$`) so the URL scheme and trailing `/v2` API path are removed without touching anything in the image name. Both `Registry.getImageFullName` and the fallback branch of `resolveContainerImageFullName` now delegate to this helper, eliminating the duplicate logic. - -- **[#386](https://github.com/CodesWhat/drydock/issues/386) β€” Agents intermittently showing 0 running containers in the controller UI β€” a second recurrence the rc.25 fix did not close.** The rc.25 fix suppressed the authoritative watcher snapshot whenever container enumeration failed or per-container enrichment errors dropped containers, but the recurrence reported on rc.25 is a different failure mode: a cold-start race between the controller's handshake and the agent's first watch cycle. When the controller's `AgentClient` (re)connects to an agent's SSE stream it handshakes immediately via `GET /api/containers`; if the agent's `watchatstart` cron has not yet finished its first run, the agent's in-memory store is still empty and the handshake legitimately receives 0 containers (the agent log shows `Handshake successful. Received 0 containers.` ~5 s before `Cron finished (4 containers watched, 0 errors)`). The handshake then fires `emitAgentConnected`, the UI re-fetches `/api/v1/agents`, and the agent's running-container count renders 0. When the agent's cron completes moments later it pushes a `dd:watcher-snapshot`, and `AgentClient.handleWatcherSnapshotEvent` ingests the four containers into the controller store correctly β€” but nothing told the UI to refresh, because `AgentsView` only re-fetches the agent summary on `agent-status-changed` / `connected` / `resync-required` events, not on `container-added` / `container-updated`. The stale 0 therefore persisted until an unrelated reconnect event (such as an agent restart) fired. The fix adds a dedicated `AgentStatsChanged` event: `app/event/index.ts` gains `emitAgentStatsChanged` / `registerAgentStatsChanged` (mirroring the existing `AgentConnected` pair); `AgentClient.handleWatcherSnapshotEvent` now emits `emitAgentStatsChanged({ agentName })` after every completed watcher snapshot; `app/api/sse.ts` broadcasts it to UI SSE clients as `dd:agent-stats-changed`; and `ui/src/stores/eventStream.ts` maps that to the existing `agent-status-changed` bus event. A completed agent watch cycle therefore always refreshes the controller's agent-summary count, even when the handshake raced ahead of the agent's first cron. - -- **[#342](https://github.com/CodesWhat/drydock/issues/342) β€” A container is no longer shown as "update available" with a blank target version after a transient registry error.** `hasRawUpdate` in `app/model/container.ts` compared `transformTag(image.tag.value)` against `transformTag(result.tag)` without guarding an undefined `result.tag`. When a registry scan failed mid-flight (for example a Docker Hub or GHCR `429`) and left a container `result` present but its `tag` unset, `transformTag(undefined)` returned `undefined`, the `localTag !== remoteTag` comparison evaluated true, and the container was flagged `updateAvailable` with an `unknown` update kind β€” which the UI renders as an update with no target version (the reporter saw this on `immich_redis`). `hasRawUpdate` now performs the tag comparison only when both `image.tag.value` and `result.tag` are defined, matching the existing guard in `getRawTagUpdate`. Digest-only updates are unaffected: a container with an undefined `result.tag` but a genuine digest change still reports the digest update. +### Added -- **[#386](https://github.com/CodesWhat/drydock/issues/386) follow-through β€” the controller's agent-summary container count now also refreshes on docker-event-driven container changes, not only completed cron cycles.** The initial #386 fix emitted `emitAgentStatsChanged` from `AgentClient.handleWatcherSnapshotEvent`, the cron-watch path. An agent also ingests individual container add/remove/update events from the Docker event stream between cron cycles (`handleContainerChangeEvent`, `handleContainerRemovedEvent`), via the controller-initiated `watch()` path, and via the per-container controller-initiated `watchContainer()` path β€” none of which emitted the stats-changed signal, so a container started or stopped on an agent host could leave the `AgentsView` running-container count stale until the next 6-hourly cron. All four paths now emit `emitAgentStatsChanged` after mutating the controller store, keeping the count current in real time. +- **Trigger environment variable taxonomy split β€” `DD_ACTION_*` and `DD_NOTIFICATION_*` prefixes.** Action triggers (Docker, Docker Compose, Command) are now configured with `DD_ACTION_*` and `dd.action.*` labels; notification/messaging triggers (Slack, SMTP, Discord, Telegram, ntfy, Pushover, and all others) are configured with `DD_NOTIFICATION_*` and `dd.notification.*` labels. All three prefix families (`DD_ACTION_*`, `DD_NOTIFICATION_*`, `DD_TRIGGER_*`) are interchangeable at runtime β€” merge priority is `DD_NOTIFICATION_*` > `DD_ACTION_*` > `DD_TRIGGER_*`. A migration CLI (`drydock config migrate --source trigger`) rewrites `DD_TRIGGER_*`, `dd.trigger.include`, and `dd.trigger.exclude` to action-prefixed aliases automatically; use `--dry-run` to preview changes before applying. -- **[#342](https://github.com/CodesWhat/drydock/issues/342) β€” GitHub release-notes lookups now survive GitHub's secondary rate limit instead of giving up on the first burst.** Drydock authenticates its api.github.com release-notes requests by reusing the configured GHCR token, but a watch cycle still fans out a lookup for every watched container at once and trips GitHub's *secondary* rate limit β€” a `403` GitHub returns to authenticated callers who burst too many requests. The shared retry helper (`app/registries/http-retry.ts`) only retried `429`/`503`, so the secondary-limit `403` was never retried: the provider logged `GitHub release notes lookup is rate-limited` and returned nothing. `withRetry` gains two optional, opt-in hooks β€” `retryPredicate` (retry a status outside `retryableStatuses`) and `retryDelayMs` (per-attempt delay override) β€” leaving every existing caller unchanged. `GithubProvider` classifies a `403` as a secondary rate limit only when it carries a `retry-after` header or `x-ratelimit-remaining: 0`, retries those (honouring `retry-after` / `x-ratelimit-reset` for the delay), and leaves a genuine `403` authorization failure failing fast as before. Once retries are exhausted the provider arms a short module-level cooldown β€” driven by GitHub's own retry hint, floored at the 60 s default so a `retry-after: 0` hint cannot produce an already-expired cooldown β€” during which further release-notes lookups are skipped, so a single cron cycle no longer hammers an already-tripped limit container after container. The rate-limit warning now also records whether the request was authenticated. +- **Up-to-date and pinned badges in Kind column** β€” Containers table now shows a green check-circle badge ("Up to date") for containers at their latest version, and a green pin badge ("Pinned") for containers with skipped updates, replacing the previous dash placeholder. -- **[#342](https://github.com/CodesWhat/drydock/issues/342) β€” the registry-error tooltip on the Containers view now names the registry that failed.** When a registry tag lookup errors (for example a `429` rate limit) the container shows a registry-error badge whose tooltip previously rendered only the raw transport message β€” `Registry error: Request failed with status code 429` β€” with no indication of which registry was queried. `registryErrorTooltip` in `ui/src/views/ContainersView.vue` now derives the registry hostname from the container's `registryUrl` and renders it through a new `registryError.detailWithRegistry` i18n string (`{registryHost} β€” {error}`), e.g. `ghcr.io β€” Request failed with status code 429`. Containers whose `registryUrl` is absent or unparseable fall back to the original message unchanged. +- **Real-time container log viewer** β€” WebSocket-based live log streaming from Docker containers directly in the UI. Features ANSI color rendering, automatic JSON log detection with syntax-highlighted pretty-printing, free-text and regex search with match navigation, stdout/stderr stream filtering, log level filtering for structured logs, copy to clipboard, and gzip-compressed download. Available in both the container detail panel and a dedicated full-page view at `/containers/:id/logs`. ([Phase 4.2](https://getdrydock.com/docs/configuration/logs)) -## [1.5.0-rc.25] β€” 2026-05-21 +- **Diagnostic debug dump** β€” One-click export of redacted system state from Configuration > Diagnostics. Collects runtime metadata, component state (watchers, registries, triggers, agents), Docker API diagnostics, MQTT Home Assistant sensors, recent Docker events, store stats, and `DD_*` environment variables. Sensitive values matching `password|token|secret|key|hash` are automatically redacted. Configurable time window (1–1440 minutes). ([Phase 4.14](https://getdrydock.com/docs/api/container)) -### Fixed +- **Container log streaming API** β€” `WS /api/v1/containers/:id/logs/stream` endpoint with Docker binary stream demultiplexing, session-based authentication on WebSocket upgrade, and fixed-window rate limiting (1,000 connections per 15 minutes). -- **[#371](https://github.com/CodesWhat/drydock/discussions/371) β€” Containers "Group By Stack" view no longer dissolves a multi-container stack into "Ungrouped" while its last container is mid-update.** The flatten rule in `groupedContainers` (`ui/src/views/ContainersView.vue`) previously keyed off the transient live container count (`buckets[key].length === 1`). During a docker recreate a 2-container stack momentarily shows only 1 live container (old removed, new not yet added), so the rule fired and dropped the stack header. The fix adds a `groupAssignedSizeMap` ref (populated by `loadGroups()` from the groups API response and reset to `{}` on error) that records each group's API-assigned member count. The flatten condition is now `buckets[key].length === 1 && groupAssignedSizeMap.value[key] === 1` β€” a strict equality check so stacks whose assigned size is > 1 or transiently absent from the API response are never flattened mid-update. Genuine single-container stacks (assigned size exactly 1) are still flattened as before (GitHub Discussion #179). +- **Container log download API** β€” `GET /api/v1/containers/:id/logs` endpoint with gzip compression support, stdout/stderr filtering, configurable tail size, and timestamp-based `since` filtering. -- **[#386](https://github.com/CodesWhat/drydock/issues/386) β€” Agents intermittently showing 0 running containers in the controller UI β€” a recurrence of [#362](https://github.com/CodesWhat/drydock/issues/362) that the rc.20 fix did not fully close.** The rc.20 guard introduced a `containerEnumerationFailed` flag in `Docker.watch()` (`app/watchers/providers/docker/Docker.ts`) that suppresses the authoritative `emitWatcherSnapshot` when `getContainers()` itself throws. However, `getContainers()` does not throw on per-container enrichment failures: `addImageDetailsToContainer()` is called for each watched container, and any container whose enrichment throws is caught (`.catch(error => return error)`) and then silently filtered out by `.filter(result => !(result instanceof Error) && result != null)`. A transient docker / socket-proxy hiccup during image inspect can therefore cause `getContainers()` to return a short or empty array **without throwing** β€” the `containerEnumerationFailed` guard does not fire, `watch()` emits an authoritative `emitWatcherSnapshot` with the degraded container list, and the controller's `AgentClient.handleWatcherSnapshotEvent` prunes every container not in that list, wiping the agent's view. The agent's own store is preserved because its local prune re-confirms each container via `inspect()`, which is why the agent kept reporting its containers and a restart's handshake re-synced the controller. The fix extends the snapshot-suppression in two steps: `getContainers()` now accepts an optional `diagnostics` out-parameter and writes the number of containers dropped due to enrichment errors into `diagnostics.enrichmentErrors`; `watch()` creates and passes this object on every call, logs a `Container enumeration degraded` warning when the count is non-zero, and suppresses `emitWatcherSnapshot` whenever either `containerEnumerationFailed` is true **or** `enumerationDiagnostics.enrichmentErrors > 0`. Per-container reports still emit as before; only the authoritative controller-side prune is deferred until a fully clean watch cycle. +- **Debug dump API** β€” `GET /api/v1/debug/dump` endpoint with configurable `minutes` query parameter for time-windowed event collection. -- **[#385](https://github.com/CodesWhat/drydock/issues/385) β€” Telegram, Pushover, and other notification triggers no longer silently swallow `update-applied` and `update-failed` events after a compose recreate or on multi-agent deployments.** When an update routed through the operation queue completed, the terminal lifecycle event (`update-applied` on success, `update-failed` on failure/rolled-back) was emitted from `app/store/update-operation.ts:buildTerminalLifecycleEventBase` with only `containerName` / `containerId` / `operationId` on the payload β€” no `container` object. Notification handlers in `app/triggers/providers/Trigger.ts` fell back to `findContainerByBusinessId(containerName)`, which missed during the ~8 s window between the old container being removed and the new one being re-watched after a compose recreate; the handler then dropped the event with a `No container found for update-applied event => ignore` debug log. This was the same class of race as [#355](https://github.com/CodesWhat/drydock/issues/355) but for the operation-queue-driven path that bypasses `UpdateLifecycleExecutor`'s direct emit. The fix persists a snapshot of the `Container` on the operation entry at enqueue time (`app/updates/request-update.ts:createAcceptedContainerUpdateRequest`) and `buildTerminalLifecycleEventBase` now forwards that snapshot on the terminal-lifecycle payload β€” both `update-applied` and `update-failed`, closing the race for compose successes and failures alike. The agent SSE wire was also extended to forward the container snapshot end-to-end so multi-agent deployments get the same fix: `sanitizeUpdateAppliedPayloadForAgentSse` and `sanitizeUpdateFailedPayloadForAgentSse` in `app/agent/api/event.ts` include `container` when present (previously stripped to scalars only), and the controller's `AgentClient.parseUpdateFailedEventPayload` accepts and decorates an inbound container with the source `agent` name to mirror the existing applied-path behaviour. The snapshot is internal-only: a new `toApiUpdateOperation` helper in `app/store/update-operation.ts` strips it before serialising operations through `GET /api/v1/update-operations/:id`, `GET /api/containers/:id/update-operations`, and `POST /api/operations/:id/cancel`, so container labels and `details.env` are not exposed to API consumers. +- **Dashboard customization** β€” Customizable grid layout with drag-to-reorder, resize, and per-widget visibility toggles using `grid-layout-plus`. Edit mode via pencil icon in breadcrumb header. Customize panel with checkboxes and S/M/L size badges. All widgets progressively collapse content based on container height. -## [1.5.0-rc.24] β€” 2026-05-17 +- **Resource usage dashboard widget** β€” CPU and memory usage bars with top-N resource consumers, progressive detail at different widget sizes. -### Changed +- **Fleet-aggregate stats subsystem ([commits `feature/v1.5-rc17`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `ContainerStatsAggregator` polls each locally-monitored container once per tick (default 10 s) and computes a fleet-wide `ContainerStatsSummary` (total CPU%, total memory, top-N rows). Two new endpoints β€” `GET /api/v1/stats/summary` and `GET /api/v1/stats/summary/stream` β€” expose the current snapshot and a live SSE feed; the dashboard Resource Usage widget now consumes the SSE stream directly, fixing the regression (introduced in rc.13 by the `?touch=false` workaround) where the widget showed zeros because the per-container cache was never warmed. The legacy `GET /api/v1/containers/stats` endpoint and the client-side `summarizeContainerResourceUsage` rollup have been removed. -- **Translations refreshed from Crowdin (commit [`202f3d83`](https://github.com/CodesWhat/drydock/commit/202f3d83)).** Human translations were synced from Crowdin for the ~110-string rc.23 i18n extraction sweep, updating the 16 non-English locales across the `appShell`, `containerComponents`, `listViews`, `sharedComponents`, `configView`, `agentsView`, and `notificationOutboxView` namespaces. Strings that were previously falling back to English now render in each locale. +- **Per-container update locks ([commit `761fb834`](https://github.com/CodesWhat/drydock/commit/761fb834)).** New keyed `LockManager` primitive in `app/updates/lock-primitives.ts` replaces the module-level `pLimit(1)` that was serialising every container update across the entire process. Lock keys are derived per container (and per compose project for `Dockercompose`), so two unrelated containers can now pull and recreate concurrently while two services in the same compose project still serialise correctly. -### Fixed +- **Restart recovery for queued and pulling updates ([commit `00788b13`](https://github.com/CodesWhat/drydock/commit/00788b13)).** Startup reconciliation in `app/store/update-operation.ts` is now selective: `status=queued` operations stay queued for the recovery dispatcher to pick up, and `phase=pulling` rows are reset to `queued` (pull is idempotent). A new `app/updates/recovery.ts` module runs once after `registry.init()`, re-resolves trigger and container for each queued operation, and dispatches them through the existing fire-and-forget pipeline. -- **[#370](https://github.com/CodesWhat/drydock/issues/370) β€” Containers list "Version" column again shows the human-readable image tag for floating-tag + digest-watch containers, restoring the [#356](https://github.com/CodesWhat/drydock/issues/356) fix that rc.20 inadvertently reverted.** The rc.20 `#342` follow-up (commit `b40d3db8`) added a visible `sha256:… β†’ sha256:…` digest pair to the Containers table "Version" cell and card body for all `updateKind === 'digest'` containers that are not digest-pinned. The intent was to surface the digest transition for hybrid containers where both the tag and the underlying image layer changed simultaneously; however, the change cast too wide a net: it also applied to floating-tag + digest-watch containers (e.g. `prom/prometheus:latest`, `linuxserver/plex` with a transform tag) β€” exactly the rows that [#356](https://github.com/CodesWhat/drydock/issues/356) fixed to show the human-readable tag instead of raw digest strings. The `updateKind === 'digest' && !isDigestPinned` branch of the table version cell and card body in `ui/src/components/containers/ContainersGroupedViews.vue` has been restored to the rc.19 behaviour: the version cell renders `c.currentTag` as a `CopyableTag` (with the full digest delta in the cell tooltip), and the card body shows only the update-state badge (with the digest delta in the badge tooltip). The digest transition remains visible through the adjacent "kind" column update-state indicator and the container detail panels. Digest-*pinned* containers (where `isDigestPinned` is true) are unaffected and continue to show the `sha256:… β†’ sha256:…` pair directly in the cell. +- **Notification outbox with retry and dead-letter queue ([commits `a9561d93`, `7d2ef6eb`, `b215d295`, `ce26bece`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `notificationOutbox` LokiJS collection and `app/notifications/outbox-worker.ts` background worker provide durable retry semantics for notification dispatch. On failure, the delivery intent is persisted to the outbox and the worker retries on a periodic drain with exponential backoff + jitter. After a configurable number of failed attempts (default 5) entries transition to the dead-letter queue; delivered and dead-letter entries are auto-purged past TTL (default 30 days). New `/api/notifications/outbox` REST surface lets operators list entries, retry from the DLQ, or discard. -- **[#374](https://github.com/CodesWhat/drydock/issues/374) β€” Security scans no longer hand Trivy a raw registry v2 API URL, which had caused every scan in controller mode to fail.** `resolveContainerImageFullName` (`app/api/container/shared.ts`), used by both the security scan scheduler and the container API, falls back to composing the image reference directly from `container.image.registry.url` whenever the container's registry component is not present in the controller's registry state β€” the normal situation in controller mode (`DD_LOCAL_WATCHER=false`), where registries are configured on the agents rather than on the controller. `registry.url` is stored in the registry v2 API base form (e.g. `https://registry-1.docker.io/v2`), so the fallback produced references such as `https://registry-1.docker.io/v2/dgtlmoon/sockpuppetbrowser:0.0.3`; Trivy then interpreted the scheme as a hostname and every scan failed with `dial tcp: lookup https`. The fallback now mirrors `Registry.getImageFullName`: it strips the URL scheme and the `/v2` path segment and uses an `@` separator for digest references, yielding a plain `registry-1.docker.io/dgtlmoon/sockpuppetbrowser:0.0.3` reference. Containers whose registry component is available are unaffected β€” they already resolved through the correct `getImageFullName` path. +- **Notification outbox UI ([commit `feature/v1.5-rc17`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `Notification outbox` page (route `/notifications/outbox`, nav under *Settings*) with status tabs (Dead-letter / Pending / Delivered), retry and discard actions. -## [1.5.0-rc.23] β€” 2026-05-16 +- **Cancel queued or in-flight updates ([commits `4b79e3ac`, `79487115`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** `POST /api/operations/:id/cancel` now accepts both queued and in-progress operations. Queued ops are marked failed immediately; in-progress ops are flagged via a `cancelRequested` field and the lifecycle observes the flag at three safe checkpoints. -### Added +- **Global concurrent-update cap (`DD_UPDATE_MAX_CONCURRENT`).** New counting semaphore provides a configurable global gate on how many update lifecycles run simultaneously. Default `0` = unlimited. Positive integer `N` means at most N updates run concurrently. Self-update operations bypass the global cap. -- **Self-update now works when Drydock reaches the Docker daemon over a TCP host, not only through a bind-mounted `/var/run/docker.sock` (commit [`fc34ffb9`](https://github.com/CodesWhat/drydock/commit/fc34ffb9)).** The self-update helper container β€” the short-lived container that outlives Drydock to stop the old instance, health-check the replacement, and commit or roll back β€” was hardcoded to a bind-mounted Unix socket and aborted with `Self-update requires the Docker socket to be bind-mounted` whenever Drydock's watcher was configured with a TCP `host`. That is the normal setup when a Docker socket proxy (such as sockguard or `docker-socket-proxy`) mediates daemon access, so self-update was unavailable for those deployments even though every other container updated correctly. `resolveHelperDockerConnection` now inspects the watcher's Dockerode connection: a TCP host produces a TCP helper that is attached to Drydock's own Docker network (the container's `NetworkMode` is cloned so the helper can resolve the proxy by DNS) and receives `DD_SELF_UPDATE_DOCKER_HOST` / `DD_SELF_UPDATE_DOCKER_PORT` / `DD_SELF_UPDATE_DOCKER_PROTOCOL` instead of a socket bind mount; `runSelfUpdateController` builds a TCP Dockerode client from those variables and skips the socket-only API-version probe and redirect guard. The bind-mounted-socket path is unchanged. When self-update runs through a *filtering* socket proxy the Drydock container must carry the proxy's ownership label so the helper is permitted to stop and replace it β€” see `content/docs/current/configuration/self-update/index.mdx`. +- **Health-gate SSE heartbeat (`DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS`).** While drydock waits for a new container to pass its health gate, a periodic heartbeat re-emits `phase: 'health-gate'` at a configurable interval (default 10 s). `DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS=0` disables heartbeats; values below 1000 ms or non-integers fail fast at startup. -- **The per-container Update button is locked with a `Self-update unavailable` indicator when Drydock cannot update itself in the current deployment (commit [`cf777280`](https://github.com/CodesWhat/drydock/commit/cf777280)).** A new hard `self-update-unavailable` update-eligibility blocker is raised for the Drydock self-container when an update is available but self-update can run neither over a bind-mounted socket nor a TCP host β€” i.e. the watcher uses a Unix socket and `/var/run/docker.sock` is not present in the container. The blocker locks the per-row Update button with an explanatory tooltip and makes `POST /containers/:id/update` return `409`, instead of the previous behaviour where the button appeared actionable and the update failed mid-flight with a socket error. Deployments that reach Docker over TCP report self-update as available, so the button is unaffected there. The check fails open: when the watcher cannot be resolved the blocker is not raised. +- **Self-update now works when Drydock reaches the Docker daemon over a TCP host, not only through a bind-mounted `/var/run/docker.sock` (commit [`fc34ffb9`](https://github.com/CodesWhat/drydock/commit/fc34ffb9)).** `resolveHelperDockerConnection` now inspects the watcher's Dockerode connection: a TCP host produces a TCP helper attached to Drydock's own Docker network. The bind-mounted-socket path is unchanged. -- **i18n coverage extended to the remaining hardcoded UI strings across 28 components ([discussion #329](https://github.com/CodesWhat/drydock/discussions/329), commit [`1b65e591`](https://github.com/CodesWhat/drydock/commit/1b65e591)).** A full audit of all 82 UI components found approximately 110 English strings that bypassed `vue-i18n` and rendered raw regardless of the active locale. The extraction sweep covers: `AppLayout` search scopes, group labels, section subtitles, and the five deprecation banner bodies (converted to `` so embedded `` elements stay translatable); `ThemeToggle` variant names; `DataFilterBar` view-mode names; `DetailPanel` size labels (S / M / L); update-kind labels (`Major` / `Minor` / `Patch` / `Digest`) in `ContainerFullPageDetail` and `ContainerFullPageTabContent`, which are now reactive `computed` maps so locale switches take effect without a page reload; tail/status labels and the stdout/stderr stream-type labels in `ContainerLogs`; action tooltips and button labels in `ContainersGroupedViews`; error messages and empty-state fallback labels across the Agents, Config, Registries, Triggers, Watchers, Notifications, NotificationOutbox, and Security views; and the `WATCHING` watcher-status badge, which was rendering the raw backend enum string. New keys land in `en/appShell.json`, `en/containerComponents.json`, `en/listViews.json`, `en/sharedComponents.json`, `en/configView.json`, `en/agentsView.json`, and `en/notificationOutboxView.json`. Non-translatable identifiers β€” the product name "Drydock" and format strings such as `spdx-json` / `cyclonedx-json` β€” are intentionally left as literals. Other locales pick up the new keys via the `en` fallback immediately and will receive human translations on the next Crowdin sync. +- **The per-container Update button is locked with a `Self-update unavailable` indicator when Drydock cannot update itself in the current deployment (commit [`cf777280`](https://github.com/CodesWhat/drydock/commit/cf777280)).** A new hard `self-update-unavailable` update-eligibility blocker is raised when self-update cannot run over either a bind-mounted socket or a TCP host. -### Changed +- **i18n coverage extended to the remaining hardcoded UI strings across 28 components ([discussion #329](https://github.com/CodesWhat/drydock/discussions/329)).** All 16 non-English locales now have full key parity with the English source. 17 locales ship in the picker: de, es, fr, it, nl, pl, pt-BR, tr, zh-CN, zh-TW, ar, ja, ko, ru, uk, vi, plus English. -- **Self-update helper now prefers the bind-mounted Docker socket over a TCP watcher connection (commit [`aa828d88`](https://github.com/CodesWhat/drydock/commit/aa828d88)).** The previous `resolveHelperDockerConnection` logic checked the watcher's TCP modem first, meaning that any deployment where Drydock was configured with a TCP host (e.g. routing through a socket proxy) would always route the helper through that proxy β€” even when the target container itself had `/var/run/docker.sock` bind-mounted. For infrastructure updates (`dd.update.mode=infrastructure`), where the container being replaced *is* the socket proxy, this is fatal: the helper relies on the proxy being up, but the update stops it. The resolution order is now inverted: `findDockerSocketBind` runs first, and if the target container carries a socket bind the helper uses that direct socket path regardless of the watcher's TCP configuration. The TCP path is preserved as the fallback for pure socket-less deployments where Drydock reaches Docker exclusively over a remote host. +- **`DD_AGENT_ALLOW_INSECURE_SECRET` escape hatch for closed-LAN deployments.** rc.20 tightened the agent-secret-over-HTTP check to a hard error. rc.21 introduces `DD_AGENT_ALLOW_INSECURE_SECRET=true` as an explicit controller-side opt-in for environments where the operator accepts that the agent secret travels in cleartext. Default behavior is unchanged. -### Fixed +- **Security scan digest mode.** Every scan cycle now carries a stable `cycleId` (UUID v7) and emits a `security-scan-cycle-complete` event. Triggers can configure `SECURITYMODE=digest` (or `batch+digest`) to receive one summary per cycle. Templates are customizable via `SECURITYDIGESTTITLE` / `SECURITYDIGESTBODY`. ([#300](https://github.com/CodesWhat/drydock/discussions/300)) -- **Dashboard Host Status widget no longer auto-scrolls to the last host when the host list changes (commit [`cbe815a6`](https://github.com/CodesWhat/drydock/commit/cbe815a6)).** The full-mode host list used `scroll-snap-type: y mandatory` with a measured tail spacer. Whenever the host-row set changed β€” a watcher or agent added, removed, or renamed, or a full-to-compact mode transition β€” Chromium re-snapped to the last row's snap point, leaving only the final host visible above a large empty gap. The scroll-snap classes (`snap-y`, `snap-mandatory`, `snap-start`), the dynamic tail-spacer element, and the measurement machinery behind it (the `onUpdated` hook, `requestAnimationFrame` scheduler, and `ResizeObserver`-triggered recompute) have all been removed. The content-aware full/compact sizing that keeps whole rows visible was already sufficient; the snapping added no functional value and actively fought the layout on every data change. +- **Opt-in scheduled-scan notifications** β€” New `DD_SECURITY_SCAN_NOTIFICATIONS=true` flag enables `security-alert` event emission from scheduled scans. Default is `false`; on-demand scans always emit. -- **Dashboard Resource Usage widget minimum height raised so per-container CPU and Memory lists stay visible (commit [`59719757`](https://github.com/CodesWhat/drydock/commit/59719757)).** The `resource-usage` widget's `minH` was set to 3 grid units (approximately 122 px), which falls below the 180 px threshold at which the per-container lists collapse out of view. The minimum is now 7 grid units (approximately 306 px). `applyConstraints` clamps any saved layout item that is below the new minimum on load, so existing dashboard configurations with a shrunken resource-usage widget are silently corrected on the next render rather than persisting an unusable layout. +- **Bulk security scan endpoint** β€” `POST /api/v1/containers/scan-all` scans all (or a filtered subset of) watched containers server-side, streams per-container progress over the existing scan SSE channel, and honors client-disconnect aborts. Rate-limited to 1 request / 60s per IP (authenticated-admin bypass). -- **`AgentClient` timers are now cleared when an agent is removed, preventing orphaned timeouts (commit [`03bf7211`](https://github.com/CodesWhat/drydock/commit/03bf7211)).** `AgentClient` maintains two `setTimeout` handles β€” `stableConnectionTimer` (arms 30 s after the SSE response arrives to reset the backoff counter) and `reconnectTimer` (fires the next reconnect attempt after the exponential-backoff delay). Neither was cancelled when `removeAgent` spliced the client out of the manager's list. An agent removed mid-reconnect-cycle or mid-stability-window would keep an armed timer alive indefinitely, potentially triggering a `startSse` call against a client that was no longer tracked and leaking the associated resources. A new idempotent `stop()` method on `AgentClient` cancels both timers and nulls the handles; `removeAgent` now calls `stop()` on each matching client before splicing it. +- **SSE Last-Event-ID replay ([#289](https://github.com/CodesWhat/drydock/issues/289))** β€” The server stamps every broadcast event with a monotonic `:` id and retains a 5-minute time-bounded ring buffer. Clients reconnecting with a `Last-Event-ID` header receive every event they missed; if the buffer has evicted the requested id the client receives a `dd:resync-required` event. -### Security +- **Update-eligibility blockers on container rows** β€” Backend surfaces 12 structured blocker reasons per container, rendered inline on the Containers list so users see *why* a row isn't updating without opening the detail drawer. -- **TCP Docker host is validated before the self-update controller passes it to Dockerode (commit [`441b4358`](https://github.com/CodesWhat/drydock/commit/441b4358)).** `DD_SELF_UPDATE_DOCKER_HOST` was forwarded to Dockerode without sanitization. A new `validateTcpDockerHost` function rejects values that contain a URL scheme prefix (`tcp://`, `http://`, `https://`, or any `://` form), a userinfo segment (`@`), whitespace, or path separators (`/` or `\`), throwing a descriptive error before any network connection is attempted. This prevents an environment variable or compose-file value from inadvertently injecting a path or URL component that Dockerode would interpret in an unexpected way. The validated host and resolved port are also logged at `INFO` level so the connection target is auditable in the container logs. `runSelfUpdateController` was additionally refactored to remove a control-flow asymmetry: socket and TCP paths previously diverged into separate Dockerode-construct-and-run blocks; they now share a single tail (`disableSocketRedirects` remains socket-only). +- **`GET /update-operations/:id` endpoint** β€” Returns the current state of a specific update operation for reconciliation when the terminal SSE is missed. -- **OIDC error logs now redact RFC-1918 IP addresses and absolute filesystem paths (commit [`9b79de77`](https://github.com/CodesWhat/drydock/commit/9b79de77)).** The rc.22 `getErrorChainMessage` improvement walks `error.cause` chains up to depth 5 and appends the results to OIDC warn logs, which is the right diagnostic behaviour β€” but TLS and connection errors in Node/undici frequently include private network addresses (e.g. `connect ECONNREFUSED 10.0.0.5:2376`) and absolute filesystem paths (e.g. `error loading /etc/ssl/certs/ca-bundle.pem`) that should not appear in logs shipped to centralised observability systems. `sanitizeOidcErrorMessage` now applies two additional redaction passes after the existing URL and bearer-token passes: RFC-1918 IPv4 ranges (10.x, 172.16–31.x, 192.168.x) with an optional port are replaced with `[internal-addr]`; absolute Unix filesystem paths of two or more segments are replaced with `[path]`. Public IP addresses are not redacted so legitimate DNS and routing errors remain actionable. The new passes run on the fully-assembled error-chain string, so they cover nested cause messages as well as the top-level error. +- **Inline update action in Security view ([#299](https://github.com/CodesWhat/drydock/discussions/299))** β€” Image rows in the Security view now show an "Update" action button directly next to the vulnerability data when a newer image is available. -## [1.5.0-rc.22] β€” 2026-05-15 +- **Watcher next-run metadata ([#288](https://github.com/CodesWhat/drydock/issues/288))** β€” Watcher API and Agents view now show when each watcher will next poll for updates, with an absolute-timestamp tooltip on hover. -### Added +- **Backend-driven update queue** β€” Container updates are now queued server-side with per-trigger concurrency limits. UI shows Queued β†’ Updating β†’ Updated state progression with sequence labels (e.g. "Updating 1 of 3"). -- **All 16 non-English locales now have full key parity with the English source (commits [`5e463631`](https://github.com/CodesWhat/drydock/commit/5e463631), [`012dcb83`](https://github.com/CodesWhat/drydock/commit/012dcb83)).** Two complementary passes bring every translation up to date. The first pass (`5e463631`) filled gaps in the ten locales that were already mostly translated (de, es, fr, it, nl, pl, pt-BR, tr, zh-CN, zh-TW) β€” each was missing `notificationOutboxView.json` entirely and had drifted behind recent string extractions in `listViews.json`, `containerComponents.json`, `containersView.json`, and `dashboardView.json` (new keys: `digestLabel`, `blockedTag` variants, `manualUpdateOnly` variants, `narrowViewportSuffix`, `autoHiddenBadgeTooltip`, queued-update toast variants, `recentUpdates.widgetAria`). A JSON-breaking typo in `de/dashboardView.json` (straight quote instead of closing curly quote) was also corrected. The second pass (`012dcb83`) gave the six stub locales (ar, ja, ko, ru, uk, vi) β€” which had been scaffolded with English placeholders since rc.20 β€” a full translation pass across all 13 namespace files plus the new `notificationOutboxView.json`. Brand names, acronyms, and interpolation placeholders are preserved verbatim; DevOps terminology follows each language's established conventions. +- **Registry 429 / 503 retry with Retry-After and per-host token bucket ([commit `ffd1b57b`](https://github.com/CodesWhat/drydock/commit/ffd1b57b), [#342](https://github.com/CodesWhat/drydock/issues/342)).** A new `withRetry` helper wraps every registry HTTP call: on 429 or 503 it honors the upstream `Retry-After` header, then falls back to exponential backoff (1 s / 2 s / 4 s, capped at 60 s), up to 3 retries. A new per-host token bucket prevents the watcher from self-inflicting rate limits during a large cron cycle. -### Changed +- **Release notes inline popover ([commit `09475fa6`](https://github.com/CodesWhat/drydock/commit/09475fa6)).** The release-notes icon on container rows now opens an inline popover showing both the current and the available-version release notes side by side, with expand/collapse per panel. -- **Playwright E2E tests moved to a dedicated workflow file (`e2e-playwright.yml`) (commit [`f0989301`](https://github.com/CodesWhat/drydock/commit/f0989301)).** OSSF Scorecard's CI-Tests check scores from the github-actions Check Suite conclusion, not individual check-run conclusions. Because every job in `ci-verify.yml` rolled into a single suite, one failing Playwright assertion would flip the entire suite to failure and cause Scorecard to mark merged PRs as untested β€” even when all other jobs were green (manifesting as code-scanning alert #43, CI-Tests score 9/10). Each workflow file gets its own Check Suite per commit; isolating Playwright into `e2e-playwright.yml` means a Playwright failure no longer drags the `ci-verify` suite down for Scorecard's purposes. Branch protection continues to gate on the "🎭 E2E: Playwright" status check (matched by job name, not workflow file), and `release-cut.yml` now polls both workflows on the target SHA so releases still require Playwright success. +- **Container source project shortcut link ([Discussion #295](https://github.com/CodesWhat/drydock/discussions/295))** β€” Containers now render a clickable "View project" link next to release notes when an `org.opencontainers.image.source` OCI label, `dd.source.repo` override label, or GHCR-derived source URL is available. -### Fixed +- **Actionable deprecation banners ([Discussion #214](https://github.com/CodesWhat/drydock/discussions/214))** β€” The 5 deprecation warning banners now show the concrete migration action inline and include a "View migration guide" link that deep-jumps to the relevant anchored section of the deprecations docs page. -- **[#368](https://github.com/CodesWhat/drydock/issues/368) β€” OIDC custom-dispatcher paths (cafile / `DD_AUTH_OIDC_*_INSECURE=true`) no longer fail with an opaque `TypeError: fetch failed` on Node 24.** Node 24 ships built-in undici 7.21.0 (v1 dispatcher interface) while the app's userland `undici@8` (bumped in rc.20) exposes an `Agent` with the v2 dispatcher interface. The OIDC custom fetch was constructing the v2 `Agent` from userland undici and passing it as `dispatcher` to Node's global `fetch`, which is bound to the built-in undici 7. The v2 `Agent`'s handlers don't satisfy the v1 contract, so the request silently fails β€” the surface symptom reported by a user upgrading rc.19 β†’ rc.21 against self-signed Authentik with `DD_AUTH_OIDC_AUTHENTIK_INSECURE=true`. The undici project's `Dispatcher1Wrapper` bridge (nodejs/undici#4827) covers this mismatch on Node 22 but is absent on Node 24. The fix imports `fetch` from `undici` and uses it whenever a custom dispatcher is required (cafile or insecure path) so both halves share the same dispatcher version. The non-insecure code path is unchanged β€” openid-client continues to use its default fetch when no custom dispatcher is needed. A strict-`tsc` type error introduced in the same fix (undici's nominal `RequestInfo`/`Response` types differ from the `lib.dom` types that `openid-client`'s `CustomFetch` is typed against) was resolved by casting through `unknown` at the boundary using `Parameters` and `ReturnType`; there is no runtime behavior change. +- **Notification dropdown rework + themeable zebra stripes ([Discussion #267](https://github.com/CodesWhat/drydock/discussions/267))** β€” Header carries the "Notifications" title plus a "Clear" text button, each row shows a per-entry dismiss affordance, and a split footer exposes "Mark all as read" + "Open audit log". Introduces `--dd-zebra-stripe`, a new theme token. -- **OIDC warn logs now surface the full `error.cause` chain, making TLS and DNS failures actionable (commit [`720d99a3`](https://github.com/CodesWhat/drydock/commit/720d99a3)).** `undici`'s fetch surfaces failures as a generic `TypeError: fetch failed`; the actionable diagnostic (`ENOTFOUND`, `ECONNREFUSED`, `UNABLE_TO_VERIFY_LEAF_SIGNATURE`, etc.) lives on `error.cause`, sometimes nested. The previous error sanitizer logged only the top-level message, so issue #368 reached us with only `"Unable to initialize OIDC session (fetch failed)"` β€” no indication whether DNS, TLS, or routing was at fault. A new `getErrorChainMessage` helper walks `error.cause` up to depth 5, joining parts with ` ← ` and appending `[code]` when a `code` property is present; a `WeakSet` guards against cyclic cause chains. `sanitizeOidcErrorMessage` now uses it so all OIDC warn logs include the cause chain (still passed through the existing URL and token redaction). This is a forward-only diagnostic improvement with no runtime behavior change for healthy OIDC paths. +- **Notification history store** β€” New LokiJS collection (`notifications_history`) records a per-(trigger, container, event-kind) result hash so `once=true` dedup survives process restarts. -- **[#362](https://github.com/CodesWhat/drydock/issues/362) β€” SSE reconnect exponential backoff no longer collapses to a flat 1 s loop when the agent is struggling.** `AgentClient.startSse()` previously called `this.reconnectAttempts = 0` the instant the axios response headers arrived β€” before the stream had proven it could stay open. A crash-looping agent, a reverse-proxy with a short upstream idle timeout, or any situation where the SSE stream returned HTTP 200 and then ended almost immediately would cycle as: connect β†’ 200 β†’ `reconnectAttempts = 0` β†’ stream ends β†’ `scheduleReconnect()` (delay = 1 000 ms, attempts β†’ 1) β†’ 1 s later connect β†’ 200 β†’ `reconnectAttempts = 0` again β€” and so on forever. The user who filed #362 saw `SSE stream ended. Reconnecting...` in their controller logs every ~1.00 s indefinitely, with no escalation. The backoff now only resets after the stream has stayed open for `SSE_STABLE_CONNECTION_MS` (30 s). A `setTimeout` is armed when the response arrives and cancelled by `scheduleReconnect()` if the stream ends or errors before the window expires; streams that end early therefore keep their accumulated `reconnectAttempts` and the delay continues to double up to the 60 s cap as intended. +- **Floating tag detection and UI indicator** β€” New `tagPrecision` classifier (`specific` | `floating`) detects mutable version aliases and auto-enables digest watching on non-Docker Hub registries. Container detail views show a caution badge when a floating tag is detected without digest watching enabled. ([Discussion #178](https://github.com/CodesWhat/drydock/discussions/178)) -## [1.5.0-rc.21] β€” 2026-05-15 +- **Hide Pinned containers toggle** β€” Checkbox in the container list filter bar hides containers pinned to specific versions. Persisted in user preferences. ([Discussion #250](https://github.com/CodesWhat/drydock/discussions/250)) -### Added +- **Combined batch+digest notification mode** β€” Triggers can now use `MODE=batch+digest` to send both immediate batch emails and scheduled digest summaries. ([#254](https://github.com/CodesWhat/drydock/issues/254)) -- **i18n coverage extended to the notification outbox, notification rules, and registry/server status badges ([discussion #329](https://github.com/CodesWhat/drydock/discussions/329)).** Four UI surfaces still rendered hardcoded English even under the zh-CN locale: `NotificationOutboxView` (tab labels, table headers, action buttons, toast messages), the notification rule name/description column in `NotificationsView`, and the connection-status badges in `RegistriesView` and `ServersView`. All four are now extracted into the existing `t()` catalogs (new `notificationOutboxView` namespace; new entries under `notificationsView.rules.*`, `registriesView.status.*`, and `serversView.status.*`). Rule names and statuses use `te()` so backend-supplied custom names fall back to the raw string when the catalog has no entry. Translation files for other locales will pick up the new keys on the next Crowdin sync. +- **Multi-select event-type filter in audit log ([commit `5e2d0c70`](https://github.com/CodesWhat/drydock/commit/5e2d0c70), [Discussion #332](https://github.com/CodesWhat/drydock/discussions/332)).** The audit log's event-type filter is now a checkbox dropdown supporting any combination of event categories simultaneously. -- **`DD_AGENT_ALLOW_INSECURE_SECRET` escape hatch for closed-LAN deployments.** rc.20 tightened the agent-secret-over-HTTP check from a warning to a hard error in `app/agent/AgentClient.ts`. rc.21 introduces `DD_AGENT_ALLOW_INSECURE_SECRET=true` as an explicit controller-side opt-in for environments (isolated private LANs, air-gapped setups) where the operator accepts that the agent secret travels in cleartext. Default behavior is unchanged β€” without the flag the boot-time error is still thrown. When the flag is set to exactly `true`, the error is downgraded to a `log.warn` on every startup so the security signal is preserved and visible in logs. Any other value (e.g. `1`, `yes`, `TRUE`) continues to throw. See `content/docs/current/configuration/agents/index.mdx` for guidance on recommended alternatives (certfile/cafile, reverse proxy TLS termination). +- **Bearer token auth for `/metrics` endpoint** β€” Set `DD_SERVER_METRICS_TOKEN` to authenticate Prometheus scrapers via `Authorization: Bearer `. -### Changed +- **Disable default local watcher** β€” Set `DD_LOCAL_WATCHER=false` to prevent the built-in Docker watcher from starting, useful for controller-only nodes that manage remote agents exclusively. -- **Default watcher cron relaxed from hourly to every 6 hours ([#342](https://github.com/CodesWhat/drydock/issues/342) follow-up).** `app/watchers/providers/docker/Docker.ts` now defaults `cron` to `0 */6 * * *` (every 6 hours) instead of `0 * * * *` (hourly). Hourly polling was the most aggressive default among active 2026 update managers β€” Diun ships `0 */6 * * *`, Watchtower (archived Dec 2025) defaulted to 24 h, and our upstream WUD ships no default at all. With fleets of 20+ containers and image tag lists that paginate to thousands of entries (immich-server is 24+ GHCR pages), hourly polling saturates anonymous Docker Hub limits (100 pulls / 6 h) and trips GitHub's 5 k req/h release-notes ceiling. rc.20's per-host token bucket + `Retry-After` handling stays as the safety net; the default change addresses the root cause. Users who set `DD_WATCHER_{name}_CRON` explicitly are unaffected. Users who want near-real-time detection (security patches) can still set `DD_WATCHER_{name}_CRON=0 * * * *`. Docs (`content/docs/current/configuration/watchers/index.mdx`, `content/docs/current/api/watcher.mdx`, `content/docs/current/api/agent.mdx`) updated to reflect the new default. +- **Multi-server notification identification ([#283](https://github.com/CodesWhat/drydock/issues/283))** β€” Notifications automatically include a `[server-name]` prefix when agents are registered. Controller name configurable via `DD_SERVER_NAME`. Custom templates can use `container.notificationServerName` and `container.notificationAgentPrefix`. -### Fixed +- **Infrastructure update mode** β€” `dd.update.mode=infrastructure` label for socket proxy containers enables helper-swap update path bypassing the socket proxy. -- **[discussion #295](https://github.com/CodesWhat/drydock/discussions/295) β€” Release-notes icon in the container table now always opens the same popover, even when only an external release URL is available.** Previously the file-text icon in the actions column had two different behaviors depending on what release-notes metadata we'd fetched for the container: containers with structured notes (title + body, e.g. from the GitHub release-notes provider) got an icon button that opened a popover with an expandable preview; containers with only a bare `releaseLink` URL got an icon that was a direct external `` β€” no popover. The popover shell now renders uniformly for both cases. When only `releaseLink` is available, the popover contains a single row that links out to the external URL (with an `external-link` indicator instead of the chevron used by expandable rows), and clicking the row dismisses the popover before navigating. No change for containers that already have structured notes β€” the existing popover and inline-expander behavior are unchanged. +- **i18n framework migration ([refs #329](https://github.com/CodesWhat/drydock/issues/329)).** Bulk vue-i18n migration into per-namespace JSON catalogs under `ui/src/locales/en/` (eight namespaces auto-loaded by `import.meta.glob` in `boot/i18n.ts`). Foundation for the Crowdin integration. 17 locales ship in the picker. -- **Docker multi-arch build no longer fails when Alpine repos drift between archs.** `Dockerfile` pinned `curl=8.17.0-r1`, which broke the `linux/arm64` build leg after Alpine's `latest-stable/aarch64` mirror rotated to `curl-8.19.0-r0` while `latest-stable/x86_64` was still on `8.17.0-r1`. No single pin can satisfy both archs during a mirror rotation window; the curl entry is now unpinned so apk installs whatever's current per-arch. Other pinned packages still match across both archs and stay pinned. +- **Design system components** β€” Added shared UI building blocks: `AppIconButton`, `AppBadge`, `StatusDot`, `DetailField`, and `AppTabBar`. ([Discussion #199](https://github.com/CodesWhat/drydock/discussions/199)) -- **[#362](https://github.com/CodesWhat/drydock/issues/362) β€” `DD_SESSION_SECRET` no longer crashes startup when unset; secret is auto-generated and persisted to the store on first boot.** rc.20 made `DD_SESSION_SECRET` a hard requirement (commit `b9e8be38`) to close a real issue β€” the prior fallback generated a fresh per-process random secret on every restart, which silently invalidated every active session whenever drydock restarted. But the hard-require shipped without a migration path: existing deployments that didn't set the variable hit an immediate boot crash on upgrade. The fallback is now restored as a *persisted* secret: on first boot without `DD_SESSION_SECRET` set, drydock generates 64 random bytes (`randomBytes(64).toString('hex')`) and writes them to a new `secrets` collection inside `/store/dd.json`. Subsequent boots read the persisted value, so sessions survive restarts. The env var still takes precedence when set (and whitespace-only values are treated as unset). Operators upgrading from rc.20 with no `DD_SESSION_SECRET` configured will boot cleanly; deployments that already set the variable see no change. +- **Podman API version negotiation** β€” Docker watcher probes the daemon's `/version` endpoint over the Unix socket and pins Dockerode to the reported API version. Prevents `EAI_AGAIN` crashes caused by `docker-modem`'s redirect-following bug when Podman returns HTTP 301 for unversioned API paths. ([#182](https://github.com/CodesWhat/drydock/issues/182)) -## [1.5.0-rc.20] β€” 2026-05-14 +- **System log live streaming in UI** β€” Added end-to-end WebSocket support for system logs (`/api/v1/log/stream`) with new UI service/composable and live log view integration. -### Added +- **System log viewer overhaul** β€” Toolbar stays pinned at top, long lines wrap at viewport width, search matches component/level/channel fields, filter toggle shows only matching entries, sort toggle switches between oldest-first and newest-first. ([#259](https://github.com/CodesWhat/drydock/discussions/259), [#260](https://github.com/CodesWhat/drydock/discussions/260)) -- **Fleet-aggregate stats subsystem ([commits `feature/v1.5-rc17`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `ContainerStatsAggregator` polls each locally-monitored container once per tick (default 10 s) and computes a fleet-wide `ContainerStatsSummary` (total CPU%, total memory, top-N rows). Two new endpoints β€” `GET /api/v1/stats/summary` and `GET /api/v1/stats/summary/stream` β€” expose the current snapshot and a live SSE feed; the dashboard Resource Usage widget now consumes the SSE stream directly, fixing the regression (introduced in rc.13 by the `?touch=false` workaround) where the widget showed zeros because the per-container cache was never warmed. The legacy `GET /api/v1/containers/stats` endpoint and the client-side `summarizeContainerResourceUsage` rollup have been removed. +- **Rollback shortcut in container actions menu** β€” Quick rollback option directly from the container row actions dropdown. -- **Per-container update locks ([commit `761fb834`](https://github.com/CodesWhat/drydock/commit/761fb834)).** New keyed `LockManager` primitive in `app/updates/lock-primitives.ts` replaces the module-level `pLimit(1)` that was serialising every container update across the entire process. Lock keys are derived per container (and per compose project for `Dockercompose`), so two unrelated containers can now pull and recreate concurrently while two services in the same compose project still serialise correctly. The lock primitive is its own pure-logic file with full unit tests; the docker trigger and compose subclass derive the lock key set via a new `getUpdateLockKeys(container)` method. -- **Restart recovery for queued and pulling updates ([commit `00788b13`](https://github.com/CodesWhat/drydock/commit/00788b13)).** Startup reconciliation in `app/store/update-operation.ts` is now selective: `status=queued` operations stay queued for the recovery dispatcher to pick up, and `phase=pulling` rows are reset to `queued` (pull is idempotent). All other in-progress phases β€” `prepare`, `renamed`, `new-created`, `old-stopped`, `new-started`, `health-gate`, `rollback-*` β€” remain marked failed because they leave inconsistent state that an operator should review. A new `app/updates/recovery.ts` module runs once after `registry.init()`, re-resolves trigger and container for each queued operation, and dispatches them through the existing fire-and-forget pipeline. Operations whose container or trigger no longer exists are marked failed with an explanatory `lastError` so they don't sit in the queue forever. -- **Notification outbox with retry and dead-letter queue ([commits `a9561d93`, `7d2ef6eb`, `b215d295`, `ce26bece`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `notificationOutbox` LokiJS collection (`app/store/notification-outbox.ts`) and matching `app/notifications/outbox-worker.ts` background worker provide durable retry semantics for notification dispatch. `Trigger.dispatchContainerForEvent` now optimistically calls `this.trigger(container)` directly; on failure, the delivery intent is persisted to the outbox and the worker retries on a periodic drain with exponential backoff + jitter. After a configurable number of failed attempts (default 5) entries transition to the dead-letter queue; delivered and dead-letter entries are auto-purged past TTL (default 30 days). New `/api/notifications/outbox` REST surface lets operators list entries (`?status=` filter), retry from the DLQ (`POST /:id/retry`), or discard (`DELETE /:id`). New base method `Trigger.dispatchOutboxEntry(entry)` is the worker's delivery hook; subclasses can override. -- **Notification outbox UI ([commit `feature/v1.5-rc17`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** New `Notification outbox` page (route `/notifications/outbox`, nav under *Settings*) consumes the existing `/api/notifications/outbox` REST surface so operators can review the dead-letter queue, retry stuck deliveries, or discard dead entries from the UI. Status tabs (Dead-letter / Pending / Delivered) keep the same query-param convention (`?status=`) used by the rest of the list views; counts per bucket render as inline badges. `Retry` is shown only on dead-letter rows; `Discard` is available everywhere. New `ui/src/services/notification-outbox.ts` mirrors the API exactly. -- **Cancel queued or in-flight updates ([commits `4b79e3ac`, `79487115`](https://github.com/CodesWhat/drydock/commits/feature/v1.5-rc17)).** `POST /api/operations/:id/cancel` now accepts both queued and in-progress operations. Queued ops are marked failed immediately with `lastError: 'Cancelled by operator'` (200). In-progress ops are flagged via a new `cancelRequested` field on the operation row and the endpoint returns `202 Accepted`; the lifecycle observes the flag at three safe checkpoints β€” after pull and before rename (clean abort, no rollback needed), before creating the replacement container, and before stopping the old container β€” so cancellations either short-circuit cleanly or fall through the existing rollback path that renames the container back. The rollback path tags the rollback reason as `cancelled` so the audit trail distinguishes operator cancellations from real failures. Already-terminal ops still return `409 Conflict`. The container row's Cancel action is now visible for both queued and in-progress operations; the toast says "Cancelled" for the immediate path and "Cancellation requested" for the in-progress path. -- **Global concurrent-update cap (`DD_UPDATE_MAX_CONCURRENT`).** New counting semaphore (`Semaphore` class in `app/updates/lock-primitives.ts`) provides a configurable global gate on how many update lifecycles run simultaneously across the entire controller instance. Default `0` = unlimited β€” no behavior change on upgrade. Positive integer `N` means at most N updates run concurrently. Negative or non-integer values fail fast at startup with a descriptive error. The cap layers on top of the existing per-container and per-compose-project locks; it does not replace them. Operations waiting on the cap remain in `queued` status. Scope is per controller instance; distributed agent hosts have independent counters by design. Self-update operations bypass the global cap β€” they take per-container locks but never wait on the global semaphore, preventing a full update queue from starving an admin-triggered self-update. -- **Health-gate SSE heartbeat (`DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS`).** While drydock waits for a new container to pass its health gate, the SSE pipeline was silent for the entire wait β€” the UI received no events between `phase: 'health-gate'` and `phase: 'health-gate-passed'`. For images with long healthcheck intervals (e.g. vaultwarden's 60 s check) this meant the UI relied on REST reconciliation poll if the SSE connection was interrupted during that window. A periodic heartbeat now re-emits `phase: 'health-gate'` at a configurable interval (default 10 s). `DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS=0` disables heartbeats entirely; values below 1000 ms or non-integers fail fast at startup. The heartbeat cancels immediately when the wait resolves in any direction (success, timeout, or unhealthy), ensuring the terminal event is never preempted. No new phases are introduced; existing UI consumers accept the re-emitted event unchanged. +- **SPA + hashed-asset cache-control** β€” Static UI assets with hashed filenames are served with immutable long-lived cache headers; the SPA `index.html` carries a short revalidation header. ### Changed -- **Crowdin export configuration aligned with app locale folders.** Crowdin now maps language codes such as `es-ES` into the locale folder IDs the UI actually loads (for example `es`) and only downloads languages exposed in the locale picker. A new config guard test prevents future sync PRs from adding ignored region-coded folders, and the new auto-hidden-columns tooltip source avoids English-only `column(s)` punctuation that triggered Crowdin QA warnings for translated strings. -- **Shared DataTable column sizing overhaul ([commit `596adcd2`](https://github.com/CodesWhat/drydock/commit/596adcd2)).** All first-party table surfaces now route through the shared `DataTable` component with numeric sizing metadata (`size`, `minSize`, `maxSize`, `flex`, `priority`, `overflow`, `autoSize`) instead of ad-hoc string widths. Tables render a stable ``, keep actions in an independent sticky/fixed managed column, support pointer and keyboard column resizing, double-click autosize visible content, and persist manual/autosized widths per table via browser preferences. Containers uses the sizing data for responsive auto-hide math so narrow widths hide lower-priority metadata instead of shrinking columns below readable minimums. The config webhook endpoint list was migrated too, and a new architecture test fails if raw `` markup or string column widths reappear in `ui/src`. -- **Watcher dispatch is fully fire-and-forget ([commit `5cfa2286`](https://github.com/CodesWhat/drydock/commit/5cfa2286)).** `Trigger.runUpdateAvailableSimpleTrigger` and `runAcceptedUpdateBatch` previously awaited `runAcceptedContainerUpdates`, so a slow update lifecycle stalled the next watcher tick. The API path was already fire-and-forget; the watcher path now matches. New `dispatchAccepted(accepted)` helper centralises the `void runAcceptedContainerUpdates(...).catch(() => undefined)` pattern across all four call sites. Per-operation failures are still terminalised inside the lifecycle handler, so swallowing the dispatch chain's rejection loses no observable information. -- **Security alert emit is non-blocking inside the update lifecycle ([commit `6c5198dd`](https://github.com/CodesWhat/drydock/commit/6c5198dd)).** `SecurityGate.maybeEmitHighSeverityAlert` was awaited inside `evaluateScanOutcome`, which itself runs inside the update lifecycle's critical path. With multiple notifiers registered for security alerts, the await chained sequential provider calls (SMTP, Slack, HTTP, MQTT, webhook) into the lifecycle, multiplying latency before pull/recreate could even start. The function now returns synchronously after firing the emit; notification dispatch semantics from the caller's perspective are unchanged (the same handlers run in the same order via `emitOrderedHandlers`), the lifecycle just no longer waits. -- **"Update started" toasts renamed to "Update queued" ([commit `79487115`](https://github.com/CodesWhat/drydock/commit/79487115)).** Dispatch is fire-and-forget β€” by the time the toast renders, the lifecycle hasn't started, the operation is just queued. The text now matches what actually happened: `"Update queued: {name}"`, `"Force update queued: {name}"`, `"Queued update(s) for N container(s)"`. Function names in `ui/src/utils/container-update.ts` are unchanged so call-site churn is zero. +- **Default watcher cron relaxed from hourly to every 6 hours ([#342](https://github.com/CodesWhat/drydock/issues/342) follow-up).** `app/watchers/providers/docker/Docker.ts` now defaults `cron` to `0 */6 * * *` (every 6 hours) instead of `0 * * * *` (hourly). Users who set `DD_WATCHER_{name}_CRON` explicitly are unaffected. Users who want near-real-time detection can still set `DD_WATCHER_{name}_CRON=0 * * * *`. -### Fixed +- **Action trigger default mode** β€” Action triggers (`docker`, `dockercompose`, `command`) now default to `AUTO=oninclude` instead of `AUTO=all`, requiring an explicit `dd.action.include` label before auto-updating containers. ([#213](https://github.com/CodesWhat/drydock/issues/213)) -- **[#342](https://github.com/CodesWhat/drydock/issues/342) follow-up β€” Hybrid `image:tag@sha256:digest` refs no longer trigger a spurious "Cannot get a reliable tag" warning when Docker's `RepoTags` is empty.** When a container is pulled by digest pin, Docker's `ImageInspect.RepoTags` is often empty even though the deploy ref written in compose (`docker.io/valkey/valkey:9@sha256:...`) carries an authoritative tag. `Docker.resolveImageName` was diverting to `resolveDigestOnlyImage` in that case, logging a misleading warning and discarding the explicit `:9` tag β€” which cascaded into `image-comparison.ts` emitting `No Registry Provider found` because the digest-only fallback could lose registry-domain context. The function now detects hybrid refs (a `:` for the tag before `@sha256:`) and parses them directly via `parse-docker-image-name`, which already returns both tag and digest correctly; only true digest-only refs (`image@sha256:...` with no tag) fall through to the existing `RepoTags` / `resolveDigestOnlyImage` path. The user-visible result for the Immich `valkey` and `postgres` digest-pinned containers reported on #342 is that the misleading warnings stop firing on every cron cycle and the registry router resolves them to Docker Hub / GHCR cleanly. -- **[#342](https://github.com/CodesWhat/drydock/issues/342) follow-up β€” Registry env-var naming convention now explained in the registries index.** The per-registry doc tables show `DD_REGISTRY__{REGISTRY_NAME}_` as a placeholder convention, but `{REGISTRY_NAME}` is not self-documenting. A new "Naming registry instances" callout on `content/docs/current/configuration/registries/index.mdx` explains that the placeholder is a user-chosen label (`PUBLIC`, `PRIVATE`, `WORK`, anything) that namespaces multiple instances of the same registry type, with a concrete two-instance example (`HUB_PUBLIC_*` + `GHCR_PRIVATE_*`). -- **[#342](https://github.com/CodesWhat/drydock/issues/342) follow-up β€” Watcher cron callout explains rate-limit interaction with hourly polling.** A new callout on `content/docs/current/configuration/watchers/index.mdx` explains why the default `0 * * * *` (hourly) can saturate anonymous Docker Hub limits with many-container deployments, points to the v1.5.0-rc.20 retry + token-bucket mitigations, and recommends `0 */6 * * *` or `0 1 * * *` for high-container-count installs that don't need near-real-time detection. - -- **[#356](https://github.com/CodesWhat/drydock/issues/356) β€” Containers list Version column no longer hides the human-readable tag for floating-tag + digest-watch images.** The rc.18 ship of [#342](https://github.com/CodesWhat/drydock/issues/342) (digest-pinned containers were rendering `currentTag β†’ newTag` as two identical truncated `sha256:` strings because both fields came from the same pinned digest reference) replaced that with a real `formatShortDigest(localValue) β†’ formatShortDigest(remoteValue)` pair whenever the update kind was `'digest'`. That correctly addressed digest-pinned containers but cast too wide a net: containers that pull a floating tag (`:latest`, `:v8.13.2`, `:compose-X-version-9.0.1`) with `image.digest.watch` enabled also surface as `kind === 'digest'` whenever the registry rebuilds the image β€” and there the tag is meaningful, the user expects to see it, and replacing it with two `sha256:…` hashes obscured every linuxserver/* and similar GHCR-hosted row on the Containers table for users like the reporter. A new derived `isDigestPinned: boolean` (added to the UI Container type, mapped from `image.tag.value.startsWith('sha256:')` β€” same heuristic the watcher uses at `app/watchers/providers/docker/image-comparison.ts:240`) now gates the digest-pair render: digest-pinned containers continue to show the `sha256:abc… β†’ sha256:def…` pair the [#342](https://github.com/CodesWhat/drydock/issues/342) fix intended, while floating-tag + digest-watch containers render the tag once (no arrow, since `currentTag === newTag` for digest-only updates) with the digest pair surfaced on the cell tooltip. The two container-detail panels gain a small muted "Digest:" subline showing the actual digest transition so the underlying change is still visible without dominating the version row. Applies symmetrically to all five UI sites that switched in rc.18: the Containers table version cell, card body, and list-accordion image subtitle, plus the side and full-page detail panels. -- **[#357](https://github.com/CodesWhat/drydock/issues/357) β€” Transient Trivy failures no longer wipe previously-stored scan history.** The scheduler used to overwrite `container.security.scan` unconditionally; when Trivy hit a hiccup (daemon timeout, registry blip, missing socket) `mapToErrorResult` returned an empty `status:'error'` record and that result silently replaced every prior `passed`/`blocked` entry on the next cycle. The scheduler now keeps the existing record when the new result is an error and there is something to preserve, capped at a 7-day max-staleness window so a persistently broken pipeline eventually surfaces in stored state instead of locking in a stale `passed` indefinitely; the UI still sees the live error via SSE so operators are not left in the dark either way. Error results are also no longer indefinitely re-spawning fresh Trivy invocations β€” `scanImageWithDedup` now uses a 15-minute error retry floor so under aggressive cron and a registry outage, retries are bounded to once per 15 minutes per digest instead of once per scheduler cycle. -- **[#355](https://github.com/CodesWhat/drydock/issues/355) β€” `update-failed` notifications no longer drop silently when the controller's container store races against post-failure prune.** `UpdateLifecycleExecutor` now carries the failing container on the `update-failed` payload, and `Trigger.handleContainerUpdateFailedEvent` accepts `payload.container` as the primary source with the store lookup as fallback β€” mirroring the existing `update-applied` symmetry. Previously, when the store lookup missed (post-failure prune timing, agent push race, watcher/name re-key) the trigger silently debug-logged "No container found for update-failed event => ignore" and the user got no out-of-band signal that the update had failed. The event payload types are now strictly typed (`container?: Container` only β€” no `Record` escape hatch), the three duck-typing payload-extraction blocks across the trigger handlers collapsed into a single direct `payload.container || lookup(...)` pattern, and the agent SSE relay strips `container` from `dd:update-applied` / `dd:update-failed` events before transmit (mirroring the controller-side sanitizer at `app/api/sse.ts`) so the full container blob β€” vulnerabilities, env entries, labels β€” no longer goes over the wire on every event. -- **[#355](https://github.com/CodesWhat/drydock/issues/355) / [#357](https://github.com/CodesWhat/drydock/issues/357) β€” Trivy scan and SBOM no longer require `/var/run/docker.sock` inside the drydock container.** Regression introduced in rc.17 forced Trivy to use only the local Docker daemon as image source. Operators running the `tecnativa/docker-socket-proxy` topology (documented in `README.md`), rootless Docker, podman, or remote watchers saw every gated update fail post-pull with `dial unix /var/run/docker.sock: connect: no such file or directory`, and previously-stored scan results were also overwritten with empty error records when the scheduler fired. The forced `--image-src docker` flag is removed; Trivy now uses its default source order (`docker, containerd, podman, remote`) and falls back to a registry pull when the local daemon isn't reachable. Operators who know their topology is socket-less and want to skip the docker/containerd/podman probe attempts can set `DD_SECURITY_TRIVY_IMAGE_SRC=remote` (any value Trivy accepts works, including comma-separated lists like `remote,docker`); when unset Trivy auto-detects. Pre-rc.17 behaviour is fully restored. -- **[#290](https://github.com/CodesWhat/drydock/issues/290) β€” "Updated Successfully" toast no longer drops intermittently after a container update.** Terminal-update toasts previously fired from three independent handlers (`ContainerUpdateDialog`, `useContainerSsePatchPipeline`, `ContainersGroupedViews`), each gated on different state β€” any one of `operationId` missing on the wire, the view being unmounted, or the per-batch dependency on `ContainersGroupedViews` being mounted would silently swallow the toast. A new `useGlobalUpdateToast` composable mounted once at `App.vue` is the single source of truth: listens for `dd:sse-update-applied` / `dd:sse-update-failed` / `dd:sse-batch-update-completed` (via `globalThis` events), survives route navigation, dedupes by `operationId` over a 5-minute window matched to the SSE replay buffer, and waits for the matching `dd:sse-container-added/updated/removed` event before firing so the toast appears the moment the row's "Updating" badge clears (not on a hardcoded delay). A 5s safety fallback fires the toast for cases where no row event arrives (remote agents, deleted containers). Backend stops coercing missing `operationId`/`containerId` to `''` so the wire format is honest about what's optional. Browser `EventSource` cannot set custom headers on reconnect, so `Last-Event-ID` is now also accepted via query param (`?last-event-id=`) and validated against the canonical `:` shape at the request boundary. Defensive hardening: module-level singleton guard so a stray child-component install can't double-register listeners, FIFO-bounded dedup map (cap 500) defends against runaway operation throughput, and HTML angle brackets are stripped from raw error text before i18n interpolation. -- **[#289](https://github.com/CodesWhat/drydock/issues/289) β€” Container row state regression after recreate.** Same root cause as #290: per-view SSE handlers dropped events when the view was unmounted or the payload omitted `operationId`. The row-state pipeline (`useContainerSsePatchPipeline`) is now decoupled from toast emission so it can focus solely on patch application; toast firing lives exclusively in `useGlobalUpdateToast` at `App.vue`. -- **[#291](https://github.com/CodesWhat/drydock/issues/291) β€” Dashboard fired "updated" toast while the "updating" toast was missed.** The dashboard had its own duplicate SSE-terminal-toast handler that competed with (and sometimes pre-empted) the global one. The dashboard SSE handler now does row-state hold/ghost management only; toast emission is owned exclusively by the global handler at `App.vue`. -- **Release security gate restored before rc.18.** Patched transitive npm dependencies flagged by OSV during the post-merge main CI run: `fast-uri` now resolves to `3.1.2` in app/UI lock domains, and `fast-xml-builder` now resolves to `1.2.0` through the app/e2e XML parser override path. This clears the Qlty security gate without changing runtime behavior. -- **[#345](https://github.com/CodesWhat/drydock/issues/345) β€” Host names with numeric suffixes no longer lose the differentiating character in the Containers table.** The rc.18 table pass already replaced the old host badge with plain text, and the host column now has a wider default/readable floor so names like `servicevault` and `servicevault2` remain distinguishable at desktop widths. Narrow layouts still auto-hide the host column into secondary metadata instead of shrinking it below readability. -- **[#340](https://github.com/CodesWhat/drydock/discussions/340) - Self-update no longer preserves stale Drydock version metadata.** The self-update clone path now drops image-inherited environment variables and labels from the old image when the target image changed them, so replacement containers inherit the new image's `DD_VERSION` and `org.opencontainers.image.version` instead of reporting the previous release after an automatic update. Operator-supplied environment variables and labels remain preserved. -- **One slow notifier no longer stalls every container update ([commit `761fb834`](https://github.com/CodesWhat/drydock/commit/761fb834)).** The module-level `pLimit(1)` introduced in v1.5 to serialise concurrent updates was the root cause behind reports of stuck queues whenever a single notifier hung β€” every update on every container was waiting for the same single slot. Per-container locks remove the global bottleneck while still preventing a container from being updated twice in parallel. -- **Process restart no longer wipes the queued update list ([commit `00788b13`](https://github.com/CodesWhat/drydock/commit/00788b13)).** Previously every active operation was force-failed on startup. Queued and pulling-phase operations now resume; only operations mid-destructive-step (renamed/new-created/old-stopped/etc.) are surfaced for operator review. See the matching addition above. -- **Transient notifier outages no longer drop alerts ([commit `b215d295`](https://github.com/CodesWhat/drydock/commit/b215d295)).** Direct dispatch failures land in the outbox and are retried with exponential backoff + jitter; only persistently failing entries (default: 5 failed attempts) move to the dead-letter queue. Crash-during-dispatch is the only remaining loss window. -- **`dd.registry.lookup.image` label no longer corrupts deploy identity ([commit `594a07e8`](https://github.com/CodesWhat/drydock/commit/594a07e8), [fixes #336](https://github.com/CodesWhat/drydock/issues/336)).** The lookup label is intended to redirect tag/manifest queries to a different image (e.g. a private mirror running `myreg/nextcloud` looking up tags from Docker Hub's `library/nextcloud`), but `normalizeContainer` was assigning the substituted view back onto the container record so the deploy identity β€” image name and registry URL β€” was silently rewritten to the lookup target. Compose-file rewrites and container recreates then deployed the wrong image. `normalizeContainer` no longer overwrites `image.name` / `image.registry.url`; a new `getImageForRegistryQuery` helper applies the substitution + provider URL normalisation only at each query boundary (`getTags`, `getImageManifestDigest`, `getImagePublishedAt`). Un-prefixed images (`nginx:1.0`) now default to `docker.io` for the registry URL; `Hub.getImageFullName` strips the prefix for clean display. -- **Password-manager autofill restored on login form ([commit `3abe2fa6`](https://github.com/CodesWhat/drydock/commit/3abe2fa6), [fixes #335](https://github.com/CodesWhat/drydock/issues/335)).** Username and password inputs lost their `name` and `id` attributes during the v1.5 plain-HTML rewrite. Browser-native autofill kept working via `autocomplete=`, but credential managers that rely on `name`/`id` heuristics (Dashlane in Chrome, among others) could no longer identify the username field. Both attributes are restored. -- **`security-scan-skipped` audit row now fires when the gate is disabled globally ([commit `ae24e0a9`](https://github.com/CodesWhat/drydock/commit/ae24e0a9)).** Previously `recordSecurityAudit('security-scan-skipped', …)` only executed when the per-container label `dd.security.gate=off` was set. With `DD_SECURITY_GATE_MODE=off` configured globally, scans were silently skipped with no audit trail β€” an operator reading the audit log had no indication that the gate was suppressed. `getGateDisabledAuditDetails` now selects the appropriate human-readable reason from whichever off-state is in effect and the audit call is unconditional. -- **Registry URL normalization restored on container record after regression in `594a07e8`.** Removing the `normalizeImage` call in `normalizeContainer` to fix deploy-identity corruption (issue #336) inadvertently left `image.registry.url` in its raw user-config form (`docker.io`) instead of the API base URL form (`https://registry-1.docker.io/v2`). All registry HTTP callers, `getImageFullName`, the Prometheus `image_registry_url` label, and the Docker trigger's self-update helper expect the normalized form. The URL rewrite is now restored for containers where the deploy image itself matches the provider; harbor-mirror containers (where a lookup label diverts to a different registry) correctly retain their deploy URL unchanged. -- **`image.name` canonicalization also restored after partial fix in `4e06329b`.** The prior fix only restored `image.registry.url`; `image.name` was still not rewritten through the provider's `normalizeImage`, so Docker Hub containers with un-prefixed names (e.g. `nginx`) kept `image.name = "nginx"` instead of `library/nginx`. This caused the Prometheus `image_name` label to emit the bare name, breaking e2e scenarios that assert `image_name="library/nginx"`. The `normalizeImage` result now also assigns `image.name` in the deploy-match branch; the cross-registry mirror branch (harbor β†’ Hub lookup) is unaffected and still preserves the deploy name. -- **Stack/group view no longer collapses to ungrouped mid-update when containers are recreated.** When a Docker action recreates a container it receives a new container ID; the group-membership map was keyed only by the original ID, so the post-recreate lookup missed and every container fell into `__ungrouped__`. With a two-container stack the single-member-flatten rule then removed both group buckets entirely. `loadGroups()` now indexes the map under id, name, AND displayName, so the existing `map[container.name]` fallback in the lookup actually resolves after a recreate. +- **Self-update helper now prefers the bind-mounted Docker socket over a TCP watcher connection (commit [`aa828d88`](https://github.com/CodesWhat/drydock/commit/aa828d88)).** The resolution order is now inverted: `findDockerSocketBind` runs first, and if the target container carries a socket bind the helper uses that direct socket path regardless of the watcher's TCP configuration. TCP is the fallback for pure socket-less deployments. -### Added +- **`DD_SESSION_SECRET` auto-generated and persisted when unset.** On first boot without `DD_SESSION_SECRET` set, drydock generates 64 random bytes and writes them to a `secrets` collection inside `/store/dd.json`. Subsequent boots read the persisted value so sessions survive restarts. The env var still takes precedence when set. (rc.21 restored this after rc.20 made it a hard requirement without a migration path; existing deployments that set the variable see no change.) -- **Chinese (Simplified) UI ([PR #331](https://github.com/CodesWhat/drydock/discussions/331) by [TianMiao](https://github.com/TianMiao), commits [`8f3286b7`](https://github.com/CodesWhat/drydock/commit/8f3286b7), [`b97944dc`](https://github.com/CodesWhat/drydock/commit/b97944dc)).** Chinese is the first non-English locale to ship in drydock. 14 namespace JSON files under `ui/src/locales/zh-CN/` cover the full UI surface β€” dashboard, containers, agents, config, list views, container components, app shell, auth, logs, and shared components (~1,100+ strings). A latent bootstrap bug (`buildMessages` map initialized only for `en`, causing `Object.assign(undefined)` crashes for any second locale) was fixed as part of this work, along with 112 translation gaps that arose because the locale files were authored before several new UI strings landed in rc.17. The i18n framework loaded on the existing `import.meta.glob` auto-discovery; no additional wiring was needed. -- **Chinese (Traditional) UI ([PR #344](https://github.com/CodesWhat/drydock/pull/344) by [TianMiao](https://github.com/TianMiao), commit [`2e60f1e7`](https://github.com/CodesWhat/drydock/commit/2e60f1e7)).** The Chinese catalog is now split into BCP-47 locale folders (`zh-CN` and `zh-TW`) so operators can choose Simplified or Traditional Chinese from Config > Appearance. The Traditional catalog ships with the same namespace coverage as Simplified Chinese, including the rc.18 appearance, outbox, table, and preference strings. -- **Multi-select event-type filter in audit log ([commit `5e2d0c70`](https://github.com/CodesWhat/drydock/commit/5e2d0c70), [discussion #332](https://github.com/CodesWhat/drydock/discussions/332)).** The audit log's event-type filter was a single-value `