LetPeopleWork · huserben · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/Lighthouse.Frontend/package.json b/Lighthouse.Frontend/package.json
@@ -21,7 +21,7 @@
 		"@microsoft/signalr": "^10.0.0",
 		"@mui/icons-material": "^7.3.11",
 		"@mui/lab": "7.0.0",
-		"@mui/material": "^9.0.1",
+		"@mui/material": "^9.1.1",
 		"@mui/system": "^9.1.1",
 		"@mui/x-charts": "9.0.1",
 		"@mui/x-data-grid": "^9.5.0",

diff --git a/Lighthouse.Frontend/pnpm-lock.yaml b/Lighthouse.Frontend/pnpm-lock.yaml
diff --git a/Lighthouse.Frontend/vitest.config.ts b/Lighthouse.Frontend/vitest.config.ts
@@ -29,7 +29,7 @@ export default defineConfig({
 		],
 		server: {
 			deps: {
-				inline: ["@mui/x-data-grid"],
+				inline: [/@mui\//, /react-transition-group/],
 			},
 		},
 

diff --git a/docs/feature/epic-5305-k8s-readiness/feature-delta.md b/docs/feature/epic-5305-k8s-readiness/feature-delta.md
diff --git a/docs/feature/epic-5305-k8s-readiness/slices/slice-01-forwarded-headers.md b/docs/feature/epic-5305-k8s-readiness/slices/slice-01-forwarded-headers.md
@@ -0,0 +1,43 @@
+# Slice 01: Reverse-proxy forwarded headers
+
+**Feature**: epic-5305-k8s-readiness
+**Story**: US-01 (ADO #5311) → job-operator-correct-behind-proxy
+**Estimate**: ~0.5–1 crafter day
+**Reference class**: config-gated startup wiring, similar to `auth-allowedorigins-envvar-binding-fix` (env-bound ASP.NET Core middleware config, off unless declared)
+
+## Goal
+Make Lighthouse honour `X-Forwarded-Proto` / `-Host` / `-For` from a declared, trusted reverse proxy so HTTPS redirects, secure cookies, OIDC callback URLs and SignalR negotiation use the real public scheme + host — config-gated and OFF unless a proxy is declared.
+
+## IN scope
+- `UseForwardedHeaders` wired with a `ForwardedHeadersOptions` populated from configuration: known proxies / known networks (CIDR), forwarded-header count limit.
+- A single config switch (env var + appsettings) that turns forwarded-header trust on and declares the trusted proxy set; default OFF.
+- OIDC callback URL + `RequireHttpsMetadata`/redirect behaviour derive from the forwarded scheme/host when trust is on.
+- Secure-cookie + HTTPS-redirect behaviour consistent with the forwarded scheme.
+
+## OUT scope
+- The Ingress / Traefik manifests themselves (Productization epic #5306, chart story 09).
+- Edge auth (oauth2-proxy) — north-star, not this slice.
+- Health-check endpoints → slice 02.
+
+## Learning hypothesis
+**Confirms if it succeeds**: a real OIDC login through a TLS-terminating proxy completes first try (no http:// callback, no redirect loop, secure cookie persists).
+**Disproves if it fails**: ASP.NET Core forwarded-header handling is insufficient for our SignalR negotiation path and we need per-endpoint handling rather than one global middleware.
+
+## Acceptance criteria
+See US-01 in `../feature-delta.md`. Key: with trust ON and a simulated `X-Forwarded-Proto: https` + `X-Forwarded-Host`, an integration test asserts the generated OIDC redirect/callback URL is `https://<public-host>/...`; with trust OFF (no proxy declared), behaviour is byte-identical to today (standalone gate).
+
+## Dependencies
+None. Foundation slice — unblocks correct auth on any proxied deployment; should land before any cluster auth testing.
+
+## Production data requirement
+**Required.** Smoke a real OIDC login (Keycloak or the configured provider) through an actual reverse proxy (local Traefik/nginx), not just a unit test with synthetic headers.
+
+## Dogfood moment
+The dev instance, placed behind a local Traefik with TLS, logs in via OIDC over the HTTPS hostname within the same day.
+
+## Cross-cutting checklist (confirmed in feature-delta)
+RBAC: N/A — no authorization surface changes; only how the app derives scheme/host. Clients: N/A — no API contract change. Website: N/A — operational, not a marketed surface.
+
+## Pre-slice spike candidates
+- Confirm SignalR negotiation respects `UseForwardedHeaders` ordering relative to other middleware. (~1 hr)
+- Verify the existing OIDC setup reads the request scheme/host (not a hardcoded base URL) so forwarded headers actually flow through. (~30 min)
diff --git a/docs/feature/epic-5305-k8s-readiness/slices/slice-02-health-checks.md b/docs/feature/epic-5305-k8s-readiness/slices/slice-02-health-checks.md
@@ -0,0 +1,44 @@
+# Slice 02: Health checks (liveness / readiness / startup)
+
+**Feature**: epic-5305-k8s-readiness
+**Story**: US-02 (ADO #5310) → job-operator-trust-pod-health
+**Estimate**: ~1–1.5 crafter days
+**Reference class**: new read endpoints + DI wiring; learning story 04 (#5194) exercised probes as a spike — this is the product implementation
+
+## Goal
+Add real ASP.NET Core health checks driving the three k8s probes so traffic reaches only serving pods and only genuinely-dead pods restart.
+
+## IN scope
+- `AddHealthChecks()` with distinct tagged checks mapped to three endpoints:
+  - **readiness** (`/health/ready`): DB connectivity + migrations-applied → pod kept OUT of LB rotation until truly serving.
+  - **liveness** (`/health/live`): shallow — restart only on genuine deadlock, NOT on a slow dependency.
+  - **startup** (`/health/startup`): covers slow boot / migration window without tripping liveness.
+- Endpoints harmless / no-op-friendly in single-container mode (standalone gate).
+
+## OUT scope
+- The k8s probe manifests (chart story 09 / Productization #5306).
+- Migration-applied detection that requires the migration lock → coordinate with slice 04 (this slice checks "migrations applied", slice 04 owns "apply once across replicas").
+- /metrics, tracing → slice 05.
+
+## Learning hypothesis
+**Confirms if it succeeds**: a pod with an unreachable DB drops out of rotation (readiness red) WITHOUT being restarted (liveness green) — no restart storm.
+**Disproves if it fails**: a shallow liveness check can't distinguish deadlock from slow dependency cheaply, forcing a richer (and riskier) liveness signal.
+
+## Acceptance criteria
+See US-02 in `../feature-delta.md`. Key: integration tests assert (a) readiness returns unhealthy when DB is down but liveness stays healthy; (b) readiness returns healthy only when DB reachable AND migrations applied; (c) endpoints return 200 in single-container mode with no orchestrator.
+
+## Dependencies
+Soft on slice 04 for the precise "migrations applied" signal; can ship with a simpler "can open a DB connection" readiness first and tighten once slice 04 lands.
+
+## Production data requirement
+**Required.** Run the dev instance, kill the DB connection, observe readiness flip while the process is NOT restarted; restore and observe recovery.
+
+## Dogfood moment
+Dev instance deployed with the three probes wired; operator watches a clean rollout where a not-yet-migrated pod stays out of rotation until ready.
+
+## Cross-cutting checklist (confirmed in feature-delta)
+RBAC: N/A — health endpoints are unauthenticated operational surface (no business data). Clients: N/A. Website: N/A.
+
+## Pre-slice spike candidates
+- Decide whether health endpoints sit on the main port or a separate management port. (~30 min)
+- Confirm a cheap, reliable "migrations applied" query against EF Core for both SQLite and Postgres. (~1 hr)
diff --git a/docs/feature/epic-5305-k8s-readiness/slices/slice-03-graceful-shutdown.md b/docs/feature/epic-5305-k8s-readiness/slices/slice-03-graceful-shutdown.md
@@ -0,0 +1,45 @@
+# Slice 03: Graceful shutdown (SIGTERM) + connection draining
+
+**Feature**: epic-5305-k8s-readiness
+**Story**: US-03 (ADO #5309) → job-operator-zero-downtime-rollout
+**Estimate**: ~1–1.5 crafter days
+**Reference class**: `IHostedService` / `IHostApplicationLifetime` lifecycle wiring; touches the same update-queue hosted services as Epic 5121 / #5304
+
+## Goal
+Handle SIGTERM cleanly so a terminating pod stops accepting new work, drains in-flight HTTP + SignalR connections, flushes/awaits the in-memory update queue, and finishes within `terminationGracePeriodSeconds` — enabling zero-downtime rolling updates.
+
+## IN scope
+- Wire `IHostApplicationLifetime` `ApplicationStopping`/`ApplicationStopped` and/or `IHostedService.StopAsync` to:
+  - stop accepting new HTTP requests and new SignalR negotiations,
+  - drain in-flight HTTP requests within a bounded window,
+  - flush/await the in-memory `UpdateQueueService` Channel so queued/in-flight updates complete (or are safely abandoned) before exit,
+  - close SignalR connections so clients reconnect to a surviving pod.
+- Configurable shutdown timeout aligned to `terminationGracePeriodSeconds`.
+
+## OUT scope
+- The cluster-wide single-consumer queue redesign → slice 07 (#5304). This slice drains the *current per-process* queue cleanly; it does not make the queue distributed.
+- SignalR Redis backplane → slice 07.
+- Probe manifests → Productization #5306.
+
+## Learning hypothesis
+**Confirms if it succeeds**: under a rolling update, a load test driving requests + an active SignalR client sees zero failed requests and a clean client reconnect as pods cycle.
+**Disproves if it fails**: the in-memory update queue can't be drained deterministically within a sane grace period (e.g. a long external sync mid-flight), forcing the queue-redesign (slice 07) to land *before* true zero-downtime is claimable.
+
+## Acceptance criteria
+See US-03 in `../feature-delta.md`. Key: an integration test issues SIGTERM/`StopAsync` while an HTTP request and a queued update are in flight and asserts both complete (or the update is safely re-enqueued) before the host reports stopped; a single-container Ctrl-C behaves exactly as today (standalone gate).
+
+## Dependencies
+Pairs with slice 02 (readiness must flip to NotReady on `ApplicationStopping` so the LB stops routing before drain). Soft-precedes slice 07.
+
+## Production data requirement
+**Required.** Drive the dev instance under a small load generator + live SignalR client through a simulated rolling restart; assert no dropped requests.
+
+## Dogfood moment
+Operator triggers a rolling restart of the dev deployment during active use and observes no user-visible error and a seamless SignalR reconnect.
+
+## Cross-cutting checklist (confirmed in feature-delta)
+RBAC: N/A. Clients: N/A — server-side lifecycle only; CLI/MCP callers just reconnect. Website: N/A.
+
+## Pre-slice spike candidates
+- Measure worst-case in-flight update duration (external sync) to size the grace period. (~1 hr)
+- Confirm Kestrel/ASP.NET shutdown ordering vs. our hosted services so drain runs before the server socket closes. (~1 hr)
diff --git a/docs/feature/epic-5305-k8s-readiness/slices/slice-04-expand-only-migrations.md b/docs/feature/epic-5305-k8s-readiness/slices/slice-04-expand-only-migrations.md
@@ -0,0 +1,42 @@
+# Slice 04: Expand-only EF migrations + safe startup under N replicas
+
+**Feature**: epic-5305-k8s-readiness
+**Story**: US-04 (ADO #5308) → job-operator-zero-downtime-rollout + job-operator-survive-multiple-replicas
+**Estimate**: ~2–2.5 crafter days
+**Reference class**: EF migration mechanics (hit the stale-migration-DLL `--no-incremental` trap in `delivery-target-date-tracking`); concurrency coordination akin to Epic 5121
+
+## Goal
+Two coupled guarantees: (1) each release's migrations are additive-only (expand now; destructive cleanup deferred to a LATER release) so old pods never depend on a dropped column during a rollover; (2) when N replicas boot concurrently, exactly one applies migrations while the rest wait — no race on `Database.Migrate()`.
+
+## IN scope
+- **Expand-only discipline**: a guard/check (analyzer, test, or migration-review gate) that fails CI if a migration in this release is destructive (drop/rename column/table) — destructive ops must be a separate later release. Document the expand → contract two-release pattern.
+- **Startup migration coordination**: a migration lock / dedicated init mechanism / leader so exactly one replica runs `Migrate()`; others wait until migrations are applied, then start serving.
+- **Standalone gate**: a single SQLite or Postgres instance still auto-migrates on boot exactly as today (lock is a no-op / trivially-acquired with one instance).
+
+## OUT scope
+- The actual cluster-wide update-queue redesign → slice 07.
+- Provider-matrix migration generation uses the existing `CreateMigration` PowerShell script (per CLAUDE.md) — not new tooling.
+
+## Learning hypothesis
+**Confirms if it succeeds**: 3 replicas started simultaneously against one fresh Postgres apply the migration exactly once (one applies, two wait), and a destructive migration is rejected by CI before merge.
+**Disproves if it fails**: app-level migration coordination is too fragile under k8s and we must move migrations into a dedicated pre-deploy Job / ArgoCD sync-wave (decision pushed to Productization #5306) — in which case this slice delivers the expand-only guard + a documented "migrate via Job" path instead of an in-process lock.
+
+## Acceptance criteria
+See US-04 in `../feature-delta.md`. Key: an integration/concurrency test starts N hosts against one DB and asserts a single migration application (e.g. via a migration-history assertion / lock observation); a CI check rejects a destructive migration; single-instance boot auto-migrates unchanged.
+
+## Dependencies
+None hard. Feeds slice 02's "migrations applied" readiness signal. Precedes real multi-replica operation (slice 07).
+
+## Production data requirement
+**Required.** Reproduce concurrent startup against a real Postgres (k3s, 3 replicas) — InMemory tests will NOT catch the race (recurring lesson: persisted-model migration traps are invisible to InMemory).
+
+## Dogfood moment
+Operator scales a fresh deploy to 3 replicas against an empty Postgres and observes one migration application in the logs, all pods healthy.
+
+## Cross-cutting checklist (confirmed in feature-delta)
+RBAC: N/A. Clients: N/A — no API contract; possibly a CLI connection hint for Postgres, confirm in DESIGN. Website: N/A.
+
+## Pre-slice spike candidates
+- Evaluate `PostgreSQL advisory lock` vs. a migration-history sentinel vs. an init-Job approach for the boot lock. (~2 hr)
+- Prototype the destructive-migration CI guard (parse generated migration for `DropColumn`/`DropTable`/`RenameColumn`). (~1 hr)
+- Confirm the SQLite path degrades the lock to a no-op. (~30 min)
diff --git a/docs/feature/epic-5305-k8s-readiness/slices/slice-05-observability.md b/docs/feature/epic-5305-k8s-readiness/slices/slice-05-observability.md
@@ -0,0 +1,42 @@
+# Slice 05: App observability hooks (/metrics + structured logging + traces)
+
+**Feature**: epic-5305-k8s-readiness
+**Story**: US-05 (ADO #5312) → job-operator-observe-in-cluster
+**Estimate**: ~1.5 crafter days
+**Reference class**: new instrumentation wiring (OpenTelemetry .NET + Prometheus exporter + structured logging provider)
+
+## Goal
+Instrument the app for cluster observability: expose a Prometheus `/metrics` endpoint, emit structured JSON logs to stdout, and add OpenTelemetry traces — in-app instrumentation only, low-overhead / off-by-default where appropriate so the single-container self-hoster pays nothing.
+
+## IN scope
+- Prometheus `/metrics` endpoint (request rate / error rate / latency at minimum) via OpenTelemetry metrics + the Prometheus exporter.
+- Structured JSON logging to stdout (configurable), preserving today's log content but in queryable JSON.
+- OpenTelemetry tracing (ASP.NET Core + HttpClient + EF instrumentation) exporting via OTLP, exporter off/no-op unless configured.
+
+## OUT scope
+- The cluster-side Prometheus / Grafana / Loki stack — Productization epic #5306, story 16.
+- Per-tenant metric labelling / multi-tenant dashboards → #5306.
+- Business KPI instrumentation (those live in `docs/product/kpi-contracts.yaml`); this slice is operational telemetry, not product KPIs.
+
+## Learning hypothesis
+**Confirms if it succeeds**: a local Prometheus scrapes `/metrics` and a local Grafana shows Lighthouse request/error/latency; JSON logs parse field-wise in Loki; a slow request is traceable.
+**Disproves if it fails**: always-on instrumentation imposes measurable overhead on the single container, forcing a stricter off-by-default posture (and documentation that self-hosters must opt in).
+
+## Acceptance criteria
+See US-05 in `../feature-delta.md`. Key: an integration test asserts `/metrics` returns Prometheus-format output including HTTP server metrics; logs emitted in the JSON shape contain the expected fields; with telemetry disabled, no exporter runs and log/format behaviour matches the configured default (standalone gate — no perf change).
+
+## Dependencies
+None. Can land any time; valuable before slice 07's multi-replica work (so the operator isn't flying blind during scale-out).
+
+## Production data requirement
+**Recommended.** Scrape the dev instance with a real local Prometheus and confirm a dashboard renders; not strictly required for the unit-level acceptance.
+
+## Dogfood moment
+Operator points a local Prometheus + Grafana at the dev instance and sees a live Lighthouse dashboard within the day.
+
+## Cross-cutting checklist (confirmed in feature-delta)
+RBAC: confirm whether `/metrics` needs gating (it can leak request paths); default to unauthenticated cluster-internal surface but DESIGN must decide exposure (Sonar/security). Clients: N/A. Website: N/A.
+
+## Pre-slice spike candidates
+- Pick the metrics surface (OpenTelemetry.Exporter.Prometheus vs. prometheus-net) and confirm it coexists with our logging. (~1 hr)
+- Measure overhead of always-on ASP.NET Core + EF tracing to decide the default. (~1 hr)
-Original file line number
+Diff line change
@@ Expand Up / @@ -29,7 +29,7 @@ export default defineConfig({ @@
     		],
     		server: {
     			deps: {
-    				inline: ["@mui/x-data-grid"],
+    				inline: [/@mui\//, /react-transition-group/],
     			},
     		},
@@ Expand Down @@