Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Lighthouse.Frontend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"@microsoft/signalr": "^10.0.0",
"@mui/icons-material": "^7.3.11",
"@mui/lab": "7.0.0",
"@mui/material": "^9.0.1",
"@mui/material": "^9.1.1",
"@mui/system": "^9.1.1",
"@mui/x-charts": "9.0.1",
"@mui/x-data-grid": "^9.5.0",
Expand Down
91 changes: 36 additions & 55 deletions Lighthouse.Frontend/pnpm-lock.yaml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion Lighthouse.Frontend/vitest.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ export default defineConfig({
],
server: {
deps: {
inline: ["@mui/x-data-grid"],
inline: [/@mui\//, /react-transition-group/],
},
},

Expand Down
445 changes: 445 additions & 0 deletions docs/feature/epic-5305-k8s-readiness/feature-delta.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Slice 01: Reverse-proxy forwarded headers

**Feature**: epic-5305-k8s-readiness
**Story**: US-01 (ADO #5311) → job-operator-correct-behind-proxy
**Estimate**: ~0.5–1 crafter day
**Reference class**: config-gated startup wiring, similar to `auth-allowedorigins-envvar-binding-fix` (env-bound ASP.NET Core middleware config, off unless declared)

## Goal
Make Lighthouse honour `X-Forwarded-Proto` / `-Host` / `-For` from a declared, trusted reverse proxy so HTTPS redirects, secure cookies, OIDC callback URLs and SignalR negotiation use the real public scheme + host — config-gated and OFF unless a proxy is declared.

## IN scope
- `UseForwardedHeaders` wired with a `ForwardedHeadersOptions` populated from configuration: known proxies / known networks (CIDR), forwarded-header count limit.
- A single config switch (env var + appsettings) that turns forwarded-header trust on and declares the trusted proxy set; default OFF.
- OIDC callback URL + `RequireHttpsMetadata`/redirect behaviour derive from the forwarded scheme/host when trust is on.
- Secure-cookie + HTTPS-redirect behaviour consistent with the forwarded scheme.

## OUT scope
- The Ingress / Traefik manifests themselves (Productization epic #5306, chart story 09).
- Edge auth (oauth2-proxy) — north-star, not this slice.
- Health-check endpoints → slice 02.

## Learning hypothesis
**Confirms if it succeeds**: a real OIDC login through a TLS-terminating proxy completes first try (no http:// callback, no redirect loop, secure cookie persists).
**Disproves if it fails**: ASP.NET Core forwarded-header handling is insufficient for our SignalR negotiation path and we need per-endpoint handling rather than one global middleware.

## Acceptance criteria
See US-01 in `../feature-delta.md`. Key: with trust ON and a simulated `X-Forwarded-Proto: https` + `X-Forwarded-Host`, an integration test asserts the generated OIDC redirect/callback URL is `https://<public-host>/...`; with trust OFF (no proxy declared), behaviour is byte-identical to today (standalone gate).

## Dependencies
None. Foundation slice — unblocks correct auth on any proxied deployment; should land before any cluster auth testing.

## Production data requirement
**Required.** Smoke a real OIDC login (Keycloak or the configured provider) through an actual reverse proxy (local Traefik/nginx), not just a unit test with synthetic headers.

## Dogfood moment
The dev instance, placed behind a local Traefik with TLS, logs in via OIDC over the HTTPS hostname within the same day.

## Cross-cutting checklist (confirmed in feature-delta)
RBAC: N/A — no authorization surface changes; only how the app derives scheme/host. Clients: N/A — no API contract change. Website: N/A — operational, not a marketed surface.

## Pre-slice spike candidates
- Confirm SignalR negotiation respects `UseForwardedHeaders` ordering relative to other middleware. (~1 hr)
- Verify the existing OIDC setup reads the request scheme/host (not a hardcoded base URL) so forwarded headers actually flow through. (~30 min)
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Slice 02: Health checks (liveness / readiness / startup)

**Feature**: epic-5305-k8s-readiness
**Story**: US-02 (ADO #5310) → job-operator-trust-pod-health
**Estimate**: ~1–1.5 crafter days
**Reference class**: new read endpoints + DI wiring; learning story 04 (#5194) exercised probes as a spike — this is the product implementation

## Goal
Add real ASP.NET Core health checks driving the three k8s probes so traffic reaches only serving pods and only genuinely-dead pods restart.

## IN scope
- `AddHealthChecks()` with distinct tagged checks mapped to three endpoints:
- **readiness** (`/health/ready`): DB connectivity + migrations-applied → pod kept OUT of LB rotation until truly serving.
- **liveness** (`/health/live`): shallow — restart only on genuine deadlock, NOT on a slow dependency.
- **startup** (`/health/startup`): covers slow boot / migration window without tripping liveness.
- Endpoints harmless / no-op-friendly in single-container mode (standalone gate).

## OUT scope
- The k8s probe manifests (chart story 09 / Productization #5306).
- Migration-applied detection that requires the migration lock → coordinate with slice 04 (this slice checks "migrations applied", slice 04 owns "apply once across replicas").
- /metrics, tracing → slice 05.

## Learning hypothesis
**Confirms if it succeeds**: a pod with an unreachable DB drops out of rotation (readiness red) WITHOUT being restarted (liveness green) — no restart storm.
**Disproves if it fails**: a shallow liveness check can't distinguish deadlock from slow dependency cheaply, forcing a richer (and riskier) liveness signal.

## Acceptance criteria
See US-02 in `../feature-delta.md`. Key: integration tests assert (a) readiness returns unhealthy when DB is down but liveness stays healthy; (b) readiness returns healthy only when DB reachable AND migrations applied; (c) endpoints return 200 in single-container mode with no orchestrator.

## Dependencies
Soft on slice 04 for the precise "migrations applied" signal; can ship with a simpler "can open a DB connection" readiness first and tighten once slice 04 lands.

## Production data requirement
**Required.** Run the dev instance, kill the DB connection, observe readiness flip while the process is NOT restarted; restore and observe recovery.

## Dogfood moment
Dev instance deployed with the three probes wired; operator watches a clean rollout where a not-yet-migrated pod stays out of rotation until ready.

## Cross-cutting checklist (confirmed in feature-delta)
RBAC: N/A — health endpoints are unauthenticated operational surface (no business data). Clients: N/A. Website: N/A.

## Pre-slice spike candidates
- Decide whether health endpoints sit on the main port or a separate management port. (~30 min)
- Confirm a cheap, reliable "migrations applied" query against EF Core for both SQLite and Postgres. (~1 hr)
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Slice 03: Graceful shutdown (SIGTERM) + connection draining

**Feature**: epic-5305-k8s-readiness
**Story**: US-03 (ADO #5309) → job-operator-zero-downtime-rollout
**Estimate**: ~1–1.5 crafter days
**Reference class**: `IHostedService` / `IHostApplicationLifetime` lifecycle wiring; touches the same update-queue hosted services as Epic 5121 / #5304

## Goal
Handle SIGTERM cleanly so a terminating pod stops accepting new work, drains in-flight HTTP + SignalR connections, flushes/awaits the in-memory update queue, and finishes within `terminationGracePeriodSeconds` — enabling zero-downtime rolling updates.

## IN scope
- Wire `IHostApplicationLifetime` `ApplicationStopping`/`ApplicationStopped` and/or `IHostedService.StopAsync` to:
- stop accepting new HTTP requests and new SignalR negotiations,
- drain in-flight HTTP requests within a bounded window,
- flush/await the in-memory `UpdateQueueService` Channel so queued/in-flight updates complete (or are safely abandoned) before exit,
- close SignalR connections so clients reconnect to a surviving pod.
- Configurable shutdown timeout aligned to `terminationGracePeriodSeconds`.

## OUT scope
- The cluster-wide single-consumer queue redesign → slice 07 (#5304). This slice drains the *current per-process* queue cleanly; it does not make the queue distributed.
- SignalR Redis backplane → slice 07.
- Probe manifests → Productization #5306.

## Learning hypothesis
**Confirms if it succeeds**: under a rolling update, a load test driving requests + an active SignalR client sees zero failed requests and a clean client reconnect as pods cycle.
**Disproves if it fails**: the in-memory update queue can't be drained deterministically within a sane grace period (e.g. a long external sync mid-flight), forcing the queue-redesign (slice 07) to land *before* true zero-downtime is claimable.

## Acceptance criteria
See US-03 in `../feature-delta.md`. Key: an integration test issues SIGTERM/`StopAsync` while an HTTP request and a queued update are in flight and asserts both complete (or the update is safely re-enqueued) before the host reports stopped; a single-container Ctrl-C behaves exactly as today (standalone gate).

## Dependencies
Pairs with slice 02 (readiness must flip to NotReady on `ApplicationStopping` so the LB stops routing before drain). Soft-precedes slice 07.

## Production data requirement
**Required.** Drive the dev instance under a small load generator + live SignalR client through a simulated rolling restart; assert no dropped requests.

## Dogfood moment
Operator triggers a rolling restart of the dev deployment during active use and observes no user-visible error and a seamless SignalR reconnect.

## Cross-cutting checklist (confirmed in feature-delta)
RBAC: N/A. Clients: N/A — server-side lifecycle only; CLI/MCP callers just reconnect. Website: N/A.

## Pre-slice spike candidates
- Measure worst-case in-flight update duration (external sync) to size the grace period. (~1 hr)
- Confirm Kestrel/ASP.NET shutdown ordering vs. our hosted services so drain runs before the server socket closes. (~1 hr)
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Slice 04: Expand-only EF migrations + safe startup under N replicas

**Feature**: epic-5305-k8s-readiness
**Story**: US-04 (ADO #5308) → job-operator-zero-downtime-rollout + job-operator-survive-multiple-replicas
**Estimate**: ~2–2.5 crafter days
**Reference class**: EF migration mechanics (hit the stale-migration-DLL `--no-incremental` trap in `delivery-target-date-tracking`); concurrency coordination akin to Epic 5121

## Goal
Two coupled guarantees: (1) each release's migrations are additive-only (expand now; destructive cleanup deferred to a LATER release) so old pods never depend on a dropped column during a rollover; (2) when N replicas boot concurrently, exactly one applies migrations while the rest wait — no race on `Database.Migrate()`.

## IN scope
- **Expand-only discipline**: a guard/check (analyzer, test, or migration-review gate) that fails CI if a migration in this release is destructive (drop/rename column/table) — destructive ops must be a separate later release. Document the expand → contract two-release pattern.
- **Startup migration coordination**: a migration lock / dedicated init mechanism / leader so exactly one replica runs `Migrate()`; others wait until migrations are applied, then start serving.
- **Standalone gate**: a single SQLite or Postgres instance still auto-migrates on boot exactly as today (lock is a no-op / trivially-acquired with one instance).

## OUT scope
- The actual cluster-wide update-queue redesign → slice 07.
- Provider-matrix migration generation uses the existing `CreateMigration` PowerShell script (per CLAUDE.md) — not new tooling.

## Learning hypothesis
**Confirms if it succeeds**: 3 replicas started simultaneously against one fresh Postgres apply the migration exactly once (one applies, two wait), and a destructive migration is rejected by CI before merge.
**Disproves if it fails**: app-level migration coordination is too fragile under k8s and we must move migrations into a dedicated pre-deploy Job / ArgoCD sync-wave (decision pushed to Productization #5306) — in which case this slice delivers the expand-only guard + a documented "migrate via Job" path instead of an in-process lock.

## Acceptance criteria
See US-04 in `../feature-delta.md`. Key: an integration/concurrency test starts N hosts against one DB and asserts a single migration application (e.g. via a migration-history assertion / lock observation); a CI check rejects a destructive migration; single-instance boot auto-migrates unchanged.

## Dependencies
None hard. Feeds slice 02's "migrations applied" readiness signal. Precedes real multi-replica operation (slice 07).

## Production data requirement
**Required.** Reproduce concurrent startup against a real Postgres (k3s, 3 replicas) — InMemory tests will NOT catch the race (recurring lesson: persisted-model migration traps are invisible to InMemory).

## Dogfood moment
Operator scales a fresh deploy to 3 replicas against an empty Postgres and observes one migration application in the logs, all pods healthy.

## Cross-cutting checklist (confirmed in feature-delta)
RBAC: N/A. Clients: N/A — no API contract; possibly a CLI connection hint for Postgres, confirm in DESIGN. Website: N/A.

## Pre-slice spike candidates
- Evaluate `PostgreSQL advisory lock` vs. a migration-history sentinel vs. an init-Job approach for the boot lock. (~2 hr)
- Prototype the destructive-migration CI guard (parse generated migration for `DropColumn`/`DropTable`/`RenameColumn`). (~1 hr)
- Confirm the SQLite path degrades the lock to a no-op. (~30 min)
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Slice 05: App observability hooks (/metrics + structured logging + traces)

**Feature**: epic-5305-k8s-readiness
**Story**: US-05 (ADO #5312) → job-operator-observe-in-cluster
**Estimate**: ~1.5 crafter days
**Reference class**: new instrumentation wiring (OpenTelemetry .NET + Prometheus exporter + structured logging provider)

## Goal
Instrument the app for cluster observability: expose a Prometheus `/metrics` endpoint, emit structured JSON logs to stdout, and add OpenTelemetry traces — in-app instrumentation only, low-overhead / off-by-default where appropriate so the single-container self-hoster pays nothing.

## IN scope
- Prometheus `/metrics` endpoint (request rate / error rate / latency at minimum) via OpenTelemetry metrics + the Prometheus exporter.
- Structured JSON logging to stdout (configurable), preserving today's log content but in queryable JSON.
- OpenTelemetry tracing (ASP.NET Core + HttpClient + EF instrumentation) exporting via OTLP, exporter off/no-op unless configured.

## OUT scope
- The cluster-side Prometheus / Grafana / Loki stack — Productization epic #5306, story 16.
- Per-tenant metric labelling / multi-tenant dashboards → #5306.
- Business KPI instrumentation (those live in `docs/product/kpi-contracts.yaml`); this slice is operational telemetry, not product KPIs.

## Learning hypothesis
**Confirms if it succeeds**: a local Prometheus scrapes `/metrics` and a local Grafana shows Lighthouse request/error/latency; JSON logs parse field-wise in Loki; a slow request is traceable.
**Disproves if it fails**: always-on instrumentation imposes measurable overhead on the single container, forcing a stricter off-by-default posture (and documentation that self-hosters must opt in).

## Acceptance criteria
See US-05 in `../feature-delta.md`. Key: an integration test asserts `/metrics` returns Prometheus-format output including HTTP server metrics; logs emitted in the JSON shape contain the expected fields; with telemetry disabled, no exporter runs and log/format behaviour matches the configured default (standalone gate — no perf change).

## Dependencies
None. Can land any time; valuable before slice 07's multi-replica work (so the operator isn't flying blind during scale-out).

## Production data requirement
**Recommended.** Scrape the dev instance with a real local Prometheus and confirm a dashboard renders; not strictly required for the unit-level acceptance.

## Dogfood moment
Operator points a local Prometheus + Grafana at the dev instance and sees a live Lighthouse dashboard within the day.

## Cross-cutting checklist (confirmed in feature-delta)
RBAC: confirm whether `/metrics` needs gating (it can leak request paths); default to unauthenticated cluster-internal surface but DESIGN must decide exposure (Sonar/security). Clients: N/A. Website: N/A.

## Pre-slice spike candidates
- Pick the metrics surface (OpenTelemetry.Exporter.Prometheus vs. prometheus-net) and confirm it coexists with our logging. (~1 hr)
- Measure overhead of always-on ASP.NET Core + EF tracing to decide the default. (~1 hr)
Loading
Loading