SecAI-Hub
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 4 deletions b/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 6 deletions b/‎README.md‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎docs/production-operations.md‎
Lines changed: 87 additions & 2 deletions b/‎docs/production-operations.md‎
Lines changed: 87 additions & 2 deletions
diff --git a/‎docs/security-status.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/security-status.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/test-counts.json‎
Lines changed: 4 additions & 4 deletions b/‎docs/test-counts.json‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/test-matrix.md‎
Lines changed: 7 additions & 6 deletions b/‎docs/test-matrix.md‎
Lines changed: 7 additions & 6 deletions
@@ -55,7 +55,7 @@ shellcheck files/system/usr/libexec/secure-ai/*.sh
 
 ## Running Tests
 
-### Go Tests (399 tests across 9 services)
+### Go Tests (402 tests across 9 services)
 
 ```bash
 for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
@@ -64,7 +64,7 @@ for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
 done
 ```
 
-### Python Tests (718 tests)
+### Python Tests (739 tests)
 
 ```bash
 pip install -r requirements-ci.txt
@@ -89,13 +89,13 @@ shellcheck files/system/usr/libexec/secure-ai/*.sh files/scripts/*.sh
 ### Run Everything
 
 ```bash
-# Go (9 services, 399 tests)
+# Go (9 services, 402 tests)
 for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
            policy-engine runtime-attestor integrity-monitor incident-recorder; do
   (cd "services/$svc" && go test -v -race ./...)
 done
 
-# Python (718 tests)
+# Python (739 tests)
 PYTHONPATH=services python -m pytest tests/ -v
 
 # Type check
 
@@ -156,7 +156,7 @@ Every model passes through the same fully automatic pipeline:
 | **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
 | **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
 
-See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 47 milestones.
+See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 48 milestones.
 
 ### Verify Image Signatures
 
@@ -218,8 +218,8 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
 
 | Job | Workflow Link | What It Proves |
 |-----|--------------|---------------|
-| `go-build-and-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 399 Go tests across 9 services with `-race` (build, test, vet) |
-| `python-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 718 Python tests (unit/integration + adversarial/acceptance), ruff lint, bandit security scan (enforced on HIGH/HIGH), mypy type checking |
+| `go-build-and-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 402 Go tests across 9 services with `-race` (build, test, vet) |
+| `python-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 739 Python tests (unit/integration + adversarial/acceptance), ruff lint, bandit security scan (enforced on HIGH/HIGH), mypy type checking |
 | `security-regression` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | Adversarial test suite: prompt injection, policy bypass, containment, recovery |
 | `supply-chain-verify` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | SBOM generation via Syft, cosign availability, provenance keywords in release/build workflows |
 | `test-count-check` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | Prevents documented test counts from drifting below actual (source of truth: [test-counts.json](docs/test-counts.json)) |
@@ -239,8 +239,8 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
 | [Threat Model](docs/threat-model.md) | Threat classes, invariants, residual risks |
 | [API Reference](docs/api.md) | HTTP API for all services |
 | [Policy Schema](docs/policy-schema.md) | Full policy.yaml schema reference |
-| [Security Status](docs/security-status.md) | Implementation status of all 47 milestones |
-| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,117 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
+| [Security Status](docs/security-status.md) | Implementation status of all 48 milestones |
+| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,141 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
 | [Compatibility Matrix](docs/compatibility-matrix.md) | GPU, VM, and hardware support |
 | [Security Test Matrix](docs/security-test-matrix.md) | Security feature test coverage |
 | [FAQ](docs/faq.md) | Common questions |
@@ -426,6 +426,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
 - [x] **Milestone 45** -- Production readiness hardening: incident persistence (file-backed), graceful shutdown for all Go services, HTTP timeouts, systemd production hardening, first-boot validation, audit log rotation, CI vulnerability scanning, production operations guide
 - [x] **Milestone 46** -- Operational maturity: bootstrap trust gap fix (cosign verify before rebase), CI runs on all changes (removed paths-ignore for .md), Python quality gates (ruff + bandit + split test suites), docs-validation CI job, production-readiness checklist, SLOs, release channel policy, support lifecycle, sample verification output
 - [x] **Milestone 47** -- CI enforcement hardening: enforced vulnerability scanning (govulncheck + pip-audit + bandit fail on HIGH/HIGH) with waiver mechanism, mypy type checking for security-sensitive services, pinned reproducible Python CI dependencies, Go 1.23→1.25 (12 stdlib CVE fixes), verification-first bootstrap docs
+- [x] **Milestone 48** -- Production hardening: build script fail-closed (fatal errors for 12 required services + binary verification gate), incident store fsync (crash-safe persistence), GPU backend metadata recording, llama-server watchdog (Type=notify + WatchdogSec=30), model catalog externalization (YAML with fallback), circuit breaker for inter-service HTTP calls, post-upgrade model verification in Greenboot, cosign key rotation documentation (full lifecycle)
 
 </details>
 
@@ -457,7 +458,7 @@ services/
   search-mediator/          Python -- Tor-routed web search (:8485)
   ui/                       Python/Flask -- Web UI (:8480)
   common/                   Python -- Shared utilities (audit, auth, mlock)
-tests/                      718 Python tests, 399 Go tests (1,117 total)
+tests/                      739 Python tests, 402 Go tests (1,141 total)
 docs/                       Architecture, API, threat model, install guides
 schemas/                    OpenAPI spec, JSON Schema for config files
 examples/                   Task-oriented walkthroughs
 
@@ -90,9 +90,94 @@ The inter-service bearer token at `/run/secure-ai/service-token`:
    sudo systemctl restart secure-ai-runtime-attestor
    ```
 
-### Cosign Signing Key (Image & Release)
+### Cosign Signing Key (Image & Release Artifacts)
 
-Rotate via the GitHub repository secrets. Update `SIGNING_SECRET` in repository settings, then trigger a new build.
+The cosign signing key is used to sign:
+- OCI container images (`ghcr.io/secai-hub/secai_os:*`)
+- SBOM attestations (CycloneDX per-service)
+- Release checksums (`SHA256SUMS.sig`)
+- SLSA provenance attestations
+
+#### Key Generation
+
+```bash
+# Generate a new cosign key pair (interactive passphrase prompt)
+cosign generate-key-pair
+
+# This creates cosign.key (private) and cosign.pub (public)
+# Store cosign.key in a password manager or HSM — never commit to git
+```
+
+#### Rotation Schedule
+
+| Trigger | Action |
+|---------|--------|
+| **Annual** (recommended) | Proactive rotation, even with no incident |
+| **Key compromise** | Immediate emergency rotation |
+| **Personnel change** | Rotate if key holder leaves the project |
+| **CI provider breach** | Rotate if GitHub Actions secrets may be exposed |
+
+#### Rotation Procedure
+
+1. **Generate new key pair** (on an air-gapped machine if possible):
+   ```bash
+   cosign generate-key-pair
+   ```
+
+2. **Update GitHub repository secret**:
+   - Go to: Settings → Secrets and variables → Actions
+   - Update `SIGNING_SECRET` with the new `cosign.key` contents
+   - Verify the secret is updated (name shows "Updated just now")
+
+3. **Update the public key in deployed appliances**:
+   ```bash
+   # Copy the new cosign.pub to the appliance
+   scp cosign.pub admin@appliance:/tmp/cosign.pub
+   # On the appliance (requires local admin access):
+   sudo cp /tmp/cosign.pub /etc/secure-ai/cosign.pub
+   sudo chmod 0644 /etc/secure-ai/cosign.pub
+   ```
+
+4. **Tag a new release** to produce signed artifacts with the new key:
+   ```bash
+   git tag -s vX.Y.Z -m "Release vX.Y.Z (key rotation)"
+   git push origin vX.Y.Z
+   ```
+
+5. **Verify** the new signature:
+   ```bash
+   cosign verify --key cosign.pub ghcr.io/secai-hub/secai_os:vX.Y.Z
+   ```
+
+#### Emergency Revocation
+
+If the signing key is compromised:
+
+1. **Immediately** rotate the key (steps above)
+2. **Revoke trust** in old images: update all deployed appliances' `cosign.pub`
+3. **Re-sign** the latest stable release with the new key:
+   ```bash
+   cosign sign --key cosign.key ghcr.io/secai-hub/secai_os:latest
+   ```
+4. **Announce** the compromise via GitHub Security Advisory
+5. **Audit** CI logs for any unauthorized release activity during the exposure window
+
+#### Key Audit Checklist
+
+- [ ] Private key stored in encrypted vault or HSM (never on disk in plaintext)
+- [ ] Only CI (`SIGNING_SECRET`) and key custodian have access to private key
+- [ ] Public key shipped in OS image at `/etc/secure-ai/cosign.pub`
+- [ ] `verify-release.sh` uses the shipped public key, not a remote fetch
+- [ ] Key rotation date recorded in `CHANGELOG.md`
+- [ ] Previous public key archived (for verifying older releases)
+
+#### Future: HSM Migration
+
+When HSM support is implemented (planned milestone):
+- Private key will be generated inside the HSM and never exported
+- Cosign will use `--key` with a PKCS#11 URI instead of a file path
+- Rotation becomes: generate new key in HSM → update PKCS#11 URI in CI
+- See `docs/security-status.md` for HSM milestone tracking
 
 ## Monitoring
 
 
@@ -1,6 +1,6 @@
 # Security Implementation Status
 
-This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M47) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
+This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M48) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
 
 Last updated: 2026-03-14
 
@@ -60,6 +60,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
 | Production readiness hardening | Implemented | M45 | Incident recorder file-backed persistence (survives restarts), graceful shutdown (SIGTERM/SIGINT with connection draining) for all 9 Go services, HTTP server timeouts for mcp-firewall and gpu-integrity-watch, systemd production hardening (TimeoutStartSec, TimeoutStopSec, StartLimitInterval, StartLimitBurst) for all 12 daemon units, first-boot health validation script, audit log rotation via logrotate, CI dependency vulnerability scanning (govulncheck + pip-audit), production operations guide (upgrade, key rotation, capacity limits, monitoring) |
 | Operational maturity | Implemented | M46 | Bootstrap trust gap fix (cosign verify before unverified rebase, documented trust gap rationale), CI runs on all changes (removed blanket paths-ignore for .md files), Python quality gates (ruff lint + bandit security scan + split test suites into unit/integration and adversarial/acceptance), docs-validation CI job (broken link detection, required docs check, test-counts.json validation), production-readiness checklist (formal release gate), SLOs (availability/latency/correctness targets + alerting thresholds), release channel policy (stable/candidate/dev + versioning + upgrade paths + security patch SLA), support lifecycle (hardware matrix, driver versions, support windows, deprecation policy, scope boundaries), CI evidence table with all 10 job descriptions and workflow links, sample verification output for verify-release.sh |
 | CI enforcement hardening | Implemented | M47 | Enforced vulnerability scanning: bandit fails CI on HIGH-severity/HIGH-confidence findings, govulncheck fails on unwaived Go vulns, pip-audit fails on unwaived Python vulns. Waiver mechanism (`.github/vuln-waivers.json`) with mandatory expiry dates for reviewed/accepted findings. mypy type checking gate for security-sensitive services (common, agent, quarantine, ui). Pinned reproducible Python CI dependencies (`requirements-ci.txt`). Go 1.23→1.25 upgrade fixing 12 stdlib CVEs (crypto/tls, crypto/x509, encoding/asn1, net/url, os). Flask 3.1.1→3.1.3 (GHSA-68rp-wp8r-4726). Verification-first bootstrap documentation (signed rebase as default quickstart, unverified bootstrap moved to labeled recovery section). |
+| Production hardening | Implemented | M48 | Build script fail-closed (all `|| echo WARNING` fallbacks replaced with fatal errors for 12 required services, final binary verification gate), incident store fsync (f.Sync() before close on both incident persistence and audit log writes), GPU backend metadata recording (`/etc/secure-ai/gpu-backend.json` written at build time with backend/version/timestamp), llama-server watchdog (Type=notify wrapper with startup health gate + WatchdogSec=30 continuous monitoring), model catalog externalization (`/etc/secure-ai/model-catalog.yaml` with YAML loading + hardcoded fallback), circuit breaker for Python services (closed→open→half-open state machine protecting inter-service HTTP calls), post-upgrade model verification in Greenboot (SHA256 manifest check closes 15-min integrity gap), cosign key rotation documentation (full lifecycle: generation, rotation schedule, distribution, emergency revocation, HSM migration path). 402 Go + 739 Python tests (1,141 total). |
 
 ---
 
 
@@ -9,9 +9,9 @@
     "policy-engine": 44,
     "runtime-attestor": 55,
     "integrity-monitor": 50,
-    "incident-recorder": 83
+    "incident-recorder": 86
   },
-  "go_total": 399,
-  "python_total": 718,
-  "grand_total": 1117
+  "go_total": 402,
+  "python_total": 739,
+  "grand_total": 1141
 }
@@ -11,11 +11,11 @@ Last updated: 2026-03-14
 
 | Language | Test Count | Runner |
 |----------|-----------|--------|
-| Go | 399 | `go test -race ./...` |
-| Python | 718 | `pytest` |
+| Go | 402 | `go test -race ./...` |
+| Python | 739 | `pytest` |
 | Shell | All .sh files | `shellcheck` |
 
-## Go Tests (399 total)
+## Go Tests (402 total)
 
 | Service | Location | Tests | Description |
 |---------|----------|-------|-------------|
@@ -27,15 +27,16 @@ Last updated: 2026-03-14
 | Policy Engine | services/policy-engine/ | 44 | Unified policy decisions across 6 domains, evidence generation, auth, adversarial tests (M43) |
 | Runtime Attestor | services/runtime-attestor/ | 55 | TPM2 quote verification, HMAC bundles, state machine, startup gating, service digests, incident-recorder integration |
 | Integrity Monitor | services/integrity-monitor/ | 50 | Baseline computation, continuous scanning, violation detection, state machine, HMAC baselines, incident-recorder integration |
-| Incident Recorder | services/incident-recorder/ | 83 | Incident creation, auto-containment, lifecycle management, severity ranking, policy loading, containment execution, enforcement chain integration, recovery ceremony, severity escalation, forensic bundle export (M43) |
+| Incident Recorder | services/incident-recorder/ | 86 | Incident creation, auto-containment, lifecycle management, severity ranking, policy loading, containment execution, enforcement chain integration, recovery ceremony, severity escalation, forensic bundle export (M43), persistence durability (fsync) |
 
-## Python Tests (718 total)
+## Python Tests (739 total)
 
 | Test File | Location | Tests | Description |
 |-----------|----------|-------|-------------|
 | test_pipeline.py | tests/ | 96 | Quarantine pipeline stages, scanning, pass/fail logic |
 | test_search.py | tests/ | 27 | Search mediator, PII stripping, injection detection |
-| test_ui.py | tests/ | 12 | Flask web UI routes, rendering, input handling |
+| test_ui.py | tests/ | 18 | Flask web UI routes, rendering, input handling, model catalog loading (YAML/fallback) |
+| test_circuit_breaker.py | tests/ | 15 | Circuit breaker state machine (closed/open/half-open), reset, error propagation |
 | test_vault_watchdog.py | tests/ | 18 | Vault auto-lock, idle detection, timer controls |
 | test_memory_protection.py | tests/ | 37 | Swap encryption, zswap, core dumps, mlock, TEE detection |
 | test_traffic_analysis.py | tests/ | 41 | Padding, timing jitter, dummy traffic generation |