Skip to content

Commit 1e5d7e1

Browse files
SecAI-Hubclaude
andcommitted
M48: Production hardening — 8 fixes for operational gaps
Build script fail-closed: replace 10+ `|| echo WARNING` patterns with fatal errors for all 12 required services, add final binary verification gate. GPU backend metadata now written to /etc/secure-ai/gpu-backend.json at build time. Incident store durability: add f.Sync() between Flush() and Close() in both persistIncidents() and writeAudit() to survive power loss. Llama-server crash recovery: Type=notify wrapper with startup health gate and WatchdogSec=30 for continuous hung-process detection. Model catalog externalization: /etc/secure-ai/model-catalog.yaml loaded at startup with hardcoded fallback for graceful degradation. Circuit breaker: closed→open→half-open state machine for inter-service HTTP calls, integrated into UI /api/status endpoint. Greenboot model verification: SHA256 manifest check at boot closes the 15-minute gap between upgrade and periodic integrity scan. Key rotation docs: cosign key lifecycle expanded from 4 lines to full procedure (generation, rotation schedule, distribution, emergency revocation, audit checklist, HSM migration path). 402 Go + 739 Python = 1,141 total tests (24 new). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 74c51c2 commit 1e5d7e1

17 files changed

Lines changed: 1142 additions & 247 deletions

File tree

CONTRIBUTING.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ shellcheck files/system/usr/libexec/secure-ai/*.sh
5555

5656
## Running Tests
5757

58-
### Go Tests (399 tests across 9 services)
58+
### Go Tests (402 tests across 9 services)
5959

6060
```bash
6161
for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
@@ -64,7 +64,7 @@ for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
6464
done
6565
```
6666

67-
### Python Tests (718 tests)
67+
### Python Tests (739 tests)
6868

6969
```bash
7070
pip install -r requirements-ci.txt
@@ -89,13 +89,13 @@ shellcheck files/system/usr/libexec/secure-ai/*.sh files/scripts/*.sh
8989
### Run Everything
9090

9191
```bash
92-
# Go (9 services, 399 tests)
92+
# Go (9 services, 402 tests)
9393
for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
9494
policy-engine runtime-attestor integrity-monitor incident-recorder; do
9595
(cd "services/$svc" && go test -v -race ./...)
9696
done
9797

98-
# Python (718 tests)
98+
# Python (739 tests)
9999
PYTHONPATH=services python -m pytest tests/ -v
100100

101101
# Type check

README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ Every model passes through the same fully automatic pipeline:
156156
| **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
157157
| **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
158158

159-
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 47 milestones.
159+
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 48 milestones.
160160

161161
### Verify Image Signatures
162162

@@ -218,8 +218,8 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
218218

219219
| Job | Workflow Link | What It Proves |
220220
|-----|--------------|---------------|
221-
| `go-build-and-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 399 Go tests across 9 services with `-race` (build, test, vet) |
222-
| `python-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 718 Python tests (unit/integration + adversarial/acceptance), ruff lint, bandit security scan (enforced on HIGH/HIGH), mypy type checking |
221+
| `go-build-and-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 402 Go tests across 9 services with `-race` (build, test, vet) |
222+
| `python-test` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | 739 Python tests (unit/integration + adversarial/acceptance), ruff lint, bandit security scan (enforced on HIGH/HIGH), mypy type checking |
223223
| `security-regression` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | Adversarial test suite: prompt injection, policy bypass, containment, recovery |
224224
| `supply-chain-verify` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | SBOM generation via Syft, cosign availability, provenance keywords in release/build workflows |
225225
| `test-count-check` | [View job](https://github.com/SecAI-Hub/SecAI_OS/actions/workflows/ci.yml) | Prevents documented test counts from drifting below actual (source of truth: [test-counts.json](docs/test-counts.json)) |
@@ -239,8 +239,8 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
239239
| [Threat Model](docs/threat-model.md) | Threat classes, invariants, residual risks |
240240
| [API Reference](docs/api.md) | HTTP API for all services |
241241
| [Policy Schema](docs/policy-schema.md) | Full policy.yaml schema reference |
242-
| [Security Status](docs/security-status.md) | Implementation status of all 47 milestones |
243-
| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,117 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
242+
| [Security Status](docs/security-status.md) | Implementation status of all 48 milestones |
243+
| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,141 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
244244
| [Compatibility Matrix](docs/compatibility-matrix.md) | GPU, VM, and hardware support |
245245
| [Security Test Matrix](docs/security-test-matrix.md) | Security feature test coverage |
246246
| [FAQ](docs/faq.md) | Common questions |
@@ -426,6 +426,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
426426
- [x] **Milestone 45** -- Production readiness hardening: incident persistence (file-backed), graceful shutdown for all Go services, HTTP timeouts, systemd production hardening, first-boot validation, audit log rotation, CI vulnerability scanning, production operations guide
427427
- [x] **Milestone 46** -- Operational maturity: bootstrap trust gap fix (cosign verify before rebase), CI runs on all changes (removed paths-ignore for .md), Python quality gates (ruff + bandit + split test suites), docs-validation CI job, production-readiness checklist, SLOs, release channel policy, support lifecycle, sample verification output
428428
- [x] **Milestone 47** -- CI enforcement hardening: enforced vulnerability scanning (govulncheck + pip-audit + bandit fail on HIGH/HIGH) with waiver mechanism, mypy type checking for security-sensitive services, pinned reproducible Python CI dependencies, Go 1.23→1.25 (12 stdlib CVE fixes), verification-first bootstrap docs
429+
- [x] **Milestone 48** -- Production hardening: build script fail-closed (fatal errors for 12 required services + binary verification gate), incident store fsync (crash-safe persistence), GPU backend metadata recording, llama-server watchdog (Type=notify + WatchdogSec=30), model catalog externalization (YAML with fallback), circuit breaker for inter-service HTTP calls, post-upgrade model verification in Greenboot, cosign key rotation documentation (full lifecycle)
429430

430431
</details>
431432

@@ -457,7 +458,7 @@ services/
457458
search-mediator/ Python -- Tor-routed web search (:8485)
458459
ui/ Python/Flask -- Web UI (:8480)
459460
common/ Python -- Shared utilities (audit, auth, mlock)
460-
tests/ 718 Python tests, 399 Go tests (1,117 total)
461+
tests/ 739 Python tests, 402 Go tests (1,141 total)
461462
docs/ Architecture, API, threat model, install guides
462463
schemas/ OpenAPI spec, JSON Schema for config files
463464
examples/ Task-oriented walkthroughs

docs/production-operations.md

Lines changed: 87 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,94 @@ The inter-service bearer token at `/run/secure-ai/service-token`:
9090
sudo systemctl restart secure-ai-runtime-attestor
9191
```
9292

93-
### Cosign Signing Key (Image & Release)
93+
### Cosign Signing Key (Image & Release Artifacts)
9494

95-
Rotate via the GitHub repository secrets. Update `SIGNING_SECRET` in repository settings, then trigger a new build.
95+
The cosign signing key is used to sign:
96+
- OCI container images (`ghcr.io/secai-hub/secai_os:*`)
97+
- SBOM attestations (CycloneDX per-service)
98+
- Release checksums (`SHA256SUMS.sig`)
99+
- SLSA provenance attestations
100+
101+
#### Key Generation
102+
103+
```bash
104+
# Generate a new cosign key pair (interactive passphrase prompt)
105+
cosign generate-key-pair
106+
107+
# This creates cosign.key (private) and cosign.pub (public)
108+
# Store cosign.key in a password manager or HSM — never commit to git
109+
```
110+
111+
#### Rotation Schedule
112+
113+
| Trigger | Action |
114+
|---------|--------|
115+
| **Annual** (recommended) | Proactive rotation, even with no incident |
116+
| **Key compromise** | Immediate emergency rotation |
117+
| **Personnel change** | Rotate if key holder leaves the project |
118+
| **CI provider breach** | Rotate if GitHub Actions secrets may be exposed |
119+
120+
#### Rotation Procedure
121+
122+
1. **Generate new key pair** (on an air-gapped machine if possible):
123+
```bash
124+
cosign generate-key-pair
125+
```
126+
127+
2. **Update GitHub repository secret**:
128+
- Go to: Settings → Secrets and variables → Actions
129+
- Update `SIGNING_SECRET` with the new `cosign.key` contents
130+
- Verify the secret is updated (name shows "Updated just now")
131+
132+
3. **Update the public key in deployed appliances**:
133+
```bash
134+
# Copy the new cosign.pub to the appliance
135+
scp cosign.pub admin@appliance:/tmp/cosign.pub
136+
# On the appliance (requires local admin access):
137+
sudo cp /tmp/cosign.pub /etc/secure-ai/cosign.pub
138+
sudo chmod 0644 /etc/secure-ai/cosign.pub
139+
```
140+
141+
4. **Tag a new release** to produce signed artifacts with the new key:
142+
```bash
143+
git tag -s vX.Y.Z -m "Release vX.Y.Z (key rotation)"
144+
git push origin vX.Y.Z
145+
```
146+
147+
5. **Verify** the new signature:
148+
```bash
149+
cosign verify --key cosign.pub ghcr.io/secai-hub/secai_os:vX.Y.Z
150+
```
151+
152+
#### Emergency Revocation
153+
154+
If the signing key is compromised:
155+
156+
1. **Immediately** rotate the key (steps above)
157+
2. **Revoke trust** in old images: update all deployed appliances' `cosign.pub`
158+
3. **Re-sign** the latest stable release with the new key:
159+
```bash
160+
cosign sign --key cosign.key ghcr.io/secai-hub/secai_os:latest
161+
```
162+
4. **Announce** the compromise via GitHub Security Advisory
163+
5. **Audit** CI logs for any unauthorized release activity during the exposure window
164+
165+
#### Key Audit Checklist
166+
167+
- [ ] Private key stored in encrypted vault or HSM (never on disk in plaintext)
168+
- [ ] Only CI (`SIGNING_SECRET`) and key custodian have access to private key
169+
- [ ] Public key shipped in OS image at `/etc/secure-ai/cosign.pub`
170+
- [ ] `verify-release.sh` uses the shipped public key, not a remote fetch
171+
- [ ] Key rotation date recorded in `CHANGELOG.md`
172+
- [ ] Previous public key archived (for verifying older releases)
173+
174+
#### Future: HSM Migration
175+
176+
When HSM support is implemented (planned milestone):
177+
- Private key will be generated inside the HSM and never exported
178+
- Cosign will use `--key` with a PKCS#11 URI instead of a file path
179+
- Rotation becomes: generate new key in HSM → update PKCS#11 URI in CI
180+
- See `docs/security-status.md` for HSM milestone tracking
96181

97182
## Monitoring
98183

docs/security-status.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Security Implementation Status
22

3-
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M47) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
3+
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M48) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
44

55
Last updated: 2026-03-14
66

@@ -60,6 +60,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
6060
| Production readiness hardening | Implemented | M45 | Incident recorder file-backed persistence (survives restarts), graceful shutdown (SIGTERM/SIGINT with connection draining) for all 9 Go services, HTTP server timeouts for mcp-firewall and gpu-integrity-watch, systemd production hardening (TimeoutStartSec, TimeoutStopSec, StartLimitInterval, StartLimitBurst) for all 12 daemon units, first-boot health validation script, audit log rotation via logrotate, CI dependency vulnerability scanning (govulncheck + pip-audit), production operations guide (upgrade, key rotation, capacity limits, monitoring) |
6161
| Operational maturity | Implemented | M46 | Bootstrap trust gap fix (cosign verify before unverified rebase, documented trust gap rationale), CI runs on all changes (removed blanket paths-ignore for .md files), Python quality gates (ruff lint + bandit security scan + split test suites into unit/integration and adversarial/acceptance), docs-validation CI job (broken link detection, required docs check, test-counts.json validation), production-readiness checklist (formal release gate), SLOs (availability/latency/correctness targets + alerting thresholds), release channel policy (stable/candidate/dev + versioning + upgrade paths + security patch SLA), support lifecycle (hardware matrix, driver versions, support windows, deprecation policy, scope boundaries), CI evidence table with all 10 job descriptions and workflow links, sample verification output for verify-release.sh |
6262
| CI enforcement hardening | Implemented | M47 | Enforced vulnerability scanning: bandit fails CI on HIGH-severity/HIGH-confidence findings, govulncheck fails on unwaived Go vulns, pip-audit fails on unwaived Python vulns. Waiver mechanism (`.github/vuln-waivers.json`) with mandatory expiry dates for reviewed/accepted findings. mypy type checking gate for security-sensitive services (common, agent, quarantine, ui). Pinned reproducible Python CI dependencies (`requirements-ci.txt`). Go 1.23→1.25 upgrade fixing 12 stdlib CVEs (crypto/tls, crypto/x509, encoding/asn1, net/url, os). Flask 3.1.1→3.1.3 (GHSA-68rp-wp8r-4726). Verification-first bootstrap documentation (signed rebase as default quickstart, unverified bootstrap moved to labeled recovery section). |
63+
| Production hardening | Implemented | M48 | Build script fail-closed (all `|| echo WARNING` fallbacks replaced with fatal errors for 12 required services, final binary verification gate), incident store fsync (f.Sync() before close on both incident persistence and audit log writes), GPU backend metadata recording (`/etc/secure-ai/gpu-backend.json` written at build time with backend/version/timestamp), llama-server watchdog (Type=notify wrapper with startup health gate + WatchdogSec=30 continuous monitoring), model catalog externalization (`/etc/secure-ai/model-catalog.yaml` with YAML loading + hardcoded fallback), circuit breaker for Python services (closed→open→half-open state machine protecting inter-service HTTP calls), post-upgrade model verification in Greenboot (SHA256 manifest check closes 15-min integrity gap), cosign key rotation documentation (full lifecycle: generation, rotation schedule, distribution, emergency revocation, HSM migration path). 402 Go + 739 Python tests (1,141 total). |
6364

6465
---
6566

docs/test-counts.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@
99
"policy-engine": 44,
1010
"runtime-attestor": 55,
1111
"integrity-monitor": 50,
12-
"incident-recorder": 83
12+
"incident-recorder": 86
1313
},
14-
"go_total": 399,
15-
"python_total": 718,
16-
"grand_total": 1117
14+
"go_total": 402,
15+
"python_total": 739,
16+
"grand_total": 1141
1717
}

docs/test-matrix.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@ Last updated: 2026-03-14
1111

1212
| Language | Test Count | Runner |
1313
|----------|-----------|--------|
14-
| Go | 399 | `go test -race ./...` |
15-
| Python | 718 | `pytest` |
14+
| Go | 402 | `go test -race ./...` |
15+
| Python | 739 | `pytest` |
1616
| Shell | All .sh files | `shellcheck` |
1717

18-
## Go Tests (399 total)
18+
## Go Tests (402 total)
1919

2020
| Service | Location | Tests | Description |
2121
|---------|----------|-------|-------------|
@@ -27,15 +27,16 @@ Last updated: 2026-03-14
2727
| Policy Engine | services/policy-engine/ | 44 | Unified policy decisions across 6 domains, evidence generation, auth, adversarial tests (M43) |
2828
| Runtime Attestor | services/runtime-attestor/ | 55 | TPM2 quote verification, HMAC bundles, state machine, startup gating, service digests, incident-recorder integration |
2929
| Integrity Monitor | services/integrity-monitor/ | 50 | Baseline computation, continuous scanning, violation detection, state machine, HMAC baselines, incident-recorder integration |
30-
| Incident Recorder | services/incident-recorder/ | 83 | Incident creation, auto-containment, lifecycle management, severity ranking, policy loading, containment execution, enforcement chain integration, recovery ceremony, severity escalation, forensic bundle export (M43) |
30+
| Incident Recorder | services/incident-recorder/ | 86 | Incident creation, auto-containment, lifecycle management, severity ranking, policy loading, containment execution, enforcement chain integration, recovery ceremony, severity escalation, forensic bundle export (M43), persistence durability (fsync) |
3131

32-
## Python Tests (718 total)
32+
## Python Tests (739 total)
3333

3434
| Test File | Location | Tests | Description |
3535
|-----------|----------|-------|-------------|
3636
| test_pipeline.py | tests/ | 96 | Quarantine pipeline stages, scanning, pass/fail logic |
3737
| test_search.py | tests/ | 27 | Search mediator, PII stripping, injection detection |
38-
| test_ui.py | tests/ | 12 | Flask web UI routes, rendering, input handling |
38+
| test_ui.py | tests/ | 18 | Flask web UI routes, rendering, input handling, model catalog loading (YAML/fallback) |
39+
| test_circuit_breaker.py | tests/ | 15 | Circuit breaker state machine (closed/open/half-open), reset, error propagation |
3940
| test_vault_watchdog.py | tests/ | 18 | Vault auto-lock, idle detection, timer controls |
4041
| test_memory_protection.py | tests/ | 37 | Swap encryption, zswap, core dumps, mlock, TEE detection |
4142
| test_traffic_analysis.py | tests/ | 41 | Padding, timing jitter, dummy traffic generation |

0 commit comments

Comments
 (0)