Skip to content

Commit 46fbf31

Browse files
SecAI-Hubclaude
andcommitted
M51: Stronger observability — unified dashboard, SLO tracking, alerting hooks, forensic export
Four operator-facing observability capabilities: 1. Unified appliance health dashboard: trusted/degraded/recovery_required state derived from runtime attestor + integrity monitor + incident recorder. Banner at top of Security page with per-subsystem breakdown. 2. Live SLO compliance monitoring: in-process tracker measuring uptime % and P95 latency against docs/slos.md targets (7-day rolling window). New /api/observability/slos endpoint + dashboard widget. 3. Webhook alerting hooks: fire-and-forget POST to configured URLs on containment events (incident-containment.yaml alerting.webhooks), with per-event-type filtering and 1-retry delivery. 4. Forensic bundle export: wired existing handlers into HTTP mux (were implemented but unregistered), enriched with real audit log entries and policy digest, accessible via UI download button, Flask proxy route, and CLI script (secai-forensic.sh export/verify). Recovery ceremony endpoints also wired (ack, reattest, status). 413 Go + 744 Python tests (1,157 total, +16 new). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d52a089 commit 46fbf31

13 files changed

Lines changed: 1181 additions & 11 deletions

File tree

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Every model passes through the same fully automatic pipeline:
158158
| **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
159159
| **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
160160

161-
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 50 milestones.
161+
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 51 milestones.
162162

163163
### Verify Image Signatures
164164

@@ -378,7 +378,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
378378
## Roadmap
379379

380380
<details>
381-
<summary>All 50 project milestones (click to expand)</summary>
381+
<summary>All 51 project milestones (click to expand)</summary>
382382

383383
- [x] **Milestone 0** -- Threat model, dataflow, invariants, policy files
384384
- [x] **Milestone 1** -- Bootable OS, encrypted vault, GPU drivers
@@ -431,6 +431,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
431431
- [x] **Milestone 48** -- Production hardening: build script fail-closed (fatal errors for 12 required services + binary verification gate), incident store fsync (crash-safe persistence), GPU backend metadata recording, llama-server watchdog (Type=notify + WatchdogSec=30), model catalog externalization (YAML with fallback), circuit breaker for inter-service HTTP calls, post-upgrade model verification in Greenboot, cosign key rotation documentation (full lifecycle)
432432
- [x] **Milestone 49** -- Signed-first install path: bootstrap script configures signing policy before first rebase (eliminates unverified transport), digest-pinned install flow (CI publishes digests in build summary + release assets), first-boot setup wizard (interactive integrity verification + vault + TPM2 + health check), recovery/dev path separated into dedicated doc
433433
- [x] **Milestone 50** -- Production operations package: backup/restore scripts (full/config/logs/keys categories, age/gpg encryption, SHA256 manifest, LUKS header backup/restore), rollback decision matrix (Greenboot auto-rollback + manual criteria), 5 break-glass recovery procedures, formal data retention policy (7 data classes, disk capacity thresholds)
434+
- [x] **Milestone 51** -- Stronger observability: unified appliance health dashboard (trusted/degraded/recovery_required), live SLO compliance monitoring (uptime + P95 latency tracking), webhook alerting hooks for containment events, forensic bundle export via UI + CLI (secai-forensic.sh), recovery ceremony endpoints wired
434435

435436
</details>
436437

docs/security-status.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Security Implementation Status
22

3-
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M50) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
3+
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M51) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
44

55
Last updated: 2026-03-14
66

@@ -63,6 +63,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
6363
| Production hardening | Implemented | M48 | Build script fail-closed (all `|| echo WARNING` fallbacks replaced with fatal errors for 12 required services, final binary verification gate), incident store fsync (f.Sync() before close on both incident persistence and audit log writes), GPU backend metadata recording (`/etc/secure-ai/gpu-backend.json` written at build time with backend/version/timestamp), llama-server watchdog (Type=notify wrapper with startup health gate + WatchdogSec=30 continuous monitoring), model catalog externalization (`/etc/secure-ai/model-catalog.yaml` with YAML loading + hardcoded fallback), circuit breaker for Python services (closed→open→half-open state machine protecting inter-service HTTP calls), post-upgrade model verification in Greenboot (SHA256 manifest check closes 15-min integrity gap), cosign key rotation documentation (full lifecycle: generation, rotation schedule, distribution, emergency revocation, HSM migration path). 402 Go + 739 Python tests (1,141 total). |
6464
| Signed-first install path | Implemented | M49 | Signed bootstrap script (`secai-bootstrap.sh`) configures container signing policy (policy.json + registries.d + cosign public key) before first rebase — eliminates unverified transport from production install path. Digest-pinned install flow (CI publishes image digest in build summary and release assets). First-boot setup wizard (interactive verification of image integrity, transport, vault setup, TPM2 sealing, health check). Signing policy files baked into OS image (`/etc/pki/containers/secai-cosign.pub`, `/etc/containers/registries.d/secai-os.yaml`, policy.json merge in build script). Recovery/dev bootstrap path separated into dedicated doc with clear warnings. |
6565
| Production operations package | Implemented | M50 | Backup script (`secai-backup.sh`) with full/config/logs/keys categories, age/gpg encryption, internal SHA256 manifest, LUKS header backup. Restore script (`secai-restore.sh`) with integrity verification, staging extraction, double-confirmation LUKS header restore, post-restore health check. Production operations doc extended with rollback decision matrix (Greenboot auto-rollback triggers + manual criteria), 5 break-glass recovery procedures (token loss, attestation failure, Level 1 panic lockout, signing policy break, Greenboot exhaustion), formal data retention policy (7 data classes with retention periods, disk capacity thresholds at 70/80/90/95%). |
66+
| Stronger observability | Implemented | M51 | Unified appliance health dashboard (trusted/degraded/recovery_required state derived from runtime attestor + integrity monitor + incident recorder). Live SLO compliance monitoring (in-process tracker measuring uptime % and P95 latency against docs/slos.md targets, 7-day rolling window). Webhook alerting hooks for containment events (fire-and-forget POST with retry, configurable per-event-type filtering in incident-containment.yaml). Forensic bundle export wired to HTTP mux (was implemented but unregistered), enriched with real audit log entries and policy digest, accessible via UI download button, Flask proxy, and CLI script (`secai-forensic.sh`). Recovery ceremony endpoints also wired (ack, reattest, status). |
6667

6768
---
6869

files/system/etc/secure-ai/policy/incident-containment.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,17 @@ rules:
7777
- force_vault_relock
7878
- log_alert
7979
default_severity: critical
80+
81+
# Alerting — fire-and-forget webhooks on containment events.
82+
# Configure webhook URLs to receive JSON alert payloads when incidents
83+
# trigger containment actions. Each entry specifies a URL and which
84+
# event types to forward. Leave events empty to receive all event types.
85+
#
86+
# Supported event types: containment, escalation, recovery
87+
alerting:
88+
webhooks: []
89+
# Example:
90+
# - url: "http://127.0.0.1:9090/api/alerts"
91+
# events: ["containment", "escalation"]
92+
# - url: "https://hooks.example.com/secai"
93+
# events: [] # receive all events
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
#!/usr/bin/env bash
2+
#
3+
# SecAI OS — Forensic Bundle Export/Verify (M51)
4+
#
5+
# Exports a signed forensic bundle from the incident recorder, or
6+
# verifies the integrity of a previously exported bundle.
7+
#
8+
# Usage:
9+
# secai-forensic export [--output FILE] Export a signed forensic bundle
10+
# secai-forensic verify <FILE> Verify bundle hash integrity
11+
# secai-forensic --help Show help
12+
#
13+
set -euo pipefail
14+
15+
INCIDENT_RECORDER_URL="${INCIDENT_RECORDER_URL:-http://127.0.0.1:8515}"
16+
SERVICE_TOKEN_PATH="${SERVICE_TOKEN_PATH:-/run/secure-ai/service-token}"
17+
18+
# ---------------------------------------------------------------------------
19+
# Helpers
20+
# ---------------------------------------------------------------------------
21+
RED='\033[0;31m'
22+
GREEN='\033[0;32m'
23+
YELLOW='\033[0;33m'
24+
NC='\033[0m'
25+
26+
info() { echo -e "${GREEN}[INFO]${NC} $*"; }
27+
warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
28+
err() { echo -e "${RED}[ERROR]${NC} $*" >&2; }
29+
30+
usage() {
31+
cat <<'EOF'
32+
secai-forensic — Forensic bundle export and verification
33+
34+
Usage:
35+
secai-forensic export [--output FILE] Export a signed forensic bundle
36+
secai-forensic verify <FILE> Verify bundle hash integrity
37+
secai-forensic --help Show this help
38+
39+
The export subcommand downloads a signed forensic bundle from the local
40+
incident recorder service. The bundle contains all incidents, audit log
41+
entries, system state, and a policy digest, signed with HMAC-SHA256.
42+
43+
The verify subcommand recomputes the bundle hash and checks it against
44+
the stored hash to detect tampering.
45+
46+
Environment:
47+
INCIDENT_RECORDER_URL (default: http://127.0.0.1:8515)
48+
SERVICE_TOKEN_PATH (default: /run/secure-ai/service-token)
49+
EOF
50+
exit 0
51+
}
52+
53+
# ---------------------------------------------------------------------------
54+
# Export
55+
# ---------------------------------------------------------------------------
56+
cmd_export() {
57+
local output="${1:-}"
58+
if [[ -z "$output" ]]; then
59+
output="forensic-bundle-$(date -u +%Y%m%d-%H%M%S).json"
60+
fi
61+
62+
# Read service token if available
63+
local auth_args=()
64+
if [[ -f "$SERVICE_TOKEN_PATH" ]]; then
65+
local token
66+
token=$(cat "$SERVICE_TOKEN_PATH")
67+
auth_args=(-H "Authorization: Bearer ${token}")
68+
else
69+
warn "Service token not found at ${SERVICE_TOKEN_PATH} — trying without auth"
70+
fi
71+
72+
info "Exporting forensic bundle from ${INCIDENT_RECORDER_URL}..."
73+
74+
local http_code
75+
http_code=$(curl -sf -w "%{http_code}" \
76+
"${auth_args[@]+"${auth_args[@]}"}" \
77+
"${INCIDENT_RECORDER_URL}/api/v1/forensic/export" \
78+
-o "$output" 2>/dev/null) || true
79+
80+
if [[ ! -f "$output" ]] || [[ ! -s "$output" ]]; then
81+
err "Export failed (HTTP ${http_code:-unknown}). Is the incident recorder running?"
82+
rm -f "$output"
83+
exit 1
84+
fi
85+
86+
# Show summary
87+
local size
88+
size=$(wc -c < "$output" | tr -d ' ')
89+
info "Exported: ${output} (${size} bytes)"
90+
91+
# Extract and show bundle hash
92+
if command -v python3 &>/dev/null; then
93+
python3 -c "
94+
import json, sys
95+
try:
96+
b = json.load(open('${output}'))
97+
print('Bundle hash: ' + b.get('bundle_hash', 'N/A'))
98+
print('Exported at: ' + b.get('exported_at', 'N/A'))
99+
print('Incidents: ' + str(len(b.get('incidents', []))))
100+
print('Audit lines: ' + str(len(b.get('audit_entries', []))))
101+
print('Signed: ' + ('yes' if b.get('signature') else 'no'))
102+
except Exception as e:
103+
print('Could not parse bundle: ' + str(e), file=sys.stderr)
104+
"
105+
fi
106+
}
107+
108+
# ---------------------------------------------------------------------------
109+
# Verify
110+
# ---------------------------------------------------------------------------
111+
cmd_verify() {
112+
local file="$1"
113+
if [[ ! -f "$file" ]]; then
114+
err "File not found: ${file}"
115+
exit 1
116+
fi
117+
118+
if ! command -v python3 &>/dev/null; then
119+
err "python3 is required for bundle verification"
120+
exit 1
121+
fi
122+
123+
python3 -c "
124+
import json, hashlib, sys
125+
126+
bundle = json.load(open('${file}'))
127+
128+
# Recompute hash over content fields (same structure as Go ExportForensicBundle)
129+
hash_input = json.dumps({
130+
'exported_at': bundle['exported_at'],
131+
'incidents': bundle['incidents'],
132+
'audit_entries': bundle['audit_entries'],
133+
'system_state': bundle['system_state'],
134+
'policy_digest': bundle['policy_digest'],
135+
}, separators=(',', ':'), sort_keys=False).encode()
136+
137+
computed = hashlib.sha256(hash_input).hexdigest()
138+
stored = bundle.get('bundle_hash', '')
139+
140+
if stored == computed:
141+
print('VERIFIED: Bundle hash matches.')
142+
print(' Hash: ' + stored)
143+
print(' Incidents: ' + str(len(bundle.get('incidents', []))))
144+
print(' Exported at: ' + bundle.get('exported_at', 'N/A'))
145+
sys.exit(0)
146+
else:
147+
print('FAILED: Bundle hash mismatch — content may have been tampered.', file=sys.stderr)
148+
print(' Expected: ' + stored, file=sys.stderr)
149+
print(' Computed: ' + computed, file=sys.stderr)
150+
sys.exit(1)
151+
"
152+
}
153+
154+
# ---------------------------------------------------------------------------
155+
# Main
156+
# ---------------------------------------------------------------------------
157+
case "${1:-}" in
158+
export)
159+
shift
160+
output=""
161+
while [[ $# -gt 0 ]]; do
162+
case "$1" in
163+
--output)
164+
[[ $# -lt 2 ]] && { err "--output requires a filename"; exit 1; }
165+
output="$2"
166+
shift 2
167+
;;
168+
*)
169+
err "Unknown option: $1"
170+
usage
171+
;;
172+
esac
173+
done
174+
cmd_export "$output"
175+
;;
176+
verify)
177+
shift
178+
[[ $# -lt 1 ]] && { err "verify requires a filename"; usage; }
179+
cmd_verify "$1"
180+
;;
181+
--help|-h)
182+
usage
183+
;;
184+
*)
185+
err "Unknown command: ${1:-}"
186+
echo ""
187+
usage
188+
;;
189+
esac
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
package main
2+
3+
import (
4+
"bytes"
5+
"encoding/json"
6+
"log"
7+
"net/http"
8+
"sync"
9+
"time"
10+
)
11+
12+
// =========================================================================
13+
// Alerting — fire-and-forget webhooks on containment/escalation events
14+
// =========================================================================
15+
16+
// AlertingConfig holds webhook configuration loaded from the containment policy.
17+
type AlertingConfig struct {
18+
Webhooks []WebhookTarget `yaml:"webhooks" json:"webhooks"`
19+
}
20+
21+
// WebhookTarget defines a single webhook endpoint.
22+
type WebhookTarget struct {
23+
URL string `yaml:"url" json:"url"`
24+
Events []string `yaml:"events" json:"events"` // "containment", "escalation", "recovery"
25+
}
26+
27+
// AlertPayload is the JSON body sent to webhook endpoints.
28+
type AlertPayload struct {
29+
Event string `json:"event"`
30+
Timestamp string `json:"timestamp"`
31+
Incident Incident `json:"incident"`
32+
Actions []string `json:"actions,omitempty"`
33+
Severity string `json:"severity"`
34+
Source string `json:"source"`
35+
}
36+
37+
var (
38+
alertingCfg AlertingConfig
39+
alertingCfgMu sync.RWMutex
40+
)
41+
42+
func getAlertingConfig() AlertingConfig {
43+
alertingCfgMu.RLock()
44+
defer alertingCfgMu.RUnlock()
45+
return alertingCfg
46+
}
47+
48+
func setAlertingConfig(cfg AlertingConfig) {
49+
alertingCfgMu.Lock()
50+
defer alertingCfgMu.Unlock()
51+
alertingCfg = cfg
52+
}
53+
54+
// fireWebhooks dispatches alert payloads to all configured webhook URLs
55+
// matching the given event type. Fire-and-forget with one retry.
56+
func fireWebhooks(event string, inc Incident, actions []string) {
57+
cfg := getAlertingConfig()
58+
if len(cfg.Webhooks) == 0 {
59+
return
60+
}
61+
62+
payload := AlertPayload{
63+
Event: event,
64+
Timestamp: time.Now().UTC().Format(time.RFC3339),
65+
Incident: inc,
66+
Actions: actions,
67+
Severity: string(inc.Severity),
68+
Source: "incident-recorder",
69+
}
70+
71+
body, err := json.Marshal(payload)
72+
if err != nil {
73+
log.Printf("alerting: failed to marshal payload: %v", err)
74+
return
75+
}
76+
77+
for _, wh := range cfg.Webhooks {
78+
if !matchesEvent(wh.Events, event) {
79+
continue
80+
}
81+
go sendWebhook(wh.URL, body)
82+
}
83+
}
84+
85+
// matchesEvent returns true if the event list is empty (match all) or
86+
// contains the given event string.
87+
func matchesEvent(events []string, event string) bool {
88+
if len(events) == 0 {
89+
return true // empty filter = match all events
90+
}
91+
for _, e := range events {
92+
if e == event {
93+
return true
94+
}
95+
}
96+
return false
97+
}
98+
99+
// sendWebhook POSTs the JSON body to the given URL.
100+
// Retries once after 1 second on failure. 5-second timeout per attempt.
101+
func sendWebhook(url string, body []byte) {
102+
client := &http.Client{Timeout: 5 * time.Second}
103+
for attempt := 0; attempt < 2; attempt++ {
104+
req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(body))
105+
if err != nil {
106+
log.Printf("alerting: cannot create request for %s: %v", url, err)
107+
return
108+
}
109+
req.Header.Set("Content-Type", "application/json")
110+
req.Header.Set("User-Agent", "SecAI-Incident-Recorder/1.0")
111+
112+
resp, err := client.Do(req)
113+
if err != nil {
114+
log.Printf("alerting: POST to %s failed (attempt %d): %v", url, attempt+1, err)
115+
if attempt == 0 {
116+
time.Sleep(1 * time.Second)
117+
continue
118+
}
119+
return
120+
}
121+
resp.Body.Close()
122+
if resp.StatusCode >= 200 && resp.StatusCode < 300 {
123+
log.Printf("alerting: webhook delivered to %s (status %d)", url, resp.StatusCode)
124+
return
125+
}
126+
log.Printf("alerting: webhook to %s returned status %d (attempt %d)", url, resp.StatusCode, attempt+1)
127+
if attempt == 0 {
128+
time.Sleep(1 * time.Second)
129+
}
130+
}
131+
}

0 commit comments

Comments
 (0)