Skip to content

Latest commit

 

History

History
239 lines (186 loc) · 15.9 KB

File metadata and controls

239 lines (186 loc) · 15.9 KB

Guardian Agent — Full System Analysis

A deep-dive analysis across performance, security, compliance, and hardening for the Guardian Agent codebase (as of current state).


Executive Summary

Area Status Summary
Performance ✅ Strong Async Rust, benchmarks, 10k+ req/s target; some sync locks in middleware.
Security ✅ Good, gaps Auth (JWT/API key), MFA, headers, sanitization; no native TLS, no per-route RBAC.
Compliance ✅ Good, stubs Retention, PII, encryption, audit, reporting; forensics/analytics/chain-of-custody are stubs.
Hardening ✅ Good Rate limit, lockout, session, backup, CORS; TLS via proxy only.

1. What We Have

1.1 Core Runtime & API

  • Binary: guardian-server (Rust, Axum, tokio).
  • Config: YAML (guardian.yaml) + env (GUARDIAN_CONFIG, GUARDIAN_JWT_SECRET, GUARDIAN_DEBUG, GUARDIAN_SIGNING_KEY, PORT, RUST_LOG).
  • Routes: 50+ endpoints — health, validate, capability, metrics, policies, retention, legal-hold, PII, encryption, RBAC users, compliance, forensics, tenants, chain-of-custody, regulatory, analytics, MCP, admin (audit, backup), MFA (setup/verify/disable).
  • Graceful shutdown: axum::serve(...).with_graceful_shutdown(shutdown_signal()) on Ctrl+C.

1.2 Performance

  • Async stack: Tokio, Axum, async/await throughout agent and server.
  • Benchmarks: benches/throughput.rs, benches/latency.rs, benches/memory.rs, benches/startup.rs (Criterion).
  • Release profile: opt-level = 3, LTO, codegen-units = 1, panic = "abort".
  • Caching: Config has cache (TTL, max_size); policy validator uses in-memory policy bundle.
  • Logging: Append-only file, optional rotation by size; signing and optional encryption on write.
  • Observability: /metrics, /health, /health/live, /health/ready; tracing with configurable level and JSON format.

1.3 Security

  • Authentication: Optional; when enabled: JWT (Bearer) or API key (x-api-key). Public paths configurable (auth.public_paths).
  • MFA: TOTP (totp-rs); MfaStore; routes /auth/mfa/setup, /auth/mfa/verify, /auth/mfa/disable; auth middleware requires X-MFA-Code when user.mfa_enabled.
  • Secrets: Env-based 32-byte keys (secrets::load_key_from_env, load_secret_from_env); signing key and JWT secret from env in production.
  • Security headers: Middleware sets HSTS, X-Content-Type-Options, X-Frame-Options, CSP, X-XSS-Protection.
  • Error sanitization: GuardianError::sanitize_for_client(); detailed messages only when GUARDIAN_DEBUG=1|true.
  • Input validation: sanitize_path (no ..), validate_id, validate_framework, truncate_safe; used on path params and bodies where applicable.
  • Request size: Middleware enforces max body (e.g. 10MB) via Content-Length.
  • CORS: Configurable cors.allowed_origins; empty = same-origin only.

1.4 Hardening

  • Rate limiting: Per-key token bucket (RateLimitStore); key = X-Forwarded-For / X-Real-IP or "unknown"; 429 + Retry-After and X-RateLimit headers.
  • Account lockout: After N failed auth attempts, key locked for M minutes (LockoutStore).
  • Session management: Optional per-user sessions with TTL and max concurrent (SessionStore); key = user_id, value = (client_key, last_activity).
  • Backup: BackupConfig (enabled, cron, local_path, optional s3/azure); BackupManager copies log (+ optional config) to timestamped dir; cron job in main; POST/GET /admin/backup and /admin/backup/status. Cloud upload stubbed (warn only).
  • Audit: In-memory ring buffer (AuditLog); admin operations record actor, operation, target, success; GET /admin/audit with filters.

1.5 Compliance & Data Governance

  • Retention: RetentionManager — cron-driven; apply policy (retention_days, archive_after_days, deletion, legal_hold); archive to local dir; legal hold list; stats and API.
  • PII: PIIDetector (regex-based SSN, email, credit card, etc.); redaction (mask/hash/remove/replace); agent redacts action/verdict/metadata on log write when enabled.
  • Encryption: EncryptionManager — AES-256-GCM; key sources: Local file, Env, HashiCorp Vault (HTTP), AWS KMS (feature, stub), Azure KV (feature commented out); agent can encrypt log lines; key rotation stub.
  • RBAC: AccessControl — users, roles (Admin, Auditor, Compliance, Operator, Viewer), permissions; list/add/update/delete users; access log; agent.check_permission(user_id, permission) exists.
  • Compliance reporting: ComplianceReporter — SOC2, HIPAA, GDPR, PCI-DSS, ISO27001; requirement definitions; generate report for period; GET requirements by framework.
  • Legal hold: Retention respects legal hold list; add/remove/list via API.
  • Multi-tenancy: TenantManager — CRUD tenants; per-tenant retention, PII config, access control (in-memory).
  • Chain of custody: Types (timestamp token, verification record, custody record); manager returns structure (implementation is placeholder).
  • Regulatory mapping: RegulatoryMapper — gap analysis types; map_to_framework returns empty mappings (TODO).
  • Analytics: AnalyticsEngine — anomaly and risk score types; detect_anomalies and calculate_risk_score return empty (TODO).
  • Forensics: ForensicEngine — query, timeline, correlate; all return empty (TODO); index not built from logs.
  • MCP: Protocol types and parsing; monitor endpoint; config/stats (runtime only, not persisted).

1.6 Infrastructure & CI

  • CI: rust-ci.yml — test (stable/beta/nightly), build release, coverage (llvm-cov, Codecov), fmt, clippy, multi-target build (linux/musl/darwin/windows), cargo audit, Docker build/push (distroless, multi-arch).
  • Deploy: Dockerfile(s), Helm chart (helm/guardian-agent), examples (docker-compose, k8s sidecar, systemd).
  • Docs: Compliance, deployment, monitoring, MCP, sidecar, performance impact.

2. What We Might Not Have (Gaps & Risks)

2.1 Performance

  • Sync locks in hot path: RateLimitStore uses std::sync::Mutex, LockoutStore/SessionStore use std::sync::RwLock. Under high concurrency these can block the async runtime. Prefer tokio::sync::RwLock/Mutex or dedicated async-friendly structures.
  • No HTTP/2: Axum/hyper can support it; not explicitly enabled; helps multiplexing under load.
  • No connection/timeout tuning: Listen backlog, request timeout, body read timeout not explicitly configured.
  • Benchmarks vs server: Throughput/latency benches exercise GuardianAgent::validate_action in-process, not the full HTTP stack (middleware, auth, rate limit). Add server-level benchmarks for realistic numbers.
  • Log I/O: Logger uses std::fs::File + BufWriter behind a mutex; under very high write load this could be a bottleneck (e.g. batch or async write path).

2.2 Security

  • No native TLS: start_server returns an error if tls_config.enabled is true; TLS is not implemented in-process. Production must use a reverse proxy (nginx, Caddy, etc.) for HTTPS. mTLS (client_ca_path) is config but unused in code.
  • RBAC not enforced per route: Users and permissions are stored and check_permission exists, but no handler checks permission before acting. Any authenticated user can call any protected endpoint. Need per-route or per-handler permission checks (e.g. Admin-only for backup, ManageRBAC for user management).
  • API keys in config: auth.api_keys (map key → user_id) can be in YAML; if config is committed or wide-readable, keys leak. Prefer env or secrets manager for API keys.
  • MFA secret storage: TOTP secrets in MfaStore (in-memory HashMap); lost on restart. No persistence or encryption at rest for MFA secrets.
  • Session storage: In-memory; no shared session store across instances (sticky sessions or Redis needed for multi-instance).
  • Client key spoofing: Rate limit/lockout key is X-Forwarded-For / X-Real-IP. If proxy is not trusted, clients can spoof; ensure proxy strips/overwrites these.

2.3 Compliance & Correctness

  • Forensics: Query/timeline/correlate return empty; index not built from log files. No real search or timeline for investigations.
  • Analytics: Anomaly detection and risk scoring return empty; no real algorithms.
  • Regulatory mapping: Mappings are empty; no feature → requirement evidence.
  • Chain of custody: Structure only; no RFC 3161 TSA or real verification flow.
  • Retention archive: Archive is local directory only; no S3/Azure upload in retention (config has archive_location but implementation moves to local archive/).
  • Audit persistence: Audit log is in-memory ring buffer; lost on restart; no durable audit trail for strict compliance.
  • Tenant isolation: Tenants stored in memory; no per-request tenant context enforcement on log read/write (tenant_id in request/header and scoping of data).

2.4 Hardening & Operations

  • Backup cloud: S3/Azure upload in backup is stubbed (warn only); no real cloud backup.
  • TLS feature: tls feature exists in Cargo.toml but no rustls/native-tls wiring in server; config exists, code path errors out.
  • Azure features: azure-kv referenced in encryption but not declared in Cargo.toml (warnings); Azure Key Vault and Azure Blob backup not implemented.
  • Container scanning: CI has cargo audit; no Trivy (or similar) container image scan in the workflow.
  • No request timeout: No global or per-route request timeout; slow clients can hold connections.

2.5 Testing & Quality

  • Integration tests: Cover validation, retention, PII, RBAC, compliance, tenants, etc., but many assertions are permissive (e.g. assert!(x || !x)). Some tests may not fail when behavior regresses.
  • No auth/MFA e2e: No automated test that runs server with auth + MFA and checks 401/403 and MFA flow.
  • Benchmarks not in CI: Criterion benches are not run in CI; no regression tracking on throughput/latency.

3. Factor-by-Factor Deep Dive

3.1 Performance

Aspect Have Missing / Risk
Async I/O Tokio, Axum, async agent
Concurrency Multi-threaded runtime Sync Mutex/RwLock in middleware
Throughput target 10k+ req/s (docs/benches) Server-level bench not in CI
Latency Criterion latency bench (in-process) No p99, no server stack
Memory Small binary, 5–20MB described No max heap or RSS guard
Startup 50ms target No startup bench in CI
Caching Config cache, policy bundle No JWT or validation result cache
Backpressure Rate limit (token bucket) No explicit connection/request limits

Recommendations: Replace sync primitives in middleware with async ones; add server-level benchmarks and run in CI; consider JWT caching and request/timeout limits.

3.2 Security

Aspect Have Missing / Risk
Authentication JWT + API key, optional API keys in config; no secret store for keys
MFA TOTP, setup/verify/disable, middleware MFA secrets volatile; no persistence
Secrets management Env keys, Vault for encryption key JWT/API keys not in Vault
TLS Config + “use proxy” error No in-process TLS
Headers HSTS, CSP, X-Frame-Options, etc.
Error leakage Sanitize unless GUARDIAN_DEBUG
Input validation Path, ID, framework, size Could extend to more JSON schemas
CORS Configurable origins
Rate limiting Per-IP/key token bucket Key spoofing if proxy untrusted
Lockout N failures → lock M min
Sessions TTL, max per user In-memory only
RBAC Users, roles, permissions Not enforced on routes

Recommendations: Enforce RBAC per endpoint; persist or encrypt MFA secrets; move API keys to env/Vault; add native TLS option (e.g. rustls) or document proxy-only TLS clearly.

3.3 Compliance

Aspect Have Missing / Risk
Retention Policies, cron, delete/archive, legal hold Archive only local
PII Detect + redact on write
Encryption at rest AES-256-GCM, multiple key sources KMS/Azure stubs or unimplemented
Audit In-memory admin audit Not durable
Reporting SOC2, HIPAA, GDPR, PCI-DSS, ISO27001 Report content is template/placeholder
Legal hold List, add, remove; retention respects
Forensics API and types No real query/timeline/correlation
Chain of custody Data structures No TSA or verification
Regulatory mapping Types, gap analysis Empty mappings
Analytics Anomaly/risk types Empty results
Multi-tenancy Tenant CRUD, per-tenant config No request-scoped tenant enforcement

Recommendations: Implement forensics (read logs, index, query/timeline); persist audit to file or external store; implement retention archive to S3/Azure; add tenant context to requests and scope data access.

3.4 Hardening

Aspect Have Missing / Risk
Rate limiting Yes, configurable
Lockout Yes
Sessions Yes, TTL + cap In-memory
Backup Local backup, cron, API Cloud upload stubbed
Health Live + ready
Graceful shutdown Yes
CORS Strict possible
Request size Capped
No TLS in-process Documented (use proxy) No optional rustls build
CI Tests, audit, multi-platform, Docker No container scan; no bench in CI

Recommendations: Implement S3/Azure backup or document as future work; add Trivy (or similar) to CI; optionally add rustls and wire tls feature.


4. Summary Tables

4.1 Implemented vs Stub vs Missing

Component Status Notes
Policy validation (OPA) Implemented HTTP + fallback
Immutable logger Implemented Signing, optional encryption
Capability gate Implemented JWT capability tokens
Retention Implemented Cron, legal hold, local archive
PII Implemented Detect + redact on write
Encryption manager Implemented Local/Env/Vault; KMS stub
RBAC Implemented No per-route enforcement
Compliance reporter Implemented Reports + requirements
Audit log Implemented Volatile, in-memory
Backup Implemented Local + cron; cloud stubbed
MFA Implemented TOTP; secrets volatile
Forensics Stub Empty query/timeline/correlate
Analytics Stub Empty anomalies/risk
Regulatory mapping Stub Empty mappings
Chain of custody Stub Types only
MCP config/stats Runtime only Not persisted
Native TLS Missing Config exists; use proxy
Per-route RBAC Missing check_permission not used in server
Durable audit Missing In-memory only
Tenant-scoped access Missing No request tenant context

4.2 Risk Overview

Risk Severity Mitigation
No per-route RBAC High Add permission checks to admin/sensitive handlers
MFA secrets lost on restart Medium Persist (encrypted) or document limitation
Audit not durable Medium Write audit to file or external store
Sync locks in middleware Medium Switch to async locks or dedicated structures
API keys in config Medium Env or secrets manager only
Forensics/analytics stubs Low Implement or mark as future in docs
No native TLS Low Reverse proxy is acceptable; document clearly

5. Conclusion

Guardian Agent has a solid base: async Rust, broad API surface, auth, MFA, rate limiting, lockout, sessions, security headers, input validation, error sanitization, retention, PII, encryption, RBAC, compliance reporting, backup scheduling, and good CI (including cargo audit). The main gaps are: per-route RBAC enforcement, native TLS (or explicit proxy-only guidance), durable audit, persisted MFA secrets, forensics/analytics/regulatory implementations, and replacing sync locks in hot-path middleware for scalability. Addressing the high/medium items above would materially strengthen production readiness and compliance posture.