feat(api): SP-5 · /metrics surface + structured JSON logging (AIN-238 + AIN-249)#101
Conversation
…h secret scrubbing (AIN-238 + AIN-249) Adds the internal-scoped observability surface + a structured JSON log formatter that scrubs secrets before bytes leave the process. Stacks on SP-2 api#72 (`feat/ain271-streaming-tooluse`); independent of SP-5 PR-A (#76 supply-chain) and SP-5 PR-B (#77 resilience). Dependency-free registry in `services/metrics.py`. Named series (process-global; NO tenant_id / agent_id / owner_handle): - `ainfera_http_requests_total{method,path,status}` — counter - `ainfera_http_request_duration_seconds{method,path}` — histogram - `ainfera_provider_calls_total{provider,outcome}` — counter - `ainfera_router_alias_hit_total{alias}` — counter - `ainfera_audit_chain_height` + `_freshness_seconds` — gauges - `ainfera_dispatch_without_capture_total` — bridge for SP-4 PR-A - `ainfera_cost_killswitch_{engaged,spent_usd,threshold_usd}` — bridge for SP-5 PR-B - `ainfera_app_info{version}` — constant info gauge `middleware/request_metrics.py` — ASGI middleware that times every request and uses the FastAPI route TEMPLATE for the path label so agent_id etc. never leak. Defensive label-cardinality cap (200 unique paths) blocks probe-spam from blowing up the histogram set. `routers/metrics.py` — `GET /metrics` gated by `X-Ainfera-Internal-Key` (same key the signup proxy uses). Cold-path enrichment reads `max(seq)` + `max(created_at)` from audit_events (read-only — never mutates the immutable chain). Hidden from openapi so it's not advertised to public clients. `ainfera_dispatch_without_capture_total` is registered here; SP-4 PR-A's `DispatchCaptureCounter` plugs in via a single `.inc()` call once both PRs merge. `services/structured_log.py` — `StructuredJSONFormatter` emits one JSON object per record + scrubs secrets in two layers: 1. Per-KEY scrubbing for structured `extra` fields (`api_key`, `password`, `secret`, `token`, `authorization`, `cookie`, `prompt`, `messages`, `content`). 2. Regex pass for known secret SHAPES in freeform message text (`ai_infera_*`, `sk-*`, `Bearer *`, JWT `eyJ*.*.*`). Tracebacks also flow through the scrubber. Wired in `main.py` via `logging.basicConfig(handlers=[...], force=True)` BEFORE the routers import so startup log lines are also scrubbed. - `tests/unit/test_structured_log.py` — 10 cases (each secret format + structured extra + nested dicts + innocent passthrough + tracebacks). - `tests/unit/test_metrics_registry.py` — 13 cases (primitives, label escaping, cumulative buckets, sorted render, named-series wrappers). Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke = 529 green. - NO tenant_id, agent_id, owner_handle, or any PII appears as a metrics label. - `/metrics` is internal-key gated; tenant cardinality (if ever needed) lands on a stricter-auth endpoint. - Log lines are scrubbed by both KEY and SHAPE. The `test_extra_field_with_prompt_label_redacted` test locks "prompt content is PII; never log it" into CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.
| raise HTTPException( | ||
| status_code=status.HTTP_401_UNAUTHORIZED, | ||
| detail="invalid X-Ainfera-Internal-Key", | ||
| ) |
There was a problem hiding this comment.
Metrics skips key rotation
Medium Severity
/metrics compares the header to internal_signup_key with != instead of Settings.verify_internal_key. During AIN-289 rotation, scrapers using internal_signup_key_previous get 401 here while other internal endpoints still accept the old key.
Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.
| raise HTTPException( | ||
| status_code=status.HTTP_401_UNAUTHORIZED, | ||
| detail="invalid X-Ainfera-Internal-Key", | ||
| ) |
There was a problem hiding this comment.
DB opened before metrics auth
Low Severity
DBSession is injected before the internal-key check in the handler body, so unauthenticated /metrics probes still acquire a database session. Sibling internal routes use Depends(require_internal_key) so auth fails before any DB work.
Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.
…103) The re-cut/squash of the SP-5 metrics work (#101) changed the commit SHAs of the synthetic sk-*/ai_infera_* fixtures in tests/unit/test_structured_log.py, so the SHA-pinned .gitleaksignore fingerprints drifted and the full-history scan flagged them again (4 findings, all that one test file). Replace the fragile SHA-pinning with a robust path allowlist (matches the existing fixture-allowlist pattern). These are synthetic keys whose sole purpose is to exercise the log redactor — never live secrets. Verified: gitleaks detect over full history → no leaks found.


Extracted from the #79 omnibus, rebased clean onto current
main.Scope: SP-5 only —
/metricsPrometheus-style surface + structured JSON logging with secret scrubbing (AIN-238, AIN-249). Payment rails (SP-8) split into a separate PR for deliberate review.Dropped as redundant during re-cut: SP-1 rename + SP-CLOSE (already in main), SP-2 streaming (already upstream).
Verified locally: 724 unit+smoke tests pass; app imports clean (85 routes); exact-match OpenAPI contract test green.
Replaces the SP-5 portion of #79.
Note
Medium Risk
Touches global logging and a new internal metrics endpoint that queries
audit_events; auth uses simple key equality (notverify_internal_keyrotation helper), and log scrubbing is a security backstop that must stay complete as new secret shapes appear.Overview
Adds SP-5 observability: process-wide structured JSON logging with secret/PII scrubbing, a custom Prometheus-style metrics registry (no
prometheus_client), per-request HTTP metrics middleware, and an internal-key–gatedGET /metricsscrape endpoint.Logging:
main.pyinstallsStructuredJSONFormatteron the root logger before router import so all stdout logs are JSON with key-based and regex redaction (API keys, Bearer/JWT, prompts, internal key, etc.).Metrics: New
services/metrics.pydefines counters/gauges/histograms and named series (ainfera_http_*, audit chain, killswitch, provider/alias helpers, etc.) with route-template labels and a path cardinality cap.RequestMetricsMiddlewarerecords every request (including after exception handlers)./metrics(hidden from OpenAPI) requiresX-Ainfera-Internal-Key, refreshes audit chain gauges from DB on scrape, and renders exposition format.Tests: Unit coverage for registry rendering and formatter scrubbing behavior.
Note: Several series are defined for future wiring (e.g. provider calls, killswitch on scrape); this PR primarily instruments HTTP + audit height/freshness at scrape time.
Reviewed by Cursor Bugbot for commit 7e81c9b. Bugbot is set up for automated code reviews on this repo. Configure here.