Skip to content

feat(api): SP-5 · /metrics surface + structured JSON logging (AIN-238 + AIN-249)#101

Merged
hizrianraz merged 1 commit into
mainfrom
recut/sp5-metrics-logging
May 31, 2026
Merged

feat(api): SP-5 · /metrics surface + structured JSON logging (AIN-238 + AIN-249)#101
hizrianraz merged 1 commit into
mainfrom
recut/sp5-metrics-logging

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 31, 2026

Extracted from the #79 omnibus, rebased clean onto current main.

Scope: SP-5 only — /metrics Prometheus-style surface + structured JSON logging with secret scrubbing (AIN-238, AIN-249). Payment rails (SP-8) split into a separate PR for deliberate review.

Dropped as redundant during re-cut: SP-1 rename + SP-CLOSE (already in main), SP-2 streaming (already upstream).

Verified locally: 724 unit+smoke tests pass; app imports clean (85 routes); exact-match OpenAPI contract test green.

Replaces the SP-5 portion of #79.


Note

Medium Risk
Touches global logging and a new internal metrics endpoint that queries audit_events; auth uses simple key equality (not verify_internal_key rotation helper), and log scrubbing is a security backstop that must stay complete as new secret shapes appear.

Overview
Adds SP-5 observability: process-wide structured JSON logging with secret/PII scrubbing, a custom Prometheus-style metrics registry (no prometheus_client), per-request HTTP metrics middleware, and an internal-key–gated GET /metrics scrape endpoint.

Logging: main.py installs StructuredJSONFormatter on the root logger before router import so all stdout logs are JSON with key-based and regex redaction (API keys, Bearer/JWT, prompts, internal key, etc.).

Metrics: New services/metrics.py defines counters/gauges/histograms and named series (ainfera_http_*, audit chain, killswitch, provider/alias helpers, etc.) with route-template labels and a path cardinality cap. RequestMetricsMiddleware records every request (including after exception handlers). /metrics (hidden from OpenAPI) requires X-Ainfera-Internal-Key, refreshes audit chain gauges from DB on scrape, and renders exposition format.

Tests: Unit coverage for registry rendering and formatter scrubbing behavior.

Note: Several series are defined for future wiring (e.g. provider calls, killswitch on scrape); this PR primarily instruments HTTP + audit height/freshness at scrape time.

Reviewed by Cursor Bugbot for commit 7e81c9b. Bugbot is set up for automated code reviews on this repo. Configure here.

…h secret scrubbing (AIN-238 + AIN-249)

Adds the internal-scoped observability surface + a structured JSON
log formatter that scrubs secrets before bytes leave the process.

Stacks on SP-2 api#72 (`feat/ain271-streaming-tooluse`); independent
of SP-5 PR-A (#76 supply-chain) and SP-5 PR-B (#77 resilience).

Dependency-free registry in `services/metrics.py`. Named series
(process-global; NO tenant_id / agent_id / owner_handle):
  - `ainfera_http_requests_total{method,path,status}` — counter
  - `ainfera_http_request_duration_seconds{method,path}` — histogram
  - `ainfera_provider_calls_total{provider,outcome}` — counter
  - `ainfera_router_alias_hit_total{alias}` — counter
  - `ainfera_audit_chain_height` + `_freshness_seconds` — gauges
  - `ainfera_dispatch_without_capture_total` — bridge for SP-4 PR-A
  - `ainfera_cost_killswitch_{engaged,spent_usd,threshold_usd}` —
    bridge for SP-5 PR-B
  - `ainfera_app_info{version}` — constant info gauge

`middleware/request_metrics.py` — ASGI middleware that times every
request and uses the FastAPI route TEMPLATE for the path label so
agent_id etc. never leak. Defensive label-cardinality cap (200
unique paths) blocks probe-spam from blowing up the histogram set.

`routers/metrics.py` — `GET /metrics` gated by
`X-Ainfera-Internal-Key` (same key the signup proxy uses). Cold-path
enrichment reads `max(seq)` + `max(created_at)` from audit_events
(read-only — never mutates the immutable chain). Hidden from openapi
so it's not advertised to public clients.

`ainfera_dispatch_without_capture_total` is registered here; SP-4
PR-A's `DispatchCaptureCounter` plugs in via a single `.inc()` call
once both PRs merge.

`services/structured_log.py` — `StructuredJSONFormatter` emits one
JSON object per record + scrubs secrets in two layers:
  1. Per-KEY scrubbing for structured `extra` fields (`api_key`,
     `password`, `secret`, `token`, `authorization`, `cookie`,
     `prompt`, `messages`, `content`).
  2. Regex pass for known secret SHAPES in freeform message text
     (`ai_infera_*`, `sk-*`, `Bearer *`, JWT `eyJ*.*.*`).

Tracebacks also flow through the scrubber. Wired in `main.py` via
`logging.basicConfig(handlers=[...], force=True)` BEFORE the routers
import so startup log lines are also scrubbed.

- `tests/unit/test_structured_log.py` — 10 cases (each secret format
  + structured extra + nested dicts + innocent passthrough + tracebacks).
- `tests/unit/test_metrics_registry.py` — 13 cases (primitives, label
  escaping, cumulative buckets, sorted render, named-series wrappers).

Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke
= 529 green.

- NO tenant_id, agent_id, owner_handle, or any PII appears as a
  metrics label.
- `/metrics` is internal-key gated; tenant cardinality (if ever
  needed) lands on a stricter-auth endpoint.
- Log lines are scrubbed by both KEY and SHAPE. The
  `test_extra_field_with_prompt_label_redacted` test locks "prompt
  content is PII; never log it" into CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hizrianraz hizrianraz merged commit 7464cb2 into main May 31, 2026
3 of 5 checks passed
@hizrianraz hizrianraz deleted the recut/sp5-metrics-logging branch May 31, 2026 15:06
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.

raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="invalid X-Ainfera-Internal-Key",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics skips key rotation

Medium Severity

/metrics compares the header to internal_signup_key with != instead of Settings.verify_internal_key. During AIN-289 rotation, scrapers using internal_signup_key_previous get 401 here while other internal endpoints still accept the old key.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.

raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="invalid X-Ainfera-Internal-Key",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB opened before metrics auth

Low Severity

DBSession is injected before the internal-key check in the handler body, so unauthenticated /metrics probes still acquire a database session. Sibling internal routes use Depends(require_internal_key) so auth fails before any DB work.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7e81c9b. Configure here.

hizrianraz added a commit that referenced this pull request May 31, 2026
…103)

The re-cut/squash of the SP-5 metrics work (#101) changed the commit SHAs of
the synthetic sk-*/ai_infera_* fixtures in tests/unit/test_structured_log.py,
so the SHA-pinned .gitleaksignore fingerprints drifted and the full-history
scan flagged them again (4 findings, all that one test file).

Replace the fragile SHA-pinning with a robust path allowlist (matches the
existing fixture-allowlist pattern). These are synthetic keys whose sole
purpose is to exercise the log redactor — never live secrets.

Verified: gitleaks detect over full history → no leaks found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant