Skip to content

Audit follow-up: defense-in-depth findings (M2, M3, M4, M5, M8, M11) #144

@jiashuoz

Description

@jiashuoz

Tracking the deferred audit findings from PR #142. These are real but either defense-in-depth or larger refactors that didn't earn a same-PR fix when batched with the H-tier work. Listing here so they don't get lost.

For full context on the audit that surfaced these, see the analysis trail in PRs #140 and #142.


M2 — Pagination cursor unsigned (DoS surface)

Where: internal/agent/api.go:1786-1825

The opaque pagination token is just base64url(json) — no HMAC. Any authenticated caller can edit CreatedAt / ID in the JSON, e.g. set CreatedAt to 0001-01-01 to force a full-table-scan-then-truncate against messages. The cursor.AgentID check prevents cross-tenant reach, but a single bearer can repeatedly request expensive windows. Bounded by partial indexes today; integrity gap is real.

Fix sketch: HMAC the cursor JSON with cfg.Signing.HMACSecret (same pattern as the approval-token signer in internal/approvaltoken/token.go). Reject on tamper.


M3 — Magic-link approval tokens replay-able within TTL

Where: internal/approvaltoken/token.go, used by internal/agent/hitl_magic_api.go

A signed approval / reject token is valid for the entire message TTL plus a 10-minute grace. The status-race guard at approve/reject time blocks repeated state changes once the message is finalized, but if a token leaks (mail archive scrape, Slack-channel paste, Gmail link-tracking) before the reviewer acts, an attacker who reaches the link can race them.

Fix sketch: Either (a) track magic_link_consumed_at on the messages table (new column + migration + index + handler check), or (b) maintain a single-use jti table that the handler INSERTs into and lets the unique constraint reject replay.


M4 — DKIM private keys stored unencrypted at rest

Where: migrations/014_per_domain_dkim.sql, internal/identity/store.go:350-364 (GetDKIMKeyInternal)

Per-domain RSA-2048 private keys live in domains.dkim_private_key BYTEA with no envelope encryption. A DB-read compromise (backup leak, replica access) yields "sign as any user's domain" capability.

Fix sketch: Envelope-encrypt with a KEK derived via HKDF from cfg.Signing.HMACSecret (same secret-rotation pattern as OAuth signing), or move to an external KMS for hosted deployments. Migration needs to re-encrypt existing rows in a backwards-compatible window.


M5 — Webhook delivery vulnerable to DNS rebinding (SSRF)

Where: internal/webhook/deliverer.go:67-105, validation only at registration in internal/agent/api.go:66-98 (ValidateWebhookURL)

ValidateWebhookURL resolves the host and rejects private IPs at agent-registration time. At delivery time, only requireHTTPS is checked. The underlying DNS may have been rebound to 127.0.0.1 or RFC1918. CheckRedirect blocks 30x → loopback, but the initial DNS lookup of the registered hostname is the SSRF vector — register evil.example.com (public IP, passes validation), later change DNS to 127.0.0.1, e2a then POSTs the inbound payload to localhost.

Fix sketch: Use a custom net.Dialer with a Control callback that rejects RFC1918/loopback per-connection, mounted as the Transport.DialContext on the webhook http.Client. Re-resolves and re-checks on every connection regardless of caching.


M8 — No CSRF token on /api/oauth/consent

Where: internal/agent/oauth_handlers.go:658-755, session cookie at internal/auth/auth.go:233 (SameSite: http.SameSiteLaxMode)

Consent POST has no anti-CSRF token. SameSite=Lax blocks cross-site POSTs reliably in current browsers, so this is defensible — but a future browser/extension default-downgrade or a same-site subdomain XSS could mount the consent flow without user intent and silently authorize an MCP client to a chosen agent. Defense in depth.

Fix sketch: Bind a one-time CSRF token to the session cookie, mint on the /oauth/consent GET, verify on POST. Standard double-submit or per-session origin-bound token.


M11 — formatExpiresIn × 4, none tick

Where: web/src/app/(app)/dashboard/agents/messages/view/page.tsx:53, web/src/app/(app)/dashboard/pending/page.tsx:28, web/src/app/(app)/dashboard/pending/_components/PendingDetailPanel.tsx:55, web/src/app/(app)/api-keys/page.tsx:38

formatExpiresIn is reimplemented 4 times. None of the surfaces have a tick interval — the "expires in 47m" label captures Date.now() at render and never refreshes, so a reviewer who leaves the page open for 30 minutes sees stale numbers until SWR's window-focus revalidation kicks the data over. UX freshness, not correctness.

Fix sketch: Extract a shared <RelativeTime iso={...} kind="expires" /> component that owns its own 30-60s setInterval (mirroring the existing tick pattern in dashboard/page.tsx). Replace all 4 callsites.


Sequencing suggestion

  1. M2 — HMAC the cursor (smallest, defense-in-depth on existing surface)
  2. M3 — magic-link single-use (touches one new column + handler)
  3. M5 — webhook DNS rebind (focused fix on one client; high-impact security)
  4. M8 — OAuth CSRF (parallel to M2; reuses HMAC secret infrastructure)
  5. M11 — ticker UI (frontend only, no security implications)
  6. M4 — DKIM at-rest encryption (largest scope; KEK derivation + migration of existing rows)

Each is independent — they can land in any order or be cherry-picked as ad-hoc work permits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions