Skip to content

fix: close audit_30 backlog — 1 HIGH + 6 MEDIUM (Workstream C / R3)#118

Merged
brownjuly2003-code merged 7 commits into
mainfrom
fix/audit-30-backlog
Jun 30, 2026
Merged

fix: close audit_30 backlog — 1 HIGH + 6 MEDIUM (Workstream C / R3)#118
brownjuly2003-code merged 7 commits into
mainfrom
fix/audit-30-backlog

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

Workstream C / R3 — fresh adversarial audit + fixes (road-to-9.8.md)

A fresh multi-agent adversarial audit of main (5814712), run after the
audit_28 backlog (16 HIGH) was closed: 4 read-only dimension agents
(concurrency · data/SQL/DV2/masking/tenant · security · API/contract/resource) +
1 verifier that executed code to confirm or refute each candidate. It found
1 HIGH + 6 MEDIUM new defects. Each is fixed here as one thematic commit with
a regression test that fails on the old code and passes on the new.

Fixed

  • D1 (HIGH) — cross-tenant leak. A CTE whose name collides with a real table
    hid the physical inner reference from _scope_sql, so a tenant query bound to
    the shared main schema and returned every tenant's rows. Rescope physical
    refs via sqlglot scope resolution (genuine CTEs untouched). Reproduced
    end-to-end.
  • C1 — alert resolved notification lost. The resolved transition advanced the
    alert even when delivery failed → the page was lost and the incident stayed
    open. Re-attempt next tick (mirrors the firing/escalation hardening).
  • C2 — webhook deliveries dropped. An exception mid per-webhook loop unwound
    dispatch_new_events after the event was marked seen → later webhooks were
    never enqueued. Isolate each webhook; mark seen only after durable enqueue.
  • S1 — unauthenticated DoS. Analytics middleware ran before auth/rate-limit,
    spawning a DB-writing thread and persisting an attacker-controlled body per
    request. Record only after successful auth; cap query_text.
  • A2 — event-loop blocking. lineage / slo / the SSE stream ran sync DuckDB
    scans inline, serializing all tenants on the worker. Offload to a threadpool on
    a dedicated cursor. (deadletter read handlers + alert/webhook history reads:
    same root cause and recipe — tracked as a follow-up.)
  • A1 — metric window unvalidated. /v1/metrics accepted any window, silently
    computed 1h, and echoed the requested window. Validate against
    available_windows → 422 (mirrors the alerts router).
  • D2 — PII masking bypass. Masking keyed on the output column name, so
    SELECT email AS contact returned cleartext. Mask by projection lineage.

Verification

  • Full unit suite 1498 passed; ruff + ruff format clean (285 files);
    mypy --strict clean (100 files); webhook integration (no-Docker) 12 passed.
  • Audit write-up + per-finding repro commands in audit_30_06_26.md (gitignored).

Honest scope

  • Refuted by the verifier: S2 (admin-key in logs) — the shipped
    config/security.yaml already redacts X-Admin-Key/Cookie; no defect at HEAD.
  • Downgraded to LOW (documented, non-gating): two unbounded .add()-only
    seen-sets (global webhook dedup — entangled with cache invalidation; per-SSE —
    bounded by connection lifetime).

🤖 Generated with Claude Code

JuliaEdom and others added 7 commits June 30, 2026 10:38
…30 D1)

_scope_sql is the sole tenant-isolation mechanism (one DuckDB DB, a schema per
tenant, no per-connection search_path). It skipped any table whose name matched
any CTE in the statement, so in `WITH orders_v2 AS (SELECT * FROM orders_v2)
SELECT * FROM orders_v2` the physical inner orders_v2 stayed unqualified, bound
to the shared `main` schema (populated for every tenant by demo-seed and
local_pipeline), and returned other tenants' rows — a cross-tenant leak.

Replace the global cte_names skip with sqlglot scope resolution: classify each
table reference as physical (rescope to the caller's tenant schema) or an
in-scope CTE reference (leave alone), and keep the "tenant context required"
guard firing for a collision in the no-schema branch. Genuine CTEs that merely
share a real table's name are unaffected.

Regression test fails on the old code (inner ref leaks to main) and passes on
the new (inner ref pinned to the tenant schema).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dit_30 C1)

The resolved-transition branch discarded deliver()'s result and unconditionally
advanced the alert to state="resolved"/fired_at=None. If every resolved page
failed (receiver briefly down/5xx/timeout) the alert still went to resolved and
the next tick fell straight through to "ok" — the resolved notification was lost
for good and the receiver kept the incident open. Firing and escalation were
already hardened against this (audit_28 #4); the resolved branch was not.

Now count only successful resolved deliveries and, if any fail, leave fired_at
set and the state unchanged so the next evaluation tick re-attempts (at-least-
once, consistent with the firing path). Happy path is unchanged.

Regression test fails on the old code (fired_at cleared despite failure) and
passes on the new; a success-path test pins that a delivered resolved page
still advances the alert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… after enqueue (audit_30 C2)

dispatch_new_events added the event to seen_event_ids *before* the per-webhook
loop, and the loop body (_enqueue_delivery -> deliver -> _record_outcome) had no
per-webhook isolation. An exception while handling one webhook — e.g. an
httpx.InvalidURL deliver() doesn't catch, or a duckdb.Error — unwound the whole
method after the event was already seen, so webhooks not yet reached got no
durable queue row and were never re-driven: permanently undelivered.

Now each matching webhook is wrapped in try/except, the event is marked seen
only once every matching webhook is durably enqueued (still runs for zero-webhook
events, preserving the metric-cache-invalidation trigger), and _enqueue_delivery
returns whether it inserted a new row so a re-scan (when another webhook's
enqueue failed and left the event unseen) does not re-POST already-queued
deliveries inline. Inline-delivery failures are swallowed — the durable row is
re-driven by process_delivery_queue.

Regression tests fail on the old code (_enqueue_delivery returned None; the
second webhook was never enqueued) and pass on the new.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… query_text (audit_30 S1)

The analytics middleware sits OUTSIDE AuthMiddleware in the stack, so it recorded
a session — spawning a DB-writing thread and persisting the request body — for
every /v1 request regardless of the auth result, including unauthenticated 401s
and the unthrottled flood path (per-key and per-IP throttling both live inside
AuthMiddleware). An attacker could drive un-throttled thread spawns and write an
attacker-controlled ~1 MiB query_text per request: a remote DoS + storage
amplification.

Gate recording on a present request.state.tenant_key and a non-rejected status
(skip 401/403/429/503), on both the success and the exception path, and truncate
query_text to the same 1000-char bound /v1/query enforces (analytics runs before
the route's Pydantic validation). Authenticated traffic is unaffected and stays
rate-limited by AuthMiddleware; 401 visibility remains via the AUTH_FAILURES
counter.

Regression tests fail on the old code (unauthenticated request scheduled a write;
oversized question persisted verbatim) and pass on the new; a positive test pins
that an authenticated request still records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lineage, slo and the SSE stream handlers ran their synchronous DuckDB scans
inline in async handlers on the single event-loop thread. The hot paths
(entity/metric/query) were already offloaded with run_in_threadpool and pinned
by test_agent_query_async, but these were missed: a slow scan froze the loop for
every tenant on the worker (noisy-neighbour), and concurrent requests serialized
instead of overlapping. The SSE stream re-runs its scan once per second per open
connection, so it blocked the loop on every tick.

Offload each scan with run_in_threadpool and run it on a dedicated cursor (not
the shared connection object) so concurrent requests on different worker threads
don't collide. Kept query_engine._conn rather than db_pool.read_conn(): with the
default DUCKDB_PATH=":memory:" the pool and the engine hold *separate* in-memory
databases, so a pooled read connection would not see the engine's data.

Regression test fails on the old code (4 concurrent /v1/lineage requests
serialize to ~1.2s) and passes on the new (~0.3s, overlapped). deadletter's read
handlers and the alert/webhook history reads share the same root cause and fix
recipe and are tracked as a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GET /v1/metrics/{name} never validated the window. The engine maps an unknown
window to "1 hour" via WINDOW_MAP.get(window, "1 hour") (and active_sessions,
whose template is hard-coded to 30 minutes, ignores it entirely), but the
handler echoed the raw requested window back in the response and used it in the
cache key. So GET .../revenue?window=2h returned the 1-hour value labelled "2h",
and an arbitrary stream of bogus window strings polluted the metric cache. The
alerts router already validated the same field; the core read path did not.

Reject windows not in the metric's available_windows with 422, mirroring
alerts._validate_metric_request, so only declared windows are computed, echoed
and cached.

Regression test fails on the old code (window=2h returned 200) and passes on the
new (422 for revenue?window=2h and active_sessions?window=24h, 200 for a declared
window).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (audit_30 D2)

PiiMasker.mask matched rule fields against the result-row dict keys (the output
column names), so a query that renamed a PII column — `SELECT email AS contact`
or `SELECT lower(email) AS e` — returned the cleartext value under the alias and
left pii_masked False (no X-PII-Masked header). Reachable in LLM mode, where the
model controls the projection.

Resolve projection lineage from the SELECT and mask every output column that
*derives* from a rule's source field, not just one literally named after it.
SELECT * (and unparseable SQL) keep the previous name-based matching, which is
correct because the outputs are the canonical source names. Single-entity and
batch masking pass no lineage and are unchanged.

Regression tests fail on the old code (`email AS contact` / `lower(email) AS e`
came back cleartext) and pass on the new; a SELECT * test pins the fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 129 total / 30.1 per week
  • Lead time for changes: avg 0.32h / median 0.0h
  • Change failure rate: 78.29% (101/129)
  • MTTR: 0.25h across 3 incident(s)

@brownjuly2003-code brownjuly2003-code merged commit d4b3b01 into main Jun 30, 2026
22 of 23 checks passed
@brownjuly2003-code brownjuly2003-code deleted the fix/audit-30-backlog branch June 30, 2026 08:29
brownjuly2003-code added a commit that referenced this pull request Jun 30, 2026
…t_30 A1 follow-up) (#119)

The A1 fix (#118) makes GET /v1/metrics reject a window the metric doesn't
declare. The e2e support-agent journey and the support-agent example queried
active_sessions — a point-in-time gauge whose only declared window is "now" —
with window="1h", which the old code silently computed as a 30-minute count and
echoed back as "1h". Post-merge that correctly returns 422, which turned E2E
Tests + Staging Deploy red on main.

Use window="now" in the e2e journey and the example, and extend the A1 unit test
to pin that active_sessions?window=now returns 200.

Co-authored-by: JuliaEdom <uedomskikh@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
brownjuly2003-code added a commit that referenced this pull request Jun 30, 2026
…IGH + 2 MEDIUM) (#123)

* fix(security): reject WITH RECURSIVE CTE shadowing a tenant table (D1 bypass)

The D1 fix (f153b23) re-scopes a physical table reference shadowed by a
non-recursive CTE of the same name, but a recursive CTE *can* self-
reference, so sqlglot keeps its name in its own body scope: the physical
anchor reference (the first UNION branch, which cannot self-reference) is
mis-classified as a CTE reference and never re-scoped — it stays bound to
the shared `main` schema and leaks every tenant's rows. _scope_sql is the
sole tenant-isolation mechanism (one DuckDB, schema-per-tenant), so this
is a full cross-tenant read from a single valid SELECT.

There is no safe re-scoping of a recursive anchor (genuinely ambiguous
with the recursion) and no legitimate query names a recursive CTE after a
physical table, so fail closed at both layers: validate_nl_sql rejects the
shape (the NL/LLM gate) and _scope_sql raises for any other caller
(defense-in-depth). Non-recursive shadows keep the safe re-scope path.

Regression tests fail on old code (validate accepts; _scope_sql leaks),
pass on new. Verified e2e: the recursive attack through execute_nl_query
now returns 403; the legitimate tenant_a query still sees only its rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): mask PII renamed through a subquery/CTE (D2 lineage bypass)

The D2 fix (faa6c77) mapped each output column to the source names in the
*outermost* projection only, so a PII column renamed at an inner level —
`SELECT contact FROM (SELECT email AS contact FROM users_enriched) t` (and
the CTE/double-rename variants) — never matched the `email` rule and was
returned as cleartext with no X-PII-Masked signal.

Resolve true projection lineage with sqlglot.lineage, tracing each output
column to its ultimate source columns across subqueries, CTEs and union
branches. Union the deep lineage sources with the shallow columns named
directly in the projection: lineage is blind through an inner `SELECT *`
(no schema to expand it) where the shallow scan still catches a direct
`email AS contact`, and the shallow scan misses the inner rename lineage
catches — either alone leaks one shape, the union closes both. A lineage
failure falls back to a sentinel source set that matches every rule field
(fail closed), so an unresolvable column is masked, never leaked.

Regression test covers subquery, CTE and double-rename renames; the full
masking + property + router suites stay green. SELECT * still falls back to
name-matching (its outputs are the source names verbatim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api): serialize lazy table DDL to stop cold-DB concurrent-create 500s

The #120 read-handler offload moved deadletter/webhook-log/alert-history
reads onto worker threads (run_in_threadpool) with a dedicated cursor, but
each still calls its ensure_*_table() — a lazy CREATE TABLE IF NOT EXISTS /
ALTER ... ADD COLUMN IF NOT EXISTS — on that cursor. Pre-#120 these ran
serialized on the event loop; the offload let them run truly in parallel,
and concurrent catalog DDL on one DuckDB raises "Catalog write-write
conflict" (across different tables too, not just the same one — the catalog
is a single versioned structure). The serving store defaults to :memory:,
cold on every restart, so a concurrent burst of cold reads returned HTTP
500s until one request won the CREATE race.

Serialize every ensure_*_table behind one shared process-wide lock
(src/db_concurrency.catalog_ddl_lock): the first thread creates, the rest
see a warm no-op (warm DDL doesn't conflict). The lock is held only for the
brief CREATE/ALTER, never around queries and never nested. Also close the
freshly-opened read cursor in deadletter._read_cursor if the DDL raises,
before the handler's own try/finally takes over (no per-failure leak).

Regression tests fire 32 threads at one ensure_* and 12 each across all
three behind a Barrier: reliably raise without the lock (verified 16-30
conflicts), green with it. A shared-instance test pins the single lock so a
future per-table lock can't reintroduce the cross-table race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: JuliaEdom <uedomskikh@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants