fix: close audit_30 backlog — 1 HIGH + 6 MEDIUM (Workstream C / R3)#118
Merged
Conversation
…30 D1) _scope_sql is the sole tenant-isolation mechanism (one DuckDB DB, a schema per tenant, no per-connection search_path). It skipped any table whose name matched any CTE in the statement, so in `WITH orders_v2 AS (SELECT * FROM orders_v2) SELECT * FROM orders_v2` the physical inner orders_v2 stayed unqualified, bound to the shared `main` schema (populated for every tenant by demo-seed and local_pipeline), and returned other tenants' rows — a cross-tenant leak. Replace the global cte_names skip with sqlglot scope resolution: classify each table reference as physical (rescope to the caller's tenant schema) or an in-scope CTE reference (leave alone), and keep the "tenant context required" guard firing for a collision in the no-schema branch. Genuine CTEs that merely share a real table's name are unaffected. Regression test fails on the old code (inner ref leaks to main) and passes on the new (inner ref pinned to the tenant schema). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dit_30 C1) The resolved-transition branch discarded deliver()'s result and unconditionally advanced the alert to state="resolved"/fired_at=None. If every resolved page failed (receiver briefly down/5xx/timeout) the alert still went to resolved and the next tick fell straight through to "ok" — the resolved notification was lost for good and the receiver kept the incident open. Firing and escalation were already hardened against this (audit_28 #4); the resolved branch was not. Now count only successful resolved deliveries and, if any fail, leave fired_at set and the state unchanged so the next evaluation tick re-attempts (at-least- once, consistent with the firing path). Happy path is unchanged. Regression test fails on the old code (fired_at cleared despite failure) and passes on the new; a success-path test pins that a delivered resolved page still advances the alert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… after enqueue (audit_30 C2) dispatch_new_events added the event to seen_event_ids *before* the per-webhook loop, and the loop body (_enqueue_delivery -> deliver -> _record_outcome) had no per-webhook isolation. An exception while handling one webhook — e.g. an httpx.InvalidURL deliver() doesn't catch, or a duckdb.Error — unwound the whole method after the event was already seen, so webhooks not yet reached got no durable queue row and were never re-driven: permanently undelivered. Now each matching webhook is wrapped in try/except, the event is marked seen only once every matching webhook is durably enqueued (still runs for zero-webhook events, preserving the metric-cache-invalidation trigger), and _enqueue_delivery returns whether it inserted a new row so a re-scan (when another webhook's enqueue failed and left the event unseen) does not re-POST already-queued deliveries inline. Inline-delivery failures are swallowed — the durable row is re-driven by process_delivery_queue. Regression tests fail on the old code (_enqueue_delivery returned None; the second webhook was never enqueued) and pass on the new. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… query_text (audit_30 S1) The analytics middleware sits OUTSIDE AuthMiddleware in the stack, so it recorded a session — spawning a DB-writing thread and persisting the request body — for every /v1 request regardless of the auth result, including unauthenticated 401s and the unthrottled flood path (per-key and per-IP throttling both live inside AuthMiddleware). An attacker could drive un-throttled thread spawns and write an attacker-controlled ~1 MiB query_text per request: a remote DoS + storage amplification. Gate recording on a present request.state.tenant_key and a non-rejected status (skip 401/403/429/503), on both the success and the exception path, and truncate query_text to the same 1000-char bound /v1/query enforces (analytics runs before the route's Pydantic validation). Authenticated traffic is unaffected and stays rate-limited by AuthMiddleware; 401 visibility remains via the AUTH_FAILURES counter. Regression tests fail on the old code (unauthenticated request scheduled a write; oversized question persisted verbatim) and pass on the new; a positive test pins that an authenticated request still records. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lineage, slo and the SSE stream handlers ran their synchronous DuckDB scans inline in async handlers on the single event-loop thread. The hot paths (entity/metric/query) were already offloaded with run_in_threadpool and pinned by test_agent_query_async, but these were missed: a slow scan froze the loop for every tenant on the worker (noisy-neighbour), and concurrent requests serialized instead of overlapping. The SSE stream re-runs its scan once per second per open connection, so it blocked the loop on every tick. Offload each scan with run_in_threadpool and run it on a dedicated cursor (not the shared connection object) so concurrent requests on different worker threads don't collide. Kept query_engine._conn rather than db_pool.read_conn(): with the default DUCKDB_PATH=":memory:" the pool and the engine hold *separate* in-memory databases, so a pooled read connection would not see the engine's data. Regression test fails on the old code (4 concurrent /v1/lineage requests serialize to ~1.2s) and passes on the new (~0.3s, overlapped). deadletter's read handlers and the alert/webhook history reads share the same root cause and fix recipe and are tracked as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GET /v1/metrics/{name} never validated the window. The engine maps an unknown
window to "1 hour" via WINDOW_MAP.get(window, "1 hour") (and active_sessions,
whose template is hard-coded to 30 minutes, ignores it entirely), but the
handler echoed the raw requested window back in the response and used it in the
cache key. So GET .../revenue?window=2h returned the 1-hour value labelled "2h",
and an arbitrary stream of bogus window strings polluted the metric cache. The
alerts router already validated the same field; the core read path did not.
Reject windows not in the metric's available_windows with 422, mirroring
alerts._validate_metric_request, so only declared windows are computed, echoed
and cached.
Regression test fails on the old code (window=2h returned 200) and passes on the
new (422 for revenue?window=2h and active_sessions?window=24h, 200 for a declared
window).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (audit_30 D2) PiiMasker.mask matched rule fields against the result-row dict keys (the output column names), so a query that renamed a PII column — `SELECT email AS contact` or `SELECT lower(email) AS e` — returned the cleartext value under the alias and left pii_masked False (no X-PII-Masked header). Reachable in LLM mode, where the model controls the projection. Resolve projection lineage from the SELECT and mask every output column that *derives* from a rule's source field, not just one literally named after it. SELECT * (and unparseable SQL) keep the previous name-based matching, which is correct because the outputs are the canonical source names. Single-entity and batch masking pass no lineage and are unchanged. Regression tests fail on the old code (`email AS contact` / `lower(email) AS e` came back cleartext) and pass on the new; a SELECT * test pins the fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DORA Metrics
|
brownjuly2003-code
added a commit
that referenced
this pull request
Jun 30, 2026
…t_30 A1 follow-up) (#119) The A1 fix (#118) makes GET /v1/metrics reject a window the metric doesn't declare. The e2e support-agent journey and the support-agent example queried active_sessions — a point-in-time gauge whose only declared window is "now" — with window="1h", which the old code silently computed as a 30-minute count and echoed back as "1h". Post-merge that correctly returns 422, which turned E2E Tests + Staging Deploy red on main. Use window="now" in the e2e journey and the example, and extend the A1 unit test to pin that active_sessions?window=now returns 200. Co-authored-by: JuliaEdom <uedomskikh@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
brownjuly2003-code
added a commit
that referenced
this pull request
Jun 30, 2026
…IGH + 2 MEDIUM) (#123) * fix(security): reject WITH RECURSIVE CTE shadowing a tenant table (D1 bypass) The D1 fix (f153b23) re-scopes a physical table reference shadowed by a non-recursive CTE of the same name, but a recursive CTE *can* self- reference, so sqlglot keeps its name in its own body scope: the physical anchor reference (the first UNION branch, which cannot self-reference) is mis-classified as a CTE reference and never re-scoped — it stays bound to the shared `main` schema and leaks every tenant's rows. _scope_sql is the sole tenant-isolation mechanism (one DuckDB, schema-per-tenant), so this is a full cross-tenant read from a single valid SELECT. There is no safe re-scoping of a recursive anchor (genuinely ambiguous with the recursion) and no legitimate query names a recursive CTE after a physical table, so fail closed at both layers: validate_nl_sql rejects the shape (the NL/LLM gate) and _scope_sql raises for any other caller (defense-in-depth). Non-recursive shadows keep the safe re-scope path. Regression tests fail on old code (validate accepts; _scope_sql leaks), pass on new. Verified e2e: the recursive attack through execute_nl_query now returns 403; the legitimate tenant_a query still sees only its rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(security): mask PII renamed through a subquery/CTE (D2 lineage bypass) The D2 fix (faa6c77) mapped each output column to the source names in the *outermost* projection only, so a PII column renamed at an inner level — `SELECT contact FROM (SELECT email AS contact FROM users_enriched) t` (and the CTE/double-rename variants) — never matched the `email` rule and was returned as cleartext with no X-PII-Masked signal. Resolve true projection lineage with sqlglot.lineage, tracing each output column to its ultimate source columns across subqueries, CTEs and union branches. Union the deep lineage sources with the shallow columns named directly in the projection: lineage is blind through an inner `SELECT *` (no schema to expand it) where the shallow scan still catches a direct `email AS contact`, and the shallow scan misses the inner rename lineage catches — either alone leaks one shape, the union closes both. A lineage failure falls back to a sentinel source set that matches every rule field (fail closed), so an unresolvable column is masked, never leaked. Regression test covers subquery, CTE and double-rename renames; the full masking + property + router suites stay green. SELECT * still falls back to name-matching (its outputs are the source names verbatim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(api): serialize lazy table DDL to stop cold-DB concurrent-create 500s The #120 read-handler offload moved deadletter/webhook-log/alert-history reads onto worker threads (run_in_threadpool) with a dedicated cursor, but each still calls its ensure_*_table() — a lazy CREATE TABLE IF NOT EXISTS / ALTER ... ADD COLUMN IF NOT EXISTS — on that cursor. Pre-#120 these ran serialized on the event loop; the offload let them run truly in parallel, and concurrent catalog DDL on one DuckDB raises "Catalog write-write conflict" (across different tables too, not just the same one — the catalog is a single versioned structure). The serving store defaults to :memory:, cold on every restart, so a concurrent burst of cold reads returned HTTP 500s until one request won the CREATE race. Serialize every ensure_*_table behind one shared process-wide lock (src/db_concurrency.catalog_ddl_lock): the first thread creates, the rest see a warm no-op (warm DDL doesn't conflict). The lock is held only for the brief CREATE/ALTER, never around queries and never nested. Also close the freshly-opened read cursor in deadletter._read_cursor if the DDL raises, before the handler's own try/finally takes over (no per-failure leak). Regression tests fire 32 threads at one ensure_* and 12 each across all three behind a Barrier: reliably raise without the lock (verified 16-30 conflicts), green with it. A shared-instance test pins the single lock so a future per-table lock can't reintroduce the cross-table race. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: JuliaEdom <uedomskikh@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Workstream C / R3 — fresh adversarial audit + fixes (
road-to-9.8.md)A fresh multi-agent adversarial audit of
main(5814712), run after theaudit_28 backlog (16 HIGH) was closed: 4 read-only dimension agents
(concurrency · data/SQL/DV2/masking/tenant · security · API/contract/resource) +
1 verifier that executed code to confirm or refute each candidate. It found
1 HIGH + 6 MEDIUM new defects. Each is fixed here as one thematic commit with
a regression test that fails on the old code and passes on the new.
Fixed
hid the physical inner reference from
_scope_sql, so a tenant query bound tothe shared
mainschema and returned every tenant's rows. Rescope physicalrefs via sqlglot scope resolution (genuine CTEs untouched). Reproduced
end-to-end.
alert even when delivery failed → the page was lost and the incident stayed
open. Re-attempt next tick (mirrors the firing/escalation hardening).
dispatch_new_eventsafter the event was marked seen → later webhooks werenever enqueued. Isolate each webhook; mark seen only after durable enqueue.
spawning a DB-writing thread and persisting an attacker-controlled body per
request. Record only after successful auth; cap
query_text.lineage/slo/ the SSE stream ran sync DuckDBscans inline, serializing all tenants on the worker. Offload to a threadpool on
a dedicated cursor. (deadletter read handlers + alert/webhook history reads:
same root cause and recipe — tracked as a follow-up.)
/v1/metricsaccepted anywindow, silentlycomputed 1h, and echoed the requested window. Validate against
available_windows→ 422 (mirrors the alerts router).SELECT email AS contactreturned cleartext. Mask by projection lineage.Verification
ruff+ruff formatclean (285 files);mypy --strictclean (100 files); webhook integration (no-Docker) 12 passed.audit_30_06_26.md(gitignored).Honest scope
config/security.yamlalready redactsX-Admin-Key/Cookie; no defect at HEAD..add()-onlyseen-sets (global webhook dedup — entangled with cache invalidation; per-SSE —
bounded by connection lifetime).
🤖 Generated with Claude Code