Skip to content

fix(api): offload deadletter/alert/webhook history scans off the event loop (audit_30 A2 follow-up)#120

Merged
brownjuly2003-code merged 1 commit into
mainfrom
fix/a2-deadletter-history-offload
Jun 30, 2026
Merged

fix(api): offload deadletter/alert/webhook history scans off the event loop (audit_30 A2 follow-up)#120
brownjuly2003-code merged 1 commit into
mainfrom
fix/a2-deadletter-history-offload

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

What

Completes the audit_30 A2 event-loop-offload work. The original A2 fix (046c38d) offloaded the lineage / slo / SSE-stream handlers but scoped these as a follow-up:

  • deadletter read handlers — GET /v1/deadletter/stats, GET /v1/deadletter (list), GET /v1/deadletter/{event_id}
  • alert historyGET /v1/alerts/{alert_id}/history
  • webhook logsGET /v1/webhooks/{webhook_id}/logs

Each still ran its synchronous DuckDB scan inline on the single event-loop thread, so a slow scan froze the loop for every tenant on the worker (noisy-neighbour) and concurrent requests serialized instead of overlapping.

How

Same recipe as the original A2 fix:

  • Offload each scan with run_in_threadpool.
  • Run it on a dedicated cursor (query_engine._conn.cursor()), not the shared connection object, so concurrent requests on different worker threads don't collide on the connection; the cursor is closed in a finally.
  • Kept query_engine._conn rather than db_pool.read_conn() — with the default DUCKDB_PATH=":memory:" the pool and the engine hold separate in-memory databases, so a pooled read connection wouldn't see the engine's data.

The deadletter write handlers (replay / dismiss) keep using the shared _conn and are out of scope.

Verification

  • New regression tests in test_blocking_routes_offloaded.py drive 4 concurrent requests against each endpoint: they fail on the old code (serialize to ~1.2 s) and pass on the new (~0.3 s, overlapped). Counterfactual confirmed by stashing only the src fix and re-running (3 fail, lineage still passes).
  • unit 1501 passed, ruff check + ruff format --check clean, mypy strict clean (3 src files).

Part of the road-to-9.8 program (Workstream C / R3 follow-up).

🤖 Generated with Claude Code

…t loop (audit_30 A2 follow-up)

The A2 fix (046c38d) offloaded lineage/slo/stream but scoped these as a
follow-up: the deadletter read handlers (stats/list/detail) and the
alert-history and webhook-log reads still ran their synchronous DuckDB scans
inline on the single event-loop thread, so a slow scan froze the loop for every
tenant on the worker and concurrent requests serialized instead of overlapping.

Offload each scan with run_in_threadpool on a dedicated cursor (not the shared
connection object) so concurrent requests on different worker threads don't
collide. Kept query_engine._conn rather than db_pool.read_conn(): with the
default DUCKDB_PATH=":memory:" the pool and the engine hold separate in-memory
databases, so a pooled read connection would not see the engine's data — same
recipe as the original A2 fix. The deadletter write handlers (replay/dismiss)
keep using the shared _conn and are out of scope.

Regression tests in test_blocking_routes_offloaded.py drive four concurrent
requests each: they fail on the old code (serialize to ~1.2s) and pass on the
new (~0.3s, overlapped), counterfactual-checked by stashing only the src fix.
Updated the alerts-router unit stub to yield a cursor.

Verified: unit 1501 passed, ruff + format clean, mypy strict clean (3 files).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 131 total / 30.57 per week
  • Lead time for changes: avg 0.32h / median 0.0h
  • Change failure rate: 78.63% (103/131)
  • MTTR: 0.25h across 3 incident(s)

@brownjuly2003-code brownjuly2003-code merged commit 619ad6e into main Jun 30, 2026
28 of 30 checks passed
@brownjuly2003-code brownjuly2003-code deleted the fix/a2-deadletter-history-offload branch June 30, 2026 10:17
brownjuly2003-code added a commit that referenced this pull request Jun 30, 2026
…IGH + 2 MEDIUM) (#123)

* fix(security): reject WITH RECURSIVE CTE shadowing a tenant table (D1 bypass)

The D1 fix (f153b23) re-scopes a physical table reference shadowed by a
non-recursive CTE of the same name, but a recursive CTE *can* self-
reference, so sqlglot keeps its name in its own body scope: the physical
anchor reference (the first UNION branch, which cannot self-reference) is
mis-classified as a CTE reference and never re-scoped — it stays bound to
the shared `main` schema and leaks every tenant's rows. _scope_sql is the
sole tenant-isolation mechanism (one DuckDB, schema-per-tenant), so this
is a full cross-tenant read from a single valid SELECT.

There is no safe re-scoping of a recursive anchor (genuinely ambiguous
with the recursion) and no legitimate query names a recursive CTE after a
physical table, so fail closed at both layers: validate_nl_sql rejects the
shape (the NL/LLM gate) and _scope_sql raises for any other caller
(defense-in-depth). Non-recursive shadows keep the safe re-scope path.

Regression tests fail on old code (validate accepts; _scope_sql leaks),
pass on new. Verified e2e: the recursive attack through execute_nl_query
now returns 403; the legitimate tenant_a query still sees only its rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): mask PII renamed through a subquery/CTE (D2 lineage bypass)

The D2 fix (faa6c77) mapped each output column to the source names in the
*outermost* projection only, so a PII column renamed at an inner level —
`SELECT contact FROM (SELECT email AS contact FROM users_enriched) t` (and
the CTE/double-rename variants) — never matched the `email` rule and was
returned as cleartext with no X-PII-Masked signal.

Resolve true projection lineage with sqlglot.lineage, tracing each output
column to its ultimate source columns across subqueries, CTEs and union
branches. Union the deep lineage sources with the shallow columns named
directly in the projection: lineage is blind through an inner `SELECT *`
(no schema to expand it) where the shallow scan still catches a direct
`email AS contact`, and the shallow scan misses the inner rename lineage
catches — either alone leaks one shape, the union closes both. A lineage
failure falls back to a sentinel source set that matches every rule field
(fail closed), so an unresolvable column is masked, never leaked.

Regression test covers subquery, CTE and double-rename renames; the full
masking + property + router suites stay green. SELECT * still falls back to
name-matching (its outputs are the source names verbatim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api): serialize lazy table DDL to stop cold-DB concurrent-create 500s

The #120 read-handler offload moved deadletter/webhook-log/alert-history
reads onto worker threads (run_in_threadpool) with a dedicated cursor, but
each still calls its ensure_*_table() — a lazy CREATE TABLE IF NOT EXISTS /
ALTER ... ADD COLUMN IF NOT EXISTS — on that cursor. Pre-#120 these ran
serialized on the event loop; the offload let them run truly in parallel,
and concurrent catalog DDL on one DuckDB raises "Catalog write-write
conflict" (across different tables too, not just the same one — the catalog
is a single versioned structure). The serving store defaults to :memory:,
cold on every restart, so a concurrent burst of cold reads returned HTTP
500s until one request won the CREATE race.

Serialize every ensure_*_table behind one shared process-wide lock
(src/db_concurrency.catalog_ddl_lock): the first thread creates, the rest
see a warm no-op (warm DDL doesn't conflict). The lock is held only for the
brief CREATE/ALTER, never around queries and never nested. Also close the
freshly-opened read cursor in deadletter._read_cursor if the DDL raises,
before the handler's own try/finally takes over (no per-failure leak).

Regression tests fire 32 threads at one ensure_* and 12 each across all
three behind a Barrier: reliably raise without the lock (verified 16-30
conflicts), green with it. A shared-instance test pins the single lock so a
future per-table lock can't reintroduce the cross-table race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: JuliaEdom <uedomskikh@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants