Skip to content

fix: close 3 defects from a fresh adversarial audit of #118–#122 (1 HIGH + 2 MEDIUM)#123

Merged
brownjuly2003-code merged 3 commits into
mainfrom
fix/audit-recursive-cte-pii-ddl-race
Jun 30, 2026
Merged

fix: close 3 defects from a fresh adversarial audit of #118–#122 (1 HIGH + 2 MEDIUM)#123
brownjuly2003-code merged 3 commits into
mainfrom
fix/audit-recursive-cte-pii-ddl-race

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

A fresh adversarial audit of the code merged after the last audit (5814712),
covering PRs #118#122 (the 7 audit_30 fixes, the deadletter offload #120, the
Flink real-path work #122), found 1 HIGH + 2 MEDIUM new defects — two of them
bypasses of audit_30 fixes, one a regression introduced by the #120 offload.
Each is confirmed by executing the real shipped code and fixed with a regression
test that fails on the old code and passes on the new.

#1 — HIGH — WITH RECURSIVE bypass of the D1 cross-tenant fix

912e562 · sql_builder.py + sql_guard.py

The D1 fix (f153b23) re-scopes a physical table shadowed by a non-recursive
CTE, but a recursive CTE can self-reference, so sqlglot keeps its name in its own
body scope: the physical anchor reference (first UNION branch, which cannot
self-reference) is mis-classified as a CTE reference and never re-scoped — it binds
to the shared main schema and leaks every tenant's rows from a single valid
SELECT. Fixed fail-closed at both layers (validate_nl_sql rejects; _scope_sql
raises for non-NL callers). Verified e2e: the recursive attack through
execute_nl_query now returns 403; the legitimate tenant query still sees only its
own rows.

#2 — MEDIUM — subquery/CTE-alias bypass of the D2 PII-masking fix

80e31e9 · masking.py

The D2 fix (faa6c77) resolved projection lineage one level deep, so a PII column
renamed at an inner level — SELECT contact FROM (SELECT email AS contact FROM users_enriched) t — returned cleartext. Now resolves true lineage via
sqlglot.lineage across subqueries/CTEs/unions, unioned with the shallow direct
columns (lineage is blind through an inner SELECT * where the shallow scan still
catches a direct email AS contact). Lineage failure falls back to a fail-closed
sentinel.

#3 — MEDIUM — concurrent-DDL race introduced by the #120 read-handler offload

601c263 · db_concurrency.py (new) + event_replayer.py + alerts/history.py + webhook_dispatcher.py + deadletter.py

#120 offloaded deadletter/webhook/alert reads to worker threads, but each still
calls its lazy ensure_*_table() (CREATE/ALTER) on the cursor. Concurrent catalog
DDL on one DuckDB raises Catalog write-write conflict (across different tables
too), so on a cold DB (:memory:, cold on every restart) a burst of cold reads
returned HTTP 500s. Serialized every ensure_*_table behind one shared
process-wide lock; also closes the read cursor if the DDL raises. Regression tests
fire 32 threads behind a Barrier (reliably conflict without the lock, green with).

Verification

  • ruff check + ruff format --check clean (286 files)
  • mypy --strict clean (101 source files)
  • 1508 passed, 1 skipped (tests/unit, SKIP_DOCKER_TESTS=1); +8 new regression tests
  • Each fix re-verified against the auditor's own execution repros

🤖 Generated with Claude Code

JuliaEdom and others added 3 commits June 30, 2026 19:10
… bypass)

The D1 fix (f153b23) re-scopes a physical table reference shadowed by a
non-recursive CTE of the same name, but a recursive CTE *can* self-
reference, so sqlglot keeps its name in its own body scope: the physical
anchor reference (the first UNION branch, which cannot self-reference) is
mis-classified as a CTE reference and never re-scoped — it stays bound to
the shared `main` schema and leaks every tenant's rows. _scope_sql is the
sole tenant-isolation mechanism (one DuckDB, schema-per-tenant), so this
is a full cross-tenant read from a single valid SELECT.

There is no safe re-scoping of a recursive anchor (genuinely ambiguous
with the recursion) and no legitimate query names a recursive CTE after a
physical table, so fail closed at both layers: validate_nl_sql rejects the
shape (the NL/LLM gate) and _scope_sql raises for any other caller
(defense-in-depth). Non-recursive shadows keep the safe re-scope path.

Regression tests fail on old code (validate accepts; _scope_sql leaks),
pass on new. Verified e2e: the recursive attack through execute_nl_query
now returns 403; the legitimate tenant_a query still sees only its rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pass)

The D2 fix (faa6c77) mapped each output column to the source names in the
*outermost* projection only, so a PII column renamed at an inner level —
`SELECT contact FROM (SELECT email AS contact FROM users_enriched) t` (and
the CTE/double-rename variants) — never matched the `email` rule and was
returned as cleartext with no X-PII-Masked signal.

Resolve true projection lineage with sqlglot.lineage, tracing each output
column to its ultimate source columns across subqueries, CTEs and union
branches. Union the deep lineage sources with the shallow columns named
directly in the projection: lineage is blind through an inner `SELECT *`
(no schema to expand it) where the shallow scan still catches a direct
`email AS contact`, and the shallow scan misses the inner rename lineage
catches — either alone leaks one shape, the union closes both. A lineage
failure falls back to a sentinel source set that matches every rule field
(fail closed), so an unresolvable column is masked, never leaked.

Regression test covers subquery, CTE and double-rename renames; the full
masking + property + router suites stay green. SELECT * still falls back to
name-matching (its outputs are the source names verbatim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…500s

The #120 read-handler offload moved deadletter/webhook-log/alert-history
reads onto worker threads (run_in_threadpool) with a dedicated cursor, but
each still calls its ensure_*_table() — a lazy CREATE TABLE IF NOT EXISTS /
ALTER ... ADD COLUMN IF NOT EXISTS — on that cursor. Pre-#120 these ran
serialized on the event loop; the offload let them run truly in parallel,
and concurrent catalog DDL on one DuckDB raises "Catalog write-write
conflict" (across different tables too, not just the same one — the catalog
is a single versioned structure). The serving store defaults to :memory:,
cold on every restart, so a concurrent burst of cold reads returned HTTP
500s until one request won the CREATE race.

Serialize every ensure_*_table behind one shared process-wide lock
(src/db_concurrency.catalog_ddl_lock): the first thread creates, the rest
see a warm no-op (warm DDL doesn't conflict). The lock is held only for the
brief CREATE/ALTER, never around queries and never nested. Also close the
freshly-opened read cursor in deadletter._read_cursor if the DDL raises,
before the handler's own try/finally takes over (no per-failure leak).

Regression tests fire 32 threads at one ensure_* and 12 each across all
three behind a Barrier: reliably raise without the lock (verified 16-30
conflicts), green with it. A shared-instance test pins the single lock so a
future per-table lock can't reintroduce the cross-table race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 134 total / 31.27 per week
  • Lead time for changes: avg 0.32h / median 0.0h
  • Change failure rate: 77.61% (104/134)
  • MTTR: 0.25h across 3 incident(s)

@brownjuly2003-code brownjuly2003-code merged commit 74e76a6 into main Jun 30, 2026
23 checks passed
@brownjuly2003-code brownjuly2003-code deleted the fix/audit-recursive-cte-pii-ddl-race branch June 30, 2026 17:20
brownjuly2003-code added a commit that referenced this pull request Jun 30, 2026
…eak + 1 LOW DDL race) (#125)

* fix(security): mask PII renamed above an inner SELECT * (D2 #123 residual)

The #123 D2 fix resolves projection lineage so a PII column renamed through a
subquery/CTE is masked by what it is built from. But when an inner `SELECT *`
sits *below* the rename — e.g.
`SELECT c FROM (SELECT email AS c FROM (SELECT * FROM users_enriched) z) t` —
sqlglot.lineage walks past the renamed `email` node to the bare `*` leaf and
returns a plain `frozenset({'*'})`. That is NOT the `_UnresolvedSources`
sentinel (which only fires on a lineage *exception*), so `email` is absent from
the source set and the column fails **open** as cleartext with no X-PII-Masked
signal; the shallow scan sees only the outer alias `c`. The #123 deep|shallow
union only closes a star one level *above* the rename.

A `*` lineage leaf means the column could carry any source column of that
table, including PII, so treat it as unresolved and fail closed (mask) — the
same policy the module already applies on a lineage exception.

Regression test (subquery and CTE forms) fails on old code (cleartext, was
masked=False) and passes on new; the existing masking/property/mutation suites
stay green. Independently reproduced against live DuckDB before and after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api): lock the webhook delivery-queue lazy DDL (#123 residual)

The #123 lock fix serialized the three offloaded read-handler `ensure_*_table`
helpers behind `catalog_ddl_lock`, but missed `ensure_webhook_delivery_queue_
table`: the dispatcher runs its lazy `CREATE TABLE IF NOT EXISTS webhook_
delivery_queue` on the shared serving connection from the event loop, while an
offloaded read handler runs its own (now-locked) `ensure_*` on a worker thread.
Concurrent catalog DDL on one cold DuckDB raises a "Catalog write-write
conflict" across *different* tables too, so the cross-table 500-on-cold-restart
the #123 fix set out to remove was still reachable through this unlocked site.

Wrap its CREATE in the same shared `catalog_ddl_lock` as its three siblings
(internal-wrap pattern; the two callers do not hold the lock, so no nesting).

Regression: add `ensure_webhook_delivery_queue_table` to the concurrency
harness's ensurer set so both the 32-thread same-table and the cross-table
Barrier hammers cover it — both fail on old code (verified: 23+ and 11+
conflicts on the queue table) and pass with the lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: JuliaEdom <uedomskikh@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants