Skip to content

fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf)#96

Merged
brownjuly2003-code merged 9 commits into
mainfrom
fix/audit-28-backlog
Jun 28, 2026
Merged

fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf)#96
brownjuly2003-code merged 9 commits into
mainfrom
fix/audit-28-backlog

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

What

A deep adversarial audit (5 serial dimension-auditors + a skeptical verifier) found 16 HIGH defects that the prior single-pass hygiene audit missed — concentrated in data correctness (DV2), multi-tenancy, delivery durability, and event-loop hot paths. The verifier confirmed 16/16 (0 refuted), 5 reproduced by executing project code.

This PR closes 15 of 16 HIGH + 1 medium, one theme per commit, each with a negative/regression test.

Fixes (commit → defect)

Defect Commit
SCD2 frozen — satellite hash_diff was a constant tag, so the live LISTEN/NOTIFY promotion silently dropped every UPDATE. Now hashes the descriptive columns + gates on the latest version. fix(dv2)
Cross-tenant read — NL→SQL guard checked only the leaf table name and _scope_sql skipped schema-qualified tables (victim.orders_v2 read another tenant). Guard now rejects any qualifier; _scope_sql force-rescopes. fix(security)
PII masking fail-open — multi-entity NL JOINs returned cleartext PII. Now masks the union of matched entities. fix(security)
SSRF — webhook/alert URLs had no egress validation. New guard rejects loopback/private/link-local/metadata at create + delivery (DNS-rebinding). fix(security)
Rate-limiter fail-open on Redis error → now fails closed to a local cap. Unbounded /v1/batch NL query → bounded LIMIT. fix(security)
active_sessions counted every session ever → time-bounded on started_at. order metrics status filters aligned. fix(metrics)
Per-request side-effects on the event loop — usage write (run_in_threadpool), audit log O(n²)→O(1), cache KEYSSCAN, search rebuild offloaded. perf
Outbox blocking Kafka flush(10) on the event loop → offloaded via asyncio.to_thread. fix(outbox)
Alerts advanced fired/escalation state regardless of delivery success → now gated on success (eval loop re-attempts). fix(alerts)

One audit finding (a suspected rate-limiter off-by-one) was checked and is not a bug; left unchanged.

Verification

  • ruff check + ruff format clean (273 files); mypy strict clean (100 files)
  • pytest tests/unit (no-Docker): 1124 passed, 1 skipped (+28 new tests)
  • Integration / e2e / chaos / Docker-gated suites were not runnable on the dev host (Windows, no Docker) — relying on CI here, which is part of why this is a PR.

Not in this PR (deliberately — need Docker verification / a migration decision)

  • Webhook durable re-drive — requires a durable per-delivery table + background loop (a new subsystem); seen_event_ids also drives cache invalidation, so a naive change would break it.
  • DV2 pg_vault_writer insert-on-change — consequence is satellite bloat (not loss); changes the shared writer contract, deserves its own PR + DV2 smoke.
  • Cross-loader hash-key unification — a data migration across seeds/loaders/MDM views; high risk, owner decision.

🤖 Generated with Claude Code

JuliaEdom and others added 8 commits June 28, 2026 18:57
…ry UPDATE

The PG OLTP->vault promotion (run live by the LISTEN/NOTIFY freshness listener)
computed satellite hash_diff as a constant per-entity tag md5(id || '|tag|v1').
The NOT EXISTS gate then matched the unchanged (hk, hash_diff) pair on every
re-promotion, so an order moving pending->shipped (or a corrected total_amount,
or a customer PII change) inserted nothing and was permanently invisible to
rv.sat_*, bv_order_canonical and the branch_pnl mart — contradicting spec.yaml
scd2:true and ADR-0005's CDC update contract.

hash_diff is now derived from the descriptive columns and the gate inserts a new
version only when it differs from the current (latest load_ts) version for the
hash key — correct SCD2 insert-on-change, still idempotent on no-change re-runs.
Adds a regression test that fails on the old constant tag.

Audit ref: audit_28_06_26.md #9 (headline data-correctness defect).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two reproduced defects in LLM-mode NL->SQL (audit_28_06_26.md #5, #6):

#5 cross-tenant read: validate_nl_sql checked only the leaf table name and
_scope_sql skipped already schema-qualified tables, so an LLM-emitted
'victim_schema.orders_v2' passed the allow-list AND was executed verbatim
against the victim tenant's schema (schemas are tenant slugs, guessable).
Fix: validate_nl_sql now rejects any table carrying a db/catalog qualifier
(primary); _scope_sql force-rescopes a known table into the caller's tenant
schema even if pre-qualified, instead of skipping it (defense-in-depth).

#6 PII masking fail-open: mask_query_results returned rows UNMASKED whenever a
query touched !=1 entity table, so a users_enriched JOIN orders_v2 leaked
cleartext email/phone/address. Fix: mask the union of all matched entities'
rules; only a query touching no entity at all returns unmasked.

Tests: schema-qualified inputs added to the guard's negative corpus; the
masking test that *asserted* the fail-open leak is rewritten to assert
fail-closed union masking; a _scope_sql foreign-schema re-scope test added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Webhook and alert targets were validated only as AnyHttpUrl, so any tenant could
register http://169.254.169.254/, http://127.0.0.1:port or http://10.x and the
server would POST to that internal target, returning status/error as an SSRF
oracle (audit_28_06_26.md #2, confirmed by two audit passes).

New egress_guard.validate_public_url resolves the host and rejects any URL that
is not an http(s) target resolving exclusively to public unicast addresses
(loopback/private/link-local/reserved/multicast/unspecified all rejected).
Applied at registration time (POST /v1/webhooks, /v1/alerts, PUT alert -> 400)
and again immediately before each delivery (narrowing the DNS-rebinding window);
a delivery to a now-internal host is failed and logged, not fetched. Resolution
runs via asyncio.to_thread so it never blocks the event loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ated NL query

#7 rate limiter fail-open: on any Redis error check() returned (True, limit, ...),
so a Redis outage (or an attacker who can degrade it) disabled per-tenant rate
limiting fleet-wide — a brute-force / DoS-amplification window on the expensive
NL->SQL and entity paths. It now falls back to the existing per-process sliding
window (refactored into _check_local) instead of fail-open. (audit #7)

#8 unbounded batch NL query: /v1/query paginates (LIMIT), but /v1/batch calls
execute_nl_query which executed the translated SQL with no row cap, so a batch
item like "SELECT * FROM orders_v2" (up to 20/request) could stream a whole
table into memory. execute_nl_query now wraps the validated SQL in a bounded
LIMIT (1000, the paginated max). nosec B608 pin bumped 2->3 with reason. (audit #8)

Note: the audit suspected a count-vs-limit off-by-one; checked and it is NOT a
bug — the Redis and local paths both admit exactly the configured number of
requests; left unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… filters

#11 active_sessions counted every session ever: the demo write path never sets
ended_at, so the metrics WHERE ended_at IS NULL OR ended_at >= NOW()-30min was
always true and the count grew monotonically. Re-anchored on started_at (active
= started in the last 30 min and not ended), so it is actually time-bounded.

#M4 order metrics were mutually inconsistent: revenue and avg_order_value filter
status != cancelled but order_count did not, so avg_order_value != revenue/
order_count. Aligned order_count to the same non-cancelled filter and corrected
the revenue/order_count descriptions (revenue was labelled completed orders but
includes pending/shipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…CAN not KEYS

Four hot-path scaling defects (audit_28_06_26.md #13/#14/#15/#16):

#13 record_usage ran a connection-per-request DuckDB write with a blocking
time.sleep retry inline in the async auth middleware, freezing the event loop on
every authenticated request. Offloaded via run_in_threadpool.

#14 the hash-chained audit publisher re-read the ENTIRE growing log file on every
publish to fetch only the last line — O(file) per request, O(n^2) over the log's
lifetime, on the event loop via #13. Now caches (last_hash, sequence) in memory
after a one-time tail read (it is the only writer, append-only under the lock).

#15 metric-cache invalidation used Redis KEYS metric:* (O(keyspace), blocks
single-threaded Redis for all clients) roughly every 2s under ingestion. Switched
to cursor-based SCAN.

#16 the search-index rebuild full-scanned and re-tokenized every entity table on
the event loop every 60s. Offloaded via run_in_threadpool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_forever is an asyncio task but called the fully synchronous process_pending,
whose per-row producer.flush(10) blocked the event loop for up to 10s per pending
message against a slow/unreachable broker — freezing all HTTP/SSE traffic.
(audit_28_06_26.md #1)

Adds process_pending_async/_process_row_async used by run_forever: the DuckDB
read/mark-sent/schedule-retry stay on the loop (the connection may be shared with
the query engine, so it must not be touched from a worker thread), while the
blocking Kafka produce+flush is offloaded via asyncio.to_thread. The synchronous
process_pending/process_entry are kept for direct/test use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dispatch_alert set fired_at / last_escalation_level and returned regardless of
whether deliver() succeeded, so a timed-out level-1 page recorded the alert as
fired and the on-call was never notified — and a single-step alert would not
re-notify until cooldown (default 30 min). (audit_28_06_26.md #4)

State transitions are now gated on delivery success: on a failed page the fire
branch leaves fired_at=None and the escalation branch leaves last_escalation_level
unchanged, so the next evaluation tick (the existing periodic dispatch loop)
re-attempts delivery instead of going silent. Adds no-Docker unit tests for the
fire-success, fire-failure, and escalation-failure paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 186 total / 43.4 per week
  • Lead time for changes: avg 0.24h / median 0.0h
  • Change failure rate: 63.44% (118/186)
  • MTTR: 0.47h across 7 incident(s)

…ions)

CI integration (Docker, not runnable on the dev host) surfaced 26 failures from
this branch's own changes — exactly why this went through a PR:

- SSRF egress guard (#2) rejected the reserved .test hostnames the webhook/alert
  suites use (agent.test etc. do not resolve), failing create with 400 and
  blocking deliveries — which then cascaded into the alert-dedup suite via the
  #4 success-gating. Added an autouse integration fixture that resolves only
  .test names to a public IP, so the guard stays active (real loopback/private
  rejection intact; logic unit-tested in test_egress_guard) while the mocked-
  httpx delivery paths remain exercisable.
- outbox ProcessorStub now implements process_pending_async (run_forever calls
  the async variant after #1).
- clickhouse live order_count expectation 8 -> 7 (now excludes the 1 cancelled
  seed order, #M4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@brownjuly2003-code brownjuly2003-code merged commit 28025a5 into main Jun 28, 2026
22 checks passed
@brownjuly2003-code brownjuly2003-code deleted the fix/audit-28-backlog branch June 28, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants