fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf)#96
Merged
Merged
Conversation
…ry UPDATE The PG OLTP->vault promotion (run live by the LISTEN/NOTIFY freshness listener) computed satellite hash_diff as a constant per-entity tag md5(id || '|tag|v1'). The NOT EXISTS gate then matched the unchanged (hk, hash_diff) pair on every re-promotion, so an order moving pending->shipped (or a corrected total_amount, or a customer PII change) inserted nothing and was permanently invisible to rv.sat_*, bv_order_canonical and the branch_pnl mart — contradicting spec.yaml scd2:true and ADR-0005's CDC update contract. hash_diff is now derived from the descriptive columns and the gate inserts a new version only when it differs from the current (latest load_ts) version for the hash key — correct SCD2 insert-on-change, still idempotent on no-change re-runs. Adds a regression test that fails on the old constant tag. Audit ref: audit_28_06_26.md #9 (headline data-correctness defect). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two reproduced defects in LLM-mode NL->SQL (audit_28_06_26.md #5, #6): #5 cross-tenant read: validate_nl_sql checked only the leaf table name and _scope_sql skipped already schema-qualified tables, so an LLM-emitted 'victim_schema.orders_v2' passed the allow-list AND was executed verbatim against the victim tenant's schema (schemas are tenant slugs, guessable). Fix: validate_nl_sql now rejects any table carrying a db/catalog qualifier (primary); _scope_sql force-rescopes a known table into the caller's tenant schema even if pre-qualified, instead of skipping it (defense-in-depth). #6 PII masking fail-open: mask_query_results returned rows UNMASKED whenever a query touched !=1 entity table, so a users_enriched JOIN orders_v2 leaked cleartext email/phone/address. Fix: mask the union of all matched entities' rules; only a query touching no entity at all returns unmasked. Tests: schema-qualified inputs added to the guard's negative corpus; the masking test that *asserted* the fail-open leak is rewritten to assert fail-closed union masking; a _scope_sql foreign-schema re-scope test added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Webhook and alert targets were validated only as AnyHttpUrl, so any tenant could register http://169.254.169.254/, http://127.0.0.1:port or http://10.x and the server would POST to that internal target, returning status/error as an SSRF oracle (audit_28_06_26.md #2, confirmed by two audit passes). New egress_guard.validate_public_url resolves the host and rejects any URL that is not an http(s) target resolving exclusively to public unicast addresses (loopback/private/link-local/reserved/multicast/unspecified all rejected). Applied at registration time (POST /v1/webhooks, /v1/alerts, PUT alert -> 400) and again immediately before each delivery (narrowing the DNS-rebinding window); a delivery to a now-internal host is failed and logged, not fetched. Resolution runs via asyncio.to_thread so it never blocks the event loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ated NL query #7 rate limiter fail-open: on any Redis error check() returned (True, limit, ...), so a Redis outage (or an attacker who can degrade it) disabled per-tenant rate limiting fleet-wide — a brute-force / DoS-amplification window on the expensive NL->SQL and entity paths. It now falls back to the existing per-process sliding window (refactored into _check_local) instead of fail-open. (audit #7) #8 unbounded batch NL query: /v1/query paginates (LIMIT), but /v1/batch calls execute_nl_query which executed the translated SQL with no row cap, so a batch item like "SELECT * FROM orders_v2" (up to 20/request) could stream a whole table into memory. execute_nl_query now wraps the validated SQL in a bounded LIMIT (1000, the paginated max). nosec B608 pin bumped 2->3 with reason. (audit #8) Note: the audit suspected a count-vs-limit off-by-one; checked and it is NOT a bug — the Redis and local paths both admit exactly the configured number of requests; left unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… filters #11 active_sessions counted every session ever: the demo write path never sets ended_at, so the metrics WHERE ended_at IS NULL OR ended_at >= NOW()-30min was always true and the count grew monotonically. Re-anchored on started_at (active = started in the last 30 min and not ended), so it is actually time-bounded. #M4 order metrics were mutually inconsistent: revenue and avg_order_value filter status != cancelled but order_count did not, so avg_order_value != revenue/ order_count. Aligned order_count to the same non-cancelled filter and corrected the revenue/order_count descriptions (revenue was labelled completed orders but includes pending/shipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…CAN not KEYS Four hot-path scaling defects (audit_28_06_26.md #13/#14/#15/#16): #13 record_usage ran a connection-per-request DuckDB write with a blocking time.sleep retry inline in the async auth middleware, freezing the event loop on every authenticated request. Offloaded via run_in_threadpool. #14 the hash-chained audit publisher re-read the ENTIRE growing log file on every publish to fetch only the last line — O(file) per request, O(n^2) over the log's lifetime, on the event loop via #13. Now caches (last_hash, sequence) in memory after a one-time tail read (it is the only writer, append-only under the lock). #15 metric-cache invalidation used Redis KEYS metric:* (O(keyspace), blocks single-threaded Redis for all clients) roughly every 2s under ingestion. Switched to cursor-based SCAN. #16 the search-index rebuild full-scanned and re-tokenized every entity table on the event loop every 60s. Offloaded via run_in_threadpool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_forever is an asyncio task but called the fully synchronous process_pending, whose per-row producer.flush(10) blocked the event loop for up to 10s per pending message against a slow/unreachable broker — freezing all HTTP/SSE traffic. (audit_28_06_26.md #1) Adds process_pending_async/_process_row_async used by run_forever: the DuckDB read/mark-sent/schedule-retry stay on the loop (the connection may be shared with the query engine, so it must not be touched from a worker thread), while the blocking Kafka produce+flush is offloaded via asyncio.to_thread. The synchronous process_pending/process_entry are kept for direct/test use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dispatch_alert set fired_at / last_escalation_level and returned regardless of whether deliver() succeeded, so a timed-out level-1 page recorded the alert as fired and the on-call was never notified — and a single-step alert would not re-notify until cooldown (default 30 min). (audit_28_06_26.md #4) State transitions are now gated on delivery success: on a failed page the fire branch leaves fired_at=None and the escalation branch leaves last_escalation_level unchanged, so the next evaluation tick (the existing periodic dispatch loop) re-attempts delivery instead of going silent. Adds no-Docker unit tests for the fire-success, fire-failure, and escalation-failure paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DORA Metrics
|
…ions) CI integration (Docker, not runnable on the dev host) surfaced 26 failures from this branch's own changes — exactly why this went through a PR: - SSRF egress guard (#2) rejected the reserved .test hostnames the webhook/alert suites use (agent.test etc. do not resolve), failing create with 400 and blocking deliveries — which then cascaded into the alert-dedup suite via the #4 success-gating. Added an autouse integration fixture that resolves only .test names to a public IP, so the guard stays active (real loopback/private rejection intact; logic unit-tested in test_egress_guard) while the mocked- httpx delivery paths remain exercisable. - outbox ProcessorStub now implements process_pending_async (run_forever calls the async variant after #1). - clickhouse live order_count expectation 8 -> 7 (now excludes the 1 cancelled seed order, #M4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A deep adversarial audit (5 serial dimension-auditors + a skeptical verifier) found 16 HIGH defects that the prior single-pass hygiene audit missed — concentrated in data correctness (DV2), multi-tenancy, delivery durability, and event-loop hot paths. The verifier confirmed 16/16 (0 refuted), 5 reproduced by executing project code.
This PR closes 15 of 16 HIGH + 1 medium, one theme per commit, each with a negative/regression test.
Fixes (commit → defect)
hash_diffwas a constant tag, so the live LISTEN/NOTIFY promotion silently dropped every UPDATE. Now hashes the descriptive columns + gates on the latest version.fix(dv2)_scope_sqlskipped schema-qualified tables (victim.orders_v2read another tenant). Guard now rejects any qualifier;_scope_sqlforce-rescopes.fix(security)fix(security)fix(security)/v1/batchNL query → bounded LIMIT.fix(security)started_at. order metrics status filters aligned.fix(metrics)run_in_threadpool), audit log O(n²)→O(1), cacheKEYS→SCAN, search rebuild offloaded.perfflush(10)on the event loop → offloaded viaasyncio.to_thread.fix(outbox)fix(alerts)One audit finding (a suspected rate-limiter off-by-one) was checked and is not a bug; left unchanged.
Verification
ruff check+ruff formatclean (273 files);mypystrict clean (100 files)pytest tests/unit(no-Docker): 1124 passed, 1 skipped (+28 new tests)Not in this PR (deliberately — need Docker verification / a migration decision)
seen_event_idsalso drives cache invalidation, so a naive change would break it.pg_vault_writerinsert-on-change — consequence is satellite bloat (not loss); changes the shared writer contract, deserves its own PR + DV2 smoke.🤖 Generated with Claude Code