fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf) by brownjuly2003-code · Pull Request #96 · brownjuly2003-code/agentflow

brownjuly2003-code · 2026-06-28T16:41:55Z

What

A deep adversarial audit (5 serial dimension-auditors + a skeptical verifier) found 16 HIGH defects that the prior single-pass hygiene audit missed — concentrated in data correctness (DV2), multi-tenancy, delivery durability, and event-loop hot paths. The verifier confirmed 16/16 (0 refuted), 5 reproduced by executing project code.

This PR closes 15 of 16 HIGH + 1 medium, one theme per commit, each with a negative/regression test.

Fixes (commit → defect)

Defect	Commit
SCD2 frozen — satellite `hash_diff` was a constant tag, so the live LISTEN/NOTIFY promotion silently dropped every UPDATE. Now hashes the descriptive columns + gates on the latest version.	`fix(dv2)`
Cross-tenant read — NL→SQL guard checked only the leaf table name and `_scope_sql` skipped schema-qualified tables (`victim.orders_v2` read another tenant). Guard now rejects any qualifier; `_scope_sql` force-rescopes.	`fix(security)`
PII masking fail-open — multi-entity NL JOINs returned cleartext PII. Now masks the union of matched entities.	`fix(security)`
SSRF — webhook/alert URLs had no egress validation. New guard rejects loopback/private/link-local/metadata at create + delivery (DNS-rebinding).	`fix(security)`
Rate-limiter fail-open on Redis error → now fails closed to a local cap. Unbounded `/v1/batch` NL query → bounded LIMIT.	`fix(security)`
active_sessions counted every session ever → time-bounded on `started_at`. order metrics status filters aligned.	`fix(metrics)`
Per-request side-effects on the event loop — usage write (`run_in_threadpool`), audit log O(n²)→O(1), cache `KEYS`→`SCAN`, search rebuild offloaded.	`perf`
Outbox blocking Kafka `flush(10)` on the event loop → offloaded via `asyncio.to_thread`.	`fix(outbox)`
Alerts advanced fired/escalation state regardless of delivery success → now gated on success (eval loop re-attempts).	`fix(alerts)`

One audit finding (a suspected rate-limiter off-by-one) was checked and is not a bug; left unchanged.

Verification

ruff check + ruff format clean (273 files); mypy strict clean (100 files)
pytest tests/unit (no-Docker): 1124 passed, 1 skipped (+28 new tests)
Integration / e2e / chaos / Docker-gated suites were not runnable on the dev host (Windows, no Docker) — relying on CI here, which is part of why this is a PR.

Not in this PR (deliberately — need Docker verification / a migration decision)

Webhook durable re-drive — requires a durable per-delivery table + background loop (a new subsystem); seen_event_ids also drives cache invalidation, so a naive change would break it.
DV2 pg_vault_writer insert-on-change — consequence is satellite bloat (not loss); changes the shared writer contract, deserves its own PR + DV2 smoke.
Cross-loader hash-key unification — a data migration across seeds/loaders/MDM views; high risk, owner decision.

🤖 Generated with Claude Code

…ry UPDATE The PG OLTP->vault promotion (run live by the LISTEN/NOTIFY freshness listener) computed satellite hash_diff as a constant per-entity tag md5(id || '|tag|v1'). The NOT EXISTS gate then matched the unchanged (hk, hash_diff) pair on every re-promotion, so an order moving pending->shipped (or a corrected total_amount, or a customer PII change) inserted nothing and was permanently invisible to rv.sat_*, bv_order_canonical and the branch_pnl mart — contradicting spec.yaml scd2:true and ADR-0005's CDC update contract. hash_diff is now derived from the descriptive columns and the gate inserts a new version only when it differs from the current (latest load_ts) version for the hash key — correct SCD2 insert-on-change, still idempotent on no-change re-runs. Adds a regression test that fails on the old constant tag. Audit ref: audit_28_06_26.md #9 (headline data-correctness defect). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two reproduced defects in LLM-mode NL->SQL (audit_28_06_26.md #5, #6): #5 cross-tenant read: validate_nl_sql checked only the leaf table name and _scope_sql skipped already schema-qualified tables, so an LLM-emitted 'victim_schema.orders_v2' passed the allow-list AND was executed verbatim against the victim tenant's schema (schemas are tenant slugs, guessable). Fix: validate_nl_sql now rejects any table carrying a db/catalog qualifier (primary); _scope_sql force-rescopes a known table into the caller's tenant schema even if pre-qualified, instead of skipping it (defense-in-depth). #6 PII masking fail-open: mask_query_results returned rows UNMASKED whenever a query touched !=1 entity table, so a users_enriched JOIN orders_v2 leaked cleartext email/phone/address. Fix: mask the union of all matched entities' rules; only a query touching no entity at all returns unmasked. Tests: schema-qualified inputs added to the guard's negative corpus; the masking test that *asserted* the fail-open leak is rewritten to assert fail-closed union masking; a _scope_sql foreign-schema re-scope test added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Webhook and alert targets were validated only as AnyHttpUrl, so any tenant could register http://169.254.169.254/, http://127.0.0.1:port or http://10.x and the server would POST to that internal target, returning status/error as an SSRF oracle (audit_28_06_26.md #2, confirmed by two audit passes). New egress_guard.validate_public_url resolves the host and rejects any URL that is not an http(s) target resolving exclusively to public unicast addresses (loopback/private/link-local/reserved/multicast/unspecified all rejected). Applied at registration time (POST /v1/webhooks, /v1/alerts, PUT alert -> 400) and again immediately before each delivery (narrowing the DNS-rebinding window); a delivery to a now-internal host is failed and logged, not fetched. Resolution runs via asyncio.to_thread so it never blocks the event loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ated NL query #7 rate limiter fail-open: on any Redis error check() returned (True, limit, ...), so a Redis outage (or an attacker who can degrade it) disabled per-tenant rate limiting fleet-wide — a brute-force / DoS-amplification window on the expensive NL->SQL and entity paths. It now falls back to the existing per-process sliding window (refactored into _check_local) instead of fail-open. (audit #7) #8 unbounded batch NL query: /v1/query paginates (LIMIT), but /v1/batch calls execute_nl_query which executed the translated SQL with no row cap, so a batch item like "SELECT * FROM orders_v2" (up to 20/request) could stream a whole table into memory. execute_nl_query now wraps the validated SQL in a bounded LIMIT (1000, the paginated max). nosec B608 pin bumped 2->3 with reason. (audit #8) Note: the audit suspected a count-vs-limit off-by-one; checked and it is NOT a bug — the Redis and local paths both admit exactly the configured number of requests; left unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… filters #11 active_sessions counted every session ever: the demo write path never sets ended_at, so the metrics WHERE ended_at IS NULL OR ended_at >= NOW()-30min was always true and the count grew monotonically. Re-anchored on started_at (active = started in the last 30 min and not ended), so it is actually time-bounded. #M4 order metrics were mutually inconsistent: revenue and avg_order_value filter status != cancelled but order_count did not, so avg_order_value != revenue/ order_count. Aligned order_count to the same non-cancelled filter and corrected the revenue/order_count descriptions (revenue was labelled completed orders but includes pending/shipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…CAN not KEYS Four hot-path scaling defects (audit_28_06_26.md #13/#14/#15/#16): #13 record_usage ran a connection-per-request DuckDB write with a blocking time.sleep retry inline in the async auth middleware, freezing the event loop on every authenticated request. Offloaded via run_in_threadpool. #14 the hash-chained audit publisher re-read the ENTIRE growing log file on every publish to fetch only the last line — O(file) per request, O(n^2) over the log's lifetime, on the event loop via #13. Now caches (last_hash, sequence) in memory after a one-time tail read (it is the only writer, append-only under the lock). #15 metric-cache invalidation used Redis KEYS metric:* (O(keyspace), blocks single-threaded Redis for all clients) roughly every 2s under ingestion. Switched to cursor-based SCAN. #16 the search-index rebuild full-scanned and re-tokenized every entity table on the event loop every 60s. Offloaded via run_in_threadpool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

run_forever is an asyncio task but called the fully synchronous process_pending, whose per-row producer.flush(10) blocked the event loop for up to 10s per pending message against a slow/unreachable broker — freezing all HTTP/SSE traffic. (audit_28_06_26.md #1) Adds process_pending_async/_process_row_async used by run_forever: the DuckDB read/mark-sent/schedule-retry stay on the loop (the connection may be shared with the query engine, so it must not be touched from a worker thread), while the blocking Kafka produce+flush is offloaded via asyncio.to_thread. The synchronous process_pending/process_entry are kept for direct/test use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dispatch_alert set fired_at / last_escalation_level and returned regardless of whether deliver() succeeded, so a timed-out level-1 page recorded the alert as fired and the on-call was never notified — and a single-step alert would not re-notify until cooldown (default 30 min). (audit_28_06_26.md #4) State transitions are now gated on delivery success: on a failed page the fire branch leaves fired_at=None and the escalation branch leaves last_escalation_level unchanged, so the next evaluation tick (the existing periodic dispatch loop) re-attempts delivery instead of going silent. Adds no-Docker unit tests for the fire-success, fire-failure, and escalation-failure paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-28T16:42:13Z

DORA Metrics

Window: last 30 days
Branch: main
Deployment frequency: 186 total / 43.4 per week
Lead time for changes: avg 0.24h / median 0.0h
Change failure rate: 63.44% (118/186)
MTTR: 0.47h across 7 incident(s)

…ions) CI integration (Docker, not runnable on the dev host) surfaced 26 failures from this branch's own changes — exactly why this went through a PR: - SSRF egress guard (#2) rejected the reserved .test hostnames the webhook/alert suites use (agent.test etc. do not resolve), failing create with 400 and blocking deliveries — which then cascaded into the alert-dedup suite via the #4 success-gating. Added an autouse integration fixture that resolves only .test names to a public IP, so the guard stays active (real loopback/private rejection intact; logic unit-tested in test_egress_guard) while the mocked- httpx delivery paths remain exercisable. - outbox ProcessorStub now implements process_pending_async (run_forever calls the async variant after #1). - clickhouse live order_count expectation 8 -> 7 (now excludes the 1 cancelled seed order, #M4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JuliaEdom and others added 8 commits June 28, 2026 18:57

brownjuly2003-code merged commit 28025a5 into main Jun 28, 2026
22 checks passed

brownjuly2003-code deleted the fix/audit-28-backlog branch June 28, 2026 16:59

brownjuly2003-code mentioned this pull request Jun 29, 2026

fix(alerts): page every intermediate escalation level (audit_28 §5 medium) #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf)#96

fix: close 15/16 HIGH audit findings (data correctness, security, durability, perf)#96
brownjuly2003-code merged 9 commits into
mainfrom
fix/audit-28-backlog

brownjuly2003-code commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brownjuly2003-code commented Jun 28, 2026

What

Fixes (commit → defect)

Verification

Not in this PR (deliberately — need Docker verification / a migration decision)

Uh oh!

github-actions Bot commented Jun 28, 2026

DORA Metrics

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants