v1.1.0: session scoring + security hardening + scoring-recv exclusion override#3
Closed
dmichael-fastly wants to merge 1 commit into
Closed
v1.1.0: session scoring + security hardening + scoring-recv exclusion override#3dmichael-fastly wants to merge 1 commit into
dmichael-fastly wants to merge 1 commit into
Conversation
dmichael-fastly
added a commit
that referenced
this pull request
Jun 3, 2026
…ecv exclusion override
This branch ships v1.1.0, three workstreams squashed from 70+ original
commits. 170+ files changed; `make ci` green (lint + format + mypy +
3,087 backend pytest + 9 vcl + 65 Rust scorer + 265 frontend vitest +
OSV no-vulns). Deployed and verified on the dev VM.
## 1 — Session scoring (Phases A / B / C)
End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.
- Edge: Compute scorer, 6-snippet VCL preflight (recv / pass / fetch /
deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
transition state, `fastly.ddos_detected` bypass.
- Backend: training pipeline, FOS-published matrix versioning,
labelled-session retrain loop, /scoring/evaluation + /scoring/health
+ composite /scoring/dashboard, matrix version history + rollback,
AES key rotation with grace window, sliding cookie lifetime,
scoring audit log, threshold enforcement that 429s flagged requests
at the edge within seconds of commit.
- Admin UI at /admin/session-scoring: StatusPanel with live ROC-AUC
against accumulated labels, ScoringHealthCard, ThresholdSlider with
counterfactual flag/pass preview + precision/recall, RocPrCurves,
TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
session-events viewer, ExcludeRegexCard (see §3), help popups.
See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.
## 2 — Security remediation (Phases 0–4)
40-finding security audit closed in five phases. All fixes deployed
and verified. Full breakdown in the `### Security` block of
CHANGELOG.md 1.1.0; one-line summary per phase:
- Phase 0 — uvicorn `--proxy-headers` + Host-spoof bypass /
leftmost-XFF / teardown-auth (#012, #013, #029, #017, #034). Three
extras spotted during the IR sweep:
- E1 Caddyfile peer-IP gate on `Fastly-Client-IP → XFF` rewrite
(port 80 was open to `0.0.0.0/0`).
- E2 docker json-file log rotation (50 MB × 10 compressed).
- E3 `generate_analyst_invite` fail-fast on missing token +
defensive Fastly-response shape check.
- Phase 1 — 14-finding trivial sweep: tenant scoping (#7 / #008 /
#019), TOCTOU (#2), quarantine narrowing (#3), email-enum
timing equalisation (#4), `isoparse` validation (#1),
`service_id` path-traversal regex (#6), session re-sync (#010),
rate-limiter bounds (#014), VCL UA/referer cap (#016), GET→POST CSRF
(#020), read_only query-param removal (#026), stack-trace strip
(#027 / #028) + sweep fixture.
- Phase 2 — backend/utils/sql_validator.py implements Decision B:
statement-type whitelist + recursive parse-tree walker (catalog +
function blocklists) + fail-closed parse + audit log + perf budget
(#031 / #033 / #035). escape_sql_literal helper + characterisation
tests at four ingest sites (#009). VCL preamble unsetting
client-spoofable internal headers (#021). Origin-metric VCL
log-injection gates (#015). Path-traversal cages in /api/download
(#5) + cache cleanup (#022). SSH host-key pinning via
configs/ssh_known_hosts with fail-safe _ensure_known_hosts (#011).
- Phase 3 — Fastly vcl_hash keys on full req.url not just path
(#024). Next.js /admin middleware gates on Caddy-injected
X-Proxied-By-Caddy marker not Host header (#032). Scorer
Python+Rust parity: L1_SCORE_COOKIE_TAMPERED=100,
L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (#036 / #037). #038 sliding-
window mean documented as tracked follow-up.
- Phase 4 — cross-tenant scope enforcement on /api/alerts/* and
/api/views/* with pre-flight get_alert_by_id / get_view_by_id
helpers so unauthorised mutations never land (#039 / #040). NGWAF
workspace listing token-gated (#018). #025 covered by Phase 0 #017.
Cache-layer audit confirmed every per-tenant cache includes
service_id in the key.
## 3 — Scoring-recv URL exclusion regex (new operator control)
The "which requests get sent to Compute" condition was previously a
hard-coded _ASSET_EXT_REGEX in code. Operators can now override it
per-service from the Session Scoring page; the default static-asset
extension list still ships as the fallback.
- Backend — recv_snippet + generate_scoring_vcl accept an
exclude_url_regex parameter; persisted in
cfg.scoring.exclude_url_regex (None / "" = use default).
update_recv_exclusion_regex orchestrator clones only the active
version, swaps the recv snippet, validates, activates — ~5–15s vs.
the full enable_scoring flow.
- New endpoints — GET /api/services/{id}/scoring/exclude-regex
(returns current + default + effective) and PUT
/api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
audit-logged as scoring_exclude_regex_changed).
- Three-layer validation before any VCL ships:
1. Input policy — length cap (2 KB), no double-quote / control
chars, must compile under Python re.
2. falco static analysis (github.com/ysugimoto/falco) on the
assembled recv snippet (catches composition errors that slip past
Python's compiler).
3. Fastly's own VCL compiler at activate time.
- Frontend — ExcludeRegexCard on the overview tab: textarea
pre-populated with current value, "Show default" toggle, "Reset to
default" button, inline lint-error display, confirm-dialog before
publish.
- Infra — falco v2.3.0 baked into the backend Docker image; production
sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
instead of degrading to input-policy-only.
## Infrastructure
- Backend + frontend Docker base: python:3.12-slim-bullseye →
python:3.12-slim-bookworm (cuts CVE-laden Debian 11 base; remaining
13 high CVEs are deep-dependency / OpenSSL CVEs every major Python
base inherits).
- Falco v2.3.0 in the backend image — required by the scoring-recv-
snippet validator.
- Dependency freshness sweep on all four ecosystems:
- Python: aiohttp 3.13.5 → 3.14.0, cfn-lint 1.51.2 → 1.51.4,
distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
- Frontend: @tanstack/react-query 5.100.14 → 5.101.0 (+ devtools),
@types/react 19.2.15 → 19.2.16, eslint-config-next 16.2.6 →
16.2.7, next 16.2.6 → 16.2.7, react / react-dom 19.2.6 → 19.2.7.
- Rust: bitflags 2.11.1 → 2.12.1.
- Deferred (major bumps reserved for 1.2): TypeScript 5.9 → 6.0
(compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
(Compute@Edge API churn); jsdom / eslint / vitest where we're
already ahead of the npm "latest" tag.
## Versioning
Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.
## Test coverage
backend pytest 3,087 (+321 vs v1.0.0)
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)
New test files for this release:
tests/utils/test_sql_validator.py (60)
tests/utils/test_vcl_validator.py (18)
tests/test_proxy_headers_regression.py (10)
tests/test_no_trace_leakage_sweep.py (4)
tests/routers/test_provision_teardown_auth.py (9)
tests/routers/test_cross_tenant_scope.py (9)
tests/routers/test_scoring_exclude_regex.py (9)
## Notes for reviewers
- Branch was squashed from 70+ commits; full per-commit history is in
git reflog locally. The squash makes this reviewable as one
semantic unit (v1.1.0 release) instead of paging through unrelated
intermediate work.
- Every security fix has acceptance tests. OpenAPI snapshot
regenerated.
- Audit-finding working docs (docs/security_remediation_final_*.md,
audit-findings/) were intentionally .gitignored and cleaned up at
the end of the cycle — a fresh audit will produce fresh artifacts.
- Stale v1.1.0 tag was deleted before the squash. After merge, tag
main with v1.1.0 rather than the PR branch.
## Test plan
- [x] `make ci` passes locally
- [x] Deployed to dev VM (fastly-log-analysis in us-central1-a) — all
three containers healthy, GET /api/health returns 200
- [x] Falco verified in production image: v2.3.0
- [x] Exclude-regex endpoint reachable + returns expected shape
- [x] Off-network attack probes: Host-spoof, SQL injection
(read_csv_auto, information_schema, getenv, duckdb_secrets),
unauthenticated teardown — all rejected with expected status codes
- [ ] Reviewer: open /admin/session-scoring, scroll to the URL
exclusion regex card, paste a custom regex (e.g. \.(healthz)$),
click Save → publish flow completes
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
e617fe1 to
ca28716
Compare
dmichael-fastly
added a commit
that referenced
this pull request
Jun 3, 2026
…ecv exclusion override
This branch ships v1.1.0, three workstreams squashed from 70+ original
commits. 170+ files changed; `make ci` green (lint + format + mypy +
3,087 backend pytest + 9 vcl + 65 Rust scorer + 265 frontend vitest +
OSV no-vulns). Deployed and verified on the dev VM.
## 1 — Session scoring (Phases A / B / C)
End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.
- Edge: Compute scorer, 6-snippet VCL preflight (recv / pass / fetch /
deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
transition state, `fastly.ddos_detected` bypass.
- Backend: training pipeline, FOS-published matrix versioning,
labelled-session retrain loop, /scoring/evaluation + /scoring/health
+ composite /scoring/dashboard, matrix version history + rollback,
AES key rotation with grace window, sliding cookie lifetime,
scoring audit log, threshold enforcement that 429s flagged requests
at the edge within seconds of commit.
- Admin UI at /admin/session-scoring: StatusPanel with live ROC-AUC
against accumulated labels, ScoringHealthCard, ThresholdSlider with
counterfactual flag/pass preview + precision/recall, RocPrCurves,
TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
session-events viewer, ExcludeRegexCard (see §3), help popups.
See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.
## 2 — Security remediation (Phases 0–4)
40-finding security audit closed in five phases. All fixes deployed
and verified. Full breakdown in the `### Security` block of
CHANGELOG.md 1.1.0; one-line summary per phase:
- Phase 0 — uvicorn `--proxy-headers` + Host-spoof bypass /
leftmost-XFF / teardown-auth (#012, #013, #029, #017, #034). Three
extras spotted during the IR sweep:
- E1 Caddyfile peer-IP gate on `Fastly-Client-IP → XFF` rewrite
(port 80 was open to `0.0.0.0/0`).
- E2 docker json-file log rotation (50 MB × 10 compressed).
- E3 `generate_analyst_invite` fail-fast on missing token +
defensive Fastly-response shape check.
- Phase 1 — 14-finding trivial sweep: tenant scoping (#7 / #008 /
#019), TOCTOU (#2), quarantine narrowing (#3), email-enum
timing equalisation (#4), `isoparse` validation (#1),
`service_id` path-traversal regex (#6), session re-sync (#010),
rate-limiter bounds (#014), VCL UA/referer cap (#016), GET→POST CSRF
(#020), read_only query-param removal (#026), stack-trace strip
(#027 / #028) + sweep fixture.
- Phase 2 — backend/utils/sql_validator.py implements Decision B:
statement-type whitelist + recursive parse-tree walker (catalog +
function blocklists) + fail-closed parse + audit log + perf budget
(#031 / #033 / #035). escape_sql_literal helper + characterisation
tests at four ingest sites (#009). VCL preamble unsetting
client-spoofable internal headers (#021). Origin-metric VCL
log-injection gates (#015). Path-traversal cages in /api/download
(#5) + cache cleanup (#022). SSH host-key pinning via
configs/ssh_known_hosts with fail-safe _ensure_known_hosts (#011).
- Phase 3 — Fastly vcl_hash keys on full req.url not just path
(#024). Next.js /admin middleware gates on Caddy-injected
X-Proxied-By-Caddy marker not Host header (#032). Scorer
Python+Rust parity: L1_SCORE_COOKIE_TAMPERED=100,
L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (#036 / #037). #038 sliding-
window mean documented as tracked follow-up.
- Phase 4 — cross-tenant scope enforcement on /api/alerts/* and
/api/views/* with pre-flight get_alert_by_id / get_view_by_id
helpers so unauthorised mutations never land (#039 / #040). NGWAF
workspace listing token-gated (#018). #025 covered by Phase 0 #017.
Cache-layer audit confirmed every per-tenant cache includes
service_id in the key.
## 3 — Scoring-recv URL exclusion regex (new operator control)
The "which requests get sent to Compute" condition was previously a
hard-coded _ASSET_EXT_REGEX in code. Operators can now override it
per-service from the Session Scoring page; the default static-asset
extension list still ships as the fallback.
- Backend — recv_snippet + generate_scoring_vcl accept an
exclude_url_regex parameter; persisted in
cfg.scoring.exclude_url_regex (None / "" = use default).
update_recv_exclusion_regex orchestrator clones only the active
version, swaps the recv snippet, validates, activates — ~5–15s vs.
the full enable_scoring flow.
- New endpoints — GET /api/services/{id}/scoring/exclude-regex
(returns current + default + effective) and PUT
/api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
audit-logged as scoring_exclude_regex_changed).
- Three-layer validation before any VCL ships:
1. Input policy — length cap (2 KB), no double-quote / control
chars, must compile under Python re.
2. falco static analysis (github.com/ysugimoto/falco) on the
assembled recv snippet (catches composition errors that slip past
Python's compiler).
3. Fastly's own VCL compiler at activate time.
- Frontend — ExcludeRegexCard on the overview tab: textarea
pre-populated with current value, "Show default" toggle, "Reset to
default" button, inline lint-error display, confirm-dialog before
publish.
- Infra — falco v2.3.0 baked into the backend Docker image; production
sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
instead of degrading to input-policy-only.
## Infrastructure
- Backend + frontend Docker base: python:3.12-slim-bullseye →
python:3.12-slim-bookworm (cuts CVE-laden Debian 11 base; remaining
13 high CVEs are deep-dependency / OpenSSL CVEs every major Python
base inherits).
- Falco v2.3.0 in the backend image — required by the scoring-recv-
snippet validator.
- Dependency freshness sweep on all four ecosystems:
- Python: aiohttp 3.13.5 → 3.14.0, cfn-lint 1.51.2 → 1.51.4,
distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
- Frontend: @tanstack/react-query 5.100.14 → 5.101.0 (+ devtools),
@types/react 19.2.15 → 19.2.16, eslint-config-next 16.2.6 →
16.2.7, next 16.2.6 → 16.2.7, react / react-dom 19.2.6 → 19.2.7.
- Rust: bitflags 2.11.1 → 2.12.1.
- Deferred (major bumps reserved for 1.2): TypeScript 5.9 → 6.0
(compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
(Compute@Edge API churn); jsdom / eslint / vitest where we're
already ahead of the npm "latest" tag.
## Versioning
Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.
## Test coverage
backend pytest 3,087 (+321 vs v1.0.0)
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)
New test files for this release:
tests/utils/test_sql_validator.py (60)
tests/utils/test_vcl_validator.py (18)
tests/test_proxy_headers_regression.py (10)
tests/test_no_trace_leakage_sweep.py (4)
tests/routers/test_provision_teardown_auth.py (9)
tests/routers/test_cross_tenant_scope.py (9)
tests/routers/test_scoring_exclude_regex.py (9)
## Notes for reviewers
- Branch was squashed from 70+ commits; full per-commit history is in
git reflog locally. The squash makes this reviewable as one
semantic unit (v1.1.0 release) instead of paging through unrelated
intermediate work.
- Every security fix has acceptance tests. OpenAPI snapshot
regenerated.
- Audit-finding working docs (docs/security_remediation_final_*.md,
audit-findings/) were intentionally .gitignored and cleaned up at
the end of the cycle — a fresh audit will produce fresh artifacts.
- Stale v1.1.0 tag was deleted before the squash. After merge, tag
main with v1.1.0 rather than the PR branch.
## Test plan
- [x] `make ci` passes locally
- [x] Deployed to dev VM (fastly-log-analysis in us-central1-a) — all
three containers healthy, GET /api/health returns 200
- [x] Falco verified in production image: v2.3.0
- [x] Exclude-regex endpoint reachable + returns expected shape
- [x] Off-network attack probes: Host-spoof, SQL injection
(read_csv_auto, information_schema, getenv, duckdb_secrets),
unauthenticated teardown — all rejected with expected status codes
- [ ] Reviewer: open /admin/session-scoring, scroll to the URL
exclusion regex card, paste a custom regex (e.g. \.(healthz)$),
click Save → publish flow completes
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ca28716 to
49861ff
Compare
dmichael-fastly
added a commit
that referenced
this pull request
Jun 3, 2026
…ecv exclusion override
This branch ships v1.1.0, three workstreams squashed from 70+ original
commits. 170+ files changed; `make ci` green (lint + format + mypy +
3,087 backend pytest + 9 vcl + 65 Rust scorer + 265 frontend vitest +
OSV no-vulns). Deployed and verified on the dev VM.
## 1 — Session scoring (Phases A / B / C)
End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.
- Edge: Compute scorer, 6-snippet VCL preflight (recv / pass / fetch /
deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
transition state, `fastly.ddos_detected` bypass.
- Backend: training pipeline, FOS-published matrix versioning,
labelled-session retrain loop, /scoring/evaluation + /scoring/health
+ composite /scoring/dashboard, matrix version history + rollback,
AES key rotation with grace window, sliding cookie lifetime,
scoring audit log, threshold enforcement that 429s flagged requests
at the edge within seconds of commit.
- Admin UI at /admin/session-scoring: StatusPanel with live ROC-AUC
against accumulated labels, ScoringHealthCard, ThresholdSlider with
counterfactual flag/pass preview + precision/recall, RocPrCurves,
TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
session-events viewer, ExcludeRegexCard (see §3), help popups.
See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.
## 2 — Security remediation (Phases 0–4)
40-finding security audit closed in five phases. All fixes deployed
and verified. Full breakdown in the `### Security` block of
CHANGELOG.md 1.1.0; one-line summary per phase:
- Phase 0 — uvicorn `--proxy-headers` + Host-spoof bypass /
leftmost-XFF / teardown-auth (#012, #013, #029, #017, #034). Three
extras spotted during the IR sweep:
- E1 Caddyfile peer-IP gate on `Fastly-Client-IP → XFF` rewrite
(port 80 was open to `0.0.0.0/0`).
- E2 docker json-file log rotation (50 MB × 10 compressed).
- E3 `generate_analyst_invite` fail-fast on missing token +
defensive Fastly-response shape check.
- Phase 1 — 14-finding trivial sweep: tenant scoping (#7 / #008 /
#019), TOCTOU (#2), quarantine narrowing (#3), email-enum
timing equalisation (#4), `isoparse` validation (#1),
`service_id` path-traversal regex (#6), session re-sync (#010),
rate-limiter bounds (#014), VCL UA/referer cap (#016), GET→POST CSRF
(#020), read_only query-param removal (#026), stack-trace strip
(#027 / #028) + sweep fixture.
- Phase 2 — backend/utils/sql_validator.py implements Decision B:
statement-type whitelist + recursive parse-tree walker (catalog +
function blocklists) + fail-closed parse + audit log + perf budget
(#031 / #033 / #035). escape_sql_literal helper + characterisation
tests at four ingest sites (#009). VCL preamble unsetting
client-spoofable internal headers (#021). Origin-metric VCL
log-injection gates (#015). Path-traversal cages in /api/download
(#5) + cache cleanup (#022). SSH host-key pinning via
configs/ssh_known_hosts with fail-safe _ensure_known_hosts (#011).
- Phase 3 — Fastly vcl_hash keys on full req.url not just path
(#024). Next.js /admin middleware gates on Caddy-injected
X-Proxied-By-Caddy marker not Host header (#032). Scorer
Python+Rust parity: L1_SCORE_COOKIE_TAMPERED=100,
L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (#036 / #037). #038 sliding-
window mean documented as tracked follow-up.
- Phase 4 — cross-tenant scope enforcement on /api/alerts/* and
/api/views/* with pre-flight get_alert_by_id / get_view_by_id
helpers so unauthorised mutations never land (#039 / #040). NGWAF
workspace listing token-gated (#018). #025 covered by Phase 0 #017.
Cache-layer audit confirmed every per-tenant cache includes
service_id in the key.
## 3 — Scoring-recv URL exclusion regex (new operator control)
The "which requests get sent to Compute" condition was previously a
hard-coded _ASSET_EXT_REGEX in code. Operators can now override it
per-service from the Session Scoring page; the default static-asset
extension list still ships as the fallback.
- Backend — recv_snippet + generate_scoring_vcl accept an
exclude_url_regex parameter; persisted in
cfg.scoring.exclude_url_regex (None / "" = use default).
update_recv_exclusion_regex orchestrator clones only the active
version, swaps the recv snippet, validates, activates — ~5–15s vs.
the full enable_scoring flow.
- New endpoints — GET /api/services/{id}/scoring/exclude-regex
(returns current + default + effective) and PUT
/api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
audit-logged as scoring_exclude_regex_changed).
- Three-layer validation before any VCL ships:
1. Input policy — length cap (2 KB), no double-quote / control
chars, must compile under Python re.
2. falco static analysis (github.com/ysugimoto/falco) on the
assembled recv snippet (catches composition errors that slip past
Python's compiler).
3. Fastly's own VCL compiler at activate time.
- Frontend — ExcludeRegexCard on the overview tab: textarea
pre-populated with current value, "Show default" toggle, "Reset to
default" button, inline lint-error display, confirm-dialog before
publish.
- Infra — falco v2.3.0 baked into the backend Docker image; production
sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
instead of degrading to input-policy-only.
## Infrastructure
- Backend + frontend Docker base: python:3.12-slim-bullseye →
python:3.12-slim-bookworm (cuts CVE-laden Debian 11 base; remaining
13 high CVEs are deep-dependency / OpenSSL CVEs every major Python
base inherits).
- Falco v2.3.0 in the backend image — required by the scoring-recv-
snippet validator.
- Dependency freshness sweep on all four ecosystems:
- Python: aiohttp 3.13.5 → 3.14.0, cfn-lint 1.51.2 → 1.51.4,
distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
- Frontend: @tanstack/react-query 5.100.14 → 5.101.0 (+ devtools),
@types/react 19.2.15 → 19.2.16, eslint-config-next 16.2.6 →
16.2.7, next 16.2.6 → 16.2.7, react / react-dom 19.2.6 → 19.2.7.
- Rust: bitflags 2.11.1 → 2.12.1.
- Deferred (major bumps reserved for 1.2): TypeScript 5.9 → 6.0
(compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
(Compute@Edge API churn); jsdom / eslint / vitest where we're
already ahead of the npm "latest" tag.
## Versioning
Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.
## Test coverage
backend pytest 3,087 (+321 vs v1.0.0)
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)
New test files for this release:
tests/utils/test_sql_validator.py (60)
tests/utils/test_vcl_validator.py (18)
tests/test_proxy_headers_regression.py (10)
tests/test_no_trace_leakage_sweep.py (4)
tests/routers/test_provision_teardown_auth.py (9)
tests/routers/test_cross_tenant_scope.py (9)
tests/routers/test_scoring_exclude_regex.py (9)
## Notes for reviewers
- Branch was squashed from 70+ commits; full per-commit history is in
git reflog locally. The squash makes this reviewable as one
semantic unit (v1.1.0 release) instead of paging through unrelated
intermediate work.
- Every security fix has acceptance tests. OpenAPI snapshot
regenerated.
- Audit-finding working docs (docs/security_remediation_final_*.md,
audit-findings/) were intentionally .gitignored and cleaned up at
the end of the cycle — a fresh audit will produce fresh artifacts.
- Stale v1.1.0 tag was deleted before the squash. After merge, tag
main with v1.1.0 rather than the PR branch.
## Test plan
- [x] `make ci` passes locally
- [x] Deployed to dev VM (fastly-log-analysis in us-central1-a) — all
three containers healthy, GET /api/health returns 200
- [x] Falco verified in production image: v2.3.0
- [x] Exclude-regex endpoint reachable + returns expected shape
- [x] Off-network attack probes: Host-spoof, SQL injection
(read_csv_auto, information_schema, getenv, duckdb_secrets),
unauthenticated teardown — all rejected with expected status codes
- [ ] Reviewer: open /admin/session-scoring, scroll to the URL
exclusion regex card, paste a custom regex (e.g. \.(healthz)$),
click Save → publish flow completes
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
49861ff to
0cd7260
Compare
dmichael-fastly
added a commit
that referenced
this pull request
Jun 3, 2026
…ecv exclusion override
This branch ships v1.1.0, three workstreams squashed from 70+ original
commits. 170+ files changed; `make ci` green (lint + format + mypy +
3,087 backend pytest + 9 vcl + 65 Rust scorer + 265 frontend vitest +
OSV no-vulns). Deployed and verified on the dev VM.
## 1 — Session scoring (Phases A / B / C)
End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.
- Edge: Compute scorer, 6-snippet VCL preflight (recv / pass / fetch /
deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
transition state, `fastly.ddos_detected` bypass.
- Backend: training pipeline, FOS-published matrix versioning,
labelled-session retrain loop, /scoring/evaluation + /scoring/health
+ composite /scoring/dashboard, matrix version history + rollback,
AES key rotation with grace window, sliding cookie lifetime,
scoring audit log, threshold enforcement that 429s flagged requests
at the edge within seconds of commit.
- Admin UI at /admin/session-scoring: StatusPanel with live ROC-AUC
against accumulated labels, ScoringHealthCard, ThresholdSlider with
counterfactual flag/pass preview + precision/recall, RocPrCurves,
TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
session-events viewer, ExcludeRegexCard (see §3), help popups.
See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.
## 2 — Security remediation (Phases 0–4)
40-finding security audit closed in five phases. All fixes deployed
and verified. Full breakdown in the `### Security` block of
CHANGELOG.md 1.1.0; one-line summary per phase:
- Phase 0 — uvicorn `--proxy-headers` + Host-spoof bypass /
leftmost-XFF / teardown-auth (#012, #013, #029, #017, #034). Three
extras spotted during the IR sweep:
- E1 Caddyfile peer-IP gate on `Fastly-Client-IP → XFF` rewrite
(port 80 was open to `0.0.0.0/0`).
- E2 docker json-file log rotation (50 MB × 10 compressed).
- E3 `generate_analyst_invite` fail-fast on missing token +
defensive Fastly-response shape check.
- Phase 1 — 14-finding trivial sweep: tenant scoping (#7 / #008 /
#019), TOCTOU (#2), quarantine narrowing (#3), email-enum
timing equalisation (#4), `isoparse` validation (#1),
`service_id` path-traversal regex (#6), session re-sync (#010),
rate-limiter bounds (#014), VCL UA/referer cap (#016), GET→POST CSRF
(#020), read_only query-param removal (#026), stack-trace strip
(#027 / #028) + sweep fixture.
- Phase 2 — backend/utils/sql_validator.py implements Decision B:
statement-type whitelist + recursive parse-tree walker (catalog +
function blocklists) + fail-closed parse + audit log + perf budget
(#031 / #033 / #035). escape_sql_literal helper + characterisation
tests at four ingest sites (#009). VCL preamble unsetting
client-spoofable internal headers (#021). Origin-metric VCL
log-injection gates (#015). Path-traversal cages in /api/download
(#5) + cache cleanup (#022). SSH host-key pinning via
configs/ssh_known_hosts with fail-safe _ensure_known_hosts (#011).
- Phase 3 — Fastly vcl_hash keys on full req.url not just path
(#024). Next.js /admin middleware gates on Caddy-injected
X-Proxied-By-Caddy marker not Host header (#032). Scorer
Python+Rust parity: L1_SCORE_COOKIE_TAMPERED=100,
L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (#036 / #037). #038 sliding-
window mean documented as tracked follow-up.
- Phase 4 — cross-tenant scope enforcement on /api/alerts/* and
/api/views/* with pre-flight get_alert_by_id / get_view_by_id
helpers so unauthorised mutations never land (#039 / #040). NGWAF
workspace listing token-gated (#018). #025 covered by Phase 0 #017.
Cache-layer audit confirmed every per-tenant cache includes
service_id in the key.
## 3 — Scoring-recv URL exclusion regex (new operator control)
The "which requests get sent to Compute" condition was previously a
hard-coded _ASSET_EXT_REGEX in code. Operators can now override it
per-service from the Session Scoring page; the default static-asset
extension list still ships as the fallback.
- Backend — recv_snippet + generate_scoring_vcl accept an
exclude_url_regex parameter; persisted in
cfg.scoring.exclude_url_regex (None / "" = use default).
update_recv_exclusion_regex orchestrator clones only the active
version, swaps the recv snippet, validates, activates — ~5–15s vs.
the full enable_scoring flow.
- New endpoints — GET /api/services/{id}/scoring/exclude-regex
(returns current + default + effective) and PUT
/api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
audit-logged as scoring_exclude_regex_changed).
- Three-layer validation before any VCL ships:
1. Input policy — length cap (2 KB), no double-quote / control
chars, must compile under Python re.
2. falco static analysis (github.com/ysugimoto/falco) on the
assembled recv snippet (catches composition errors that slip past
Python's compiler).
3. Fastly's own VCL compiler at activate time.
- Frontend — ExcludeRegexCard on the overview tab: textarea
pre-populated with current value, "Show default" toggle, "Reset to
default" button, inline lint-error display, confirm-dialog before
publish.
- Infra — falco v2.3.0 baked into the backend Docker image; production
sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
instead of degrading to input-policy-only.
## Infrastructure
- Backend + frontend Docker base: python:3.12-slim-bullseye →
python:3.12-slim-bookworm (cuts CVE-laden Debian 11 base; remaining
13 high CVEs are deep-dependency / OpenSSL CVEs every major Python
base inherits).
- Falco v2.3.0 in the backend image — required by the scoring-recv-
snippet validator.
- Dependency freshness sweep on all four ecosystems:
- Python: aiohttp 3.13.5 → 3.14.0, cfn-lint 1.51.2 → 1.51.4,
distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
- Frontend: @tanstack/react-query 5.100.14 → 5.101.0 (+ devtools),
@types/react 19.2.15 → 19.2.16, eslint-config-next 16.2.6 →
16.2.7, next 16.2.6 → 16.2.7, react / react-dom 19.2.6 → 19.2.7.
- Rust: bitflags 2.11.1 → 2.12.1.
- Deferred (major bumps reserved for 1.2): TypeScript 5.9 → 6.0
(compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
(Compute@Edge API churn); jsdom / eslint / vitest where we're
already ahead of the npm "latest" tag.
## Versioning
Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.
## Test coverage
backend pytest 3,087 (+321 vs v1.0.0)
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)
New test files for this release:
tests/utils/test_sql_validator.py (60)
tests/utils/test_vcl_validator.py (18)
tests/test_proxy_headers_regression.py (10)
tests/test_no_trace_leakage_sweep.py (4)
tests/routers/test_provision_teardown_auth.py (9)
tests/routers/test_cross_tenant_scope.py (9)
tests/routers/test_scoring_exclude_regex.py (9)
## Notes for reviewers
- Branch was squashed from 70+ commits; full per-commit history is in
git reflog locally. The squash makes this reviewable as one
semantic unit (v1.1.0 release) instead of paging through unrelated
intermediate work.
- Every security fix has acceptance tests. OpenAPI snapshot
regenerated.
- Audit-finding working docs (docs/security_remediation_final_*.md,
audit-findings/) were intentionally .gitignored and cleaned up at
the end of the cycle — a fresh audit will produce fresh artifacts.
- Stale v1.1.0 tag was deleted before the squash. After merge, tag
main with v1.1.0 rather than the PR branch.
## Test plan
- [x] `make ci` passes locally
- [x] Deployed to dev VM (fastly-log-analysis in us-central1-a) — all
three containers healthy, GET /api/health returns 200
- [x] Falco verified in production image: v2.3.0
- [x] Exclude-regex endpoint reachable + returns expected shape
- [x] Off-network attack probes: Host-spoof, SQL injection
(read_csv_auto, information_schema, getenv, duckdb_secrets),
unauthenticated teardown — all rejected with expected status codes
- [ ] Reviewer: open /admin/session-scoring, scroll to the URL
exclusion regex card, paste a custom regex (e.g. \.(healthz)$),
click Save → publish flow completes
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
e6650ee to
70929f0
Compare
dc91229 to
a446c3a
Compare
19b2f14 to
231d96a
Compare
Squashed from the working set on session-scoring. Covers the session scoring + dashboard performance work from the prior squash baseline plus the recent additions: - DUCKDB_POOL_MAX_SIZE env knob (was hardcoded to 8 per service) - run.sh: compose NODE_OPTIONS instead of clobbering, refuse to bind ports commonly used by SSH tunnels to a remote backend/frontend - Dashboard stale-view retry: detect when /api/dashboard/aggregates returns inconsistent results (metadata reports recent logs but every aggregation comes back empty) and let React Query retry up to twice. Mitigates the intermittent "no data" symptom during metadata_sync cron ticks; doesn't address the underlying writer contention. - scripts/dev/sync-from-remote.sh: developer-only tool that mirrors a remote data tree locally and scrubs credentials/crons in the copied configs so the local backend can serve the synced volume without writing back. - .vscode/ added to .gitignore (local editor config). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5919286 to
14bc9bc
Compare
Collaborator
Author
|
Closing and re-opening as a fresh PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v1.1.0: session scoring + security hardening + scoring-recv exclusion override
This branch ships v1.1.0, three workstreams squashed from 70+ original
commits. 170+ files changed;
make cigreen (lint + format + mypy +3,053 backend pytest + 9 vcl + 65 Rust scorer + 265 frontend vitest +
gitleaks no-leaks + OSV no-vulns). Deployed and verified on the dev VM.
1 — Session scoring
End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.
deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
transition state,
fastly.ddos_detectedbypass.labelled-session retrain loop, /scoring/evaluation + /scoring/health
AES key rotation with grace window, sliding cookie lifetime,
scoring audit log, threshold enforcement that 429s flagged requests
at the edge within seconds of commit.
against accumulated labels, ScoringHealthCard, ThresholdSlider with
counterfactual flag/pass preview + precision/recall, RocPrCurves,
TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
session-events viewer, ExcludeRegexCard (see §3), help popups.
See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.
2 — Security hardening
Comprehensive hardening across the FastAPI backend, Fastly VCL, Next.js
frontend, and Rust scorer. Full breakdown in the
### Securityblockof CHANGELOG.md 1.1.0; capability summary:
--proxy-headers+--forwarded-allow-ips=127.0.0.1, Caddy peer-IP gating ofFastly-Client-IP → XFFrewrite, Caddy-injectedX-Proxied-By-Caddymarker driving Next.js/admingating in placeof the (forgeable) Host header. Backend reads request.client.host
as the trust signal everywhere; in-app leftmost-XFF parsing is gone.
mutation, and NGWAF workspace listing all require a caller-supplied
Fastly token validated via /tokens/self for
globalscope andservice binding. Server-stored credentials are never used as a
destructive-op fallback.
every /api/query call with a statement-type whitelist + recursive
parse-tree walker (catalog + function blocklists, fail-closed parse,
audit logging, perf budget). Replaces an incomplete regex-based
blocklist that missed read_csv_auto, information_schema,
duckdb_secrets, INSTALL/LOAD, and getenv. Plus escape_sql_literal
helper applied across ingest sites with characterisation tests for
the audit-PoC payload, multi-byte UTF-8, backslash, and empty inputs.
client-spoofable internal x-of-* / x-fos-edge-data /
x-is-cluster-fetch / X-Edge-* header; origin-metric log fields are
numeric-regex-gated and json.escape-wrapped; CDN vcl_hash keys on
full req.url; CDN vcl_recv now also runs querystring.filter_except
(S3-API allow-list) + querystring.sort so unexpected params can't
fracture the cache or leak the auth key into req.hash.
filter every read by analyst-session service_ids and gate every
mutation with pre-flight get_alert_by_id / get_view_by_id lookups so
unauthorised mutations never land. Cache layer audited: every
per-tenant cache key includes service_id.
realpath + commonpath check; service_id alphanumeric/dash regex at
path helpers; bucket-name separator rejection in cleanup.
with atomic UPDATE-with-rowcount; quarantine narrowed to actual
SQLite corruption signatures (was wiping the DB on any
OperationalError); scrypt timing equalised across hit/miss to close
the email-enumeration oracle; rate-limiter dicts bounded; stack-
trace key stripped from HTTPException.detail with sweep fixture
asserting no route leaks tracebacks.
loader; the tunnel manager refuses to start if the pinned file is
missing/empty.
L1_SCORE_COOKIE_TAMPERED=100 (was capped at 75 alongside
missing/expired), L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (closes the
0.20s–0.50s dwell-band evasion). Sliding-window mean (cookie schema
v3) tracked as a 1.2 follow-up.
3 — Scoring-recv URL exclusion regex (new operator control)
The "which requests get sent to Compute" condition was previously a
hard-coded _ASSET_EXT_REGEX in code. Operators can now override it
per-service from the Session Scoring page; the default static-asset
extension list still ships as the fallback.
exclude_url_regex parameter; persisted in
cfg.scoring.exclude_url_regex (None / "" = use default).
update_recv_exclusion_regex orchestrator clones only the active
version, swaps the recv snippet, validates, activates — ~5–15s vs.
the full enable_scoring flow.
(returns current + default + effective) and PUT
/api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
audit-logged as scoring_exclude_regex_changed).
chars, must compile under Python re.
assembled recv snippet (catches composition errors that slip past
Python's compiler).
pre-populated with current value, "Show default" toggle, "Reset to
default" button, inline lint-error display, confirm-dialog before
publish.
sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
instead of degrading to input-policy-only.
Tooling additions
(.pre-commit-config.yaml), make secret-scan (chained into make ci),
and .github/workflows/ci.yml. Configuration in .gitleaks.toml
extends the default ruleset with path allowlists for tracked test
fixtures, Rust lockfile checksums, the public SSH host key, and
gitignored runtime directories. AGENTS.md §Secrets documents the
policy and suppression playbook.
Infrastructure
python:3.12-slim-bookworm. Remaining base-image CVEs are deep-
dependency / OpenSSL CVEs every major Python base inherits.
snippet validator.
distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
@types/react 19.2.15 → 19.2.16, eslint-config-next 16.2.6 →
16.2.7, next 16.2.6 → 16.2.7, react / react-dom 19.2.6 → 19.2.7.
(compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
(Compute@Edge API churn); jsdom / eslint / vitest where we're
already ahead of the npm "latest" tag.
Versioning
Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.
Test coverage
backend pytest 3,053
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)
New test files for this release:
tests/utils/test_sql_validator.py (60)
tests/utils/test_vcl_validator.py (18)
tests/test_proxy_headers_regression.py (10)
tests/test_no_trace_leakage_sweep.py (4)
tests/routers/test_provision_teardown_auth.py (9)
tests/routers/test_cross_tenant_scope.py (9)
tests/routers/test_scoring_exclude_regex.py (9)
Notes for reviewers
git reflog locally. The squash makes this reviewable as one
semantic unit (v1.1.0 release) instead of paging through unrelated
intermediate work.
snapshot regenerated.
main with v1.1.0 rather than the PR branch.
Test plan
make cipasses locallythree containers healthy, GET /api/health returns 200
+ sort + req.url cache key
exclusion regex card, paste a custom regex (e.g. .(healthz)$),
click Save → publish flow completes
Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com