Skip to content

v1.1.0: session scoring + security hardening + scoring-recv exclusion override#4

Merged
dmichael-fastly merged 3 commits into
mainfrom
session-scoring
Jun 8, 2026
Merged

v1.1.0: session scoring + security hardening + scoring-recv exclusion override#4
dmichael-fastly merged 3 commits into
mainfrom
session-scoring

Conversation

@dmichael-fastly

@dmichael-fastly dmichael-fastly commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

v1.1.0: session scoring + security hardening + scoring-recv exclusion override

This branch ships v1.1.0, squashed from 70+ original commits across
four workstreams (sections below): session scoring, security
hardening, scoring-recv URL exclusion override, and a tranche of
performance / reliability / operational changes. 240+ files changed;
make ci green (lint + format + mypy + 3,053 backend pytest + 9 vcl

  • 65 Rust scorer + 265 frontend vitest + gitleaks no-leaks + OSV
    no-vulns). Deployed and verified on the dev VM.

1 — Session scoring

End-to-end edge anomaly-detection pipeline for Fastly Compute. Layer 1
behavioural (cookie compliance, impossibly-fast browsing, robotic
dwell) + Layer 2 transition-matrix scoring + combined 0–100 quantized
score. Dual-implemented in Python (backend/scoring) and Rust
(compute/scorer); cross-language wire-format tests pin the AES-GCM
cookie codec byte-for-byte.

  • Edge: Compute scorer, 6-snippet VCL preflight (recv / pass / fetch /
    deliver / miss / enforce), AES-GCM cookie carrying rotating sid +
    transition state, fastly.ddos_detected bypass.
  • Backend: training pipeline, FOS-published matrix versioning,
    labelled-session retrain loop, /scoring/evaluation + /scoring/health
    • composite /scoring/dashboard, matrix version history + rollback,
      AES key rotation with grace window, sliding cookie lifetime,
      scoring audit log, threshold enforcement that 429s flagged requests
      at the edge within seconds of commit.
  • Admin UI at /admin/session-scoring: StatusPanel with live ROC-AUC
    against accumulated labels, ScoringHealthCard, ThresholdSlider with
    counterfactual flag/pass preview + precision/recall, RocPrCurves,
    TopFlaggedTable, LabelsTab with click-to-view-events, RetrainButton,
    RotateKeyButton, MatrixVersionsCard, per-reason AUC breakdown,
    session-events viewer, ExcludeRegexCard (see §3), help popups.

See docs/session_scoring_runbook.md + docs/features.md for the runbook
and feature reference.

2 — Security hardening

Comprehensive hardening across the FastAPI backend, Fastly VCL, Next.js
frontend, and Rust scorer. Full breakdown in the ### Security block
of CHANGELOG.md 1.1.0; capability summary:

  • Trust-boundary normalisation — uvicorn --proxy-headers +
    --forwarded-allow-ips=127.0.0.1, Caddy peer-IP gating of
    Fastly-Client-IP → XFF rewrite, Caddy-injected
    X-Proxied-By-Caddy marker driving Next.js /admin gating in place
    of the (forgeable) Host header. Backend reads request.client.host
    as the trust signal everywhere; in-app leftmost-XFF parsing is gone.
  • Destructive-op auth — provisioning teardown, NGWAF workspace
    mutation, and NGWAF workspace listing all require a caller-supplied
    Fastly token validated via /tokens/self for global scope and
    service binding. Server-stored credentials are never used as a
    destructive-op fallback.
  • DuckDB user-SQL safety — new backend/utils/sql_validator.py wraps
    every /api/query call with a statement-type whitelist + recursive
    parse-tree walker (catalog + function blocklists, fail-closed parse,
    audit logging, perf budget). Replaces an incomplete regex-based
    blocklist that missed read_csv_auto, information_schema,
    duckdb_secrets, INSTALL/LOAD, and getenv. Plus escape_sql_literal
    helper applied across ingest sites with characterisation tests for
    the audit-PoC payload, multi-byte UTF-8, backslash, and empty inputs.
  • VCL header & cache discipline — vcl_recv preamble unsets every
    client-spoofable internal x-of-* / x-fos-edge-data /
    x-is-cluster-fetch / X-Edge-* header; origin-metric log fields are
    numeric-regex-gated and json.escape-wrapped; CDN vcl_hash keys on
    full req.url; CDN vcl_recv now also runs querystring.filter_except
    (S3-API allow-list) + querystring.sort so unexpected params can't
    fracture the cache or leak the auth key into req.hash.
  • Cross-tenant scope enforcement — /api/alerts/* and /api/views/*
    filter every read by analyst-session service_ids and gate every
    mutation with pre-flight get_alert_by_id / get_view_by_id lookups so
    unauthorised mutations never land. Cache layer audited: every
    per-tenant cache key includes service_id.
  • Path-traversal cages — /api/download and cache-cleanup paths
    realpath + commonpath check; service_id alphanumeric/dash regex at
    path helpers; bucket-name separator rejection in cleanup.
  • Secret & data hygiene — share-DB TOCTOU on claim_token replaced
    with atomic UPDATE-with-rowcount; quarantine narrowed to actual
    SQLite corruption signatures (was wiping the DB on any
    OperationalError); scrypt timing equalised across hit/miss to close
    the email-enumeration oracle; rate-limiter dicts bounded; stack-
    trace key stripped from HTTPException.detail with sweep fixture
    asserting no route leaks tracebacks.
  • SSH host-key pinning — configs/ssh_known_hosts with fail-safe
    loader; the tunnel manager refuses to start if the pinned file is
    missing/empty.
  • Scorer signal tightening — Python+Rust parity:
    L1_SCORE_COOKIE_TAMPERED=100 (was capped at 75 alongside
    missing/expired), L1_ROBOTIC_DWELL_LOW_S 0.5 → 0.20 (closes the
    0.20s–0.50s dwell-band evasion). Sliding-window mean (cookie schema
    v3) tracked as a 1.2 follow-up.

3 — Scoring-recv URL exclusion regex (operator control)

The Compute scorer skips requests whose URL matches a configurable
exclusion regex. A default static-asset extension list (.css, .js,
.png, …) ships as the fallback so common asset traffic bypasses the
scorer out of the box; operators override it per-service from the
Session Scoring page to add patterns (health checks, internal
endpoints, etc.) without touching code or VCL.

  • Backend — recv_snippet + generate_scoring_vcl accept an
    exclude_url_regex parameter; persisted in
    cfg.scoring.exclude_url_regex (None / "" = use default).
    update_recv_exclusion_regex orchestrator clones only the active
    version, swaps the recv snippet, validates, activates — ~5–15s vs.
    the full enable_scoring flow.
  • New endpoints — GET /api/services/{id}/scoring/exclude-regex
    (returns current + default + effective) and PUT
    /api/services/{id}/scoring/exclude-regex?confirm=true (token-gated;
    audit-logged as scoring_exclude_regex_changed).
  • Three-layer validation before any VCL ships:
    1. Input policy — length cap (2 KB), no double-quote / control
      chars, must compile under Python re.
    2. falco static analysis (github.com/ysugimoto/falco) on the
      assembled recv snippet (catches composition errors that slip past
      Python's compiler).
    3. Fastly's own VCL compiler at activate time.
  • Frontend — ExcludeRegexCard on the overview tab: textarea
    pre-populated with current value, "Show default" toggle, "Reset to
    default" button, inline lint-error display, confirm-dialog before
    publish.
  • Infra — falco v2.3.0 baked into the backend Docker image; production
    sets SCORING_REQUIRE_FALCO=1 so a missing binary fails closed
    instead of degrading to input-policy-only.

4 — Performance, reliability, and operational changes

A tranche of supporting work that doesn't fit cleanly into §1–§3 but
is in the squash:

Dashboard performance

  • backend/core/duckdb_pool.py — DuckDB connection pool replacing
    per-request connection setup.
  • backend/core/rollups.py + scripts/backfill_rollups.py — rollup
    precomputation pipeline for the dashboard's hot aggregates.
  • backend/utils/bounded_cache.py — generic TTL + maxsize cache
    (tests/utils/test_bounded_cache.py, 13 tests), used to bound the
    rate-limiter and several previously-unbounded dict caches.
  • 17 new loading.tsx files across the Next.js app routes, plus
    frontend/components/skeletons/PageSkeleton.tsx,
    frontend/components/LazyMount.tsx,
    frontend/lib/staleViewRetry.ts, frontend/hooks/useNowSeconds.ts
    for streaming SSR fallbacks and stale-view recovery.
  • Repository refactors absorbing the pool + rollup paths:
    backend/repositories/dashboard.py (+346),
    backend/repositories/_base.py (+261), backend/state_sync.py
    (+268), backend/scheduler.py (+229),
    backend/core/metadata_db.py (+893).

Operational tooling

  • backend/core/data_migrations.py — schema migration framework run
    on backend startup.
  • scripts/dev/sync-from-remote.sh — pull GCE deployment data into a
    scrubbed local working copy for performance work or feature dev
    against realistic volumes.
  • frontend/middleware.ts removed; behaviour moved to
    frontend/proxy.ts for the Caddy-marker / analyst-session split.
  • docs/demo_production_guide.md — runbook for the demo deployment.

Supporting modules surfaced during §2 / §3 work

  • backend/utils/vcl_validator.py (+ tests/utils/test_vcl_validator.py)
    — the falco-backed VCL static-analysis stage used by the scoring
    orchestrator.
  • backend/utils/fastly_auth.py — extracted Fastly token-validation
    helper used by the destructive-op auth checks.

Session scoring additions beyond §1's named components

  • Additional SessionScoring UI: AuditLogTab, ComplianceChart,
    FlagSessionPopover, PerReasonAucCard, ScoreDistChart,
    SessionEventsDialog, SinceHoursPicker, StackedHourlyBarChart,
    per-component help-content modules.
  • Additional scoring router endpoints: /scoring/score-distribution,
    /scoring/compliance-breakdown, /scoring/enforce-threshold
    (GET+PUT), /scoring/enforce-status-code (GET+PUT),
    /scoring/threshold (GET+PUT), /scoring/evaluation/per-reason,
    /scoring/matrix-versions/{version}/restore,
    /scoring/exclude-regex/validate, plus the /scoring/labels
    POST/PATCH/DELETE mutations behind the read endpoint.
  • compute/scorer/matrix.default.json ships as the default trained
    matrix; scripts/scoring/{deploy_wasm,extract_traces,train}.sh|.py
    are the training / deploy tooling.

Tooling additions

  • Secret scanner — gitleaks v8.30.1 wired into pre-commit
    (.pre-commit-config.yaml), make secret-scan (chained into make ci),
    and .github/workflows/ci.yml. Configuration in .gitleaks.toml
    extends the default ruleset with path allowlists for tracked test
    fixtures, Rust lockfile checksums, the public SSH host key, and
    gitignored runtime directories. AGENTS.md §Secrets documents the
    policy and suppression playbook.

Infrastructure

  • Backend Docker base: python:3.12-slim-bullseye →
    python:3.12-slim-bookworm. The frontend runtime stays on
    node:24-slim; only the multi-stage api-schema build inherits the
    Python bookworm bump. Remaining base-image CVEs are deep-dependency
    / OpenSSL CVEs every major Python base inherits.
  • Falco v2.3.0 in the backend image — required by the scoring-recv-
    snippet validator.
  • Dependency freshness sweep on all four ecosystems:
    • Python: aiohttp 3.13.5 → 3.14.0, cfn-lint 1.51.2 → 1.51.4,
      distlib, filelock, idna 3.17 → 3.18, joserfc 1.6.8 → 1.7.0.
    • Frontend: @tanstack/react-query 5.100.14 → 5.101.0 (+ devtools),
      @types/react 19.2.15 → 19.2.16, react / react-dom resolved to
      19.2.7 via the existing ^19.2.5 range. next + eslint-config-next
      stay pinned at 16.2.6.
    • Rust: bitflags 2.11.1 → 2.12.1.
    • Deferred (major bumps reserved for 1.2): TypeScript 5.9 → 6.0
      (compiler-API breaking changes); Fastly Rust SDK 0.11 → 0.12
      (Compute@Edge API churn); jsdom / eslint / vitest where we're
      already ahead of the npm "latest" tag.

Versioning

Bumped to 1.1.0 in pyproject.toml, frontend/package.json, and the
FastAPI app.version. CHANGELOG updated under [1.1.0] - 2026-06-03
with Security + Infrastructure sections.

Test coverage

backend pytest 3,053
Rust scorer 65 (+8)
frontend vitest 265 (+13)
VCL tests 9 (same)

New test files for this release (test-function count):
tests/utils/test_sql_validator.py 27
tests/utils/test_vcl_validator.py 18
tests/test_proxy_headers_regression.py 7
tests/test_no_trace_leakage_sweep.py 3
tests/routers/test_provision_teardown_auth.py 11
tests/routers/test_cross_tenant_scope.py 9
tests/routers/test_scoring_exclude_regex.py 15
tests/utils/test_bounded_cache.py 13 (see §4)

Notes for reviewers

  • Branch was squashed from 70+ commits; full per-commit history is in
    git reflog locally. The squash makes this reviewable as one
    semantic unit (v1.1.0 release) instead of paging through unrelated
    intermediate work.
  • Every security-relevant change has acceptance tests. OpenAPI
    snapshot regenerated.
  • Stale v1.1.0 tag was deleted before the squash. After merge, tag
    main with v1.1.0 rather than the PR branch.

Test plan

  • make ci passes locally
  • Deployed to dev VM (fastly-log-analysis in us-central1-a) — all
    three containers healthy, GET /api/health returns 200
  • Falco verified in production image: v2.3.0
  • Exclude-regex endpoint reachable + returns expected shape
  • CDN VCL active version updated with new querystring filter_except
    + sort + req.url cache key
  • Gitleaks scan clean against full branch history
  • Reviewer: open /admin/session-scoring, scroll to the URL
    exclusion regex card, paste a custom regex (e.g. .(healthz)$),
    click Save → publish flow completes

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

dmichael-fastly and others added 3 commits June 5, 2026 09:17
Squashed from the working set on session-scoring. Covers the session
scoring + dashboard performance work from the prior squash baseline
plus the recent additions:

- DUCKDB_POOL_MAX_SIZE env knob (was hardcoded to 8 per service)
- run.sh: compose NODE_OPTIONS instead of clobbering, refuse to bind
  ports commonly used by SSH tunnels to a remote backend/frontend
- Dashboard stale-view retry: detect when /api/dashboard/aggregates
  returns inconsistent results (metadata reports recent logs but every
  aggregation comes back empty) and let React Query retry up to twice.
  Mitigates the intermittent "no data" symptom during metadata_sync
  cron ticks; doesn't address the underlying writer contention.
- scripts/dev/sync-from-remote.sh: developer-only tool that mirrors a
  remote data tree locally and scrubs credentials/crons in the copied
  configs so the local backend can serve the synced volume without
  writing back.
- .vscode/ added to .gitignore (local editor config).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CHANGELOG.md
- Drop incorrect "next 16.2.6 → 16.2.7" and "eslint-config-next 16.2.6
  → 16.2.7" bumps; both stayed pinned at 16.2.6. Phrasing for the
  react/react-dom resolved bump corrected.
- Performance section now names the three structural workstreams that
  landed alongside the smaller tuning bullets: the DuckDB connection
  pool, the hourly Top-N rollup pipeline, and the bounded-cache
  primitive.

README.md
- Replace broken reference to `sample-vcl.vcl` (the file does not
  exist in the repo) with a description of what the manual-setup VCL
  needs to do and how to source it from the wizard's output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dmichael-fastly dmichael-fastly merged commit 9448897 into main Jun 8, 2026
0 of 2 checks passed
@dmichael-fastly dmichael-fastly deleted the session-scoring branch June 8, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant