Skip to content

v1.2.0: dashboard performance overhaul + security hardening#7

Merged
dmichael-fastly merged 1 commit into
mainfrom
performance-improvement
Jun 9, 2026
Merged

v1.2.0: dashboard performance overhaul + security hardening#7
dmichael-fastly merged 1 commit into
mainfrom
performance-improvement

Conversation

@dmichael-fastly

Copy link
Copy Markdown
Collaborator

Summary

  • Performance: cold and warm dashboard loads drop from seconds to sub-second on large services; sustained concurrent load no longer wedges the backend. Read-path I/O is structurally cut by a per-service DuckDB connection pool, a per-minute time-series rollup bundle, size-capped bin-packing local compaction (daily + weekly tiers), composite admin-page endpoints, and a frontend pre-warm + hover-prefetch pattern that makes navigation feel instant.
  • Reliability: multi-worker login loop fixed; DuckDB pool/cron lock conflict resolved; iceberg s3fs proxy hook always registers; rollup current-hour merge restored; usage-log reconcile changed to UPSERT; telemetry backstop middleware added.
  • Security: closed a cross-tenant ContextVar leak in the s3fs proxy hook (now via a ThreadPoolExecutor.submit monkeypatch); centralised path-param service-scope checks; removed a secret-in-URL leak on downloads; added strict input validation across the destructive-op surface; tightened CSRF gates.
  • Version bumped to 1.2.0 across pyproject.toml, frontend/package.json, frontend/openapi.json, and backend/main.py.

See CHANGELOG.md for the full entry.

Highlights

Performance — structural

  • Per-minute time-series rollup bundle and per-day rollup compaction.
  • Size-capped bin-packing local compaction (default 256 MB) — preserves DuckDB scan parallelism on multi-month services.
  • DuckDB pool tuning knobs (DUCKDB_POOL_CONN_MEMORY_LIMIT, DUCKDB_POOL_CONN_THREADS); view-binding moved outside the pool lock.
  • Composite endpoints: POST /api/scoring/dashboard, GET /api/scoring/analytics, GET /api/scoring/config, expanded GET /api/network-health, new POST /api/origin/aggregates. Per-card endpoints stay mounted for back-compat.
  • Parquet ingest sort key → (timestamp, ip) (~2× sessions speedup).
  • ingested_files.file_date column + (source_name, file_date) index for log-accounting fast path.
  • Iceberg buffer-file tombstoning; optimize_table adds union_by_name + retry-on-CAS-conflict.
  • Bootstrap stale-while-revalidate dir-stats; views folded into the response.

Frontend

  • starlette-compress replaces GZipMiddleware (br / zstd / gzip).
  • Keep-alive on Next.js http/undici global agents.
  • Pre-warm + lazy-mount for plotly + maplibre-gl + world.geojson; hover-prefetch sidebar links; per-insight skeleton cards.
  • Modulepreload for the plotly chunk via a build-time-generated preload manifest.
  • Shared useNowMs interval; MapLibre style-data listener replaces a 100 ms setTimeout poll.

Security

  • ContextVar propagation patch on ThreadPoolExecutor.submit eliminates the prior endpoint-keyed proxy registry. Documented in MONKEYPATCHES.md §6.
  • Centralised session-scope check on every scoped route; signed short-lived bearer on the download redirect.
  • Strict validation (length caps, character allowlists, falco static analysis) before any VCL ships on the destructive-op surface.
  • Cross-tenant cache-key audit — two missing service_id entries closed.

Documentation

  • AGENTS.md — Key Systems entries for the DuckDB connection pool, the hourly Top-N rollup pipeline, and the response telemetry middleware; local-compaction section updated for the bin-packing tiers.
  • MONKEYPATCHES.md — documents the new ThreadPoolExecutor.submit patch.

Test plan

  • make ci green locally (lint + format + mypy + pytest + vcl-test + verify-deps + typecheck-frontend + test-frontend + osv + secret-scan)
  • Cold dashboard load on a large service is sub-second
  • Concurrent load (~20 in-flight queries) — backend stays responsive, no wedge
  • Login flow with multiple uvicorn workers succeeds on first POST
  • Cross-tenant smoke test — verify no leaked state across services after concurrent reads
  • Modulepreload manifest is read at request time on a fresh container build
  • Manually exercise the new composite endpoints (scoring dashboard / analytics / config, network-health shielding, origin aggregates)

🤖 Generated with Claude Code

Cold and warm dashboard loads drop from seconds to sub-second on large
services; sustained concurrent load no longer wedges the backend. Read
path I/O is structurally cut by a per-service DuckDB connection pool, a
per-minute time-series rollup bundle, size-capped bin-packing local
compaction (daily + weekly tiers), composite admin-page endpoints, and a
frontend pre-warm + hover-prefetch pattern that makes navigation feel
instant.

Performance — structural

* Per-minute time-series rollup bundle precomputes the dashboard chart's
  per-minute aggregate per (field, hour); eliminates the wide Iceberg
  scan on chart render.
* Per-day rollup compaction — closed days roll up into a single per-day
  file; the reader prefers per-day and falls back to hourly only for the
  current day.
* Size-capped bin-packing local compaction (default 256 MB cap) replaces
  single-file daily/weekly rollups; preserves DuckDB scan parallelism on
  multi-month services.
* DuckDB connection-pool tuning — DUCKDB_POOL_CONN_MEMORY_LIMIT and
  DUCKDB_POOL_CONN_THREADS env vars cap per-connection RSS and threads.
  View-binding moved outside the pool's Condition lock to eliminate a
  stale-snapshot deadlock.
* Composite read endpoints — POST /api/scoring/dashboard,
  GET /api/scoring/analytics, GET /api/scoring/config,
  GET /api/network-health (now includes shielding), and the new
  POST /api/origin/aggregates collapse multi-card mounts into one round
  trip. Per-card endpoints stay mounted for back-compat.
* Parquet ingest sort key changed to (timestamp, ip) so sessions queries
  stream-merge on ip instead of materialising a temp table (~2× speedup).
* ingested_files.file_date column + (source_name, file_date) index for
  the log-accounting fast path.
* Iceberg buffer files tombstoned and removed on the next pass instead
  of unlinked inline at commit. optimize_table adds union_by_name +
  retry-on-CAS-conflict.
* Bootstrap stale-while-revalidate for dir-stats; views folded into the
  response.

Performance — tuning

* Dashboard: live-hour TEMP TABLE shared across CTEs; Python-side bot
  match; memoised ngwaf_top.
* Insights: coalesce 4 city/region/country queries into 1; coalesce 4
  URL-keyed insights into 1 CTE.
* Sessions: split monolithic CTE into measurable stages; eliminate hot-
  path temp-table materialisation.
* Origin: combine two sequential scans into one via GROUPING SETS.
* Cron-runs since_id delta-poll on /logs recentCrons.
* Admin usage-log visibility-gates its 30s tick; latest-per-task SQL
  rewritten to skip the full join.
* 60s TTL on bot-source cache-dir scandir.
* React-Query: skip 4xx retries; hooks lifted out of insights /
  ReportLayout render-props.

Frontend

* starlette-compress replaces GZipMiddleware (br / zstd / gzip
  negotiation).
* Keep-alive on Next.js http/undici global agents.
* Pre-warm + lazy-mount pattern for plotly + maplibre-gl +
  world.geojson on AppLayout mount; hover-prefetch sidebar links;
  per-insight skeleton cards on first paint.
* Modulepreload for the plotly chunk via a build-time-generated preload
  manifest. Root layout opts out of build-time SSG so the manifest is
  read at request time.
* /geo/* aggressively cached; PlotlyChart dynamic-import on /network.
* SystemHealthCard polls at 1s for live attack/load feedback.
* Shared useNowMs interval for visible-tick components.
* MapLibre style-data listener replaces a 100ms setTimeout poll.

Reliability

* Multi-worker login loop fixed via on-demand SQLite session rehydration.
* DuckDB lock conflict between pool and cron writes resolved —
  get_connection forces read_only=False on the file.
* QueryRunner empty-schema self-heal busts _view_cache before the
  force=True rebuild so the lock-timeout fallback can't re-execute the
  same stale cached SQL (mirrors the execute() self-heal). Without
  this, ingest-cron lock contention pinned the view to a deleted buffer
  path and the dashboard surfaced "No data available" on a 200.
* QueryRunner clears _view_cache before force=True rebuild on the post-
  empty self-heal path.
* Iceberg s3fs proxy hook falls back to the process-global source so the
  hook always registers (cold-start LIST before _get_catalog).
* Top-N current-hour merge silent ImportError fixed; rollup compaction
  threads run_id through the error branch + uses in-memory DuckDB.
* Dashboard response cache: write to is_cached (not aliased _is_cached)
  to keep Pydantic from dropping the flag.
* Usage-log reconcile cycle changed from DELETE+INSERT to UPSERT.
* expire_snapshots updated for pyiceberg 0.11.1 + emits cron_runs
  telemetry.
* Next.js 16 compat: middleware.ts → proxy.ts (Caddy-marker preserved).
* TelemetryResponseBodyMiddleware backstops endpoints that bypass
  BaseResponse.with_telemetry.

Security

* Cross-tenant ContextVar leak in the s3fs proxy hook closed —
  ThreadPoolExecutor.submit monkeypatched to wrap callables in
  contextvars.copy_context(); endpoint-keyed global registry removed.
* Path-param service-scope desync — centralised the session-scope check
  via a router-utils helper invoked on every scoped route.
* Secret-in-URL leak on downloads — switched to a signed short-lived
  bearer stripped before redirect.
* Strict input validation on the destructive-op surface (provision
  teardown, NGWAF mutations, scoring threshold + enforce-status-code +
  recv-exclusion-regex). Length caps, character allowlists, and falco
  static analysis before any VCL ships.
* CSRF: state-changing endpoints moved off GET.
* Cross-tenant cache key audit — every per-tenant cache key includes
  service_id; closed two missing entries on insights and origin paths.
* Thread leak in share-login replaced by on-demand SQLite rehydration.
* Terms-of-service bypass on share-login /acknowledge fixed.

Tests

* 3500+ backend tests (+450), 290+ frontend vitest tests (+25).
* New coverage: DuckDB pool, local compaction, rollups compaction +
  hour bundling, iceberg helpers, service manager, SQL validator,
  telemetry response middleware, router utils, state sync, terraform
  gen, plus router coverage for the new composite endpoints and the
  destructive-op-auth surface.
* make ci green: lint + format + mypy + pytest + vcl-test + verify-deps
  + typecheck-frontend + test-frontend + osv + secret-scan.

Infrastructure

* Synthetic load generator (scripts/loadtest_generator.py) and read-path
  probe (scripts/dev/loadtest_probe.sh) for reproducible perf
  measurement.
* Two-pass next build in the frontend Dockerfile so SSG sees the
  correct plotly chunk hashes.

Documentation

* AGENTS.md — Key Systems entries for the DuckDB connection pool, the
  hourly Top-N rollup pipeline, and the response telemetry middleware;
  local-compaction section updated for the bin-packing tiers.
* MONKEYPATCHES.md — documents the new ThreadPoolExecutor.submit patch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dmichael-fastly dmichael-fastly force-pushed the performance-improvement branch from e179fa7 to 0f0887e Compare June 9, 2026 17:51
@dmichael-fastly dmichael-fastly enabled auto-merge June 9, 2026 17:54
@dmichael-fastly dmichael-fastly merged commit b552c60 into main Jun 9, 2026
0 of 2 checks passed
@dmichael-fastly dmichael-fastly deleted the performance-improvement branch June 9, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant