v1.2.0: dashboard performance overhaul + security hardening#7
Merged
Conversation
Cold and warm dashboard loads drop from seconds to sub-second on large services; sustained concurrent load no longer wedges the backend. Read path I/O is structurally cut by a per-service DuckDB connection pool, a per-minute time-series rollup bundle, size-capped bin-packing local compaction (daily + weekly tiers), composite admin-page endpoints, and a frontend pre-warm + hover-prefetch pattern that makes navigation feel instant. Performance — structural * Per-minute time-series rollup bundle precomputes the dashboard chart's per-minute aggregate per (field, hour); eliminates the wide Iceberg scan on chart render. * Per-day rollup compaction — closed days roll up into a single per-day file; the reader prefers per-day and falls back to hourly only for the current day. * Size-capped bin-packing local compaction (default 256 MB cap) replaces single-file daily/weekly rollups; preserves DuckDB scan parallelism on multi-month services. * DuckDB connection-pool tuning — DUCKDB_POOL_CONN_MEMORY_LIMIT and DUCKDB_POOL_CONN_THREADS env vars cap per-connection RSS and threads. View-binding moved outside the pool's Condition lock to eliminate a stale-snapshot deadlock. * Composite read endpoints — POST /api/scoring/dashboard, GET /api/scoring/analytics, GET /api/scoring/config, GET /api/network-health (now includes shielding), and the new POST /api/origin/aggregates collapse multi-card mounts into one round trip. Per-card endpoints stay mounted for back-compat. * Parquet ingest sort key changed to (timestamp, ip) so sessions queries stream-merge on ip instead of materialising a temp table (~2× speedup). * ingested_files.file_date column + (source_name, file_date) index for the log-accounting fast path. * Iceberg buffer files tombstoned and removed on the next pass instead of unlinked inline at commit. optimize_table adds union_by_name + retry-on-CAS-conflict. * Bootstrap stale-while-revalidate for dir-stats; views folded into the response. Performance — tuning * Dashboard: live-hour TEMP TABLE shared across CTEs; Python-side bot match; memoised ngwaf_top. * Insights: coalesce 4 city/region/country queries into 1; coalesce 4 URL-keyed insights into 1 CTE. * Sessions: split monolithic CTE into measurable stages; eliminate hot- path temp-table materialisation. * Origin: combine two sequential scans into one via GROUPING SETS. * Cron-runs since_id delta-poll on /logs recentCrons. * Admin usage-log visibility-gates its 30s tick; latest-per-task SQL rewritten to skip the full join. * 60s TTL on bot-source cache-dir scandir. * React-Query: skip 4xx retries; hooks lifted out of insights / ReportLayout render-props. Frontend * starlette-compress replaces GZipMiddleware (br / zstd / gzip negotiation). * Keep-alive on Next.js http/undici global agents. * Pre-warm + lazy-mount pattern for plotly + maplibre-gl + world.geojson on AppLayout mount; hover-prefetch sidebar links; per-insight skeleton cards on first paint. * Modulepreload for the plotly chunk via a build-time-generated preload manifest. Root layout opts out of build-time SSG so the manifest is read at request time. * /geo/* aggressively cached; PlotlyChart dynamic-import on /network. * SystemHealthCard polls at 1s for live attack/load feedback. * Shared useNowMs interval for visible-tick components. * MapLibre style-data listener replaces a 100ms setTimeout poll. Reliability * Multi-worker login loop fixed via on-demand SQLite session rehydration. * DuckDB lock conflict between pool and cron writes resolved — get_connection forces read_only=False on the file. * QueryRunner empty-schema self-heal busts _view_cache before the force=True rebuild so the lock-timeout fallback can't re-execute the same stale cached SQL (mirrors the execute() self-heal). Without this, ingest-cron lock contention pinned the view to a deleted buffer path and the dashboard surfaced "No data available" on a 200. * QueryRunner clears _view_cache before force=True rebuild on the post- empty self-heal path. * Iceberg s3fs proxy hook falls back to the process-global source so the hook always registers (cold-start LIST before _get_catalog). * Top-N current-hour merge silent ImportError fixed; rollup compaction threads run_id through the error branch + uses in-memory DuckDB. * Dashboard response cache: write to is_cached (not aliased _is_cached) to keep Pydantic from dropping the flag. * Usage-log reconcile cycle changed from DELETE+INSERT to UPSERT. * expire_snapshots updated for pyiceberg 0.11.1 + emits cron_runs telemetry. * Next.js 16 compat: middleware.ts → proxy.ts (Caddy-marker preserved). * TelemetryResponseBodyMiddleware backstops endpoints that bypass BaseResponse.with_telemetry. Security * Cross-tenant ContextVar leak in the s3fs proxy hook closed — ThreadPoolExecutor.submit monkeypatched to wrap callables in contextvars.copy_context(); endpoint-keyed global registry removed. * Path-param service-scope desync — centralised the session-scope check via a router-utils helper invoked on every scoped route. * Secret-in-URL leak on downloads — switched to a signed short-lived bearer stripped before redirect. * Strict input validation on the destructive-op surface (provision teardown, NGWAF mutations, scoring threshold + enforce-status-code + recv-exclusion-regex). Length caps, character allowlists, and falco static analysis before any VCL ships. * CSRF: state-changing endpoints moved off GET. * Cross-tenant cache key audit — every per-tenant cache key includes service_id; closed two missing entries on insights and origin paths. * Thread leak in share-login replaced by on-demand SQLite rehydration. * Terms-of-service bypass on share-login /acknowledge fixed. Tests * 3500+ backend tests (+450), 290+ frontend vitest tests (+25). * New coverage: DuckDB pool, local compaction, rollups compaction + hour bundling, iceberg helpers, service manager, SQL validator, telemetry response middleware, router utils, state sync, terraform gen, plus router coverage for the new composite endpoints and the destructive-op-auth surface. * make ci green: lint + format + mypy + pytest + vcl-test + verify-deps + typecheck-frontend + test-frontend + osv + secret-scan. Infrastructure * Synthetic load generator (scripts/loadtest_generator.py) and read-path probe (scripts/dev/loadtest_probe.sh) for reproducible perf measurement. * Two-pass next build in the frontend Dockerfile so SSG sees the correct plotly chunk hashes. Documentation * AGENTS.md — Key Systems entries for the DuckDB connection pool, the hourly Top-N rollup pipeline, and the response telemetry middleware; local-compaction section updated for the bin-packing tiers. * MONKEYPATCHES.md — documents the new ThreadPoolExecutor.submit patch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
e179fa7 to
0f0887e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ThreadPoolExecutor.submitmonkeypatch); centralised path-param service-scope checks; removed a secret-in-URL leak on downloads; added strict input validation across the destructive-op surface; tightened CSRF gates.pyproject.toml,frontend/package.json,frontend/openapi.json, andbackend/main.py.See
CHANGELOG.mdfor the full entry.Highlights
Performance — structural
DUCKDB_POOL_CONN_MEMORY_LIMIT,DUCKDB_POOL_CONN_THREADS); view-binding moved outside the pool lock.POST /api/scoring/dashboard,GET /api/scoring/analytics,GET /api/scoring/config, expandedGET /api/network-health, newPOST /api/origin/aggregates. Per-card endpoints stay mounted for back-compat.(timestamp, ip)(~2× sessions speedup).ingested_files.file_datecolumn +(source_name, file_date)index for log-accounting fast path.optimize_tableaddsunion_by_name+ retry-on-CAS-conflict.Frontend
starlette-compressreplacesGZipMiddleware(br / zstd / gzip).world.geojson; hover-prefetch sidebar links; per-insight skeleton cards.useNowMsinterval; MapLibre style-data listener replaces a 100 mssetTimeoutpoll.Security
ThreadPoolExecutor.submiteliminates the prior endpoint-keyed proxy registry. Documented inMONKEYPATCHES.md§6.falcostatic analysis) before any VCL ships on the destructive-op surface.service_identries closed.Documentation
AGENTS.md— Key Systems entries for the DuckDB connection pool, the hourly Top-N rollup pipeline, and the response telemetry middleware; local-compaction section updated for the bin-packing tiers.MONKEYPATCHES.md— documents the newThreadPoolExecutor.submitpatch.Test plan
make cigreen locally (lint + format + mypy + pytest + vcl-test + verify-deps + typecheck-frontend + test-frontend + osv + secret-scan)🤖 Generated with Claude Code