Add app_benchmark harness for edge-endpoint application throughput by honeytung · Pull Request #400 · groundlight/edge-endpoint

honeytung · 2026-05-06T20:36:23Z

Summary

A YAML-driven, multi-process Python harness for benchmarking edge-endpoint application throughput — multiple lenses (single-detector or chained bbox→binary), each driven by N independent client processes, with per-frame composite generation, real chain inference, and rolled-up FPS / latency / VRAM / GPU-compute reporting.

Branched from #398 so it picks up groundlight_helpers._get_resources and call_api(timeout=...) directly. Will rebase to main once #398 lands.

Hard dependencies (already on this branch):

PR Add CPU and GPU utilization to /status/resources.json #394 (/status/resources.json v2 shape + loading_detectors warmup gate)
PR Source load-test system utilization from /status/resources.json #398 (_get_resources helper + SystemMonitor reference pattern)

Companion design docs (gitignored, in .context/):

prd-edge-benchmark.md — PRD
tdd-edge-benchmark.md — Technical design (this PR is the implementation of)

What's in this PR

New package at load-testing/app_benchmark/ (~1900 LOC):

Module	Role
`cli.py`	Top-level orchestration (load → host check → create → register → verify → run → cleanup → report)
`config.py`	Pydantic schemas + ruamel.yaml loader with line-annotated validation errors
`detectors.py`	Wraps `provision_detector` + `configure_edge_endpoint` (no reimpl)
`verification.py`	Control-plane proof: `from_edge=True`, latency sanity, `loading_detectors == 0`, IDs in `/status/resources.json`
`host_check.py`	Pre-flight: refuse if non-bench detectors are loaded
`cleanup_orphans.py`	Standalone CLI for SIGKILL recovery (deletes by name prefix)
`image_loader.py`	`CompositeGenerator` with per-camera `random.Random`, parametric base, ground-truth ROI crops, padding
`client.py`	Per-camera process loop: composite → stage 0 → ground-truth crops + padding → stages 1..N. Per-client pacing, retry policy, error budget.
`supervisor.py`	Spawns/joins clients + monitor (multiprocessing `spawn` context)
`monitor.py`	Resource sampler — thin wrapper over `glh._get_resources`, builds `ExperimentalApi()` post-fork
`metrics_writer.py`	CSV writers for `metrics.csv` / `warmup.csv` / `lens_events.csv`; rolling FPS aggregation
`environment.py`	Host / GPU / git SHA / image digest capture for `summary.json`
`report.py`	Aggregates artifacts, builds `summary.json`/`md`, plots (per-lens FPS w/ composite-objects overlay, combined cross-lens FPS, system metrics)

Plus:

configs/example_3lens.yaml — fire/fall/fence reference config
configs/known_pipelines.md — mlpipe registry-key starter list (sourced from Benchmarking VRAM and RAM #373)
tests/test_config_schema.py, test_image_loader.py, test_client.py — 24 unit tests
images/cat.jpeg — default downstream-crop padding fixture
pyproject.toml — added ruamel-yaml, psutil, pytest (dev)

Key semantics

cameras: N spawns N independent client processes; target_fps is per-camera. Aggregate = cameras × target_fps.
FPS counts lens-loop iterations (stage_idx == -1 rows), NOT HTTP requests. HTTP rate = aggregate_fps × (1 + num_crops_into_next) for a 2-stage chain.
mlpipe is Optional[str] (≤100 chars), interpreted as a named-pipeline key in the cloud registry. null = mode default.
Composite generation (per frame): random[1, num_crops_into_next] copies of the base image at random sizes/positions. Downstream crops use generation ground truth ROIs (not detector outputs); slots beyond k_actual are filled with the configured padding_image. Always sends exactly num_crops_into_next to the next stage.

Test plan

Unit tests pass (24/24): cd load-testing && uv run pytest app_benchmark/tests/
All modules import without error
Example config validates and loads cleanly
Integration (cannot run from this workspace): --dry-run against staging cloud + local k3s edge — should create 4 detectors, register on edge, verify from_edge=true, then delete them
Integration: 1-min run against the example_3lens config — should produce summary.json, plots, clean detector cleanup
Integration: kill -9 mid-run, then cleanup_orphans --prefix bench — verify orphans are gone

Known follow-ups

TQ-1b: enumerate canonical mlpipe names in configs/known_pipelines.md (started; needs validation against staging)
TQ-2: capture edge-endpoint git SHA via a /version endpoint if/when one exists (currently stamps image digest only)
TQ-4: measure observer effect of polling /status/resources.json at 2 Hz under saturation
TQ-8: scope reconciliation with PR Benchmarking VRAM and RAM #373 (tim/benchmarking-scripts) — overlap with measure_ram_and_vram_usage.py
Plotly HTML dashboard (TDD §6.10) — currently just matplotlib PNGs; HTML can be added in a follow-up

🤖 Generated with Claude Code

A YAML-driven multi-process harness that benchmarks edge-endpoint throughput under realistic application shapes (single-detector and chained bbox->binary lenses with multi-camera fan-out). Each lens spawns N independent client processes that generate per-frame composite images, send them through a configurable detector chain, and record FPS / latency / error stats. Resource sampling reuses groundlight_helpers._get_resources (PR #398) to consume system-level totals from /status/resources.json. Reuses without reimplementing: - groundlight_helpers.provision_detector (cloud detector creation) - groundlight_helpers.configure_edge_endpoint (gl.edge.set_config) - groundlight_helpers._get_resources (post-#398 resource endpoint) - image_helpers compositing logic (parametric base + per-camera RNG) CLI entry points: - python -m app_benchmark <yaml> — main run - python -m app_benchmark.cleanup_orphans — SIGKILL recovery 24/24 unit tests pass. Branched from htung_axoncorp/load-test-migrate (PR #398) per the TDD's branching strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes from the first dry-run against a real edge: 1. Edge-side cleanup gap: gl.delete_detector() removes the detector from the cloud but leaves it configured on the edge (inference pod stays loaded). Fix: snapshot gl.edge.get_config() at startup, push it back in the cleanup path. Preserves any pre-existing detectors when refuse_if_host_not_clean=false. 2. host_check warning was a single log line that's easy to miss. Now prints a banner-bordered multi-line warning explaining that pre-existing detectors share resources and will skew the results. Plus two new configs: - smoke_test.yaml: minimal 1-detector dry-run validation - two_lens_2k.yaml: counting + chain at 1920×1080, saturate, max 5 objs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues found running against a real edge: 1. cleanup_orphans only handles cloud-side detector leaks, but the edge can also have orphan EdgeEndpointConfig entries (e.g. from a pre-fix run that deleted the cloud detector but left the edge config). Added a sister CLI: python -m app_benchmark.cleanup_edge --edge-endpoint URL --list python -m app_benchmark.cleanup_edge --edge-endpoint URL --wipe 2. host_check was prefix-matching against the `name` field of /status/resources.json detectors[] entries, but that endpoint only returns `detector_id` (no friendly name). Prefix matching was always failing → every loaded detector got flagged. Simplified to "any loaded detector = not clean", with the recovery path printed in the error message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

EdgeEndpointConfig.detectors is a list of DetectorConfig objects (each with a .detector_id attribute), not a dict keyed by detector ID. The previous introspection treated it as a dict and silently returned [], causing --list to always report "0 configured" and --wipe to be a no-op when the config actually had detectors. This is what caused the user's earlier wipe to do nothing — the edge config still had det_3DM, the controller correctly kept its pods alive, and we incorrectly attributed it to an edge-endpoint behavior bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

multiprocessing.Queue(maxsize=N) creates a POSIX BoundedSemaphore initialized to N. On macOS, SEM_VALUE_MAX=32767 — passing 100_000 raises OSError: [Errno 22] Invalid argument when the supervisor spins up. Cap to 32_000 (frame_queue) and 4_000 (sample_queue). Both are well above what we actually queue at run-time since metrics_writer drains continuously on a background thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three things broken in the actual load loop, all caught by the first real run: 1. Client was POSTing to /image-queries — actual route per app/api/api.py is /device-api/v1/image-queries (mounted via API_BASE_PATH). Verification used the SDK so it found the right path; raw-HTTP clients got 404. 2. Monitor process constructed ExperimentalApi() with no endpoint, which falls back to GROUNDLIGHT_ENDPOINT — usually the cloud, which 404s on /status/resources.json. Pass the edge URL explicitly through Supervisor. 3. Console output was JSON-lines (the JsonFormatter was applied to both stdout and run.log). Split: stdout now uses a human-readable format ("HH:MM:SS [LEVEL] logger: msg"), run.log keeps the structured JSON for machine parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related cleanups per review feedback: 1. client.py was POSTing to the edge with raw `requests`. That meant reimplementing path resolution, auth headers, and retry — all things the Groundlight SDK already does correctly. Replaced with `gl.submit_image_query(detector_id, image_bytes, **IQ_KWARGS_FOR_NO_ESCALATION)`, matching the canonical pattern in load-testing/simple_ee_test.py. The SDK's built-in 5xx/429 retry replaces our manual retry loop; from_edge=False still trips a fatal control-plane-drift error. 2. cloud_endpoint default was "https://api.groundlight.ai/device-api". The SDK appends /device-api itself, so the suffix was redundant. Updated default + all example configs to "https://api.groundlight.ai/". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit followup to the client.py SDK refactor — replace remaining indirect-HTTP usages with direct SDK methods where they exist: - host_check: gl.edge.get_config().detectors instead of glh._get_resources()['detectors']. The active config is the authoritative source for "what's configured" anyway. - verification._check_resources_loaded: gl.edge.get_detector_readiness() (returns dict[detector_id -> bool]) instead of polling /status/resources.json for loading_detectors==0 + ID presence. One call answers exactly the question we're asking. Remaining HTTP via glh: - monitor.py: system CPU/RAM/GPU metrics — no SDK equivalent (mandatory) - environment.py: container image digests — no SDK equivalent (mandatory) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three review-driven cleanups: 1. Network latency baseline: at startup, run `ping -c 5 <edge-host>` and parse min/avg/max/stddev RTT. Stored under summary.json.environment.network_latency_ms and surfaced in summary.md. Tolerant of failure (firewalls, no ping binary). Cloud RTT is intentionally NOT measured — cloud is only touched during one-time training, not in the steady-state load loop. 2. Drop api_token_env from config schema. The Groundlight SDK reads GROUNDLIGHT_API_TOKEN directly; we never need to extract it ourselves now that all calls go through the SDK. cli.py asserts it's set in the environment as a startup precondition. 3. RAM and VRAM plots use a GB y-axis instead of bytes (and summary.md likewise). Easier to read for engineers; we don't need byte-level resolution for these graphs. New file: app_benchmark/network.py Updated: cli.py, config.py, report.py, all three example configs TDD/PRD updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

submit_image_query polls the cloud for high-confidence answers when the returned IQ is below confidence_threshold — that fails when the edge runs in NO_CLOUD mode (no cloud reachable for the IQ poll). ask_ml returns the first ML prediction without confidence polling or human-review escalation — exactly what we want for benchmarking inference throughput. Applied to: - client.py per-frame chain (3 call sites via _submit_via_sdk) - verification.py sentinel + final-pass latency measurement Side effect: ask_ml takes no escalation kwargs (its behavior is inherent), so dropped the IQ_KWARGS_FOR_NO_ESCALATION lookup from client and the unused groundlight_helpers import from verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. Cloud Predictor.name has a 100-char limit. Our prefix `{detector_name_prefix}_{run.name}_{spec.name}` plus the ~50-char suffix from provision_detector plus ~20 chars of cloud sibling-detector overhead pushed long configs over the limit (e.g. fire-fall-fence-3lens hit 108 chars). Replaced run.name with a 6-char SHA-256 prefix and capped the total prefix at 28 chars. cleanup_orphans still matches by the user-facing detector_name_prefix (e.g. `bench_*`), so the inner hash is invisible to the user. spec.name is preserved for readability when it fits, otherwise replaced with an 8-char hash. 2. Pinging `localhost` was hanging on macOS for 15s before our timeout fired. Loopback hosts are now short-circuited (no ping, no benchmark value anyway since RTT to self is ~0). Per-ping timeout also dropped from 2.0s → 1.5s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 'Target FPS' column showed `target_fps × cameras` (the aggregate target the lens-as-a-whole should produce) without saying so. With target_fps: 5 and cameras: 4 in the YAML, the table read 'Target FPS: 20' which looked like a bug. The math was correct (we compare achieved aggregate vs target aggregate to compute deficit %), but the presentation was opaque. - Add a Cams column so the per-camera × cams = aggregate math is self-documenting. - Split into 'Target FPS (per-cam)' and 'Target FPS (aggr)' columns, explicit aggregate suffix on Achieved. - Add 'saturate' label when target_fps == 0 (no rate limit). - Add target_fps_aggregate to summary.json for symmetry with achieved_fps_aggregate (target_fps_per_camera was already there). - Add a one-line note above the table pointing readers to achieved_fps_per_client in summary.json for per-camera data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

honeytung and others added 12 commits May 6, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add app_benchmark harness for edge-endpoint application throughput#400

Add app_benchmark harness for edge-endpoint application throughput#400
honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
htung_axoncorp/edge-bench-impl

honeytung commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

honeytung commented May 6, 2026

Summary

What's in this PR

Key semantics

Test plan

Known follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant