Add app_benchmark harness for edge-endpoint application throughput#400
Draft
honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
Draft
Add app_benchmark harness for edge-endpoint application throughput#400honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
Conversation
A YAML-driven multi-process harness that benchmarks edge-endpoint throughput under realistic application shapes (single-detector and chained bbox->binary lenses with multi-camera fan-out). Each lens spawns N independent client processes that generate per-frame composite images, send them through a configurable detector chain, and record FPS / latency / error stats. Resource sampling reuses groundlight_helpers._get_resources (PR #398) to consume system-level totals from /status/resources.json. Reuses without reimplementing: - groundlight_helpers.provision_detector (cloud detector creation) - groundlight_helpers.configure_edge_endpoint (gl.edge.set_config) - groundlight_helpers._get_resources (post-#398 resource endpoint) - image_helpers compositing logic (parametric base + per-camera RNG) CLI entry points: - python -m app_benchmark <yaml> — main run - python -m app_benchmark.cleanup_orphans — SIGKILL recovery 24/24 unit tests pass. Branched from htung_axoncorp/load-test-migrate (PR #398) per the TDD's branching strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes from the first dry-run against a real edge: 1. Edge-side cleanup gap: gl.delete_detector() removes the detector from the cloud but leaves it configured on the edge (inference pod stays loaded). Fix: snapshot gl.edge.get_config() at startup, push it back in the cleanup path. Preserves any pre-existing detectors when refuse_if_host_not_clean=false. 2. host_check warning was a single log line that's easy to miss. Now prints a banner-bordered multi-line warning explaining that pre-existing detectors share resources and will skew the results. Plus two new configs: - smoke_test.yaml: minimal 1-detector dry-run validation - two_lens_2k.yaml: counting + chain at 1920×1080, saturate, max 5 objs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues found running against a real edge:
1. cleanup_orphans only handles cloud-side detector leaks, but the edge
can also have orphan EdgeEndpointConfig entries (e.g. from a pre-fix
run that deleted the cloud detector but left the edge config). Added
a sister CLI:
python -m app_benchmark.cleanup_edge --edge-endpoint URL --list
python -m app_benchmark.cleanup_edge --edge-endpoint URL --wipe
2. host_check was prefix-matching against the `name` field of
/status/resources.json detectors[] entries, but that endpoint only
returns `detector_id` (no friendly name). Prefix matching was always
failing → every loaded detector got flagged. Simplified to
"any loaded detector = not clean", with the recovery path printed
in the error message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EdgeEndpointConfig.detectors is a list of DetectorConfig objects (each with a .detector_id attribute), not a dict keyed by detector ID. The previous introspection treated it as a dict and silently returned [], causing --list to always report "0 configured" and --wipe to be a no-op when the config actually had detectors. This is what caused the user's earlier wipe to do nothing — the edge config still had det_3DM, the controller correctly kept its pods alive, and we incorrectly attributed it to an edge-endpoint behavior bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
multiprocessing.Queue(maxsize=N) creates a POSIX BoundedSemaphore initialized to N. On macOS, SEM_VALUE_MAX=32767 — passing 100_000 raises OSError: [Errno 22] Invalid argument when the supervisor spins up. Cap to 32_000 (frame_queue) and 4_000 (sample_queue). Both are well above what we actually queue at run-time since metrics_writer drains continuously on a background thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three things broken in the actual load loop, all caught by the first
real run:
1. Client was POSTing to /image-queries — actual route per app/api/api.py
is /device-api/v1/image-queries (mounted via API_BASE_PATH). Verification
used the SDK so it found the right path; raw-HTTP clients got 404.
2. Monitor process constructed ExperimentalApi() with no endpoint, which
falls back to GROUNDLIGHT_ENDPOINT — usually the cloud, which 404s on
/status/resources.json. Pass the edge URL explicitly through Supervisor.
3. Console output was JSON-lines (the JsonFormatter was applied to both
stdout and run.log). Split: stdout now uses a human-readable format
("HH:MM:SS [LEVEL] logger: msg"), run.log keeps the structured JSON
for machine parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related cleanups per review feedback: 1. client.py was POSTing to the edge with raw `requests`. That meant reimplementing path resolution, auth headers, and retry — all things the Groundlight SDK already does correctly. Replaced with `gl.submit_image_query(detector_id, image_bytes, **IQ_KWARGS_FOR_NO_ESCALATION)`, matching the canonical pattern in load-testing/simple_ee_test.py. The SDK's built-in 5xx/429 retry replaces our manual retry loop; from_edge=False still trips a fatal control-plane-drift error. 2. cloud_endpoint default was "https://api.groundlight.ai/device-api". The SDK appends /device-api itself, so the suffix was redundant. Updated default + all example configs to "https://api.groundlight.ai/". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit followup to the client.py SDK refactor — replace remaining indirect-HTTP usages with direct SDK methods where they exist: - host_check: gl.edge.get_config().detectors instead of glh._get_resources()['detectors']. The active config is the authoritative source for "what's configured" anyway. - verification._check_resources_loaded: gl.edge.get_detector_readiness() (returns dict[detector_id -> bool]) instead of polling /status/resources.json for loading_detectors==0 + ID presence. One call answers exactly the question we're asking. Remaining HTTP via glh: - monitor.py: system CPU/RAM/GPU metrics — no SDK equivalent (mandatory) - environment.py: container image digests — no SDK equivalent (mandatory) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three review-driven cleanups: 1. Network latency baseline: at startup, run `ping -c 5 <edge-host>` and parse min/avg/max/stddev RTT. Stored under summary.json.environment.network_latency_ms and surfaced in summary.md. Tolerant of failure (firewalls, no ping binary). Cloud RTT is intentionally NOT measured — cloud is only touched during one-time training, not in the steady-state load loop. 2. Drop api_token_env from config schema. The Groundlight SDK reads GROUNDLIGHT_API_TOKEN directly; we never need to extract it ourselves now that all calls go through the SDK. cli.py asserts it's set in the environment as a startup precondition. 3. RAM and VRAM plots use a GB y-axis instead of bytes (and summary.md likewise). Easier to read for engineers; we don't need byte-level resolution for these graphs. New file: app_benchmark/network.py Updated: cli.py, config.py, report.py, all three example configs TDD/PRD updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
submit_image_query polls the cloud for high-confidence answers when the returned IQ is below confidence_threshold — that fails when the edge runs in NO_CLOUD mode (no cloud reachable for the IQ poll). ask_ml returns the first ML prediction without confidence polling or human-review escalation — exactly what we want for benchmarking inference throughput. Applied to: - client.py per-frame chain (3 call sites via _submit_via_sdk) - verification.py sentinel + final-pass latency measurement Side effect: ask_ml takes no escalation kwargs (its behavior is inherent), so dropped the IQ_KWARGS_FOR_NO_ESCALATION lookup from client and the unused groundlight_helpers import from verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Cloud Predictor.name has a 100-char limit. Our prefix
`{detector_name_prefix}_{run.name}_{spec.name}` plus the
~50-char suffix from provision_detector plus ~20 chars of cloud
sibling-detector overhead pushed long configs over the limit
(e.g. fire-fall-fence-3lens hit 108 chars).
Replaced run.name with a 6-char SHA-256 prefix and capped the
total prefix at 28 chars. cleanup_orphans still matches by the
user-facing detector_name_prefix (e.g. `bench_*`), so the inner
hash is invisible to the user. spec.name is preserved for
readability when it fits, otherwise replaced with an 8-char hash.
2. Pinging `localhost` was hanging on macOS for 15s before our
timeout fired. Loopback hosts are now short-circuited (no ping,
no benchmark value anyway since RTT to self is ~0).
Per-ping timeout also dropped from 2.0s → 1.5s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'Target FPS' column showed `target_fps × cameras` (the aggregate target the lens-as-a-whole should produce) without saying so. With target_fps: 5 and cameras: 4 in the YAML, the table read 'Target FPS: 20' which looked like a bug. The math was correct (we compare achieved aggregate vs target aggregate to compute deficit %), but the presentation was opaque. - Add a Cams column so the per-camera × cams = aggregate math is self-documenting. - Split into 'Target FPS (per-cam)' and 'Target FPS (aggr)' columns, explicit aggregate suffix on Achieved. - Add 'saturate' label when target_fps == 0 (no rate limit). - Add target_fps_aggregate to summary.json for symmetry with achieved_fps_aggregate (target_fps_per_camera was already there). - Add a one-line note above the table pointing readers to achieved_fps_per_client in summary.json for per-camera data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A YAML-driven, multi-process Python harness for benchmarking edge-endpoint application throughput — multiple lenses (single-detector or chained bbox→binary), each driven by N independent client processes, with per-frame composite generation, real chain inference, and rolled-up FPS / latency / VRAM / GPU-compute reporting.
Branched from #398 so it picks up
groundlight_helpers._get_resourcesandcall_api(timeout=...)directly. Will rebase tomainonce #398 lands.Hard dependencies (already on this branch):
/status/resources.json#394 (/status/resources.jsonv2 shape +loading_detectorswarmup gate)_get_resourceshelper +SystemMonitorreference pattern)Companion design docs (gitignored, in
.context/):prd-edge-benchmark.md— PRDtdd-edge-benchmark.md— Technical design (this PR is the implementation of)What's in this PR
New package at
load-testing/app_benchmark/(~1900 LOC):cli.pyconfig.pydetectors.pyprovision_detector+configure_edge_endpoint(no reimpl)verification.pyfrom_edge=True, latency sanity,loading_detectors == 0, IDs in/status/resources.jsonhost_check.pycleanup_orphans.pyimage_loader.pyCompositeGeneratorwith per-camerarandom.Random, parametric base, ground-truth ROI crops, paddingclient.pysupervisor.pyspawncontext)monitor.pyglh._get_resources, buildsExperimentalApi()post-forkmetrics_writer.pymetrics.csv/warmup.csv/lens_events.csv; rolling FPS aggregationenvironment.pysummary.jsonreport.pysummary.json/md, plots (per-lens FPS w/ composite-objects overlay, combined cross-lens FPS, system metrics)Plus:
configs/example_3lens.yaml— fire/fall/fence reference configconfigs/known_pipelines.md—mlpiperegistry-key starter list (sourced from Benchmarking VRAM and RAM #373)tests/test_config_schema.py,test_image_loader.py,test_client.py— 24 unit testsimages/cat.jpeg— default downstream-crop padding fixturepyproject.toml— addedruamel-yaml,psutil,pytest(dev)Key semantics
cameras: Nspawns N independent client processes;target_fpsis per-camera. Aggregate =cameras × target_fps.stage_idx == -1rows), NOT HTTP requests. HTTP rate =aggregate_fps × (1 + num_crops_into_next)for a 2-stage chain.mlpipeisOptional[str](≤100 chars), interpreted as a named-pipeline key in the cloud registry.null= mode default.random[1, num_crops_into_next]copies of the base image at random sizes/positions. Downstream crops use generation ground truth ROIs (not detector outputs); slots beyondk_actualare filled with the configuredpadding_image. Always sends exactlynum_crops_into_nextto the next stage.Test plan
cd load-testing && uv run pytest app_benchmark/tests/--dry-runagainst staging cloud + local k3s edge — should create 4 detectors, register on edge, verifyfrom_edge=true, then delete themsummary.json, plots, clean detector cleanupcleanup_orphans --prefix bench— verify orphans are goneKnown follow-ups
mlpipenames inconfigs/known_pipelines.md(started; needs validation against staging)/versionendpoint if/when one exists (currently stamps image digest only)/status/resources.jsonat 2 Hz under saturationtim/benchmarking-scripts) — overlap withmeasure_ram_and_vram_usage.py🤖 Generated with Claude Code