Skip to content

Add app_benchmark harness for edge-endpoint application throughput#400

Draft
honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
htung_axoncorp/edge-bench-impl
Draft

Add app_benchmark harness for edge-endpoint application throughput#400
honeytung wants to merge 12 commits intohtung_axoncorp/load-test-migratefrom
htung_axoncorp/edge-bench-impl

Conversation

@honeytung
Copy link
Copy Markdown
Member

Summary

A YAML-driven, multi-process Python harness for benchmarking edge-endpoint application throughput — multiple lenses (single-detector or chained bbox→binary), each driven by N independent client processes, with per-frame composite generation, real chain inference, and rolled-up FPS / latency / VRAM / GPU-compute reporting.

Branched from #398 so it picks up groundlight_helpers._get_resources and call_api(timeout=...) directly. Will rebase to main once #398 lands.

Hard dependencies (already on this branch):

Companion design docs (gitignored, in .context/):

  • prd-edge-benchmark.md — PRD
  • tdd-edge-benchmark.md — Technical design (this PR is the implementation of)

What's in this PR

New package at load-testing/app_benchmark/ (~1900 LOC):

Module Role
cli.py Top-level orchestration (load → host check → create → register → verify → run → cleanup → report)
config.py Pydantic schemas + ruamel.yaml loader with line-annotated validation errors
detectors.py Wraps provision_detector + configure_edge_endpoint (no reimpl)
verification.py Control-plane proof: from_edge=True, latency sanity, loading_detectors == 0, IDs in /status/resources.json
host_check.py Pre-flight: refuse if non-bench detectors are loaded
cleanup_orphans.py Standalone CLI for SIGKILL recovery (deletes by name prefix)
image_loader.py CompositeGenerator with per-camera random.Random, parametric base, ground-truth ROI crops, padding
client.py Per-camera process loop: composite → stage 0 → ground-truth crops + padding → stages 1..N. Per-client pacing, retry policy, error budget.
supervisor.py Spawns/joins clients + monitor (multiprocessing spawn context)
monitor.py Resource sampler — thin wrapper over glh._get_resources, builds ExperimentalApi() post-fork
metrics_writer.py CSV writers for metrics.csv / warmup.csv / lens_events.csv; rolling FPS aggregation
environment.py Host / GPU / git SHA / image digest capture for summary.json
report.py Aggregates artifacts, builds summary.json/md, plots (per-lens FPS w/ composite-objects overlay, combined cross-lens FPS, system metrics)

Plus:

  • configs/example_3lens.yaml — fire/fall/fence reference config
  • configs/known_pipelines.mdmlpipe registry-key starter list (sourced from Benchmarking VRAM and RAM #373)
  • tests/test_config_schema.py, test_image_loader.py, test_client.py — 24 unit tests
  • images/cat.jpeg — default downstream-crop padding fixture
  • pyproject.toml — added ruamel-yaml, psutil, pytest (dev)

Key semantics

  • cameras: N spawns N independent client processes; target_fps is per-camera. Aggregate = cameras × target_fps.
  • FPS counts lens-loop iterations (stage_idx == -1 rows), NOT HTTP requests. HTTP rate = aggregate_fps × (1 + num_crops_into_next) for a 2-stage chain.
  • mlpipe is Optional[str] (≤100 chars), interpreted as a named-pipeline key in the cloud registry. null = mode default.
  • Composite generation (per frame): random[1, num_crops_into_next] copies of the base image at random sizes/positions. Downstream crops use generation ground truth ROIs (not detector outputs); slots beyond k_actual are filled with the configured padding_image. Always sends exactly num_crops_into_next to the next stage.

Test plan

  • Unit tests pass (24/24): cd load-testing && uv run pytest app_benchmark/tests/
  • All modules import without error
  • Example config validates and loads cleanly
  • Integration (cannot run from this workspace): --dry-run against staging cloud + local k3s edge — should create 4 detectors, register on edge, verify from_edge=true, then delete them
  • Integration: 1-min run against the example_3lens config — should produce summary.json, plots, clean detector cleanup
  • Integration: kill -9 mid-run, then cleanup_orphans --prefix bench — verify orphans are gone

Known follow-ups

  • TQ-1b: enumerate canonical mlpipe names in configs/known_pipelines.md (started; needs validation against staging)
  • TQ-2: capture edge-endpoint git SHA via a /version endpoint if/when one exists (currently stamps image digest only)
  • TQ-4: measure observer effect of polling /status/resources.json at 2 Hz under saturation
  • TQ-8: scope reconciliation with PR Benchmarking VRAM and RAM #373 (tim/benchmarking-scripts) — overlap with measure_ram_and_vram_usage.py
  • Plotly HTML dashboard (TDD §6.10) — currently just matplotlib PNGs; HTML can be added in a follow-up

🤖 Generated with Claude Code

honeytung and others added 12 commits May 6, 2026 13:35
A YAML-driven multi-process harness that benchmarks edge-endpoint
throughput under realistic application shapes (single-detector and
chained bbox->binary lenses with multi-camera fan-out).

Each lens spawns N independent client processes that generate per-frame
composite images, send them through a configurable detector chain, and
record FPS / latency / error stats. Resource sampling reuses
groundlight_helpers._get_resources (PR #398) to consume system-level
totals from /status/resources.json.

Reuses without reimplementing:
  - groundlight_helpers.provision_detector (cloud detector creation)
  - groundlight_helpers.configure_edge_endpoint (gl.edge.set_config)
  - groundlight_helpers._get_resources (post-#398 resource endpoint)
  - image_helpers compositing logic (parametric base + per-camera RNG)

CLI entry points:
  - python -m app_benchmark <yaml>           — main run
  - python -m app_benchmark.cleanup_orphans  — SIGKILL recovery

24/24 unit tests pass. Branched from htung_axoncorp/load-test-migrate
(PR #398) per the TDD's branching strategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes from the first dry-run against a real edge:

1. Edge-side cleanup gap: gl.delete_detector() removes the detector from
   the cloud but leaves it configured on the edge (inference pod stays
   loaded). Fix: snapshot gl.edge.get_config() at startup, push it back
   in the cleanup path. Preserves any pre-existing detectors when
   refuse_if_host_not_clean=false.

2. host_check warning was a single log line that's easy to miss. Now
   prints a banner-bordered multi-line warning explaining that
   pre-existing detectors share resources and will skew the results.

Plus two new configs:
  - smoke_test.yaml: minimal 1-detector dry-run validation
  - two_lens_2k.yaml: counting + chain at 1920×1080, saturate, max 5 objs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues found running against a real edge:

1. cleanup_orphans only handles cloud-side detector leaks, but the edge
   can also have orphan EdgeEndpointConfig entries (e.g. from a pre-fix
   run that deleted the cloud detector but left the edge config). Added
   a sister CLI:

       python -m app_benchmark.cleanup_edge --edge-endpoint URL --list
       python -m app_benchmark.cleanup_edge --edge-endpoint URL --wipe

2. host_check was prefix-matching against the `name` field of
   /status/resources.json detectors[] entries, but that endpoint only
   returns `detector_id` (no friendly name). Prefix matching was always
   failing → every loaded detector got flagged. Simplified to
   "any loaded detector = not clean", with the recovery path printed
   in the error message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EdgeEndpointConfig.detectors is a list of DetectorConfig objects (each
with a .detector_id attribute), not a dict keyed by detector ID. The
previous introspection treated it as a dict and silently returned [],
causing --list to always report "0 configured" and --wipe to be a no-op
when the config actually had detectors.

This is what caused the user's earlier wipe to do nothing — the edge
config still had det_3DM, the controller correctly kept its pods alive,
and we incorrectly attributed it to an edge-endpoint behavior bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
multiprocessing.Queue(maxsize=N) creates a POSIX BoundedSemaphore
initialized to N. On macOS, SEM_VALUE_MAX=32767 — passing 100_000 raises
OSError: [Errno 22] Invalid argument when the supervisor spins up.

Cap to 32_000 (frame_queue) and 4_000 (sample_queue). Both are well
above what we actually queue at run-time since metrics_writer drains
continuously on a background thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three things broken in the actual load loop, all caught by the first
real run:

1. Client was POSTing to /image-queries — actual route per app/api/api.py
   is /device-api/v1/image-queries (mounted via API_BASE_PATH). Verification
   used the SDK so it found the right path; raw-HTTP clients got 404.

2. Monitor process constructed ExperimentalApi() with no endpoint, which
   falls back to GROUNDLIGHT_ENDPOINT — usually the cloud, which 404s on
   /status/resources.json. Pass the edge URL explicitly through Supervisor.

3. Console output was JSON-lines (the JsonFormatter was applied to both
   stdout and run.log). Split: stdout now uses a human-readable format
   ("HH:MM:SS [LEVEL] logger: msg"), run.log keeps the structured JSON
   for machine parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related cleanups per review feedback:

1. client.py was POSTing to the edge with raw `requests`. That meant
   reimplementing path resolution, auth headers, and retry — all things
   the Groundlight SDK already does correctly. Replaced with
   `gl.submit_image_query(detector_id, image_bytes, **IQ_KWARGS_FOR_NO_ESCALATION)`,
   matching the canonical pattern in load-testing/simple_ee_test.py.
   The SDK's built-in 5xx/429 retry replaces our manual retry loop;
   from_edge=False still trips a fatal control-plane-drift error.

2. cloud_endpoint default was "https://api.groundlight.ai/device-api".
   The SDK appends /device-api itself, so the suffix was redundant.
   Updated default + all example configs to "https://api.groundlight.ai/".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit followup to the client.py SDK refactor — replace remaining
indirect-HTTP usages with direct SDK methods where they exist:

- host_check: gl.edge.get_config().detectors instead of
  glh._get_resources()['detectors']. The active config is the
  authoritative source for "what's configured" anyway.

- verification._check_resources_loaded: gl.edge.get_detector_readiness()
  (returns dict[detector_id -> bool]) instead of polling
  /status/resources.json for loading_detectors==0 + ID presence. One
  call answers exactly the question we're asking.

Remaining HTTP via glh:
  - monitor.py: system CPU/RAM/GPU metrics — no SDK equivalent (mandatory)
  - environment.py: container image digests — no SDK equivalent (mandatory)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three review-driven cleanups:

1. Network latency baseline: at startup, run `ping -c 5 <edge-host>`
   and parse min/avg/max/stddev RTT. Stored under
   summary.json.environment.network_latency_ms and surfaced in
   summary.md. Tolerant of failure (firewalls, no ping binary). Cloud
   RTT is intentionally NOT measured — cloud is only touched during
   one-time training, not in the steady-state load loop.

2. Drop api_token_env from config schema. The Groundlight SDK reads
   GROUNDLIGHT_API_TOKEN directly; we never need to extract it
   ourselves now that all calls go through the SDK. cli.py asserts
   it's set in the environment as a startup precondition.

3. RAM and VRAM plots use a GB y-axis instead of bytes (and
   summary.md likewise). Easier to read for engineers; we don't
   need byte-level resolution for these graphs.

New file: app_benchmark/network.py
Updated: cli.py, config.py, report.py, all three example configs
TDD/PRD updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
submit_image_query polls the cloud for high-confidence answers when
the returned IQ is below confidence_threshold — that fails when the
edge runs in NO_CLOUD mode (no cloud reachable for the IQ poll).

ask_ml returns the first ML prediction without confidence polling or
human-review escalation — exactly what we want for benchmarking
inference throughput.

Applied to:
  - client.py per-frame chain (3 call sites via _submit_via_sdk)
  - verification.py sentinel + final-pass latency measurement

Side effect: ask_ml takes no escalation kwargs (its behavior is inherent),
so dropped the IQ_KWARGS_FOR_NO_ESCALATION lookup from client and the
unused groundlight_helpers import from verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Cloud Predictor.name has a 100-char limit. Our prefix
   `{detector_name_prefix}_{run.name}_{spec.name}` plus the
   ~50-char suffix from provision_detector plus ~20 chars of cloud
   sibling-detector overhead pushed long configs over the limit
   (e.g. fire-fall-fence-3lens hit 108 chars).

   Replaced run.name with a 6-char SHA-256 prefix and capped the
   total prefix at 28 chars. cleanup_orphans still matches by the
   user-facing detector_name_prefix (e.g. `bench_*`), so the inner
   hash is invisible to the user. spec.name is preserved for
   readability when it fits, otherwise replaced with an 8-char hash.

2. Pinging `localhost` was hanging on macOS for 15s before our
   timeout fired. Loopback hosts are now short-circuited (no ping,
   no benchmark value anyway since RTT to self is ~0).
   Per-ping timeout also dropped from 2.0s → 1.5s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'Target FPS' column showed `target_fps × cameras` (the aggregate
target the lens-as-a-whole should produce) without saying so. With
target_fps: 5 and cameras: 4 in the YAML, the table read 'Target FPS:
20' which looked like a bug.

The math was correct (we compare achieved aggregate vs target
aggregate to compute deficit %), but the presentation was opaque.

- Add a Cams column so the per-camera × cams = aggregate math is
  self-documenting.
- Split into 'Target FPS (per-cam)' and 'Target FPS (aggr)' columns,
  explicit aggregate suffix on Achieved.
- Add 'saturate' label when target_fps == 0 (no rate limit).
- Add target_fps_aggregate to summary.json for symmetry with
  achieved_fps_aggregate (target_fps_per_camera was already there).
- Add a one-line note above the table pointing readers to
  achieved_fps_per_client in summary.json for per-camera data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant