From 04b1e8b3551dfa777abaddfa870a176ce7007f64 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Thu, 14 May 2026 16:31:48 +0200
Subject: [PATCH 01/10] feat(scripts): add docker-only e2e command for
 example-libpng

Adds scripts/e2e.sh, `make e2e`, and a .claude/commands/e2e.md slash
command that bring the Buttercup stack up via dev/docker-compose
(no Kubernetes), submit the example-libpng task, and monitor the
scheduler / seed-gen / patcher logs through the milestones tracked by
.github/workflows/system-integration.yml (fuzzer build, POV submit/
pass, seed-gen, patch generate / approve / pass, bundle submit, and
optionally SARIF). Defaults LITELLM_MAX_BUDGET to \$3 so accidental
runs are cheap; tears the stack down on exit unless --keep-up is set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/commands/e2e.md |  89 +++++++++
 Makefile                |   8 +-
 scripts/e2e.sh          | 424 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 520 insertions(+), 1 deletion(-)
 create mode 100644 .claude/commands/e2e.md
 create mode 100755 scripts/e2e.sh

diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md
new file mode 100644
index 00000000..a364c01b
--- /dev/null
+++ b/.claude/commands/e2e.md
@@ -0,0 +1,89 @@
+---
+description: Run a Docker-only end-to-end smoke test of Buttercup against example-libpng with a low LLM budget, and monitor the pipeline.
+argument-hint: "[--budget N] [--task-duration SEC] [--keep-up] [--no-build] [--skip-wait] [--sarif]"
+allowed-tools: Bash(./scripts/e2e.sh:*), Bash(make e2e*), Bash(docker compose:*), Bash(cd dev/docker-compose && docker compose:*), Read
+---
+
+# /e2e — Docker-only end-to-end Buttercup run (example-libpng)
+
+This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack and a low LiteLLM budget (default **$3**), so an accidental run is cheap.
+
+> **Host requirement:** x86_64. The fuzzer / patcher / seed-gen images build on `gcr.io/oss-fuzz-base/base-runner`, which is amd64-only. On aarch64 the build will fail with `exec format error` unless you install `qemu-user-static` + `binfmt` and set `DOCKER_DEFAULT_PLATFORM=linux/amd64` (and even then everything runs ~10× slower under emulation).
+
+Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails `docker compose logs` instead of `kubectl logs`.
+
+## What it does
+
+1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env.
+2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `3`).
+3. Builds and starts every service in `dev/docker-compose/compose.yaml` (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui).
+4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`.
+5. Waits, in order, for these scheduler/seed-gen log markers (timeout configurable per phase):
+   - `Processing build output for type FUZZER` — fuzzer build done
+   - `POV submission response: pov_id=` — vulnerability found and POV submitted
+   - `Updated POV status. New status PASSED` — POV accepted by competition API
+   - `Copied N files to corpus` — seed-gen produced seeds
+   - `Appending patch for task` — patch generated
+   - approves the patch via `POST /v1/task/<task_id>/patch/<patch_id>/approve`
+   - `Patch passed` — patch accepted
+   - `Bundle submission response: bundle_id=` — bundle submitted
+6. With `--sarif`, also sends a SARIF broadcast and waits for `Matching SARIF submission response`.
+7. Prints a colored summary and tears the stack down with `docker compose down -v` (unless `--keep-up`).
+
+## Run it
+
+The driver is `scripts/e2e.sh`. The `Makefile` exposes `make e2e`.
+
+```bash
+# Plain run with the $3 budget default
+make e2e
+
+# Pass flags through the Makefile
+make e2e E2E_ARGS="--budget 5 --keep-up"
+
+# Or call the script directly
+./scripts/e2e.sh --budget 3 --task-duration 1800
+./scripts/e2e.sh --skip-wait --keep-up   # just bring the stack up + submit task
+./scripts/e2e.sh --sarif                 # also exercise the SARIF flow
+```
+
+The script writes/overwrites `dev/docker-compose/.env` on each run.
+
+## Monitoring while it's running
+
+The script already streams milestone progress to its own stdout. For finer-grained visibility while it runs:
+
+```bash
+# All services, follow
+cd dev/docker-compose && docker compose logs -f
+
+# Just the scheduler (most milestones live here)
+cd dev/docker-compose && docker compose logs -f scheduler
+
+# Patcher, seed-gen, fuzzer-bot, program-model
+cd dev/docker-compose && docker compose logs -f patcher seed-gen fuzzer-bot program-model
+
+# LiteLLM spend tracking
+cd dev/docker-compose && docker compose logs -f litellm | grep -i 'spend\|budget'
+```
+
+The web UI is at `http://localhost:31323` (no port-forward needed — it's published on the host).
+
+## Tearing down
+
+```bash
+cd dev/docker-compose && docker compose down -v --remove-orphans
+```
+
+`scripts/e2e.sh` does this automatically on exit unless you pass `--keep-up`.
+
+## When you invoke /e2e
+
+When the user runs `/e2e`, default behavior:
+
+1. Run `./scripts/e2e.sh $ARGUMENTS` (forwarding any flags the user passed).
+2. While it runs, surface key transitions to the user. The script's own output already prints `[e2e] Reached: …` for each milestone — relay those as they arrive.
+3. If the run fails on a milestone, fetch the last ~50 lines of the relevant service:
+   - `cd dev/docker-compose && docker compose logs --tail=50 <service>`
+4. If the user asks to keep digging, expand the watch with `docker compose logs -f <service>` until the user is satisfied.
+5. On success, summarize the milestones reached and remind the user the stack is already torn down (or still up, if `--keep-up`).
diff --git a/Makefile b/Makefile
index fbbd49e6..ca083f9c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 # Makefile for Trail of Bits AIxCC Finals CRS
 
-.PHONY: help setup-local setup-azure validate deploy test undeploy install-cscope lint lint-component clean-local wait-crs check-crs crs-instance-id status send-integration-task
+.PHONY: help setup-local setup-azure validate deploy test undeploy install-cscope lint lint-component clean-local wait-crs check-crs crs-instance-id status send-integration-task e2e
 
 # Default target
 help:
@@ -23,6 +23,7 @@ help:
 	@echo "Testing:"
 	@echo "  send-integration-task  - Run integration-test task"
 	@echo "  send-libpng-task  - Run libpng task"
+	@echo "  e2e                   - Docker-only end-to-end smoke test against example-libpng (low LLM budget)"
 	@echo ""
 	@echo "Development:"
 	@echo "  install-cscope    - Install cscope tool"
@@ -150,6 +151,11 @@ send-libpng-task:
 	./orchestrator/scripts/task_crs.sh; \
 	kill $$PORT_FORWARD_PID 2>/dev/null || true
 
+# Docker-only end-to-end run against example-libpng. No Kubernetes required.
+# Pass extra flags via E2E_ARGS, e.g.:  make e2e E2E_ARGS="--keep-up --budget 5"
+e2e:
+	@./scripts/e2e.sh $(E2E_ARGS)
+
 # Development targets
 lint:
 	@echo "Linting all Python code..."
diff --git a/scripts/e2e.sh b/scripts/e2e.sh
new file mode 100755
index 00000000..2f8cce99
--- /dev/null
+++ b/scripts/e2e.sh
@@ -0,0 +1,424 @@
+#!/usr/bin/env bash
+# scripts/e2e.sh — Run the full Buttercup pipeline against example-libpng using
+# the dev docker-compose stack (no Kubernetes required).
+#
+# This mirrors the milestones checked by .github/workflows/system-integration.yml
+# but reads docker-compose logs instead of `kubectl logs`.
+
+set -u
+set -o pipefail
+
+###############################################################################
+# Config & defaults
+###############################################################################
+
+# Resolve repo root from this script's location.
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
+COMPOSE_DIR="${REPO_ROOT}/dev/docker-compose"
+ENV_FILE="${COMPOSE_DIR}/.env"
+
+# Defaults — overridable via flags or environment.
+BUDGET="${LITELLM_MAX_BUDGET:-3}"
+TASK_DURATION="${E2E_TASK_DURATION:-1800}"
+BUILD_TIMEOUT="${E2E_BUILD_TIMEOUT:-1800}"      # seconds  (fuzzer build)
+VULN_TIMEOUT="${E2E_VULN_TIMEOUT:-1800}"
+PATCH_TIMEOUT="${E2E_PATCH_TIMEOUT:-1800}"
+BUNDLE_TIMEOUT="${E2E_BUNDLE_TIMEOUT:-300}"
+SEED_GEN_TIMEOUT="${E2E_SEED_GEN_TIMEOUT:-1800}"
+
+DO_BUILD=1
+DO_TEARDOWN=1
+SKIP_WAIT=0
+TASK_JSON=""    # if set, used instead of the canned libpng payload
+SARIF_RUN=0
+
+###############################################################################
+# Logging
+###############################################################################
+
+if [[ -t 1 ]]; then
+    C_RST=$'\033[0m'; C_RED=$'\033[1;31m'; C_GRN=$'\033[1;32m'
+    C_YLW=$'\033[1;33m'; C_BLU=$'\033[1;36m'; C_DIM=$'\033[2m'
+else
+    C_RST=""; C_RED=""; C_GRN=""; C_YLW=""; C_BLU=""; C_DIM=""
+fi
+
+log()    { printf '%s[e2e]%s %s\n' "$C_BLU" "$C_RST" "$*"; }
+ok()     { printf '%s[e2e]%s %s\n' "$C_GRN" "$C_RST" "$*"; }
+warn()   { printf '%s[e2e]%s %s\n' "$C_YLW" "$C_RST" "$*" >&2; }
+err()    { printf '%s[e2e]%s %s\n' "$C_RED" "$C_RST" "$*" >&2; }
+dim()    { printf '%s[e2e]%s %s%s%s\n' "$C_BLU" "$C_RST" "$C_DIM" "$*" "$C_RST"; }
+
+###############################################################################
+# Usage
+###############################################################################
+
+usage() {
+    cat <<EOF
+Usage: scripts/e2e.sh [options]
+
+Runs an end-to-end smoke test of Buttercup against example-libpng using
+docker-compose (no Kubernetes). Monitors scheduler/seed-gen logs for the
+milestones tracked by .github/workflows/system-integration.yml.
+
+Options:
+  --budget DOLLARS          LiteLLM per-user max budget (default: $BUDGET)
+  --task-duration SECONDS   How long the CRS should fuzz (default: $TASK_DURATION)
+  --task-json FILE          Custom trigger_task payload (default: example-libpng)
+  --no-build                Skip 'docker compose build' (use existing images)
+  --keep-up                 Don't tear the stack down on exit (for debugging)
+  --skip-wait               Bring the stack up and submit the task, but don't
+                            block waiting on milestones (returns immediately)
+  --sarif                   Also submit a SARIF broadcast after the patch
+                            passes and wait for the matching SARIF response
+  --build-timeout SEC       Override fuzzer-build milestone timeout (default $BUILD_TIMEOUT)
+  --vuln-timeout SEC        Override vuln milestone timeout (default $VULN_TIMEOUT)
+  --patch-timeout SEC       Override patch milestone timeout (default $PATCH_TIMEOUT)
+  --bundle-timeout SEC      Override bundle milestone timeout (default $BUNDLE_TIMEOUT)
+  --seed-gen-timeout SEC    Override seed-gen milestone timeout (default $SEED_GEN_TIMEOUT)
+  -h, --help                Print this help
+
+Required environment (at least one provider key, plus litellm master key):
+  ANTHROPIC_API_KEY   and/or   OPENAI_API_KEY   and/or   GEMINI_API_KEY
+  BUTTERCUP_LITELLM_KEY  (optional, defaults to sk-1234 for local runs)
+
+Optional:
+  LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY
+
+The script writes ${ENV_FILE} from the values above each run.
+EOF
+}
+
+###############################################################################
+# Argument parsing
+###############################################################################
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --budget)            BUDGET="$2"; shift 2 ;;
+        --task-duration)     TASK_DURATION="$2"; shift 2 ;;
+        --task-json)         TASK_JSON="$(cat "$2")"; shift 2 ;;
+        --no-build)          DO_BUILD=0; shift ;;
+        --keep-up)           DO_TEARDOWN=0; shift ;;
+        --skip-wait)         SKIP_WAIT=1; shift ;;
+        --sarif)             SARIF_RUN=1; shift ;;
+        --build-timeout)     BUILD_TIMEOUT="$2"; shift 2 ;;
+        --vuln-timeout)      VULN_TIMEOUT="$2"; shift 2 ;;
+        --patch-timeout)     PATCH_TIMEOUT="$2"; shift 2 ;;
+        --bundle-timeout)    BUNDLE_TIMEOUT="$2"; shift 2 ;;
+        --seed-gen-timeout)  SEED_GEN_TIMEOUT="$2"; shift 2 ;;
+        -h|--help)           usage; exit 0 ;;
+        *) err "Unknown argument: $1"; usage; exit 2 ;;
+    esac
+done
+
+###############################################################################
+# Pre-flight checks
+###############################################################################
+
+if ! command -v docker >/dev/null 2>&1; then
+    err "docker is required but not installed."
+    exit 1
+fi
+if ! docker compose version >/dev/null 2>&1; then
+    err "'docker compose' v2 is required (not 'docker-compose')."
+    exit 1
+fi
+if ! command -v curl >/dev/null 2>&1; then
+    err "curl is required but not installed."
+    exit 1
+fi
+
+provider_keys_set=0
+for v in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY; do
+    val="${!v:-}"
+    if [[ -n "$val" && "$val" != "<INSERT_KEY>" ]]; then
+        provider_keys_set=1
+    fi
+done
+if [[ "$provider_keys_set" -eq 0 ]]; then
+    err "No LLM provider key found in env. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY."
+    err "Tip: 'export ANTHROPIC_API_KEY=...; scripts/e2e.sh' or add to ${ENV_FILE} first."
+    exit 1
+fi
+
+# If keys are missing, leave them at the placeholder so litellm still loads the
+# config (some models will fail at request time, others will succeed).
+: "${ANTHROPIC_API_KEY:=<INSERT_KEY>}"
+: "${OPENAI_API_KEY:=<INSERT_KEY>}"
+: "${GEMINI_API_KEY:=<INSERT_KEY>}"
+: "${AZURE_API_BASE:=<INSERT_HOST>}"
+: "${AZURE_API_KEY:=<INSERT_KEY>}"
+: "${BUTTERCUP_LITELLM_KEY:=sk-1234}"
+: "${LANGFUSE_HOST:=}"
+: "${LANGFUSE_PUBLIC_KEY:=}"
+: "${LANGFUSE_SECRET_KEY:=}"
+
+###############################################################################
+# .env generation
+###############################################################################
+
+log "Writing ${ENV_FILE} (LITELLM_MAX_BUDGET=\$${BUDGET})"
+{
+    echo "# Generated by scripts/e2e.sh on $(date -Is)"
+    echo "BUTTERCUP_LITELLM_KEY=${BUTTERCUP_LITELLM_KEY}"
+    echo "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}"
+    echo "OPENAI_API_KEY=${OPENAI_API_KEY}"
+    echo "GEMINI_API_KEY=${GEMINI_API_KEY}"
+    echo "AZURE_API_BASE=${AZURE_API_BASE}"
+    echo "AZURE_API_KEY=${AZURE_API_KEY}"
+    echo "LITELLM_MAX_BUDGET=${BUDGET}"
+    echo "LANGFUSE_HOST=${LANGFUSE_HOST}"
+    echo "LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}"
+    echo "LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}"
+} > "$ENV_FILE"
+
+###############################################################################
+# docker compose helpers
+###############################################################################
+
+# Always run compose from the compose dir so relative includes resolve.
+dc() {
+    (cd "$COMPOSE_DIR" && docker compose "$@")
+}
+
+teardown() {
+    if [[ "$DO_TEARDOWN" -eq 1 ]]; then
+        log "Tearing the stack down (docker compose down -v)"
+        dc down -v --remove-orphans || true
+    else
+        warn "Leaving the stack up (--keep-up). Tear down with: cd ${COMPOSE_DIR} && docker compose down -v"
+    fi
+}
+
+on_exit() {
+    rc=$?
+    teardown
+    if [[ $rc -ne 0 ]]; then
+        err "e2e run finished with exit code $rc"
+    fi
+    exit $rc
+}
+trap on_exit EXIT INT TERM
+
+###############################################################################
+# Bring the stack up
+###############################################################################
+
+if [[ "$DO_BUILD" -eq 1 ]]; then
+    log "Building docker compose images (this can take a while the first time)"
+    if ! dc build; then
+        err "docker compose build failed. On non-x86_64 hosts this usually means an"
+        err "image (e.g. fuzzer/Dockerfile -> gcr.io/oss-fuzz-base/base-runner) is amd64-only."
+        err "Inspect the build output above; retry on an x86_64 host, or install"
+        err "qemu-user-static + binfmt and re-run with DOCKER_DEFAULT_PLATFORM=linux/amd64."
+        exit 1
+    fi
+fi
+
+log "Starting services"
+if ! dc up -d; then
+    err "docker compose up failed. Check 'docker compose ps' / logs."
+    exit 1
+fi
+
+# Wait for the buttercup-ui task webhook to be reachable.
+log "Waiting for buttercup-ui to accept connections on http://localhost:31323"
+ui_up=0
+for _ in $(seq 1 120); do
+    if curl -sf -o /dev/null -m 2 http://localhost:31323/v1/ping/ 2>/dev/null \
+        || curl -sf -o /dev/null -m 2 http://localhost:31323/ 2>/dev/null; then
+        ui_up=1; break
+    fi
+    sleep 2
+done
+if [[ "$ui_up" -ne 1 ]]; then
+    err "buttercup-ui did not come up on port 31323. Check 'docker compose logs buttercup-ui'."
+    exit 1
+fi
+ok "buttercup-ui is up."
+
+###############################################################################
+# Submit the task
+###############################################################################
+
+if [[ -z "$TASK_JSON" ]]; then
+    TASK_JSON=$(cat <<EOF
+{
+    "challenge_repo_url": "https://github.com/tob-challenges/example-libpng",
+    "challenge_repo_base_ref": "5bf8da2d7953974e5dfbd778429c3affd461f51a",
+    "challenge_repo_head_ref": "challenges/lp-delta-01",
+    "fuzz_tooling_url": "https://github.com/trail-of-forks/oss-fuzz",
+    "fuzz_tooling_ref": "fix-libpng",
+    "fuzz_tooling_project_name": "libpng",
+    "duration": ${TASK_DURATION}
+}
+EOF
+)
+fi
+
+log "Submitting task to buttercup-ui /webhook/trigger_task"
+http_code=$(curl -s -o /tmp/e2e_task_resp.$$ -w '%{http_code}' \
+    -X POST 'http://127.0.0.1:31323/webhook/trigger_task' \
+    -H 'Content-Type: application/json' \
+    -d "$TASK_JSON")
+resp_body=$(cat /tmp/e2e_task_resp.$$ || true)
+rm -f /tmp/e2e_task_resp.$$
+if [[ "$http_code" != "200" && "$http_code" != "201" ]]; then
+    err "trigger_task returned HTTP $http_code: $resp_body"
+    exit 1
+fi
+ok "Task accepted (HTTP $http_code). ${C_DIM}${resp_body}${C_RST}"
+
+if [[ "$SKIP_WAIT" -eq 1 ]]; then
+    ok "--skip-wait set; not waiting on milestones."
+    DO_TEARDOWN=0
+    exit 0
+fi
+
+###############################################################################
+# Milestone waiters
+###############################################################################
+
+# wait_for SERVICE PATTERN TIMEOUT_SEC LABEL
+#
+# Tails `docker compose logs <SERVICE>` until a line matching PATTERN appears
+# or TIMEOUT_SEC elapses. Returns 0 on success, non-zero on timeout.
+wait_for() {
+    local service="$1" pattern="$2" timeout="$3" label="$4"
+    local deadline=$(( $(date +%s) + timeout ))
+    log "Waiting for milestone: ${label}  ${C_DIM}(service=${service}, timeout=${timeout}s)${C_RST}"
+
+    while [[ $(date +%s) -lt $deadline ]]; do
+        # --no-color so the grep matches plain text; --tail=all replays history.
+        if dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \
+            | grep -m1 -E "$pattern" >/dev/null; then
+            ok "Reached: ${label}"
+            return 0
+        fi
+        sleep 15
+    done
+
+    err "Timed out after ${timeout}s waiting for: ${label}"
+    err "Recent logs from ${service}:"
+    dc logs --no-color --tail=50 "$service" >&2 || true
+    return 1
+}
+
+# Capture a single matching log line (returns it on stdout, empty on miss).
+capture_line() {
+    local service="$1" pattern="$2"
+    dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \
+        | grep -E "$pattern" | head -n1 || true
+}
+
+###############################################################################
+# Walk through the pipeline
+###############################################################################
+
+declare -a SUMMARY=()
+record() { SUMMARY+=("$1"); }
+
+wait_for scheduler \
+    "Processing build output for type FUZZER" \
+    "$BUILD_TIMEOUT" "fuzzer build processed" \
+    && record "fuzzer-build: ok" || record "fuzzer-build: TIMEOUT"
+
+wait_for scheduler \
+    "POV submission response: pov_id=" \
+    "$VULN_TIMEOUT" "vulnerability (POV) submitted" \
+    && record "pov-submit: ok" || record "pov-submit: TIMEOUT"
+
+wait_for scheduler \
+    "Updated POV status. New status PASSED" \
+    "$VULN_TIMEOUT" "POV accepted by competition API" \
+    && record "pov-passed: ok" || record "pov-passed: TIMEOUT"
+
+wait_for seed-gen \
+    "Copied [1-9][0-9]* files to corpus" \
+    "$SEED_GEN_TIMEOUT" "seed-gen produced seeds" \
+    && record "seed-gen: ok" || record "seed-gen: TIMEOUT"
+
+wait_for scheduler \
+    "Appending patch for task" \
+    "$PATCH_TIMEOUT" "patch generated" \
+    && record "patch-generated: ok" || record "patch-generated: TIMEOUT"
+
+# Approve the patch (the local UI requires explicit approval, unlike scored
+# rounds where it is automatic).
+PATCH_LINE="$(capture_line scheduler 'competition_patch_id=')"
+if [[ -n "$PATCH_LINE" ]]; then
+    PATCH_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*competition_patch_id=\([^ ]*\).*/\1/p')
+    # Task id is inside the first [...] block, after the last ':'.
+    TASK_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://')
+    if [[ -n "$PATCH_ID" && -n "$TASK_ID" ]]; then
+        log "Approving patch ${C_DIM}task=${TASK_ID} patch=${PATCH_ID}${C_RST}"
+        curl -fsS -X POST \
+            "http://127.0.0.1:31323/v1/task/${TASK_ID}/patch/${PATCH_ID}/approve" \
+            >/dev/null && record "patch-approve: ok" || record "patch-approve: HTTP fail"
+    else
+        warn "Could not extract patch/task ids from: $PATCH_LINE"
+        record "patch-approve: skipped (parse fail)"
+    fi
+else
+    warn "No competition_patch_id= line seen; skipping approval"
+    record "patch-approve: skipped (no patch line)"
+fi
+
+wait_for scheduler \
+    "Patch passed" \
+    "$PATCH_TIMEOUT" "patch accepted by competition API" \
+    && record "patch-passed: ok" || record "patch-passed: TIMEOUT"
+
+wait_for scheduler \
+    "Bundle submission response: bundle_id=" \
+    "$BUNDLE_TIMEOUT" "bundle submitted" \
+    && record "bundle-submit: ok" || record "bundle-submit: TIMEOUT"
+
+if [[ "$SARIF_RUN" -eq 1 ]]; then
+    SARIF_TASK_ID="${TASK_ID:-}"
+    if [[ -z "$SARIF_TASK_ID" ]]; then
+        SARIF_TASK_ID=$(dc logs --no-color --no-log-prefix --tail=all scheduler \
+            | grep "Submitting bundle for harness" | head -n1 \
+            | grep -o "\[[^]]*\]" | head -n1 \
+            | tr -d '[]' | awk -F: '{print $NF}')
+    fi
+    if [[ -n "$SARIF_TASK_ID" ]]; then
+        log "Sending SARIF broadcast for task ${SARIF_TASK_ID}"
+        if "${REPO_ROOT}/orchestrator/scripts/send_sarif.sh" "$SARIF_TASK_ID" >/dev/null 2>&1; then
+            record "sarif-send: ok"
+        else
+            record "sarif-send: HTTP fail"
+        fi
+        wait_for scheduler \
+            "Matching SARIF submission response" \
+            "$BUNDLE_TIMEOUT" "SARIF accepted" \
+            && record "sarif-passed: ok" || record "sarif-passed: TIMEOUT"
+    else
+        record "sarif: skipped (no task id)"
+    fi
+fi
+
+###############################################################################
+# Summary
+###############################################################################
+
+printf '\n%s===================== e2e summary =====================%s\n' "$C_BLU" "$C_RST"
+for line in "${SUMMARY[@]}"; do
+    if [[ "$line" == *": ok" ]]; then
+        printf '  %s✓%s %s\n' "$C_GRN" "$C_RST" "$line"
+    elif [[ "$line" == *": TIMEOUT" || "$line" == *"fail"* ]]; then
+        printf '  %s✗%s %s\n' "$C_RED" "$C_RST" "$line"
+    else
+        printf '  %s•%s %s\n' "$C_YLW" "$C_RST" "$line"
+    fi
+done
+printf '%s=======================================================%s\n' "$C_BLU" "$C_RST"
+
+# Exit non-zero if any milestone failed.
+for line in "${SUMMARY[@]}"; do
+    if [[ "$line" == *": TIMEOUT" || "$line" == *"fail"* ]]; then
+        exit 1
+    fi
+done

From f1ae7073e84708c2f315e382d8140670ea56b4ad Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Fri, 15 May 2026 09:40:06 +0000
Subject: [PATCH 02/10] feat(scripts): run e2e via prebuilt GHCR images instead
 of local build

The e2e driver now brings the stack up through the compose.prebuilt.yaml
overlay and `docker compose pull` (tag configurable via --image-tag /
BUTTERCUP_IMAGE_TAG, default "main") instead of `docker compose build`,
so a run no longer depends on a working local image build (e.g. the
cscope submodule / oss-fuzz base-runner build chain).

- dc() applies `-f compose.yaml -f compose.prebuilt.yaml` and exports
  BUTTERCUP_IMAGE_TAG for every compose subcommand (pull/up/logs/down).
- --no-build kept as a deprecated alias for the new --no-pull.
- Teardown hint and e2e.md updated for the overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/commands/e2e.md | 10 ++++++----
 scripts/e2e.sh          | 40 ++++++++++++++++++++++++++++------------
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md
index a364c01b..1e94f492 100644
--- a/.claude/commands/e2e.md
+++ b/.claude/commands/e2e.md
@@ -1,14 +1,16 @@
 ---
 description: Run a Docker-only end-to-end smoke test of Buttercup against example-libpng with a low LLM budget, and monitor the pipeline.
-argument-hint: "[--budget N] [--task-duration SEC] [--keep-up] [--no-build] [--skip-wait] [--sarif]"
+argument-hint: "[--budget N] [--task-duration SEC] [--image-tag TAG] [--keep-up] [--no-pull] [--skip-wait] [--sarif]"
 allowed-tools: Bash(./scripts/e2e.sh:*), Bash(make e2e*), Bash(docker compose:*), Bash(cd dev/docker-compose && docker compose:*), Read
 ---
 
 # /e2e — Docker-only end-to-end Buttercup run (example-libpng)
 
-This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack and a low LiteLLM budget (default **$3**), so an accidental run is cheap.
+This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack with the **`compose.prebuilt.yaml` overlay** — every component runs from its prebuilt GHCR image (`ghcr.io/trailofbits/buttercup/*`, tag `main` by default), so **nothing is built locally**. A low LiteLLM budget (default **$3**) keeps an accidental run cheap.
 
-> **Host requirement:** x86_64. The fuzzer / patcher / seed-gen images build on `gcr.io/oss-fuzz-base/base-runner`, which is amd64-only. On aarch64 the build will fail with `exec format error` unless you install `qemu-user-static` + `binfmt` and set `DOCKER_DEFAULT_PLATFORM=linux/amd64` (and even then everything runs ~10× slower under emulation).
+> **Image tag:** defaults to `main`. Override with `--image-tag <branch-or-tag>` or `BUTTERCUP_IMAGE_TAG=...` to test a specific build. Private images require `docker login ghcr.io` first.
+>
+> **Host requirement:** x86_64. The prebuilt fuzzer / patcher / seed-gen images are based on `gcr.io/oss-fuzz-base/base-runner`, which is amd64-only. On aarch64 they only run under `qemu-user-static` + `binfmt` with `DOCKER_DEFAULT_PLATFORM=linux/amd64` (and ~10× slower).
 
 Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails `docker compose logs` instead of `kubectl logs`.
 
@@ -16,7 +18,7 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
 
 1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env.
 2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `3`).
-3. Builds and starts every service in `dev/docker-compose/compose.yaml` (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui).
+3. Pulls the prebuilt component images (`docker compose -f compose.yaml -f compose.prebuilt.yaml pull`, skippable with `--no-pull`) and starts every service (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui). No local image build.
 4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`.
 5. Waits, in order, for these scheduler/seed-gen log markers (timeout configurable per phase):
    - `Processing build output for type FUZZER` — fuzzer build done
diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 2f8cce99..7fa0ed0e 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -2,6 +2,10 @@
 # scripts/e2e.sh — Run the full Buttercup pipeline against example-libpng using
 # the dev docker-compose stack (no Kubernetes required).
 #
+# Uses the prebuilt component images published to GHCR (via the
+# compose.prebuilt.yaml overlay) instead of building them locally, so a run
+# does not depend on a working local image build.
+#
 # This mirrors the milestones checked by .github/workflows/system-integration.yml
 # but reads docker-compose logs instead of `kubectl logs`.
 
@@ -27,7 +31,10 @@ PATCH_TIMEOUT="${E2E_PATCH_TIMEOUT:-1800}"
 BUNDLE_TIMEOUT="${E2E_BUNDLE_TIMEOUT:-300}"
 SEED_GEN_TIMEOUT="${E2E_SEED_GEN_TIMEOUT:-1800}"
 
-DO_BUILD=1
+# Prebuilt GHCR images instead of local builds (compose.prebuilt.yaml overlay).
+IMAGE_TAG="${BUTTERCUP_IMAGE_TAG:-main}"
+
+DO_PULL=1
 DO_TEARDOWN=1
 SKIP_WAIT=0
 TASK_JSON=""    # if set, used instead of the canned libpng payload
@@ -66,7 +73,9 @@ Options:
   --budget DOLLARS          LiteLLM per-user max budget (default: $BUDGET)
   --task-duration SECONDS   How long the CRS should fuzz (default: $TASK_DURATION)
   --task-json FILE          Custom trigger_task payload (default: example-libpng)
-  --no-build                Skip 'docker compose build' (use existing images)
+  --image-tag TAG           Prebuilt GHCR image tag to run (default: $IMAGE_TAG)
+  --no-pull                 Skip 'docker compose pull' (use already-pulled images)
+  --no-build                Deprecated alias for --no-pull (no local build happens)
   --keep-up                 Don't tear the stack down on exit (for debugging)
   --skip-wait               Bring the stack up and submit the task, but don't
                             block waiting on milestones (returns immediately)
@@ -84,6 +93,7 @@ Required environment (at least one provider key, plus litellm master key):
   BUTTERCUP_LITELLM_KEY  (optional, defaults to sk-1234 for local runs)
 
 Optional:
+  BUTTERCUP_IMAGE_TAG  Prebuilt GHCR image tag (default: main; same as --image-tag)
   LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY
 
 The script writes ${ENV_FILE} from the values above each run.
@@ -99,7 +109,9 @@ while [[ $# -gt 0 ]]; do
         --budget)            BUDGET="$2"; shift 2 ;;
         --task-duration)     TASK_DURATION="$2"; shift 2 ;;
         --task-json)         TASK_JSON="$(cat "$2")"; shift 2 ;;
-        --no-build)          DO_BUILD=0; shift ;;
+        --image-tag)         IMAGE_TAG="$2"; shift 2 ;;
+        --no-pull)           DO_PULL=0; shift ;;
+        --no-build)          DO_PULL=0; shift ;;   # deprecated alias
         --keep-up)           DO_TEARDOWN=0; shift ;;
         --skip-wait)         SKIP_WAIT=1; shift ;;
         --sarif)             SARIF_RUN=1; shift ;;
@@ -179,8 +191,12 @@ log "Writing ${ENV_FILE} (LITELLM_MAX_BUDGET=\$${BUDGET})"
 ###############################################################################
 
 # Always run compose from the compose dir so relative includes resolve.
+# The compose.prebuilt.yaml overlay swaps every locally-built service for its
+# prebuilt GHCR image, so nothing is built locally.
 dc() {
-    (cd "$COMPOSE_DIR" && docker compose "$@")
+    (cd "$COMPOSE_DIR" \
+        && BUTTERCUP_IMAGE_TAG="$IMAGE_TAG" \
+           docker compose -f compose.yaml -f compose.prebuilt.yaml "$@")
 }
 
 teardown() {
@@ -188,7 +204,7 @@ teardown() {
         log "Tearing the stack down (docker compose down -v)"
         dc down -v --remove-orphans || true
     else
-        warn "Leaving the stack up (--keep-up). Tear down with: cd ${COMPOSE_DIR} && docker compose down -v"
+        warn "Leaving the stack up (--keep-up). Tear down with: cd ${COMPOSE_DIR} && docker compose -f compose.yaml -f compose.prebuilt.yaml down -v"
     fi
 }
 
@@ -206,13 +222,13 @@ trap on_exit EXIT INT TERM
 # Bring the stack up
 ###############################################################################
 
-if [[ "$DO_BUILD" -eq 1 ]]; then
-    log "Building docker compose images (this can take a while the first time)"
-    if ! dc build; then
-        err "docker compose build failed. On non-x86_64 hosts this usually means an"
-        err "image (e.g. fuzzer/Dockerfile -> gcr.io/oss-fuzz-base/base-runner) is amd64-only."
-        err "Inspect the build output above; retry on an x86_64 host, or install"
-        err "qemu-user-static + binfmt and re-run with DOCKER_DEFAULT_PLATFORM=linux/amd64."
+if [[ "$DO_PULL" -eq 1 ]]; then
+    log "Pulling prebuilt component images from GHCR (tag: ${IMAGE_TAG})"
+    if ! dc pull; then
+        err "docker compose pull failed for tag '${IMAGE_TAG}'."
+        err "Check that the tag exists at ghcr.io/trailofbits/buttercup/* and that"
+        err "you can reach GHCR (private images need 'docker login ghcr.io')."
+        err "Override with --image-tag <branch-or-tag> or BUTTERCUP_IMAGE_TAG=..."
         exit 1
     fi
 fi

From a25a525ad45b8e55e068ef1867ec0c4b5c51792e Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Fri, 15 May 2026 12:31:40 +0000
Subject: [PATCH 03/10] fix(scripts): preserve existing .env values in e2e.sh

e2e.sh regenerates dev/docker-compose/.env from scratch every run,
sourcing values only from environment variables. Variables not exported
(notably LANGFUSE_HOST/PUBLIC_KEY/SECRET_KEY) were defaulted to empty and
written back, clobbering values a user had set directly in .env.

Add prev_env() and a 3-tier resolution: environment > existing .env >
placeholder. Manually-set .env values (Langfuse creds, provider keys,
litellm key) now survive subsequent runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/e2e.sh | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 7fa0ed0e..44dc6dbe 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -155,8 +155,29 @@ if [[ "$provider_keys_set" -eq 0 ]]; then
     exit 1
 fi
 
-# If keys are missing, leave them at the placeholder so litellm still loads the
-# config (some models will fail at request time, others will succeed).
+# Read a value already present in the existing .env. Used so that variables
+# not provided via the environment (e.g. LANGFUSE_*) are preserved across runs
+# instead of being clobbered with empty/placeholder values, since this script
+# regenerates .env from scratch on every run.
+prev_env() {
+    [[ -f "$ENV_FILE" ]] || return 0
+    sed -n "s/^$1=//p" "$ENV_FILE" | head -n1
+}
+
+# 1) Prefer the environment; 2) fall back to whatever is already in .env.
+: "${ANTHROPIC_API_KEY:=$(prev_env ANTHROPIC_API_KEY)}"
+: "${OPENAI_API_KEY:=$(prev_env OPENAI_API_KEY)}"
+: "${GEMINI_API_KEY:=$(prev_env GEMINI_API_KEY)}"
+: "${AZURE_API_BASE:=$(prev_env AZURE_API_BASE)}"
+: "${AZURE_API_KEY:=$(prev_env AZURE_API_KEY)}"
+: "${BUTTERCUP_LITELLM_KEY:=$(prev_env BUTTERCUP_LITELLM_KEY)}"
+: "${LANGFUSE_HOST:=$(prev_env LANGFUSE_HOST)}"
+: "${LANGFUSE_PUBLIC_KEY:=$(prev_env LANGFUSE_PUBLIC_KEY)}"
+: "${LANGFUSE_SECRET_KEY:=$(prev_env LANGFUSE_SECRET_KEY)}"
+
+# 3) Final placeholders if still unset after both env and .env. Keys left at
+# the placeholder so litellm still loads its config (some models will fail at
+# request time, others will succeed). LANGFUSE_* stay empty (telemetry off).
 : "${ANTHROPIC_API_KEY:=<INSERT_KEY>}"
 : "${OPENAI_API_KEY:=<INSERT_KEY>}"
 : "${GEMINI_API_KEY:=<INSERT_KEY>}"

From 7616b37953c9c1a0476be7888e2d9b03a08ef0fc Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Fri, 15 May 2026 14:45:42 +0000
Subject: [PATCH 04/10] fix(scripts): use explicit if-then-else in e2e.sh to
 satisfy shellcheck

Replace the `wait_for ... && record ok || record TIMEOUT` and
`curl ... && record ok || record fail` constructs with explicit
if-then-else blocks. shellcheck flagged these as SC2015 (A && B || C
is not if-then-else), causing the "Lint shell scripts" step in the
Static Checks workflow to fail. Behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/e2e.sh | 80 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 54 insertions(+), 26 deletions(-)

diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 44dc6dbe..8816a01b 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -357,30 +357,45 @@ capture_line() {
 declare -a SUMMARY=()
 record() { SUMMARY+=("$1"); }
 
-wait_for scheduler \
+if wait_for scheduler \
     "Processing build output for type FUZZER" \
-    "$BUILD_TIMEOUT" "fuzzer build processed" \
-    && record "fuzzer-build: ok" || record "fuzzer-build: TIMEOUT"
+    "$BUILD_TIMEOUT" "fuzzer build processed"; then
+    record "fuzzer-build: ok"
+else
+    record "fuzzer-build: TIMEOUT"
+fi
 
-wait_for scheduler \
+if wait_for scheduler \
     "POV submission response: pov_id=" \
-    "$VULN_TIMEOUT" "vulnerability (POV) submitted" \
-    && record "pov-submit: ok" || record "pov-submit: TIMEOUT"
+    "$VULN_TIMEOUT" "vulnerability (POV) submitted"; then
+    record "pov-submit: ok"
+else
+    record "pov-submit: TIMEOUT"
+fi
 
-wait_for scheduler \
+if wait_for scheduler \
     "Updated POV status. New status PASSED" \
-    "$VULN_TIMEOUT" "POV accepted by competition API" \
-    && record "pov-passed: ok" || record "pov-passed: TIMEOUT"
+    "$VULN_TIMEOUT" "POV accepted by competition API"; then
+    record "pov-passed: ok"
+else
+    record "pov-passed: TIMEOUT"
+fi
 
-wait_for seed-gen \
+if wait_for seed-gen \
     "Copied [1-9][0-9]* files to corpus" \
-    "$SEED_GEN_TIMEOUT" "seed-gen produced seeds" \
-    && record "seed-gen: ok" || record "seed-gen: TIMEOUT"
+    "$SEED_GEN_TIMEOUT" "seed-gen produced seeds"; then
+    record "seed-gen: ok"
+else
+    record "seed-gen: TIMEOUT"
+fi
 
-wait_for scheduler \
+if wait_for scheduler \
     "Appending patch for task" \
-    "$PATCH_TIMEOUT" "patch generated" \
-    && record "patch-generated: ok" || record "patch-generated: TIMEOUT"
+    "$PATCH_TIMEOUT" "patch generated"; then
+    record "patch-generated: ok"
+else
+    record "patch-generated: TIMEOUT"
+fi
 
 # Approve the patch (the local UI requires explicit approval, unlike scored
 # rounds where it is automatic).
@@ -391,9 +406,13 @@ if [[ -n "$PATCH_LINE" ]]; then
     TASK_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://')
     if [[ -n "$PATCH_ID" && -n "$TASK_ID" ]]; then
         log "Approving patch ${C_DIM}task=${TASK_ID} patch=${PATCH_ID}${C_RST}"
-        curl -fsS -X POST \
+        if curl -fsS -X POST \
             "http://127.0.0.1:31323/v1/task/${TASK_ID}/patch/${PATCH_ID}/approve" \
-            >/dev/null && record "patch-approve: ok" || record "patch-approve: HTTP fail"
+            >/dev/null; then
+            record "patch-approve: ok"
+        else
+            record "patch-approve: HTTP fail"
+        fi
     else
         warn "Could not extract patch/task ids from: $PATCH_LINE"
         record "patch-approve: skipped (parse fail)"
@@ -403,15 +422,21 @@ else
     record "patch-approve: skipped (no patch line)"
 fi
 
-wait_for scheduler \
+if wait_for scheduler \
     "Patch passed" \
-    "$PATCH_TIMEOUT" "patch accepted by competition API" \
-    && record "patch-passed: ok" || record "patch-passed: TIMEOUT"
+    "$PATCH_TIMEOUT" "patch accepted by competition API"; then
+    record "patch-passed: ok"
+else
+    record "patch-passed: TIMEOUT"
+fi
 
-wait_for scheduler \
+if wait_for scheduler \
     "Bundle submission response: bundle_id=" \
-    "$BUNDLE_TIMEOUT" "bundle submitted" \
-    && record "bundle-submit: ok" || record "bundle-submit: TIMEOUT"
+    "$BUNDLE_TIMEOUT" "bundle submitted"; then
+    record "bundle-submit: ok"
+else
+    record "bundle-submit: TIMEOUT"
+fi
 
 if [[ "$SARIF_RUN" -eq 1 ]]; then
     SARIF_TASK_ID="${TASK_ID:-}"
@@ -428,10 +453,13 @@ if [[ "$SARIF_RUN" -eq 1 ]]; then
         else
             record "sarif-send: HTTP fail"
         fi
-        wait_for scheduler \
+        if wait_for scheduler \
             "Matching SARIF submission response" \
-            "$BUNDLE_TIMEOUT" "SARIF accepted" \
-            && record "sarif-passed: ok" || record "sarif-passed: TIMEOUT"
+            "$BUNDLE_TIMEOUT" "SARIF accepted"; then
+            record "sarif-passed: ok"
+        else
+            record "sarif-passed: TIMEOUT"
+        fi
     else
         record "sarif: skipped (no task id)"
     fi

From ba140d902c52a677899e6176b6d1b02c7aef6af6 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Mon, 18 May 2026 10:29:15 +0000
Subject: [PATCH 05/10] fix(scripts): make e2e.sh wait_for robust to
 pipefail+SIGPIPE

With `set -o pipefail`, `dc logs ... | grep -m1` makes the upstream
`docker compose logs` die with SIGPIPE (rc 141) once grep matches the
first line; pipefail then fails the whole pipeline, so milestones whose
log line appears early in a high-volume stream (e.g. seed-gen's 'Copied
N files to corpus') are never registered and wait_for spins until
timeout even though the milestone occurred. Capture grep output with
'|| true' and test for non-empty instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/e2e.sh | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 8816a01b..9b2e3ecf 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -329,8 +329,15 @@ wait_for() {
 
     while [[ $(date +%s) -lt $deadline ]]; do
         # --no-color so the grep matches plain text; --tail=all replays history.
-        if dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \
-            | grep -m1 -E "$pattern" >/dev/null; then
+        # NOTE: capture into a var with `|| true` instead of `if cmd | grep`.
+        # Under `set -o pipefail`, `grep -m1` exits on the first match and the
+        # upstream `docker compose logs` then dies with SIGPIPE (rc 141), which
+        # would make the whole pipeline "fail" and the milestone never register
+        # for high-volume services whose match is early in the stream.
+        local match
+        match="$(dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \
+            | grep -m1 -E "$pattern" || true)"
+        if [[ -n "$match" ]]; then
             ok "Reached: ${label}"
             return 0
         fi

From f39763178054674a30681a04d8129a2b2f424426 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Tue, 19 May 2026 07:29:17 +0000
Subject: [PATCH 06/10] refactor(scripts): simplify e2e.sh to
 budget/duration/tag/no-pull

Drop --no-build, --keep-up, --skip-wait, --sarif, --task-json and the
per-phase --*-timeout flags. The stack now always tears down on exit;
milestone timeouts are internal constants.

Addresses PR #552 review:
- provider-key check moved below the .env fallback so keys saved to
  .env on a prior run are accepted (tip is now accurate)
- --task-json removed (was silently falling back to the libpng default)
- trigger_task response uses mktemp + on_exit cleanup instead of a
  predictable /tmp/e2e_task_resp.$$ leaked on SIGINT/SIGTERM
- --no-build phantom "deprecated alias" removed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/commands/e2e.md |  18 +++---
 Makefile                |   2 +-
 scripts/e2e.sh          | 135 +++++++++++-----------------------------
 3 files changed, 45 insertions(+), 110 deletions(-)

diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md
index 1e94f492..d757fc7b 100644
--- a/.claude/commands/e2e.md
+++ b/.claude/commands/e2e.md
@@ -1,6 +1,6 @@
 ---
 description: Run a Docker-only end-to-end smoke test of Buttercup against example-libpng with a low LLM budget, and monitor the pipeline.
-argument-hint: "[--budget N] [--task-duration SEC] [--image-tag TAG] [--keep-up] [--no-pull] [--skip-wait] [--sarif]"
+argument-hint: "[--budget N] [--task-duration SEC] [--image-tag TAG] [--no-pull]"
 allowed-tools: Bash(./scripts/e2e.sh:*), Bash(make e2e*), Bash(docker compose:*), Bash(cd dev/docker-compose && docker compose:*), Read
 ---
 
@@ -16,11 +16,11 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
 
 ## What it does
 
-1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env.
+1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env (or already saved in `dev/docker-compose/.env`).
 2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `3`).
 3. Pulls the prebuilt component images (`docker compose -f compose.yaml -f compose.prebuilt.yaml pull`, skippable with `--no-pull`) and starts every service (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui). No local image build.
 4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`.
-5. Waits, in order, for these scheduler/seed-gen log markers (timeout configurable per phase):
+5. Waits, in order, for these scheduler/seed-gen log markers:
    - `Processing build output for type FUZZER` — fuzzer build done
    - `POV submission response: pov_id=` — vulnerability found and POV submitted
    - `Updated POV status. New status PASSED` — POV accepted by competition API
@@ -29,8 +29,7 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
    - approves the patch via `POST /v1/task/<task_id>/patch/<patch_id>/approve`
    - `Patch passed` — patch accepted
    - `Bundle submission response: bundle_id=` — bundle submitted
-6. With `--sarif`, also sends a SARIF broadcast and waits for `Matching SARIF submission response`.
-7. Prints a colored summary and tears the stack down with `docker compose down -v` (unless `--keep-up`).
+6. Prints a colored summary and tears the stack down with `docker compose down -v`.
 
 ## Run it
 
@@ -41,12 +40,11 @@ The driver is `scripts/e2e.sh`. The `Makefile` exposes `make e2e`.
 make e2e
 
 # Pass flags through the Makefile
-make e2e E2E_ARGS="--budget 5 --keep-up"
+make e2e E2E_ARGS="--budget 5 --no-pull"
 
 # Or call the script directly
 ./scripts/e2e.sh --budget 3 --task-duration 1800
-./scripts/e2e.sh --skip-wait --keep-up   # just bring the stack up + submit task
-./scripts/e2e.sh --sarif                 # also exercise the SARIF flow
+./scripts/e2e.sh --image-tag my-branch --no-pull   # run already-present images
 ```
 
 The script writes/overwrites `dev/docker-compose/.env` on each run.
@@ -77,7 +75,7 @@ The web UI is at `http://localhost:31323` (no port-forward needed — it's publi
 cd dev/docker-compose && docker compose down -v --remove-orphans
 ```
 
-`scripts/e2e.sh` does this automatically on exit unless you pass `--keep-up`.
+`scripts/e2e.sh` does this automatically on exit.
 
 ## When you invoke /e2e
 
@@ -88,4 +86,4 @@ When the user runs `/e2e`, default behavior:
 3. If the run fails on a milestone, fetch the last ~50 lines of the relevant service:
    - `cd dev/docker-compose && docker compose logs --tail=50 <service>`
 4. If the user asks to keep digging, expand the watch with `docker compose logs -f <service>` until the user is satisfied.
-5. On success, summarize the milestones reached and remind the user the stack is already torn down (or still up, if `--keep-up`).
+5. On success, summarize the milestones reached and remind the user the stack is already torn down.
diff --git a/Makefile b/Makefile
index ca083f9c..a5f0d445 100644
--- a/Makefile
+++ b/Makefile
@@ -152,7 +152,7 @@ send-libpng-task:
 	kill $$PORT_FORWARD_PID 2>/dev/null || true
 
 # Docker-only end-to-end run against example-libpng. No Kubernetes required.
-# Pass extra flags via E2E_ARGS, e.g.:  make e2e E2E_ARGS="--keep-up --budget 5"
+# Pass extra flags via E2E_ARGS, e.g.:  make e2e E2E_ARGS="--budget 5 --no-pull"
 e2e:
 	@./scripts/e2e.sh $(E2E_ARGS)
 
diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 9b2e3ecf..b25c3aaf 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -25,20 +25,19 @@ ENV_FILE="${COMPOSE_DIR}/.env"
 # Defaults — overridable via flags or environment.
 BUDGET="${LITELLM_MAX_BUDGET:-3}"
 TASK_DURATION="${E2E_TASK_DURATION:-1800}"
-BUILD_TIMEOUT="${E2E_BUILD_TIMEOUT:-1800}"      # seconds  (fuzzer build)
-VULN_TIMEOUT="${E2E_VULN_TIMEOUT:-1800}"
-PATCH_TIMEOUT="${E2E_PATCH_TIMEOUT:-1800}"
-BUNDLE_TIMEOUT="${E2E_BUNDLE_TIMEOUT:-300}"
-SEED_GEN_TIMEOUT="${E2E_SEED_GEN_TIMEOUT:-1800}"
 
 # Prebuilt GHCR images instead of local builds (compose.prebuilt.yaml overlay).
 IMAGE_TAG="${BUTTERCUP_IMAGE_TAG:-main}"
 
 DO_PULL=1
-DO_TEARDOWN=1
-SKIP_WAIT=0
-TASK_JSON=""    # if set, used instead of the canned libpng payload
-SARIF_RUN=0
+
+# Internal milestone timeouts (seconds). Bundle submission is quick; the rest
+# (build, vuln, seed-gen, patch) can each take a while on a low-budget run.
+MILESTONE_TIMEOUT=1800
+BUNDLE_TIMEOUT=300
+
+# Temp file for the trigger_task HTTP response; cleaned up on exit.
+TASK_RESP=""
 
 ###############################################################################
 # Logging
@@ -72,20 +71,8 @@ milestones tracked by .github/workflows/system-integration.yml.
 Options:
   --budget DOLLARS          LiteLLM per-user max budget (default: $BUDGET)
   --task-duration SECONDS   How long the CRS should fuzz (default: $TASK_DURATION)
-  --task-json FILE          Custom trigger_task payload (default: example-libpng)
   --image-tag TAG           Prebuilt GHCR image tag to run (default: $IMAGE_TAG)
   --no-pull                 Skip 'docker compose pull' (use already-pulled images)
-  --no-build                Deprecated alias for --no-pull (no local build happens)
-  --keep-up                 Don't tear the stack down on exit (for debugging)
-  --skip-wait               Bring the stack up and submit the task, but don't
-                            block waiting on milestones (returns immediately)
-  --sarif                   Also submit a SARIF broadcast after the patch
-                            passes and wait for the matching SARIF response
-  --build-timeout SEC       Override fuzzer-build milestone timeout (default $BUILD_TIMEOUT)
-  --vuln-timeout SEC        Override vuln milestone timeout (default $VULN_TIMEOUT)
-  --patch-timeout SEC       Override patch milestone timeout (default $PATCH_TIMEOUT)
-  --bundle-timeout SEC      Override bundle milestone timeout (default $BUNDLE_TIMEOUT)
-  --seed-gen-timeout SEC    Override seed-gen milestone timeout (default $SEED_GEN_TIMEOUT)
   -h, --help                Print this help
 
 Required environment (at least one provider key, plus litellm master key):
@@ -108,18 +95,8 @@ while [[ $# -gt 0 ]]; do
     case "$1" in
         --budget)            BUDGET="$2"; shift 2 ;;
         --task-duration)     TASK_DURATION="$2"; shift 2 ;;
-        --task-json)         TASK_JSON="$(cat "$2")"; shift 2 ;;
         --image-tag)         IMAGE_TAG="$2"; shift 2 ;;
         --no-pull)           DO_PULL=0; shift ;;
-        --no-build)          DO_PULL=0; shift ;;   # deprecated alias
-        --keep-up)           DO_TEARDOWN=0; shift ;;
-        --skip-wait)         SKIP_WAIT=1; shift ;;
-        --sarif)             SARIF_RUN=1; shift ;;
-        --build-timeout)     BUILD_TIMEOUT="$2"; shift 2 ;;
-        --vuln-timeout)      VULN_TIMEOUT="$2"; shift 2 ;;
-        --patch-timeout)     PATCH_TIMEOUT="$2"; shift 2 ;;
-        --bundle-timeout)    BUNDLE_TIMEOUT="$2"; shift 2 ;;
-        --seed-gen-timeout)  SEED_GEN_TIMEOUT="$2"; shift 2 ;;
         -h|--help)           usage; exit 0 ;;
         *) err "Unknown argument: $1"; usage; exit 2 ;;
     esac
@@ -142,19 +119,6 @@ if ! command -v curl >/dev/null 2>&1; then
     exit 1
 fi
 
-provider_keys_set=0
-for v in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY; do
-    val="${!v:-}"
-    if [[ -n "$val" && "$val" != "<INSERT_KEY>" ]]; then
-        provider_keys_set=1
-    fi
-done
-if [[ "$provider_keys_set" -eq 0 ]]; then
-    err "No LLM provider key found in env. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY."
-    err "Tip: 'export ANTHROPIC_API_KEY=...; scripts/e2e.sh' or add to ${ENV_FILE} first."
-    exit 1
-fi
-
 # Read a value already present in the existing .env. Used so that variables
 # not provided via the environment (e.g. LANGFUSE_*) are preserved across runs
 # instead of being clobbered with empty/placeholder values, since this script
@@ -175,6 +139,21 @@ prev_env() {
 : "${LANGFUSE_PUBLIC_KEY:=$(prev_env LANGFUSE_PUBLIC_KEY)}"
 : "${LANGFUSE_SECRET_KEY:=$(prev_env LANGFUSE_SECRET_KEY)}"
 
+# Require at least one usable provider key. Checked *after* the .env fallback
+# above so a key saved to .env on a prior run still counts.
+provider_keys_set=0
+for v in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY; do
+    val="${!v:-}"
+    if [[ -n "$val" && "$val" != "<INSERT_KEY>" ]]; then
+        provider_keys_set=1
+    fi
+done
+if [[ "$provider_keys_set" -eq 0 ]]; then
+    err "No LLM provider key found. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY."
+    err "Tip: 'export ANTHROPIC_API_KEY=...; scripts/e2e.sh' or add it to ${ENV_FILE} first."
+    exit 1
+fi
+
 # 3) Final placeholders if still unset after both env and .env. Keys left at
 # the placeholder so litellm still loads its config (some models will fail at
 # request time, others will succeed). LANGFUSE_* stay empty (telemetry off).
@@ -220,18 +199,11 @@ dc() {
            docker compose -f compose.yaml -f compose.prebuilt.yaml "$@")
 }
 
-teardown() {
-    if [[ "$DO_TEARDOWN" -eq 1 ]]; then
-        log "Tearing the stack down (docker compose down -v)"
-        dc down -v --remove-orphans || true
-    else
-        warn "Leaving the stack up (--keep-up). Tear down with: cd ${COMPOSE_DIR} && docker compose -f compose.yaml -f compose.prebuilt.yaml down -v"
-    fi
-}
-
 on_exit() {
     rc=$?
-    teardown
+    [[ -n "$TASK_RESP" ]] && rm -f "$TASK_RESP"
+    log "Tearing the stack down (docker compose down -v)"
+    dc down -v --remove-orphans || true
     if [[ $rc -ne 0 ]]; then
         err "e2e run finished with exit code $rc"
     fi
@@ -280,8 +252,7 @@ ok "buttercup-ui is up."
 # Submit the task
 ###############################################################################
 
-if [[ -z "$TASK_JSON" ]]; then
-    TASK_JSON=$(cat <<EOF
+TASK_JSON=$(cat <<EOF
 {
     "challenge_repo_url": "https://github.com/tob-challenges/example-libpng",
     "challenge_repo_base_ref": "5bf8da2d7953974e5dfbd778429c3affd461f51a",
@@ -293,27 +264,20 @@ if [[ -z "$TASK_JSON" ]]; then
 }
 EOF
 )
-fi
 
 log "Submitting task to buttercup-ui /webhook/trigger_task"
-http_code=$(curl -s -o /tmp/e2e_task_resp.$$ -w '%{http_code}' \
+TASK_RESP="$(mktemp)"
+http_code=$(curl -s -o "$TASK_RESP" -w '%{http_code}' \
     -X POST 'http://127.0.0.1:31323/webhook/trigger_task' \
     -H 'Content-Type: application/json' \
     -d "$TASK_JSON")
-resp_body=$(cat /tmp/e2e_task_resp.$$ || true)
-rm -f /tmp/e2e_task_resp.$$
+resp_body=$(cat "$TASK_RESP" || true)
 if [[ "$http_code" != "200" && "$http_code" != "201" ]]; then
     err "trigger_task returned HTTP $http_code: $resp_body"
     exit 1
 fi
 ok "Task accepted (HTTP $http_code). ${C_DIM}${resp_body}${C_RST}"
 
-if [[ "$SKIP_WAIT" -eq 1 ]]; then
-    ok "--skip-wait set; not waiting on milestones."
-    DO_TEARDOWN=0
-    exit 0
-fi
-
 ###############################################################################
 # Milestone waiters
 ###############################################################################
@@ -366,7 +330,7 @@ record() { SUMMARY+=("$1"); }
 
 if wait_for scheduler \
     "Processing build output for type FUZZER" \
-    "$BUILD_TIMEOUT" "fuzzer build processed"; then
+    "$MILESTONE_TIMEOUT" "fuzzer build processed"; then
     record "fuzzer-build: ok"
 else
     record "fuzzer-build: TIMEOUT"
@@ -374,7 +338,7 @@ fi
 
 if wait_for scheduler \
     "POV submission response: pov_id=" \
-    "$VULN_TIMEOUT" "vulnerability (POV) submitted"; then
+    "$MILESTONE_TIMEOUT" "vulnerability (POV) submitted"; then
     record "pov-submit: ok"
 else
     record "pov-submit: TIMEOUT"
@@ -382,7 +346,7 @@ fi
 
 if wait_for scheduler \
     "Updated POV status. New status PASSED" \
-    "$VULN_TIMEOUT" "POV accepted by competition API"; then
+    "$MILESTONE_TIMEOUT" "POV accepted by competition API"; then
     record "pov-passed: ok"
 else
     record "pov-passed: TIMEOUT"
@@ -390,7 +354,7 @@ fi
 
 if wait_for seed-gen \
     "Copied [1-9][0-9]* files to corpus" \
-    "$SEED_GEN_TIMEOUT" "seed-gen produced seeds"; then
+    "$MILESTONE_TIMEOUT" "seed-gen produced seeds"; then
     record "seed-gen: ok"
 else
     record "seed-gen: TIMEOUT"
@@ -398,7 +362,7 @@ fi
 
 if wait_for scheduler \
     "Appending patch for task" \
-    "$PATCH_TIMEOUT" "patch generated"; then
+    "$MILESTONE_TIMEOUT" "patch generated"; then
     record "patch-generated: ok"
 else
     record "patch-generated: TIMEOUT"
@@ -431,7 +395,7 @@ fi
 
 if wait_for scheduler \
     "Patch passed" \
-    "$PATCH_TIMEOUT" "patch accepted by competition API"; then
+    "$MILESTONE_TIMEOUT" "patch accepted by competition API"; then
     record "patch-passed: ok"
 else
     record "patch-passed: TIMEOUT"
@@ -445,33 +409,6 @@ else
     record "bundle-submit: TIMEOUT"
 fi
 
-if [[ "$SARIF_RUN" -eq 1 ]]; then
-    SARIF_TASK_ID="${TASK_ID:-}"
-    if [[ -z "$SARIF_TASK_ID" ]]; then
-        SARIF_TASK_ID=$(dc logs --no-color --no-log-prefix --tail=all scheduler \
-            | grep "Submitting bundle for harness" | head -n1 \
-            | grep -o "\[[^]]*\]" | head -n1 \
-            | tr -d '[]' | awk -F: '{print $NF}')
-    fi
-    if [[ -n "$SARIF_TASK_ID" ]]; then
-        log "Sending SARIF broadcast for task ${SARIF_TASK_ID}"
-        if "${REPO_ROOT}/orchestrator/scripts/send_sarif.sh" "$SARIF_TASK_ID" >/dev/null 2>&1; then
-            record "sarif-send: ok"
-        else
-            record "sarif-send: HTTP fail"
-        fi
-        if wait_for scheduler \
-            "Matching SARIF submission response" \
-            "$BUNDLE_TIMEOUT" "SARIF accepted"; then
-            record "sarif-passed: ok"
-        else
-            record "sarif-passed: TIMEOUT"
-        fi
-    else
-        record "sarif: skipped (no task id)"
-    fi
-fi
-
 ###############################################################################
 # Summary
 ###############################################################################

From acf39c237ab93a34747605807e3bbeace2eebde8 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Tue, 19 May 2026 08:02:45 +0000
Subject: [PATCH 07/10] refactor(scripts): drop user-facing
 BUTTERCUP_LITELLM_KEY from e2e.sh

The local litellm master key is an internal detail of the docker-compose
stack, not something the user should set. Remove it from the usage text
and the env/.env resolution; e2e.sh now just writes the local default
(sk-1234) into the generated .env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/e2e.sh | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index b25c3aaf..82177c45 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -75,9 +75,8 @@ Options:
   --no-pull                 Skip 'docker compose pull' (use already-pulled images)
   -h, --help                Print this help
 
-Required environment (at least one provider key, plus litellm master key):
+Required environment (at least one provider key):
   ANTHROPIC_API_KEY   and/or   OPENAI_API_KEY   and/or   GEMINI_API_KEY
-  BUTTERCUP_LITELLM_KEY  (optional, defaults to sk-1234 for local runs)
 
 Optional:
   BUTTERCUP_IMAGE_TAG  Prebuilt GHCR image tag (default: main; same as --image-tag)
@@ -134,7 +133,6 @@ prev_env() {
 : "${GEMINI_API_KEY:=$(prev_env GEMINI_API_KEY)}"
 : "${AZURE_API_BASE:=$(prev_env AZURE_API_BASE)}"
 : "${AZURE_API_KEY:=$(prev_env AZURE_API_KEY)}"
-: "${BUTTERCUP_LITELLM_KEY:=$(prev_env BUTTERCUP_LITELLM_KEY)}"
 : "${LANGFUSE_HOST:=$(prev_env LANGFUSE_HOST)}"
 : "${LANGFUSE_PUBLIC_KEY:=$(prev_env LANGFUSE_PUBLIC_KEY)}"
 : "${LANGFUSE_SECRET_KEY:=$(prev_env LANGFUSE_SECRET_KEY)}"
@@ -162,7 +160,6 @@ fi
 : "${GEMINI_API_KEY:=<INSERT_KEY>}"
 : "${AZURE_API_BASE:=<INSERT_HOST>}"
 : "${AZURE_API_KEY:=<INSERT_KEY>}"
-: "${BUTTERCUP_LITELLM_KEY:=sk-1234}"
 : "${LANGFUSE_HOST:=}"
 : "${LANGFUSE_PUBLIC_KEY:=}"
 : "${LANGFUSE_SECRET_KEY:=}"
@@ -174,7 +171,8 @@ fi
 log "Writing ${ENV_FILE} (LITELLM_MAX_BUDGET=\$${BUDGET})"
 {
     echo "# Generated by scripts/e2e.sh on $(date -Is)"
-    echo "BUTTERCUP_LITELLM_KEY=${BUTTERCUP_LITELLM_KEY}"
+    # litellm master key — internal to the local stack, not user-facing.
+    echo "BUTTERCUP_LITELLM_KEY=sk-1234"
     echo "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}"
     echo "OPENAI_API_KEY=${OPENAI_API_KEY}"
     echo "GEMINI_API_KEY=${GEMINI_API_KEY}"

From 4ec48c35328de4a1a2d87d3b418b8b6c274db302 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Tue, 19 May 2026 08:11:54 +0000
Subject: [PATCH 08/10] fix(scripts): don't clobber LANGFUSE_* with empty
 values in e2e.sh

e2e.sh regenerates dev/docker-compose/.env every run and was always
writing LANGFUSE_HOST=/PUBLIC_KEY=/SECRET_KEY= even when unset. Since
.env is loaded last in compose's env_file list, an empty value silently
disabled Langfuse telemetry. Now resolved env -> existing .env, and the
LANGFUSE_* lines are only written when non-empty, so values the user set
in .env survive across runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/e2e.sh | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 82177c45..5fb266fd 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -120,8 +120,8 @@ fi
 
 # Read a value already present in the existing .env. Used so that variables
 # not provided via the environment (e.g. LANGFUSE_*) are preserved across runs
-# instead of being clobbered with empty/placeholder values, since this script
-# regenerates .env from scratch on every run.
+# instead of being clobbered, since this script regenerates .env from scratch
+# on every run.
 prev_env() {
     [[ -f "$ENV_FILE" ]] || return 0
     sed -n "s/^$1=//p" "$ENV_FILE" | head -n1
@@ -154,7 +154,9 @@ fi
 
 # 3) Final placeholders if still unset after both env and .env. Keys left at
 # the placeholder so litellm still loads its config (some models will fail at
-# request time, others will succeed). LANGFUSE_* stay empty (telemetry off).
+# request time, others will succeed). LANGFUSE_* are intentionally left unset
+# here: empty lines are NOT written to .env below, so a run without them set
+# never clobbers LANGFUSE_* the user previously had in .env.
 : "${ANTHROPIC_API_KEY:=<INSERT_KEY>}"
 : "${OPENAI_API_KEY:=<INSERT_KEY>}"
 : "${GEMINI_API_KEY:=<INSERT_KEY>}"
@@ -179,9 +181,12 @@ log "Writing ${ENV_FILE} (LITELLM_MAX_BUDGET=\$${BUDGET})"
     echo "AZURE_API_BASE=${AZURE_API_BASE}"
     echo "AZURE_API_KEY=${AZURE_API_KEY}"
     echo "LITELLM_MAX_BUDGET=${BUDGET}"
-    echo "LANGFUSE_HOST=${LANGFUSE_HOST}"
-    echo "LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}"
-    echo "LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}"
+    # Only emit LANGFUSE_* when we actually have a value, so a run without
+    # them set leaves no empty LANGFUSE_HOST= behind to disable telemetry.
+    [[ -n "$LANGFUSE_HOST" ]]       && echo "LANGFUSE_HOST=${LANGFUSE_HOST}"
+    [[ -n "$LANGFUSE_PUBLIC_KEY" ]] && echo "LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}"
+    [[ -n "$LANGFUSE_SECRET_KEY" ]] && echo "LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}"
+    true
 } > "$ENV_FILE"
 
 ###############################################################################

From dc77e02809260b48f3817e8878d21d245fd9ecd8 Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Tue, 19 May 2026 08:54:58 +0000
Subject: [PATCH 09/10] fix(scripts): match real summary log markers for
 POV/bundle milestones
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The pov-submit and bundle-submit waiters used
"POV submission response: pov_id=" and "Bundle submission response: bundle_id="
which never match any rendered log line: the only
"... submission response:" logs are logger.debug calls whose payload is an
API object repr (no literal pov_id=/bundle_id=), while pov_id=/bundle_id=
appear only in the separate structured summary line (logger.info) with a
different prefix. Result: both milestones always timed out, so every run —
including fully successful ones — wasted MILESTONE_TIMEOUT+BUNDLE_TIMEOUT
and exited non-zero.

Repoint both to the structured summary tokens (pov_id= / bundle_id=) and
sync the marker list in .claude/commands/e2e.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/commands/e2e.md | 4 ++--
 scripts/e2e.sh          | 9 +++++++--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md
index d757fc7b..7c81d7e3 100644
--- a/.claude/commands/e2e.md
+++ b/.claude/commands/e2e.md
@@ -22,13 +22,13 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
 4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`.
 5. Waits, in order, for these scheduler/seed-gen log markers:
    - `Processing build output for type FUZZER` — fuzzer build done
-   - `POV submission response: pov_id=` — vulnerability found and POV submitted
+   - `pov_id=` — vulnerability found and POV submitted
    - `Updated POV status. New status PASSED` — POV accepted by competition API
    - `Copied N files to corpus` — seed-gen produced seeds
    - `Appending patch for task` — patch generated
    - approves the patch via `POST /v1/task/<task_id>/patch/<patch_id>/approve`
    - `Patch passed` — patch accepted
-   - `Bundle submission response: bundle_id=` — bundle submitted
+   - `bundle_id=` — bundle submitted
 6. Prints a colored summary and tears the stack down with `docker compose down -v`.
 
 ## Run it
diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 5fb266fd..5b08e428 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -339,8 +339,11 @@ else
     record "fuzzer-build: TIMEOUT"
 fi
 
+# NOTE: match the structured summary line (`[i:task] pov_id=<id> ...`,
+# logger.info), NOT the "POV submission response:" debug line whose payload is
+# an API object repr that never contains a literal `pov_id=`.
 if wait_for scheduler \
-    "POV submission response: pov_id=" \
+    "pov_id=" \
     "$MILESTONE_TIMEOUT" "vulnerability (POV) submitted"; then
     record "pov-submit: ok"
 else
@@ -404,8 +407,10 @@ else
     record "patch-passed: TIMEOUT"
 fi
 
+# NOTE: same as POV above — match the structured summary `bundle_id=<id>`
+# (logger.info), not the "Bundle submission response:" debug object repr.
 if wait_for scheduler \
-    "Bundle submission response: bundle_id=" \
+    "bundle_id=" \
     "$BUNDLE_TIMEOUT" "bundle submitted"; then
     record "bundle-submit: ok"
 else

From c1856c4ae38f5e5b1445d904745d79db99337f3a Mon Sep 17 00:00:00 2001
From: Riccardo Schirone <riccardo.schirone@trailofbits.com>
Date: Tue, 19 May 2026 11:10:48 +0000
Subject: [PATCH 10/10] fix(scripts): e2e.sh approval wait-loop + viable
 budget/duration defaults

Three defects found while verifying the pipeline end-to-end:

1. Approval one-shot race: capture_line 'competition_patch_id=' ran once
   right after the patch-generated milestone, but the scheduler logs that
   id only minutes later (after it builds+verifies+submits the patch). The
   capture always lost the race, so approval was always skipped and the
   local stack never reached Patch passed / bundle. Replace with a
   wait_capture() poll loop (mirrors wait_for) so approval actually fires.

2. Default --task-duration 1800 is self-defeating: build->POV->seed-gen->
   patch exceeds 30 min on normal hardware, so the task expires mid-patch
   ("task expired/cancelled? Will discard") and never reaches patch/bundle.
   Default to 7200 so the task outlives the pipeline.

3. Default --budget 3 cannot reach patch/bundle: a full run through patch
   generation costs ~$10; $3 is exhausted around POV. Default to 10.

e2e.md updated to match (defaults, the cheap --budget 3 caveat, and the
poll-then-approve description).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/commands/e2e.md | 13 ++++++------
 scripts/e2e.sh          | 47 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md
index 7c81d7e3..a3da6ea2 100644
--- a/.claude/commands/e2e.md
+++ b/.claude/commands/e2e.md
@@ -6,7 +6,7 @@ allowed-tools: Bash(./scripts/e2e.sh:*), Bash(make e2e*), Bash(docker compose:*)
 
 # /e2e — Docker-only end-to-end Buttercup run (example-libpng)
 
-This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack with the **`compose.prebuilt.yaml` overlay** — every component runs from its prebuilt GHCR image (`ghcr.io/trailofbits/buttercup/*`, tag `main` by default), so **nothing is built locally**. A low LiteLLM budget (default **$3**) keeps an accidental run cheap.
+This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack with the **`compose.prebuilt.yaml` overlay** — every component runs from its prebuilt GHCR image (`ghcr.io/trailofbits/buttercup/*`, tag `main` by default), so **nothing is built locally**. A LiteLLM budget cap (default **$10**) bounds the spend — a full run through patch generation costs roughly that; a lower cap stops the pipeline before patch/bundle, so `--budget 3` only exercises up to seed-gen.
 
 > **Image tag:** defaults to `main`. Override with `--image-tag <branch-or-tag>` or `BUTTERCUP_IMAGE_TAG=...` to test a specific build. Private images require `docker login ghcr.io` first.
 >
@@ -17,7 +17,7 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
 ## What it does
 
 1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env (or already saved in `dev/docker-compose/.env`).
-2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `3`).
+2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `10`). The submitted task's `duration` defaults to `7200`s (2h) — the CRS discards a task's work once its deadline passes, and the full pipeline can exceed 30 min, so a short duration would expire mid-patch.
 3. Pulls the prebuilt component images (`docker compose -f compose.yaml -f compose.prebuilt.yaml pull`, skippable with `--no-pull`) and starts every service (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui). No local image build.
 4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`.
 5. Waits, in order, for these scheduler/seed-gen log markers:
@@ -26,7 +26,7 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
    - `Updated POV status. New status PASSED` — POV accepted by competition API
    - `Copied N files to corpus` — seed-gen produced seeds
    - `Appending patch for task` — patch generated
-   - approves the patch via `POST /v1/task/<task_id>/patch/<patch_id>/approve`
+   - polls for the `competition_patch_id=` summary line (logged only after the scheduler builds, verifies and submits the patch — minutes after the patch is generated), then approves via `POST /v1/task/<task_id>/patch/<patch_id>/approve`
    - `Patch passed` — patch accepted
    - `bundle_id=` — bundle submitted
 6. Prints a colored summary and tears the stack down with `docker compose down -v`.
@@ -36,15 +36,16 @@ Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails
 The driver is `scripts/e2e.sh`. The `Makefile` exposes `make e2e`.
 
 ```bash
-# Plain run with the $3 budget default
+# Plain run with the $10 budget / 7200s task-duration defaults
 make e2e
 
 # Pass flags through the Makefile
-make e2e E2E_ARGS="--budget 5 --no-pull"
+make e2e E2E_ARGS="--budget 15 --no-pull"
 
 # Or call the script directly
-./scripts/e2e.sh --budget 3 --task-duration 1800
+./scripts/e2e.sh --budget 10 --task-duration 7200
 ./scripts/e2e.sh --image-tag my-branch --no-pull   # run already-present images
+./scripts/e2e.sh --budget 3                         # cheap: only reaches ~seed-gen
 ```
 
 The script writes/overwrites `dev/docker-compose/.env` on each run.
diff --git a/scripts/e2e.sh b/scripts/e2e.sh
index 5b08e428..84f93799 100755
--- a/scripts/e2e.sh
+++ b/scripts/e2e.sh
@@ -23,8 +23,17 @@ COMPOSE_DIR="${REPO_ROOT}/dev/docker-compose"
 ENV_FILE="${COMPOSE_DIR}/.env"
 
 # Defaults — overridable via flags or environment.
-BUDGET="${LITELLM_MAX_BUDGET:-3}"
-TASK_DURATION="${E2E_TASK_DURATION:-1800}"
+#
+# BUDGET: a full run through patch generation costs ~$10 of LLM spend; $3 is
+# exhausted during/just after POV, so anything past seed-gen would always time
+# out. Default to 10 so the whole pipeline (incl. patch+bundle) is reachable.
+#
+# TASK_DURATION: the CRS discards a task's work once its deadline passes. On
+# normal hardware build->POV->seed-gen->patch exceeds 30 min, so an 1800s task
+# expires mid-patch ("task expired/cancelled? Will discard") and never reaches
+# patch/bundle. Default to 7200 (2h) so the task outlives the pipeline.
+BUDGET="${LITELLM_MAX_BUDGET:-10}"
+TASK_DURATION="${E2E_TASK_DURATION:-7200}"
 
 # Prebuilt GHCR images instead of local builds (compose.prebuilt.yaml overlay).
 IMAGE_TAG="${BUTTERCUP_IMAGE_TAG:-main}"
@@ -324,6 +333,33 @@ capture_line() {
         | grep -E "$pattern" | head -n1 || true
 }
 
+# wait_capture SERVICE PATTERN TIMEOUT_SEC LABEL
+#
+# Like capture_line, but polls until the pattern appears or TIMEOUT_SEC
+# elapses, echoing the first matching line on stdout (empty on timeout).
+# Progress goes to stderr so stdout stays just the captured line.
+#
+# Needed because `competition_patch_id=` is logged by the scheduler only
+# *after* it builds, verifies and submits the patch — minutes after the
+# "Appending patch for task" milestone. A one-shot capture right after that
+# milestone always races and loses, so approval would always be skipped.
+wait_capture() {
+    local service="$1" pattern="$2" timeout="$3" label="$4"
+    local deadline=$(( $(date +%s) + timeout ))
+    log "Waiting to capture: ${label}  ${C_DIM}(service=${service}, timeout=${timeout}s)${C_RST}" >&2
+    while [[ $(date +%s) -lt $deadline ]]; do
+        local match
+        match="$(dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \
+            | grep -m1 -E "$pattern" || true)"
+        if [[ -n "$match" ]]; then
+            printf '%s\n' "$match"
+            return 0
+        fi
+        sleep 15
+    done
+    return 1
+}
+
 ###############################################################################
 # Walk through the pipeline
 ###############################################################################
@@ -375,8 +411,11 @@ else
 fi
 
 # Approve the patch (the local UI requires explicit approval, unlike scored
-# rounds where it is automatic).
-PATCH_LINE="$(capture_line scheduler 'competition_patch_id=')"
+# rounds where it is automatic). competition_patch_id= only appears once the
+# scheduler has built+verified+submitted the patch, well after the patch was
+# generated, so poll for it rather than capturing once (which always races).
+PATCH_LINE="$(wait_capture scheduler 'competition_patch_id=[0-9a-fA-F-]' \
+    "$MILESTONE_TIMEOUT" "competition_patch_id (for approval)" || true)"
 if [[ -n "$PATCH_LINE" ]]; then
     PATCH_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*competition_patch_id=\([^ ]*\).*/\1/p')
     # Task id is inside the first [...] block, after the last ':'.