feat: add functional eval framework for agent pipelines#1682
Conversation
Site previewPreview: https://c48d522b-site.fullsend-ai.workers.dev Commit: |
ReviewFindingsMedium
Low
Info
Previous runReviewFindingsMedium
Low
Info
Previous run (2)ReviewReason: stale-head The review agent reviewed commit Previous run (3)ReviewFindingsMedium
Low
Previous run (4)ReviewFindingsMedium
Low
Previous run (5)ReviewFindingsHigh
Medium
Low
Previous run (6)ReviewFindingsHigh
Medium
Low
Previous run (7)ReviewFindingsHigh
Medium
Low
Previous run (8)ReviewFindingsHigh
Medium
Low
Previous run (9)ReviewFindingsHigh
Medium
Low
Previous run (10)ReviewFindingsMedium
Low
Previous run (11)ReviewFindingsMedium
Low
Previous run (12)ReviewFindingsMedium
Low
Previous run (13)ReviewFindingsMedium
Low
|
6e40384 to
826ceaf
Compare
sandbox.Upload() silently fails for large files (~16MB) — the binary doesn't appear in the sandbox. Switch bootstrapSandbox to use UploadDir() (tarball + extract), the same mechanism used for uploading project code, which works reliably regardless of file size. Observed in CI functional evals (run 26642640034). This is the same class of issue that motivated the UploadFile helper in commit 907d482. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
aec8273 to
a74886c
Compare
35b2e20 to
8e004de
Compare
b4bda40 to
73a0b60
Compare
Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Add openshell sandbox setup (CLI, gateway, Podman), GCP WIF authentication, and credential preparation for functional evals. Key changes: - Set up openshell sandbox infrastructure in the workflow - Prepare WIF credentials for sandbox use (rewrite external_account config to file-based OIDC token source) - Restore host credentials for the scoring phase (LLM judge runs on the host, not in the sandbox) - Pass EVALS_GCP_REGION and EVALS_VERTEX_PROJECT_ID for Vertex AI - Configure git identity and authenticated clone URLs - Pass --fullsend-binary flag in eval runner Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
73a0b60 to
d056e94
Compare
- ADR 0044: establish functional evals as a test category in the four-layer test pyramid (unit, prompt eval, functional eval, e2e) - ADR 0045: adopt agent-eval-harness as the eval framework, using its opaque CLI runner contract as the integration boundary - Add docs/testing/evals.md with contributor guide for writing and running functional evals - Update docs/architecture.md with ADR 0044 references on testing open questions - Annotate testing-agents.md golden-set bootstrapping question as partially answered Refs #1682 Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
- Use git credential helper instead of embedding GH_TOKEN in clone URLs, preventing token persistence in .git/config - Set 0600 permissions on .eval-env before writing secrets - Delete .eval-env after fullsend run to prevent secret upload as artifact - Capture fullsend run exit code correctly (rc=$? pattern vs $? after ||) - Remove 2>&1 from gh create commands that corrupted URL extraction - Replace non-portable grep -P with portable parameter expansion - Pipe yq output directly instead of echo to prevent backslash mangling - Add explicit out.Close() check in copyFile to catch write flush errors - Add concurrency group with cancel-in-progress to CI workflow - Add timeout-minutes: 45 to prevent runaway eval jobs - Fix shellcheck warnings (SC2086, SC2034, SC2155) in CI workflow - Suppress ruff F841 in eval fixture code (intentional unused variable) - Add reversibility note to ADR 0045 consequences Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
- Delete .eval-env in cleanup trap so secrets don't survive crashes - Scrub .eval-env from artifact uploads as defense-in-depth - Pin agent-eval-harness clone to a specific commit ref - Add per-case timeout (default 30min) to prevent hung agents - Fix credential helper comment (expression is stored, not the token) - Fix docs env var table: add missing vars, correct FULLSEND_DIR desc - Remove unused shell variable declaration (shellcheck SC2034) - Remove unused comments.min_count/max_count from annotations Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Replace the runtime clone of agent-eval-harness with a git submodule at eval/.agent-eval-harness. Dependabot's gitsubmodule ecosystem will keep it updated automatically, which the previous pinned-SHA-in-a-script approach could not support. - Add eval/.agent-eval-harness submodule (at 8e471f8, post-v1.4.0) - Remove .gitignore entry for eval/.agent-eval-harness/ - Update run-functional.sh to init submodule instead of cloning - Add submodules: true to workflow checkout step - Update docs/testing/evals.md to describe submodule setup Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
|
While reviewing this I dug into what agent-eval-harness can do that we're not using yet. Noting it here so we can file backlog issues after this merges if any of these seem worth pursuing. Quick wins — could adopt with minimal effort:
Medium-term:
Longer-term (needs architecture changes):
The HTML reports and parallelism seem like obvious next steps once we have more than one test case. |
Summary
001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue descriptionWhat's in the box
eval/fullsend-runner.sheval/run-functional.sheval/triage/eval.yamleval/triage/cases/001-*+but issue claims it doesn't)eval/triage/repos/python-webapp/.github/workflows/functional-evals.ymleval/orinternal/scaffold/changesMakefilemake functional-evalstargetCurrent results
The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles
+. Threshold set to 2.5 with a TODO to raise it.Still TODO
evalsGitHub environment with secrets (EVAL_GH_TOKEN,GCP_CREDENTIALS) and vars (EVAL_ORG,ANTHROPIC_VERTEX_PROJECT_ID)execute.pyexpectations so we can use their parallelismRefs #499, #73
🤖 Generated with Claude Code