feat: add functional eval framework for agent pipelines by ralphbean · Pull Request #1682 · fullsend-ai/fullsend

ralphbean · 2026-05-29T00:26:28Z

Summary

Adds end-to-end functional eval framework that tests the full agent pipeline (pre-script → agent → post-script) by observing side effects on GitHub fixtures
Ephemeral repos with UUID suffixes ensure concurrent eval runs don't interfere
Uses agent-eval-harness for LLM-graded scoring via CLI runner interface
First test case (001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue description

What's in the box

File	Purpose
`eval/fullsend-runner.sh`	CLI runner: ephemeral repo → fixture → fullsend run → capture state → teardown
`eval/run-functional.sh`	Orchestrator: iterate cases, call runner, score with agent-eval-harness
`eval/triage/eval.yaml`	Eval config: LLM judge (1-5 rubric) + deterministic label check
`eval/triage/cases/001-*`	Test case with tricky scenario (regex accepts `+` but issue claims it doesn't)
`eval/triage/repos/python-webapp/`	Shared repo content, symlinked by cases
`.github/workflows/functional-evals.yml`	CI workflow triggered by `eval/` or `internal/scaffold/` changes
`Makefile`	`make functional-evals` target

Current results

The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles +. Threshold set to 2.5 with a TODO to raise it.

Still TODO

Set up evals GitHub environment with secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) and vars (EVAL_ORG, ANTHROPIC_VERTEX_PROJECT_ID)
Refactor case layout to match agent-eval-harness execute.py expectations so we can use their parallelism
Add more test cases (feature request, vague bug, PR review)
Per-agent path filtering in CI (only run triage evals when triage-related files change)

Refs #499, #73

🤖 Generated with Claude Code

github-actions · 2026-05-29T00:27:54Z

Site preview

Preview: https://c48d522b-site.fullsend-ai.workers.dev

Commit: a67077658b24f50a39aa7097b0b10196de77debe

fullsend-ai-review · 2026-05-29T00:32:22Z

Review

Findings

Medium

[protected-path] .github/workflows/functional-evals.yml — This file is under .github/, a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and the description explains the rationale for the new CI workflow (functional eval framework for agent pipelines). Human review of the workflow's secret access, permissions, and trigger configuration is required before merge.

Low

[correctness] .github/workflows/functional-evals.yml — The workflow references secrets.EVAL_GH_TOKEN and vars.EVAL_ORG but lacks an environment: evals declaration on the job. Without it, environment-scoped secrets and variables won't resolve. The PR TODO acknowledges the environment isn't set up yet, but the YAML should include the environment: stanza so it's ready when secrets are configured.
Remediation: Add environment: evals to the functional-evals job definition.
[correctness] eval/fullsend-runner.sh — Ephemeral repos are created with --public (line 84 in the diff). This is acceptable for the current test fixtures (innocuous Python webapp), but future test cases could inadvertently expose sensitive patterns or data. Consider using --private if the eval token has the necessary permissions, or document the public visibility as a deliberate design constraint.

Info

[style] .github/workflows/functional-evals.yml — The OpenShell setup (version, CLI install, gateway download, Podman config, gateway start) is duplicated from action.yml. The inline TODO already flags this for extraction into a shared script.
[correctness] eval/run-functional.sh — The script does not produce a summary exit code reflecting the scoring outcome. CI gating depends on score.py exiting non-zero on threshold failure. If score.py handles this natively, no change needed; if not, the CI step will always pass regardless of eval scores.

Previous run

Review

Findings

Medium

[protected-path] .github/workflows/functional-evals.yml — This file is under .github/, a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and the description explains the rationale for the new CI workflow. Human review of the workflow's secret access, permissions, and trigger configuration is required before merge.

Low

[correctness] docs/testing/evals.md:39 — The environment variable table describes FULLSEND_DIR as "Path to pre-built fullsend binary (skips build)" but this is incorrect. FULLSEND_DIR is the path to the fullsend scaffold directory (e.g., internal/scaffold/fullsend-repo), as documented in the Makefile default and the runner script comments. The description and required/optional status should be corrected.
Remediation: Change the description to "Path to fullsend scaffold directory (default: internal/scaffold/fullsend-repo)" and mark as required (the runner script uses ${FULLSEND_DIR:?}).
[correctness] .github/workflows/functional-evals.yml — The workflow references secrets.EVAL_GH_TOKEN and vars.EVAL_ORG but lacks an environment: evals declaration on the job. Without it, environment-scoped secrets and variables won't resolve. The PR TODO acknowledges the environment isn't set up yet, but the YAML should include the environment: stanza so it's ready when secrets are configured.
Remediation: Add environment: evals to the functional-evals job definition.
[correctness] eval/fullsend-runner.sh:84 — Ephemeral repos are created with --public. This is fine for the current test fixtures (innocuous Python webapp), but future test cases could inadvertently expose sensitive patterns or data. Consider using --private if the eval token has the necessary permissions, or document the public visibility as a deliberate design constraint.

Info

[style] .github/workflows/functional-evals.yml — The OpenShell setup (version, CLI install, gateway download, Podman config, gateway start) is duplicated from action.yml. The inline TODO already flags this for extraction into a shared script. Tracking as a follow-up.
[correctness] eval/run-functional.sh — The script does not produce a summary exit code reflecting the scoring outcome. CI gating depends on score.py exiting non-zero on threshold failure. If score.py does handle this, no change needed; if not, the CI step will always pass regardless of eval scores.

Previous run (2)

Review

Reason: stale-head

The review agent reviewed commit a74886c393f4913c5d194e1e2fb0c0c4531346e8 but the PR HEAD is now 8c8c781113d441f22e8c870e14b5cbcd5c32d3f8. This review was discarded to avoid approving unreviewed code.

Previous run (3)

Review

Findings

Medium

[protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and explains the rationale (functional eval framework for agent pipelines). The workflow uses appropriately scoped permissions (contents: read, id-token: write) and references secrets via a GitHub environment that hasn't been created yet (acknowledged in the PR as a TODO).

Low

[correctness] eval/fullsend-runner.sh:353 — grep -oP '/issues/\K[0-9]+' uses Perl-compatible regex (-P flag) which is not available on macOS BSD grep. Since the script header and Makefile target suggest local execution is expected, consider using sed or awk for portability: echo "$url" | sed 's|.*/issues/||'.
Remediation: Replace grep -oP '/issues/\K[0-9]+' and grep -oP '/pull/\K[0-9]+' with portable alternatives (e.g., awk -F/ '{print $NF}' or sed 's|.*/||').
[correctness] internal/cli/run.go:1810 — The new copyFile function uses defer out.Close() without checking the close error. While Go's os.File.Write is unbuffered (writes go to syscall directly), the idiomatic Go pattern is to explicitly close the destination file and check the error before returning, especially for write operations. In practice this is safe for the temp-dir use case here.
Remediation: Close out explicitly before the Stat call and return any close error.

Previous run (4)

Review

Findings

Medium

[protected-path] .github/workflows/functional-evals.yml — New workflow file under .github/. The PR provides sufficient context (references Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73; explains the eval framework purpose), but human approval is always required for changes to protected governance and infrastructure paths.

Low

[correctness] eval/fullsend-runner.sh:352,389 — grep -oP uses Perl regex, a GNU grep extension not available on macOS. The primary use is CI on ubuntu-latest so this works today, but developers running evals locally on macOS will hit failures.
Remediation: Replace with a POSIX-compatible alternative, e.g. sed -n 's|.*/issues/||p' or awk -F/ '{print $NF}'.
[correctness] internal/cli/run.go:724-727 — The openshell upload path fix is correct (upload to parent dir so the binary lands at the expected path), but this bug fix is bundled into an eval-framework feature PR. A separate commit improves bisectability if the upload behavior needs to be reverted independently.
[documentation-currency] docs/problems/testing-agents.md — This doc explores agent testing approaches (golden-set evaluation, behavioral contracts) in the abstract. The new eval/ directory is a concrete implementation of Approach 1 (golden-set evaluation with LLM judges). A cross-reference from the doc to the implementation would help contributors discover the working framework.

Previous run (5)

Review

Findings

High

[correctness] internal/cli/run.go:348,697,725,731,735,1283 — Debug instrumentation left in production code, now in six locations — worse than the prior review. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap. (b) Line 697: prints fullsendBinary and localBinary flag values to stderr. (c) Line 725: prints localBinary before upload to stderr. (d) Line 731: prints upload source/destination to stderr. (e) Lines 735–737: runs ls -la /tmp/workspace/bin/fullsend inside the sandbox and prints the result to stderr. (f) Lines 1283–1284: echo DEBUG:PATH=$PATH >&2 && ls -la .../bin/fullsend >&2 && which fullsend >&2 injected into buildScanContextCommand, emitting debug output on every context scan — flagged in the prior review and still present. The latest commit added five more debug blocks instead of removing the previously flagged instance.
Remediation: Remove all six debug blocks. For (a), delete lines 348–350. For (b)–(e), delete the fmt.Fprintf(os.Stderr, "DEBUG ... lines and the verify sandbox.Exec call. For (f), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

[correctness] eval/fullsend-runner.sh:351,387 — gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

[supply-chain] eval/run-functional.sh:55 — git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
Remediation: Pin the clone to a specific commit SHA.
[supply-chain] .github/workflows/functional-evals.yml:48 — yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
Remediation: Add sha256sum -c verification after download.
[platform-security] eval/fullsend-runner.sh:316 — GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
Remediation: Use a credential helper instead of embedding the token in the URL.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (6)

Review

Findings

High

[correctness] internal/cli/run.go:348,1276 — Debug instrumentation left in production code, now in two locations. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap — added since the prior review. (b) Line 1276: echo DEBUG:PATH, ls -la .../bin/fullsend, and which fullsend injected into buildScanContextCommand, emitting debug output to stderr on every context scan — flagged in the prior review and still present. The latest commit added more debug code instead of removing the previously flagged instance.
Remediation: Remove both debug blocks. For (a), delete lines 348–350 (the debugOut/debugErr block). For (b), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

[correctness] eval/fullsend-runner.sh:351,385 — gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

[supply-chain] eval/run-functional.sh:55 — git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
Remediation: Pin the clone to a specific commit SHA.
[supply-chain] .github/workflows/functional-evals.yml:48 — yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
Remediation: Add sha256sum -c verification after download.
[platform-security] eval/fullsend-runner.sh:316 — GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
Remediation: Use a credential helper instead of embedding the token in the URL.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (7)

Review

Findings

High

[correctness] internal/cli/run.go:348,1276 — Debug instrumentation left in production code, now in two locations. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap — added since the prior review. (b) Line 1276: echo DEBUG:PATH, ls -la .../bin/fullsend, and which fullsend injected into buildScanContextCommand, emitting debug output to stderr on every context scan — flagged in the prior review and still present. The latest commit added more debug code instead of removing the previously flagged instance.
Remediation: Remove both debug blocks. For (a), delete lines 348–350 (the debugOut/debugErr block). For (b), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

[correctness] eval/fullsend-runner.sh:351,385 — gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

[supply-chain] eval/run-functional.sh:55 — git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
Remediation: Pin the clone to a specific commit SHA.
[supply-chain] .github/workflows/functional-evals.yml:48 — yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
Remediation: Add sha256sum -c verification after download.
[platform-security] eval/fullsend-runner.sh:316 — GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
Remediation: Use a credential helper instead of embedding the token in the URL.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (8)

Review

Findings

High

[correctness] internal/cli/run.go:1276 — Debug instrumentation left in production code. The change adds echo DEBUG:PATH=$PATH >&2 && ls -la .../bin/fullsend >&2 && which fullsend >&2 to buildScanContextCommand, which runs inside every sandbox context scan. This emits noisy debug output to stderr on every agent run and is clearly development debugging that should not be merged.
Remediation: Revert the internal/cli/run.go change — remove the debug echo, ls -la, and which commands from the format string, restoring the original single-line command.

Medium

[correctness] eval/fullsend-runner.sh:351 — gh issue create and gh pr create (line 385) use 2>&1, which merges stderr into the captured URL. If gh emits any warnings or progress messages to stderr (e.g., authentication notices, rate-limit warnings), they contaminate FIXTURE_URL. While grep -oP may still extract the number, FIXTURE_URL will contain garbage text that propagates into fixture-state.json and potentially breaks downstream scoring.
Remediation: Remove 2>&1 from both gh issue create and gh pr create command substitutions. If error capture is needed, redirect stderr to a log file instead.
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, which is a protected path. The PR body explains the purpose (functional eval CI) and references Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73, providing sufficient context. Human approval is required for all protected-path changes regardless of justification.

Low

[supply-chain] eval/run-functional.sh:55 — git clone --depth=1 https://github.com/opendatahub-io/agent-eval-harness.git clones the harness repo at HEAD without pinning to a specific commit or tag. If the upstream repo is compromised, malicious scoring code (score.py) would execute in CI with access to EVAL_GH_TOKEN and GCP credentials.
Remediation: Pin the clone to a specific commit SHA (e.g., git clone --depth=1 --branch <tag-or-sha> ...) and document the pinned version.
[platform-security] eval/fullsend-runner.sh:316 — git clone "https://x-access-token:${GH_TOKEN}@github.com/..." embeds the token in the URL, which is visible in /proc/*/cmdline on Linux. Acceptable in ephemeral CI runners but worth noting.
Remediation: Consider using git -c credential.helper='!echo password=${GH...' clone ... or configuring the credential helper to avoid token exposure in process listings.

Previous run (9)

Review

Findings

High

[platform-security] eval/fullsend-runner.sh:511 — Token leakage via CI artifact upload. The runner writes .eval-env containing GH_TOKEN, PUSH_TOKEN, and REVIEW_TOKEN to $OUTPUT_DIR (which resolves to eval/runs/<agent>/<run-id>/cases/<case>/). The CI workflow uploads eval/runs/ as an artifact with 30-day retention. The .eval-env file would be included in the downloadable artifact, exposing all three tokens to anyone with artifact download access. The cleanup trap only removes TARGET_DIR — it does not delete ENV_FILE.
Remediation: Write the env file to a temp location outside the output directory (e.g., ENV_FILE=$(mktemp)) and add rm -f "$ENV_FILE" to the cleanup trap. Alternatively, add an exclusion pattern to the artifact upload step.

Medium

[protected-path] .github/workflows/functional-evals.yml — This file is under .github/, a protected path requiring human approval. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and the description explains the workflow's purpose. Human review and approval is required regardless of automated review outcome.
[supply-chain] .github/workflows/functional-evals.yml:48 — yq binary downloaded from GitHub Releases without checksum verification (curl | chmod +x). Compare with the existing lychee install in the Makefile which includes a sha256sum -c check. Additionally, agent-eval-harness is installed from a git URL without a version pin or commit SHA (line 44), meaning the CI installs whatever is on the default branch at run time.
Remediation: Add SHA256 checksum verification for the yq download. Pin agent-eval-harness to a specific commit SHA or release tag (e.g., git+https://github.com/opendatahub-io/agent-eval-harness.git@<commit>).

Low

[correctness] eval/fullsend-runner.sh:352 — grep -oP uses PCRE mode, which is a GNU grep extension not available on macOS default grep. Local development on macOS would fail when parsing fixture URLs. The same pattern appears at line 389 for PR URL parsing.
Remediation: Use sed 's|.*/issues/||' or awk -F/ '{print $NF}' for portable URL parsing.

Previous run (10)

Review

Findings

Medium

[correctness] eval/fullsend-runner.sh:137,173 — gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

[supply-chain] .github/workflows/functional-evals.yml:48 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.
[supply-chain] .github/workflows/functional-evals.yml:84 — The openshell-gateway binary is downloaded from GitHub releases via curl without SHA256 checksum verification, same pattern as the yq download. The supervisor image is pinned by SHA (dfd47683e7da4f1a4a8fa5d77f92d3696e6a41f9), which is good — applying the same rigor to the gateway binary would be consistent.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.
[correctness] eval/fullsend-runner.sh:298 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted shell expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted during expansion. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf '%s' "$GH_TOKEN" or a quoted heredoc would be more robust.
[style] eval/fullsend-runner.sh:97 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (11)

Review

Findings

Medium

[correctness] eval/fullsend-runner.sh:130,156 — gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

[supply-chain] .github/workflows/functional-evals.yml:43 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.
[correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.
[style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (12)

Review

Findings

Medium

[correctness] eval/fullsend-runner.sh:130,156 — gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

[supply-chain] .github/workflows/functional-evals.yml:43 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.
[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.
[correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.
[style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (13)

Review

Findings

Medium

[correctness] eval/fullsend-runner.sh:130,156 — gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").
[protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

[documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.
[correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.
[style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

sandbox.Upload() silently fails for large files (~16MB) — the binary doesn't appear in the sandbox. Switch bootstrapSandbox to use UploadDir() (tarball + extract), the same mechanism used for uploading project code, which works reliably regardless of file size. Observed in CI functional evals (run 26642640034). This is the same class of issue that motivated the UploadFile helper in commit 907d482. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Add openshell sandbox setup (CLI, gateway, Podman), GCP WIF authentication, and credential preparation for functional evals. Key changes: - Set up openshell sandbox infrastructure in the workflow - Prepare WIF credentials for sandbox use (rewrite external_account config to file-based OIDC token source) - Restore host credentials for the scoring phase (LLM judge runs on the host, not in the sandbox) - Pass EVALS_GCP_REGION and EVALS_VERTEX_PROJECT_ID for Vertex AI - Configure git identity and authenticated clone URLs - Pass --fullsend-binary flag in eval runner Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

- ADR 0044: establish functional evals as a test category in the four-layer test pyramid (unit, prompt eval, functional eval, e2e) - ADR 0045: adopt agent-eval-harness as the eval framework, using its opaque CLI runner contract as the integration boundary - Add docs/testing/evals.md with contributor guide for writing and running functional evals - Update docs/architecture.md with ADR 0044 references on testing open questions - Annotate testing-agents.md golden-set bootstrapping question as partially answered Refs #1682 Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

- Use git credential helper instead of embedding GH_TOKEN in clone URLs, preventing token persistence in .git/config - Set 0600 permissions on .eval-env before writing secrets - Delete .eval-env after fullsend run to prevent secret upload as artifact - Capture fullsend run exit code correctly (rc=$? pattern vs $? after ||) - Remove 2>&1 from gh create commands that corrupted URL extraction - Replace non-portable grep -P with portable parameter expansion - Pipe yq output directly instead of echo to prevent backslash mangling - Add explicit out.Close() check in copyFile to catch write flush errors - Add concurrency group with cancel-in-progress to CI workflow - Add timeout-minutes: 45 to prevent runaway eval jobs - Fix shellcheck warnings (SC2086, SC2034, SC2155) in CI workflow - Suppress ruff F841 in eval fixture code (intentional unused variable) - Add reversibility note to ADR 0045 consequences Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

- Delete .eval-env in cleanup trap so secrets don't survive crashes - Scrub .eval-env from artifact uploads as defense-in-depth - Pin agent-eval-harness clone to a specific commit ref - Add per-case timeout (default 30min) to prevent hung agents - Fix credential helper comment (expression is stored, not the token) - Fix docs env var table: add missing vars, correct FULLSEND_DIR desc - Remove unused shell variable declaration (shellcheck SC2034) - Remove unused comments.min_count/max_count from annotations Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Replace the runtime clone of agent-eval-harness with a git submodule at eval/.agent-eval-harness. Dependabot's gitsubmodule ecosystem will keep it updated automatically, which the previous pinned-SHA-in-a-script approach could not support. - Add eval/.agent-eval-harness submodule (at 8e471f8, post-v1.4.0) - Remove .gitignore entry for eval/.agent-eval-harness/ - Update run-functional.sh to init submodule instead of cloning - Add submodules: true to workflow checkout step - Update docs/testing/evals.md to describe submodule setup Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

ralphbean · 2026-05-29T21:27:23Z

While reviewing this I dug into what agent-eval-harness can do that we're not using yet. Noting it here so we can file backlog issues after this merges if any of these seem worth pursuing.

Quick wins — could adopt with minimal effort:

HTML reports (report.py) — self-contained visual report with per-case scores, rationale, artifact previews. One extra script call in run-functional.sh.
Parallel case execution (execution.parallelism: N) — runs cases concurrently. Wall-clock time drops linearly as we add cases.
Pairwise comparison (score.py pairwise) — compares two runs case-by-case with win/loss/tie. Useful for regression detection when we change agent prompts or models.
Built-in judges (cost_budget, tool_call_validation) — reusable parameterized judges, less inline Python to maintain.
Conditional judges (if: field) — skip judges on certain cases based on annotations.

Medium-term:

Analysis generation — automatic failure pattern clustering + prioritized recommendations.
Baseline regression (score.py regression) — compare current run metrics against a previous run.
Judge model overrides (judges[].model) — use a cheaper model for simple judges.
Test case generation (/eval-dataset) — bootstrap new cases from a skill description or expand coverage gaps.

Longer-term (needs architecture changes):

Tool interception — auto-answer AskUserQuestion during headless runs (requires Claude Code runner, not our CLI runner).
MLflow integration — log runs, traces, feedback (needs MLflow infrastructure).
/eval-optimize — closed-loop: run evals → analyze failures → edit skill → re-run.
Stream-JSON event parsing — judges analyze tool call sequences and reasoning (requires Claude Code runner).
Production trace extraction — bootstrap test cases from real production runs (needs MLflow tracing in prod).

The HTML reports and parallelism seem like obvious next steps once we have more than one test case.

ralphbean had a problem deploying to evals May 29, 2026 00:26 — with GitHub Actions Failure

github-actions Bot deployed to site-preview May 29, 2026 00:27 View deployment

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 00:36 View deployment

github-actions Bot deployed to site-preview May 29, 2026 00:38 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 01:38 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 01:48 View deployment

ralphbean force-pushed the ci/functional-evals branch from 6e40384 to 826ceaf Compare May 29, 2026 01:50

github-actions Bot deployed to site-preview May 29, 2026 01:52 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 10:40 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread .github/workflows/functional-evals.yml

Comment thread .github/workflows/functional-evals.yml

fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label May 29, 2026

ralphbean marked this pull request as draft May 29, 2026 10:57

github-actions Bot deployed to site-preview May 29, 2026 11:02 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread eval/run-functional.sh

Comment thread eval/fullsend-runner.sh

github-actions Bot deployed to site-preview May 29, 2026 13:50 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread eval/run-functional.sh

Comment thread .github/workflows/functional-evals.yml

Comment thread eval/fullsend-runner.sh

github-actions Bot deployed to site-preview May 29, 2026 14:00 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread .github/workflows/functional-evals.yml

Comment thread eval/run-functional.sh

Comment thread .github/workflows/functional-evals.yml

Comment thread eval/fullsend-runner.sh

github-actions Bot deployed to site-preview May 29, 2026 14:21 View deployment

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 14:58 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

ralphbean force-pushed the ci/functional-evals branch from aec8273 to a74886c Compare May 29, 2026 15:48

github-actions Bot deployed to site-preview May 29, 2026 15:50 View deployment

github-actions Bot deployed to site-preview May 29, 2026 15:55 View deployment

github-actions Bot deployed to site-preview May 29, 2026 16:06 View deployment

ralphbean force-pushed the ci/functional-evals branch from 35b2e20 to 8e004de Compare May 29, 2026 17:06

github-actions Bot deployed to site-preview May 29, 2026 17:08 View deployment

github-actions Bot deployed to site-preview May 29, 2026 17:45 View deployment

github-actions Bot deployed to site-preview May 29, 2026 17:52 View deployment

ralphbean force-pushed the ci/functional-evals branch from b4bda40 to 73a0b60 Compare May 29, 2026 18:01

github-actions Bot deployed to site-preview May 29, 2026 18:02 View deployment

ralphbean added 2 commits May 29, 2026 14:11

feat: add functional eval framework for agent pipelines

681e801

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

ralphbean force-pushed the ci/functional-evals branch from 73a0b60 to d056e94 Compare May 29, 2026 18:12

github-actions Bot deployed to site-preview May 29, 2026 18:13 View deployment

github-actions Bot deployed to site-preview May 29, 2026 18:51 View deployment

github-actions Bot deployed to site-preview May 29, 2026 19:31 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

ralphbean added 2 commits May 29, 2026 17:09

ralphbean marked this pull request as ready for review May 29, 2026 21:16

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

Conversation

ralphbean commented May 29, 2026

Summary

What's in the box

Current results

Still TODO

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Site preview

Uh oh!

fullsend-ai-review Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Findings

Medium

Low

Info

Review

Findings

Medium

Low

Info

Review

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Review

Findings

High

Medium

Low

Review

Findings

High

Medium

Low

Review

Findings

High

Medium

Low

Review

Findings

High

Medium

Low

Review

Findings

High

Medium

Low

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Uh oh!

fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fullsend-ai-review Bot left a comment

github-actions Bot commented May 29, 2026 •

edited

Loading

fullsend-ai-review Bot commented May 29, 2026 •

edited

Loading