Skip to content

feat: add functional eval framework for agent pipelines#1682

Open
ralphbean wants to merge 7 commits into
mainfrom
ci/functional-evals
Open

feat: add functional eval framework for agent pipelines#1682
ralphbean wants to merge 7 commits into
mainfrom
ci/functional-evals

Conversation

@ralphbean
Copy link
Copy Markdown
Contributor

Summary

  • Adds end-to-end functional eval framework that tests the full agent pipeline (pre-script → agent → post-script) by observing side effects on GitHub fixtures
  • Ephemeral repos with UUID suffixes ensure concurrent eval runs don't interfere
  • Uses agent-eval-harness for LLM-graded scoring via CLI runner interface
  • First test case (001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue description

What's in the box

File Purpose
eval/fullsend-runner.sh CLI runner: ephemeral repo → fixture → fullsend run → capture state → teardown
eval/run-functional.sh Orchestrator: iterate cases, call runner, score with agent-eval-harness
eval/triage/eval.yaml Eval config: LLM judge (1-5 rubric) + deterministic label check
eval/triage/cases/001-* Test case with tricky scenario (regex accepts + but issue claims it doesn't)
eval/triage/repos/python-webapp/ Shared repo content, symlinked by cases
.github/workflows/functional-evals.yml CI workflow triggered by eval/ or internal/scaffold/ changes
Makefile make functional-evals target

Current results

The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles +. Threshold set to 2.5 with a TODO to raise it.

Still TODO

  • Set up evals GitHub environment with secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) and vars (EVAL_ORG, ANTHROPIC_VERTEX_PROJECT_ID)
  • Refactor case layout to match agent-eval-harness execute.py expectations so we can use their parallelism
  • Add more test cases (feature request, vague bug, PR review)
  • Per-agent path filtering in CI (only run triage evals when triage-related files change)

Refs #499, #73

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Site preview

Preview: https://c48d522b-site.fullsend-ai.workers.dev

Commit: a67077658b24f50a39aa7097b0b10196de77debe

@fullsend-ai-review
Copy link
Copy Markdown

fullsend-ai-review Bot commented May 29, 2026

Review

Findings

Medium

Low

  • [correctness] .github/workflows/functional-evals.yml — The workflow references secrets.EVAL_GH_TOKEN and vars.EVAL_ORG but lacks an environment: evals declaration on the job. Without it, environment-scoped secrets and variables won't resolve. The PR TODO acknowledges the environment isn't set up yet, but the YAML should include the environment: stanza so it's ready when secrets are configured.
    Remediation: Add environment: evals to the functional-evals job definition.

  • [correctness] eval/fullsend-runner.sh — Ephemeral repos are created with --public (line 84 in the diff). This is acceptable for the current test fixtures (innocuous Python webapp), but future test cases could inadvertently expose sensitive patterns or data. Consider using --private if the eval token has the necessary permissions, or document the public visibility as a deliberate design constraint.

Info

  • [style] .github/workflows/functional-evals.yml — The OpenShell setup (version, CLI install, gateway download, Podman config, gateway start) is duplicated from action.yml. The inline TODO already flags this for extraction into a shared script.

  • [correctness] eval/run-functional.sh — The script does not produce a summary exit code reflecting the scoring outcome. CI gating depends on score.py exiting non-zero on threshold failure. If score.py handles this natively, no change needed; if not, the CI step will always pass regardless of eval scores.

Previous run

Review

Findings

Medium

Low

  • [correctness] docs/testing/evals.md:39 — The environment variable table describes FULLSEND_DIR as "Path to pre-built fullsend binary (skips build)" but this is incorrect. FULLSEND_DIR is the path to the fullsend scaffold directory (e.g., internal/scaffold/fullsend-repo), as documented in the Makefile default and the runner script comments. The description and required/optional status should be corrected.
    Remediation: Change the description to "Path to fullsend scaffold directory (default: internal/scaffold/fullsend-repo)" and mark as required (the runner script uses ${FULLSEND_DIR:?}).

  • [correctness] .github/workflows/functional-evals.yml — The workflow references secrets.EVAL_GH_TOKEN and vars.EVAL_ORG but lacks an environment: evals declaration on the job. Without it, environment-scoped secrets and variables won't resolve. The PR TODO acknowledges the environment isn't set up yet, but the YAML should include the environment: stanza so it's ready when secrets are configured.
    Remediation: Add environment: evals to the functional-evals job definition.

  • [correctness] eval/fullsend-runner.sh:84 — Ephemeral repos are created with --public. This is fine for the current test fixtures (innocuous Python webapp), but future test cases could inadvertently expose sensitive patterns or data. Consider using --private if the eval token has the necessary permissions, or document the public visibility as a deliberate design constraint.

Info

  • [style] .github/workflows/functional-evals.yml — The OpenShell setup (version, CLI install, gateway download, Podman config, gateway start) is duplicated from action.yml. The inline TODO already flags this for extraction into a shared script. Tracking as a follow-up.

  • [correctness] eval/run-functional.sh — The script does not produce a summary exit code reflecting the scoring outcome. CI gating depends on score.py exiting non-zero on threshold failure. If score.py does handle this, no change needed; if not, the CI step will always pass regardless of eval scores.

Previous run (2)

Review

Reason: stale-head

The review agent reviewed commit a74886c393f4913c5d194e1e2fb0c0c4531346e8 but the PR HEAD is now 8c8c781113d441f22e8c870e14b5cbcd5c32d3f8. This review was discarded to avoid approving unreviewed code.

Previous run (3)

Review

Findings

Medium

Low

  • [correctness] eval/fullsend-runner.sh:353grep -oP '/issues/\K[0-9]+' uses Perl-compatible regex (-P flag) which is not available on macOS BSD grep. Since the script header and Makefile target suggest local execution is expected, consider using sed or awk for portability: echo "$url" | sed 's|.*/issues/||'.
    Remediation: Replace grep -oP '/issues/\K[0-9]+' and grep -oP '/pull/\K[0-9]+' with portable alternatives (e.g., awk -F/ '{print $NF}' or sed 's|.*/||').

  • [correctness] internal/cli/run.go:1810 — The new copyFile function uses defer out.Close() without checking the close error. While Go's os.File.Write is unbuffered (writes go to syscall directly), the idiomatic Go pattern is to explicitly close the destination file and check the error before returning, especially for write operations. In practice this is safe for the temp-dir use case here.
    Remediation: Close out explicitly before the Stat call and return any close error.

Previous run (4)

Review

Findings

Medium

Low

  • [correctness] eval/fullsend-runner.sh:352,389grep -oP uses Perl regex, a GNU grep extension not available on macOS. The primary use is CI on ubuntu-latest so this works today, but developers running evals locally on macOS will hit failures.
    Remediation: Replace with a POSIX-compatible alternative, e.g. sed -n 's|.*/issues/||p' or awk -F/ '{print $NF}'.

  • [correctness] internal/cli/run.go:724-727 — The openshell upload path fix is correct (upload to parent dir so the binary lands at the expected path), but this bug fix is bundled into an eval-framework feature PR. A separate commit improves bisectability if the upload behavior needs to be reverted independently.

  • [documentation-currency] docs/problems/testing-agents.md — This doc explores agent testing approaches (golden-set evaluation, behavioral contracts) in the abstract. The new eval/ directory is a concrete implementation of Approach 1 (golden-set evaluation with LLM judges). A cross-reference from the doc to the implementation would help contributors discover the working framework.

Previous run (5)

Review

Findings

High

  • [correctness] internal/cli/run.go:348,697,725,731,735,1283 — Debug instrumentation left in production code, now in six locations — worse than the prior review. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap. (b) Line 697: prints fullsendBinary and localBinary flag values to stderr. (c) Line 725: prints localBinary before upload to stderr. (d) Line 731: prints upload source/destination to stderr. (e) Lines 735–737: runs ls -la /tmp/workspace/bin/fullsend inside the sandbox and prints the result to stderr. (f) Lines 1283–1284: echo DEBUG:PATH=$PATH >&2 && ls -la .../bin/fullsend >&2 && which fullsend >&2 injected into buildScanContextCommand, emitting debug output on every context scan — flagged in the prior review and still present. The latest commit added five more debug blocks instead of removing the previously flagged instance.
    Remediation: Remove all six debug blocks. For (a), delete lines 348–350. For (b)–(e), delete the fmt.Fprintf(os.Stderr, "DEBUG ... lines and the verify sandbox.Exec call. For (f), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

  • [correctness] eval/fullsend-runner.sh:351,387gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
    Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

  • [supply-chain] eval/run-functional.sh:55git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
    Remediation: Pin the clone to a specific commit SHA.

  • [supply-chain] .github/workflows/functional-evals.yml:48yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
    Remediation: Add sha256sum -c verification after download.

  • [platform-security] eval/fullsend-runner.sh:316GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
    Remediation: Use a credential helper instead of embedding the token in the URL.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (6)

Review

Findings

High

  • [correctness] internal/cli/run.go:348,1276 — Debug instrumentation left in production code, now in two locations. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap — added since the prior review. (b) Line 1276: echo DEBUG:PATH, ls -la .../bin/fullsend, and which fullsend injected into buildScanContextCommand, emitting debug output to stderr on every context scan — flagged in the prior review and still present. The latest commit added more debug code instead of removing the previously flagged instance.
    Remediation: Remove both debug blocks. For (a), delete lines 348–350 (the debugOut/debugErr block). For (b), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

  • [correctness] eval/fullsend-runner.sh:351,385gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
    Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

  • [supply-chain] eval/run-functional.sh:55git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
    Remediation: Pin the clone to a specific commit SHA.

  • [supply-chain] .github/workflows/functional-evals.yml:48yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
    Remediation: Add sha256sum -c verification after download.

  • [platform-security] eval/fullsend-runner.sh:316GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
    Remediation: Use a credential helper instead of embedding the token in the URL.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (7)

Review

Findings

High

  • [correctness] internal/cli/run.go:348,1276 — Debug instrumentation left in production code, now in two locations. (a) Lines 348–350: a new sandbox.Exec call that runs ls -la /tmp/workspace/bin/ and prints to stderr after every bootstrap — added since the prior review. (b) Line 1276: echo DEBUG:PATH, ls -la .../bin/fullsend, and which fullsend injected into buildScanContextCommand, emitting debug output to stderr on every context scan — flagged in the prior review and still present. The latest commit added more debug code instead of removing the previously flagged instance.
    Remediation: Remove both debug blocks. For (a), delete lines 348–350 (the debugOut/debugErr block). For (b), restore the original single-line fmt.Sprintf in buildScanContextCommand without the debug echo/ls/which commands.

Medium

  • [correctness] eval/fullsend-runner.sh:351,385gh issue create and gh pr create use 2>&1, merging stderr into the captured URL. If gh emits warnings or progress text to stderr, they contaminate FIXTURE_URL and may break downstream scoring via fixture-state.json. While grep -oP extracts the number today, the URL variable itself carries garbage.
    Remediation: Remove 2>&1 from both command substitutions. Redirect stderr to a log file if error capture is needed.

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, a protected path. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and describes the workflow's purpose. Human approval is always required for protected-path changes regardless of context.

Low

  • [supply-chain] eval/run-functional.sh:55git clone --depth=1 of agent-eval-harness without pinning to a commit or tag. A compromised upstream would execute malicious score.py in CI with access to EVAL_GH_TOKEN and GCP credentials.
    Remediation: Pin the clone to a specific commit SHA.

  • [supply-chain] .github/workflows/functional-evals.yml:48yq binary downloaded via curl without SHA256 checksum verification. Other tool installs in this repo (e.g., lychee in the Makefile) include checksum verification.
    Remediation: Add sha256sum -c verification after download.

  • [platform-security] eval/fullsend-runner.sh:316GH_TOKEN embedded in the git clone URL is visible in /proc/*/cmdline. Acceptable in ephemeral CI but worth noting.
    Remediation: Use a credential helper instead of embedding the token in the URL.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section describes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real pipeline using agent-eval-harness in eval/. A cross-reference to the actual implementation would help future readers. Not blocking since problem docs are exploratory.

Previous run (8)

Review

Findings

High

  • [correctness] internal/cli/run.go:1276 — Debug instrumentation left in production code. The change adds echo DEBUG:PATH=$PATH >&2 && ls -la .../bin/fullsend >&2 && which fullsend >&2 to buildScanContextCommand, which runs inside every sandbox context scan. This emits noisy debug output to stderr on every agent run and is clearly development debugging that should not be merged.
    Remediation: Revert the internal/cli/run.go change — remove the debug echo, ls -la, and which commands from the format string, restoring the original single-line command.

Medium

  • [correctness] eval/fullsend-runner.sh:351gh issue create and gh pr create (line 385) use 2>&1, which merges stderr into the captured URL. If gh emits any warnings or progress messages to stderr (e.g., authentication notices, rate-limit warnings), they contaminate FIXTURE_URL. While grep -oP may still extract the number, FIXTURE_URL will contain garbage text that propagates into fixture-state.json and potentially breaks downstream scoring.
    Remediation: Remove 2>&1 from both gh issue create and gh pr create command substitutions. If error capture is needed, redirect stderr to a log file instead.

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new CI workflow under .github/, which is a protected path. The PR body explains the purpose (functional eval CI) and references Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73, providing sufficient context. Human approval is required for all protected-path changes regardless of justification.

Low

  • [supply-chain] eval/run-functional.sh:55git clone --depth=1 https://github.com/opendatahub-io/agent-eval-harness.git clones the harness repo at HEAD without pinning to a specific commit or tag. If the upstream repo is compromised, malicious scoring code (score.py) would execute in CI with access to EVAL_GH_TOKEN and GCP credentials.
    Remediation: Pin the clone to a specific commit SHA (e.g., git clone --depth=1 --branch <tag-or-sha> ...) and document the pinned version.

  • [platform-security] eval/fullsend-runner.sh:316git clone "https://x-access-token:${GH_TOKEN}@github.com/..." embeds the token in the URL, which is visible in /proc/*/cmdline on Linux. Acceptable in ephemeral CI runners but worth noting.
    Remediation: Consider using git -c credential.helper='!echo password=${GH...' clone ... or configuring the credential helper to avoid token exposure in process listings.

Previous run (9)

Review

Findings

High

  • [platform-security] eval/fullsend-runner.sh:511 — Token leakage via CI artifact upload. The runner writes .eval-env containing GH_TOKEN, PUSH_TOKEN, and REVIEW_TOKEN to $OUTPUT_DIR (which resolves to eval/runs/<agent>/<run-id>/cases/<case>/). The CI workflow uploads eval/runs/ as an artifact with 30-day retention. The .eval-env file would be included in the downloadable artifact, exposing all three tokens to anyone with artifact download access. The cleanup trap only removes TARGET_DIR — it does not delete ENV_FILE.
    Remediation: Write the env file to a temp location outside the output directory (e.g., ENV_FILE=$(mktemp)) and add rm -f "$ENV_FILE" to the cleanup trap. Alternatively, add an exclusion pattern to the artifact upload step.

Medium

  • [protected-path] .github/workflows/functional-evals.yml — This file is under .github/, a protected path requiring human approval. The PR links to Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and the description explains the workflow's purpose. Human review and approval is required regardless of automated review outcome.

  • [supply-chain] .github/workflows/functional-evals.yml:48yq binary downloaded from GitHub Releases without checksum verification (curl | chmod +x). Compare with the existing lychee install in the Makefile which includes a sha256sum -c check. Additionally, agent-eval-harness is installed from a git URL without a version pin or commit SHA (line 44), meaning the CI installs whatever is on the default branch at run time.
    Remediation: Add SHA256 checksum verification for the yq download. Pin agent-eval-harness to a specific commit SHA or release tag (e.g., git+https://github.com/opendatahub-io/agent-eval-harness.git@<commit>).

Low

  • [correctness] eval/fullsend-runner.sh:352grep -oP uses PCRE mode, which is a GNU grep extension not available on macOS default grep. Local development on macOS would fail when parsing fixture URLs. The same pattern appears at line 389 for PR URL parsing.
    Remediation: Use sed 's|.*/issues/||' or awk -F/ '{print $NF}' for portable URL parsing.
Previous run (10)

Review

Findings

Medium

  • [correctness] eval/fullsend-runner.sh:137,173gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
    Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

  • [supply-chain] .github/workflows/functional-evals.yml:48 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.

  • [supply-chain] .github/workflows/functional-evals.yml:84 — The openshell-gateway binary is downloaded from GitHub releases via curl without SHA256 checksum verification, same pattern as the yq download. The supervisor image is pinned by SHA (dfd47683e7da4f1a4a8fa5d77f92d3696e6a41f9), which is good — applying the same rigor to the gateway binary would be consistent.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.

  • [correctness] eval/fullsend-runner.sh:298 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted shell expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted during expansion. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf '%s' "$GH_TOKEN" or a quoted heredoc would be more robust.

  • [style] eval/fullsend-runner.sh:97 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (11)

Review

Findings

Medium

  • [correctness] eval/fullsend-runner.sh:130,156gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
    Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

  • [supply-chain] .github/workflows/functional-evals.yml:43 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.

  • [correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.

  • [style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (12)

Review

Findings

Medium

  • [correctness] eval/fullsend-runner.sh:130,156gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
    Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

  • [supply-chain] .github/workflows/functional-evals.yml:43 — The yq binary is downloaded from GitHub releases via curl without SHA256 checksum verification. While HTTPS and curl -sSfL provide baseline safety, adding checksum verification would be a defense-in-depth improvement against compromised release artifacts.

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.

  • [correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.

  • [style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

Previous run (13)

Review

Findings

Medium

  • [correctness] eval/fullsend-runner.sh:130,156gh issue create and gh pr create output is captured with 2>&1, which mixes stderr progress messages into the URL variable. This makes FIXTURE_URL unreliable and FIXTURE_NUMBER extraction via grep -oP fragile — it works today because stderr messages don't contain /issues/ or /pull/, but any change to gh CLI output format could break it silently.
    Remediation: Drop 2>&1 so only stdout (the URL) is captured, or redirect stderr to /dev/null or a log file: url=$(gh issue create ... 2>/dev/null) or url=$(gh issue create ... 2>>"$case_output_dir/runner.log").

  • [protected-path] .github/workflows/functional-evals.yml — This PR adds a new GitHub Actions workflow under .github/, which is a protected path requiring human approval. The PR links to issues Investigate opendatahub-io/agent-eval-harness for skill evals #499 and Add regression tests / evals for all agents and skills #73 and provides clear rationale for the workflow. Human review of this file is required regardless of automated review outcome.

Low

  • [documentation-currency] docs/problems/testing-agents.md:272 — The "Practical architecture" section proposes a hypothetical eval pipeline using Inspect AI and DeepEval Synthesizer. This PR implements a real eval pipeline using agent-eval-harness in eval/. While the doc is exploratory (problem doc, not spec), a cross-reference to the actual implementation would help future readers. Not blocking since problem docs are expected to evolve independently.

  • [correctness] eval/fullsend-runner.sh:263 — The env file writes GH_TOKEN=${GH_TOKEN} using double-quoted heredoc expansion. If the token ever contains shell metacharacters ($, backticks), they would be interpreted. Current GitHub tokens are alphanumeric so this is not an active bug, but using printf or single-quoting the value would be more robust.

  • [style] eval/fullsend-runner.sh:91 — Ephemeral repos created as --public without a code comment explaining the choice. Adding a comment about why public was chosen (vs private) helps future maintainers understand the trade-off.

@fullsend-ai-review fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
@ralphbean ralphbean force-pushed the ci/functional-evals branch from 6e40384 to 826ceaf Compare May 29, 2026 01:50
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
Copy link
Copy Markdown

@fullsend-ai-review fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread .github/workflows/functional-evals.yml
Comment thread .github/workflows/functional-evals.yml
@fullsend-ai-review fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label May 29, 2026
@ralphbean ralphbean marked this pull request as draft May 29, 2026 10:57
Copy link
Copy Markdown

@fullsend-ai-review fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread eval/run-functional.sh
Comment thread eval/fullsend-runner.sh
Copy link
Copy Markdown

@fullsend-ai-review fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread eval/run-functional.sh
Comment thread .github/workflows/functional-evals.yml
Comment thread eval/fullsend-runner.sh
Copy link
Copy Markdown

@fullsend-ai-review fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread .github/workflows/functional-evals.yml
Comment thread eval/run-functional.sh
Comment thread .github/workflows/functional-evals.yml
Comment thread eval/fullsend-runner.sh
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
sandbox.Upload() silently fails for large files (~16MB) — the binary
doesn't appear in the sandbox. Switch bootstrapSandbox to use
UploadDir() (tarball + extract), the same mechanism used for uploading
project code, which works reliably regardless of file size.

Observed in CI functional evals (run 26642640034). This is the same
class of issue that motivated the UploadFile helper in commit 907d482.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
ralphbean added 2 commits May 29, 2026 14:11
Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Add openshell sandbox setup (CLI, gateway, Podman), GCP WIF
authentication, and credential preparation for functional evals.

Key changes:
- Set up openshell sandbox infrastructure in the workflow
- Prepare WIF credentials for sandbox use (rewrite external_account
  config to file-based OIDC token source)
- Restore host credentials for the scoring phase (LLM judge runs on
  the host, not in the sandbox)
- Pass EVALS_GCP_REGION and EVALS_VERTEX_PROJECT_ID for Vertex AI
- Configure git identity and authenticated clone URLs
- Pass --fullsend-binary flag in eval runner

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
- ADR 0044: establish functional evals as a test category in the four-layer
  test pyramid (unit, prompt eval, functional eval, e2e)
- ADR 0045: adopt agent-eval-harness as the eval framework, using its opaque
  CLI runner contract as the integration boundary
- Add docs/testing/evals.md with contributor guide for writing and running
  functional evals
- Update docs/architecture.md with ADR 0044 references on testing open
  questions
- Annotate testing-agents.md golden-set bootstrapping question as partially
  answered

Refs #1682

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
- Use git credential helper instead of embedding GH_TOKEN in clone URLs,
  preventing token persistence in .git/config
- Set 0600 permissions on .eval-env before writing secrets
- Delete .eval-env after fullsend run to prevent secret upload as artifact
- Capture fullsend run exit code correctly (rc=$? pattern vs $? after ||)
- Remove 2>&1 from gh create commands that corrupted URL extraction
- Replace non-portable grep -P with portable parameter expansion
- Pipe yq output directly instead of echo to prevent backslash mangling
- Add explicit out.Close() check in copyFile to catch write flush errors
- Add concurrency group with cancel-in-progress to CI workflow
- Add timeout-minutes: 45 to prevent runaway eval jobs
- Fix shellcheck warnings (SC2086, SC2034, SC2155) in CI workflow
- Suppress ruff F841 in eval fixture code (intentional unused variable)
- Add reversibility note to ADR 0045 consequences

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
ralphbean added 2 commits May 29, 2026 17:09
- Delete .eval-env in cleanup trap so secrets don't survive crashes
- Scrub .eval-env from artifact uploads as defense-in-depth
- Pin agent-eval-harness clone to a specific commit ref
- Add per-case timeout (default 30min) to prevent hung agents
- Fix credential helper comment (expression is stored, not the token)
- Fix docs env var table: add missing vars, correct FULLSEND_DIR desc
- Remove unused shell variable declaration (shellcheck SC2034)
- Remove unused comments.min_count/max_count from annotations

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Replace the runtime clone of agent-eval-harness with a git submodule at
eval/.agent-eval-harness. Dependabot's gitsubmodule ecosystem will keep
it updated automatically, which the previous pinned-SHA-in-a-script
approach could not support.

- Add eval/.agent-eval-harness submodule (at 8e471f8, post-v1.4.0)
- Remove .gitignore entry for eval/.agent-eval-harness/
- Update run-functional.sh to init submodule instead of cloning
- Add submodules: true to workflow checkout step
- Update docs/testing/evals.md to describe submodule setup

Assisted-by: Claude [claude-opus-4-6] <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
@ralphbean ralphbean marked this pull request as ready for review May 29, 2026 21:16
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
@ralphbean
Copy link
Copy Markdown
Contributor Author

While reviewing this I dug into what agent-eval-harness can do that we're not using yet. Noting it here so we can file backlog issues after this merges if any of these seem worth pursuing.

Quick wins — could adopt with minimal effort:

  • HTML reports (report.py) — self-contained visual report with per-case scores, rationale, artifact previews. One extra script call in run-functional.sh.
  • Parallel case execution (execution.parallelism: N) — runs cases concurrently. Wall-clock time drops linearly as we add cases.
  • Pairwise comparison (score.py pairwise) — compares two runs case-by-case with win/loss/tie. Useful for regression detection when we change agent prompts or models.
  • Built-in judges (cost_budget, tool_call_validation) — reusable parameterized judges, less inline Python to maintain.
  • Conditional judges (if: field) — skip judges on certain cases based on annotations.

Medium-term:

  • Analysis generation — automatic failure pattern clustering + prioritized recommendations.
  • Baseline regression (score.py regression) — compare current run metrics against a previous run.
  • Judge model overrides (judges[].model) — use a cheaper model for simple judges.
  • Test case generation (/eval-dataset) — bootstrap new cases from a skill description or expand coverage gaps.

Longer-term (needs architecture changes):

  • Tool interception — auto-answer AskUserQuestion during headless runs (requires Claude Code runner, not our CLI runner).
  • MLflow integration — log runs, traces, feedback (needs MLflow infrastructure).
  • /eval-optimize — closed-loop: run evals → analyze failures → edit skill → re-run.
  • Stream-JSON event parsing — judges analyze tool call sequences and reasoning (requires Claude Code runner).
  • Production trace extraction — bootstrap test cases from real production runs (needs MLflow tracing in prod).

The HTML reports and parallelism seem like obvious next steps once we have more than one test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requires-manual-review Review requires human judgment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant