Skip to content

Implement reproducer agent#610

Open
vmihalko wants to merge 16 commits into
packit:mainfrom
vmihalko:reproducer-agent
Open

Implement reproducer agent#610
vmihalko wants to merge 16 commits into
packit:mainfrom
vmihalko:reproducer-agent

Conversation

@vmihalko

Copy link
Copy Markdown

TODO:

  • Unit tests for Testing Farm / SSH tools (710 lines)
  • Unit tests for reproducer input/output models
  • Updated existing test_push_to_remote_repository for new communicate() API
  • E2E tests (deferred to follow-up PR)
  • Doc-strings on all new public classes and helpers
  • Skill definition (agents_as_skills/reproducer/SKILL.md)
  • Deployment config (compose service, Makefile targets, env template)
  • Jinja2 prompt template following existing agent patterns
  • Validated on 2 different Jira issues / bug types (RHEL-46618 ksh, RHEL-170657 file)
  • 8 end-to-end test runs against real Testing Farm infrastructure

Background

I've been using and improving a custom reproducer skill over the last few months. After a meeting with @TomasKorbar, I took my reproducer skill and @mkyral's cve-test-generator (kudos to him) as references, and let Claude Opus 4.6 read the existing ai-workflows agents and codebase to prepare an implementation plan: basic functionality + tests, monofunctional commits. Once it was working, I started an iterative implement - review - fix - test run loop with Claude. This PR is the result after 16 runs.

What

Adds a reproducer agent that takes a Jira issue, provisions a Testing Farm machine, designs and iterates on a BeakerLib reproducer test against real RHEL, and opens a merge request with the working test.

New components

  • Testing Farm / SSH tools (testing_farm.py): ReserveTestingFarmMachine, GetTestingFarmReservationDetails, CancelTestingFarmRequest, RunRemoteCommand, CopyFilesToRemote - full reservation lifecycle and remote execution via the MCP gateway
  • TF cleanup middleware (tf_cleanup_middleware.py): event-based tracker that cancels leaked Testing Farm reservations on agent crash, preventing orphaned machines from burning quota
  • Reproducer agent (reproducer_agent.py): orchestration with dedup guard (prevents duplicate runs across concurrent deployments), structured output, retry logic, and Jira lifecycle labels
  • Prompt template (prompts/reproducer/prompt.j2): 717-line Jinja2 template guiding the LLM through issue analysis, test design, TF provisioning, iterative verification, and MR submission
  • Skill definition (agents_as_skills/reproducer/SKILL.md): registers the agent as a Claude Code slash-command skill
  • Deployment config: compose service, Makefile targets, env template

Bug fixes in existing GitLab tools

  • push_to_remote_repository: was silently failing with a generic error and leaking credentials via git stderr. Now captures stderr via subprocess.PIPE + communicate(), filters auth-related lines through _sanitize_git_stderr, and includes sanitized output in the error message.
  • clone_repository: failed on retry when a previous clone left a partial directory behind. Now removes the target directory before cloning (guarded by an allowlist of safe parent paths).
  • ToolError double-wrapping: 5 existing tools had except Exception handlers that caught ToolError and re-wrapped it, losing the specific error message. Added except ToolError: raise before each except Exception.

Test coverage

  • Unit tests for all Testing Farm/SSH tools (710 lines)
  • Unit tests for reproducer input/output models
  • Updated test_push_to_remote_repository for the new communicate() API

Note: E2E tests for the reproducer agent are intentionally deferred to a follow-up PR to allow the team to decide on the e2e testing approach independently.

Test runs and results

The agent was validated over 8 end-to-end runs against real Jira issues.

Issue Package Result Cost Duration Tool calls
RHEL-46618 ksh SUCCESS $11.21 (3.6M tokens) ~22 min 49
RHEL-170657 file SUCCESS $6.43 ~10 min 36

The cost difference is driven by bug complexity: ksh required 22 remote commands (valgrind, pmap, multiple reproducer variants, RSS threshold tuning) vs file's 6 (create a 32-byte binary, run file, check output). Both runs demonstrate the agent generalizes across different bug types.

Merge requests created by the agent

Related to
#553

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new "reproducer" agent skill to Ymir, designed to automatically create, verify, and publish minimal reproducers for RHEL bugs and CVEs. It adds the reproducer workflow, prompts, BeakerLib templates, Testing Farm integration tools (for machine reservation, remote command execution, and file copying), and cleanup middleware. The feedback focuses on critical improvements: wrapping synchronous blocking calls (such as API requests and directory removals) in asyncio.to_thread to prevent event loop starvation, fixing a scoping bug in the BeakerLib test template where variables are lost across subshells, robustly parsing LLM JSON responses to handle markdown code blocks, and using Path.is_relative_to for safer path validation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

ymir/tools/privileged/testing_farm.py (346)

high

The _testing_farm_api_post call is synchronous and blocking. Since this tool is executed within an asynchronous context (async def _run), calling it directly blocks the event loop and can starve other concurrent tasks. Wrap the call in asyncio.to_thread to run it in a separate thread.

            response = await asyncio.to_thread(_testing_farm_api_post, "requests", json=body)

ymir/tools/privileged/testing_farm.py (397-409)

high

There are two issues here:

  1. The _testing_farm_api_get call is synchronous and blocking. It should be run in a separate thread using asyncio.to_thread to avoid blocking the event loop during polling.
  2. Catching only requests.HTTPError means other requests-related exceptions (like ConnectionError or Timeout) will propagate to the generic except Exception block and immediately abort the polling loop. Catching requests.RequestException and treating connection/timeout errors as transient makes the polling loop much more resilient to temporary network issues.
                response = await asyncio.to_thread(_testing_farm_api_get, f"requests/{tool_input.request_id}")
            except requests.RequestException as e:
                is_transient = False
                if isinstance(e, requests.HTTPError) and e.response is not None:
                    if e.response.status_code in _TRANSIENT_HTTP_CODES:
                        is_transient = True
                elif isinstance(e, (requests.ConnectionError, requests.Timeout)):
                    is_transient = True

                if is_transient:
                    logger.warning(
                        "Transient error %s polling TF %s (attempt %d/%d)",
                        e, tool_input.request_id, attempt, max_attempts,
                    )
                    if attempt < max_attempts:
                        await asyncio.sleep(poll_interval)
                    continue
                raise ToolError(
                    f"Failed to get Testing Farm reservation details {tool_input.request_id}: {e}"
                ) from e

ymir/tools/privileged/testing_farm.py (424)

high

The requests.get call to fetch pipeline.log is synchronous and blocking. Wrap it in asyncio.to_thread to prevent blocking the event loop.

                        log_resp = await asyncio.to_thread(requests.get, f"{artifacts_url}/pipeline.log", timeout=30)

ymir/tools/privileged/testing_farm.py (487)

high

The _testing_farm_api_delete call is synchronous and blocking. Wrap it in asyncio.to_thread to avoid blocking the event loop.

            await asyncio.to_thread(_testing_farm_api_delete, f"requests/{request_id}")

ymir/tools/privileged/gitlab.py (505)

high

The shutil.rmtree call is synchronous and blocking. Removing a cloned git repository can take several seconds, which will block the event loop. Wrap it in asyncio.to_thread to run it asynchronously.

                await asyncio.to_thread(shutil.rmtree, clone_path)

ymir/agents/prompts/reproducer/prompt.j2 (300-302)

high

In BeakerLib, each rlRun invocation executes in a separate subshell. Assigning TmpDir=\$(mktemp -d) inside rlRun means the variable is lost immediately after that command exits. Subsequent commands like rlRun "pushd \$TmpDir" will fail because TmpDir is empty. Assign TmpDir in the parent shell context instead, and let it expand in the parent shell when passing to rlRun.

        TmpDir=\$(mktemp -d)
        ORIG_DIR=\$(pwd)
        rlRun "pushd \$TmpDir"

ymir/agents/prompts/reproducer/prompt.j2 (331)

high

Since TmpDir is now defined in the parent shell context, do not escape the $ in the rlRun call so that it expands in the parent shell before being executed.

        rlRun "rm -rf \$TmpDir" 0 "Removing tmp directory"

agents_as_skills/reproducer/SKILL.md (300-302)

high

In BeakerLib, each rlRun invocation executes in a separate subshell. Assigning TmpDir=\$(mktemp -d) inside rlRun means the variable is lost immediately after that command exits. Subsequent commands like rlRun "pushd \$TmpDir" will fail because TmpDir is empty. Assign TmpDir in the parent shell context instead, and let it expand in the parent shell when passing to rlRun.

        TmpDir=\$(mktemp -d)
        ORIG_DIR=\$(pwd)
        rlRun "pushd \$TmpDir"

agents_as_skills/reproducer/SKILL.md (331)

high

Since TmpDir is now defined in the parent shell context, do not escape the $ in the rlRun call so that it expands in the parent shell before being executed.

        rlRun "rm -rf \$TmpDir" 0 "Removing tmp directory"

ymir/agents/reproducer_agent.py (238)

medium

Directly calling OutputSchema.model_validate_json on the raw LLM response text is fragile. If the LLM wraps the JSON in markdown code blocks (e.g., ```json ... ```), validation will fail. Clean the response text to robustly extract the JSON payload before validation.

            raw_text = response.last_message.text.strip()
            if raw_text.startswith("```"):
                lines = raw_text.splitlines()
                if lines[0].startswith("```"):
                    lines = lines[1:]
                if lines and lines[-1].startswith("```"):
                    lines = lines[:-1]
                raw_text = "\n".join(lines).strip()
            state.result = OutputSchema.model_validate_json(raw_text)

ymir/tools/privileged/testing_farm.py (571)

medium

Define _ALLOWED_COPY_BASES as Path objects to enable more robust and idiomatic path validation using is_relative_to instead of string prefix matching.

_ALLOWED_COPY_BASES = (Path("/git-repos"), Path("/tmp"))

ymir/tools/privileged/testing_farm.py (580-587)

medium

Use is_relative_to for path validation to be consistent with the rest of the codebase (e.g., gitlab.py) and to avoid fragile string-based prefix matching.

    @field_validator("local_paths")
    @classmethod
    def validate_local_paths(cls, v: list[str]) -> list[str]:
        for p in v:
            resolved = Path(p).resolve()
            if not any(resolved.is_relative_to(base) for base in _ALLOWED_COPY_BASES):
                raise ValueError(f"Path {p} is not under an allowed directory")
        return v

vmihalko and others added 15 commits June 19, 2026 14:58
Add ReproducerInputSchema and ReproducerOutputSchema for structured
agent input/output with validation. Includes comprehensive unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add REPRODUCER_QUEUE name and lifecycle labels (REPRODUCER_TRIAGED,
REPRODUCER_COMPLETED, REPRODUCER_FAILED, REPRODUCER_ERRORED) for
tracking reproducer agent pipeline state in Jira.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make TF API URL configurable via TESTING_FARM_API_URL env var
(defaults to testing-farm.io). Add _redact_secrets for safe logging
of API payloads, _testing_farm_api_delete helper, _ensure_gateway_ssh_key
for gateway-managed SSH key generation, and _parse_tf_request helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three MCP tools for the TF machine reservation lifecycle:
- ReserveTestingFarmMachineTool: provisions machines with SSH access,
  security group ingress rules, and configurable compose/arch/duration
- GetTestingFarmReservationDetailsTool: polls SSH availability from
  pipeline.log with built-in retry (20 attempts, 30s intervals) and
  transient HTTP 502/503/504 retry
- CancelTestingFarmRequestTool: releases reserved machines via API

Includes input validation with regex patterns for request_id and
remote_dir parameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add two MCP tools for executing commands on reserved TF machines:
- RunRemoteCommandTool: runs commands via SSH with configurable
  timeout, working directory, and gateway SSH key
- CopyFilesToRemoteTool: transfers files via SCP with path validation
  against an allowlist of safe base directories

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive test suite covering all TF tools: reservation,
polling with retry/transient-error handling, cancellation, remote
command execution, file copying, input validation patterns, SSH
key management, and dry-run mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Before: push failures returned only "Failed to push to the specified
repository" with no detail. Worse, git stderr could contain credential
helper output (Authorization headers, Basic auth tokens) which would
be logged or returned to the LLM agent verbatim.

Fix: capture stderr via subprocess.PIPE + communicate(), filter it
through _sanitize_git_stderr (strips lines matching auth-related
patterns), and include the sanitized output in the error message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Before: clone_repository unconditionally called mkdir(exist_ok=True)
then git clone, which fails with "destination path already exists"
if a previous clone attempt left a partial directory behind (e.g.
a network failure mid-clone).

Fix: for branchless clones (the common path), remove the target
directory if it already exists before cloning. The rmtree is guarded
by an allowlist of safe parent directories (/git-repos/, /tmp/) to
prevent accidental deletion of arbitrary paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `except Exception` handlers in 5 tools (AddMergeRequestLabels,
AddMergeRequestComment, GetAuthorizedCommentsFromMergeRequest,
GetMergeRequestDetails, SearchGitlabProjectMrs) catch ToolError
alongside genuine exceptions, wrapping it in a second ToolError and
losing the specific error message. Add `except ToolError: raise`
before each `except Exception` to preserve the original error.

Also remove the stale local `from beeai_framework.tools import ToolError`
in SearchGitlabProjectMrsTool in favor of the existing top-level import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Event-based middleware that tracks active Testing Farm reservations
during agent runs. If the agent crashes or is interrupted, any
reservations left in the tracker are cancelled on cleanup, preventing
orphaned machines from burning Testing Farm quota.

Listens for tool success/error events and maintains a set of active
request IDs, removing them when the agent cancels them normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
717-line structured prompt that guides the LLM through the full
reproducer workflow: analyzing the Jira issue, designing a BeakerLib
test, provisioning a Testing Farm machine, iterating on the test
until it reliably demonstrates the bug, and opening a merge request.

Extracted as a Jinja2 template to keep the Python agent code focused
on orchestration rather than prompt content. Uses template variables
for issue metadata, tool names, and configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Main agent module that wires together the reproducer workflow:
- Dedup guard: checks Jira labels before starting to prevent
  duplicate runs across concurrent deployments
- Tool registration: assembles the MCP gateway tools (GitLab,
  Testing Farm, SSH) with Jira and file-access tools
- TF cleanup middleware: attaches crash-recovery reservation tracking
- Structured output: parses LLM response into ReproducerOutputSchema
- Retry logic: up to 3 attempts on transient LLM/tool failures
- Jira lifecycle labels: marks issues through queued → in-progress →
  done/failed/error states

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compose.yaml: Add reproducer-agent service definition with
  MCP gateway dependency, Jira/TF environment variables, and
  volume mounts for git repos and prompt templates
- Makefile: Add reproducer-related targets (run, test, lint)
- .gitignore: Exclude secrets directory and agent runtime artifacts
- templates/beeai-agent.env: Add REPRODUCER_QUEUE variable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SKILL.md that registers the reproducer as a Claude Code slash-command
skill. Defines the input schema (Jira issue key), required environment
variables, and invocation instructions for running the reproducer
through the Claude Code CLI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vmihalko

Copy link
Copy Markdown
Author

ad gemini findings:

  1. _testing_farm_api_post blocking (line 346): Fixed, wrapped with asyncio.to_thread.
  2. _testing_farm_api_get blocking + broader exception handling (lines 397-409): Fixed. Wrapped with asyncio.to_thread and broadened to requests.RequestException, treating ConnectionError and Timeout as transient alongside HTTP.
  3. requests.get for pipeline.log blocking (line 424): Fixed, wrapped with asyncio.to_thread.
  4. _testing_farm_api_delete blocking (line 487): Fixed, wrapped with asyncio.to_thread.
  5. shutil.rmtree blocking in gitlab.py (line 505): Fixed, wrapped with asyncio.to_thread.
  6. BeakerLib rlRun subshell in prompt.j2 (lines 300-302): False positive. rlRun uses eval in the current shell context, not a subshell. rlRun "TmpDir=$(mktemp -d)" is a standard BeakerLib pattern - the variable persists and is
    available to subsequent commands. All existing ksh tests use exactly rlRun "TmpDir=$(mktemp -d)"...
  7. BeakerLib rlRun cleanup in prompt.j2 (line 331): Same as above, not a bug.
  8. BeakerLib rlRun subshell in SKILL.md (lines 300-302): Same as 6, false positive.
  9. BeakerLib rlRun cleanup in SKILL.md (line 331): Same as 7, not a bug.
  10. JSON markdown code block stripping (reproducer_agent.py line 238): Not needed. The BeeAI framework's expected_output parameter already provides explicit JSON formatting instructions and schema. Adding defensive stripping would mask issues rather than fix them.
  11. _ALLOWED_COPY_BASES as Path objects (line 571): Fixed.
  12. is_relative_to for path validation (lines 580-587): Fixed.

@vmihalko

Copy link
Copy Markdown
Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new 'reproducer' agent to automate the creation, verification, and publication of minimal reproducers for RHEL bugs and CVEs. It adds the core reproducer agent workflow, prompt templates, Testing Farm reservation cleanup middleware, and several new Testing Farm tools. The review feedback highlights several key areas for robustness and performance improvements, including wrapping blocking synchronous calls in asyncio.to_thread, resolving allowed parent paths to absolute paths to prevent validation failures, defensively parsing LLM JSON responses that may contain markdown wrapping, and safely accessing event attributes in the cleanup middleware.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

)

try:
_ensure_gateway_ssh_key()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling _ensure_gateway_ssh_key() synchronously inside an async def _run blocks the event loop because it performs synchronous file I/O and runs a subprocess. Wrap this call in asyncio.to_thread to run it asynchronously.

Suggested change
_ensure_gateway_ssh_key()
await asyncio.to_thread(_ensure_gateway_ssh_key)

}
)

_ensure_gateway_ssh_key()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling _ensure_gateway_ssh_key() synchronously inside an async def _run blocks the event loop because it performs synchronous file I/O and runs a subprocess. Wrap this call in asyncio.to_thread to run it asynchronously.

Suggested change
_ensure_gateway_ssh_key()
await asyncio.to_thread(_ensure_gateway_ssh_key)

Comment on lines +495 to +498
Path(os.environ.get("GIT_REPO_BASEPATH", "/git-repos")),
Path("/tmp"), # noqa: S108
}
if not any(clone_path.resolve().is_relative_to(p) for p in allowed_parents):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If GIT_REPO_BASEPATH is configured as a relative path or contains symlinks, clone_path.resolve().is_relative_to(p) can raise a ValueError or return False due to mixing absolute and relative paths. To prevent false-positive directory traversal blocks, resolve all paths in allowed_parents to absolute paths using .resolve().

Suggested change
Path(os.environ.get("GIT_REPO_BASEPATH", "/git-repos")),
Path("/tmp"), # noqa: S108
}
if not any(clone_path.resolve().is_relative_to(p) for p in allowed_parents):
Path(os.environ.get("GIT_REPO_BASEPATH", "/git-repos")).resolve(),
Path("/tmp").resolve(), # noqa: S108
}
if not any(clone_path.resolve().is_relative_to(p) for p in allowed_parents):

),
**get_agent_execution_config(),
)
state.result = OutputSchema.model_validate_json(response.last_message.text)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

LLMs frequently wrap JSON responses in markdown code blocks (e.g., ```json ... ```) even when instructed not to. Directly calling model_validate_json on such wrapped text will raise a JSONDecodeError. Extract the raw JSON string from markdown code blocks before parsing to make the validation more robust.

Suggested change
state.result = OutputSchema.model_validate_json(response.last_message.text)
raw_text = response.last_message.text.strip()
if raw_text.startswith("```"):
lines = raw_text.splitlines()
if lines[0].startswith("```"):
lines = lines[1:]
if lines and lines[-1].startswith("```"):
lines = lines[:-1]
raw_text = "\n".join(lines).strip()
state.result = OutputSchema.model_validate_json(raw_text)

Comment on lines +30 to +40
async def _on_reserve(self, event: ToolSuccessEvent, meta: EventMeta) -> None:
request_id = event.output.result.get("id")
if request_id:
self._reserved.add(request_id)
logger.debug("Tracked TF reservation %s", request_id)

async def _on_cancel(self, event: ToolSuccessEvent, meta: EventMeta) -> None:
request_id = event.input.request_id if hasattr(event.input, "request_id") else None
if request_id:
self._cancelled.add(request_id)
logger.debug("Tracked TF cancellation %s", request_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential runtime crashes in the event listeners (which could disrupt the agent's execution), use defensive programming when accessing event.input and event.output. event.input and event.output can be dictionaries or Pydantic models depending on the framework adapter, and event.output.result can occasionally be None.

    async def _on_reserve(self, event: ToolSuccessEvent, meta: EventMeta) -> None:
        output = getattr(event, "output", None)
        result = getattr(output, "result", None) if output else None
        request_id = result.get("id") if isinstance(result, dict) else None
        if request_id:
            self._reserved.add(request_id)
            logger.debug("Tracked TF reservation %s", request_id)

    async def _on_cancel(self, event: ToolSuccessEvent, meta: EventMeta) -> None:
        request_id = None
        tool_input = getattr(event, "input", None)
        if isinstance(tool_input, dict):
            request_id = tool_input.get("request_id")
        elif tool_input:
            request_id = getattr(tool_input, "request_id", None)

        if request_id:
            self._cancelled.add(request_id)
            logger.debug("Tracked TF cancellation %s", request_id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant