Skip to content

FORGE-EVAL-3: replay mode (--replay) for full-sandbox eval (Phase 7) #191

Description

@initializ-mk

Context

Replay-A: full deterministic re-execution of an agent against a candidate model with tools stubbed by recorded responses. This is the high-fidelity replay strategy reserved for production-critical agents. Replay-B (LLM-only replay, initializ/eval#11) is v1; this is Phase 7 of the eval epic.

Scope

Replay mode entry point

```bash
forge run \
--replay \
--replay-source <invocation_id_or_trace_id> \
--replay-tool-oracle <oracle_endpoint_or_file> \
--replay-candidate-model \
--replay-output
```

When `--replay` is set:

  1. Load recorded inputs: from the trace identified by `--replay-source`, reconstruct the inbound A2A request (input message, headers including workflow_id/correlation_id).
  2. Tool oracle: replace the tool dispatch layer with a stub that, for every tool call (tool_name, args), looks up the recorded response from the oracle. Strict by default: no recorded response → return error to the agent (counts as drift in evaluation).
  3. Egress lockdown: override forge.yaml's egress allow-list to only the LLM provider host. Block everything else regardless of declared allow-list.
  4. Candidate model: override the LLM dispatch to use `--replay-candidate-model` regardless of forge.yaml.
  5. Audit emission: tag every emitted audit event with `replay_run_id`; route to a separate NATS subject `replayEvent` (not `auditEvent`) so replay events don't pollute production audit.
  6. OTel spans: emit normally but tag with `replay_run_id` resource attribute so the trace store can segregate.
  7. No A2A response: the agent's output is written to `--replay-output` instead of being returned to a caller. The agent process exits when the invocation completes.

Tool oracle protocol

Two formats:

  • File: JSON file with an array of `{tool_name, args_hash, response}` entries. Pre-computed from the original trace.
  • Endpoint: HTTP service that the eval orchestrator runs; agent calls it per tool dispatch. Simpler to drive from eval#11/Replay-A orchestrator code.

Tool annotation for read-only-real (future)

`forge.yaml`'s tool declarations may include `idempotent: true` — for those, the oracle is bypassed and the tool runs for real. Not in v1; capture in this issue's design but ship strict-only.

Test plan

  • Replay against a known invocation reproduces the conversation; output captured in --replay-output
  • Candidate model attempts a tool the original didn't use → strict oracle returns error; agent surfaces it; replay completes with drift recorded
  • Egress to a non-LLM host is blocked even if the agent declares it in forge.yaml allow-list
  • Audit events from the replay land on `replayEvent`, not `auditEvent` — production audit untouched

Dependencies

Phase 7, depends on:

  • initializ/eval#11 (Replay-B as the v1 baseline that this replaces for high-stakes agents)
  • initializ/eval#15 (collector accepting replay-tagged spans on a separate path or with filtering)

Metadata

Metadata

Assignees

No one assigned

    Labels

    epic:evalCross-repo observability + evaluation epic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions