Context
Replay-A: full deterministic re-execution of an agent against a candidate model with tools stubbed by recorded responses. This is the high-fidelity replay strategy reserved for production-critical agents. Replay-B (LLM-only replay, initializ/eval#11) is v1; this is Phase 7 of the eval epic.
Scope
Replay mode entry point
```bash
forge run \
--replay \
--replay-source <invocation_id_or_trace_id> \
--replay-tool-oracle <oracle_endpoint_or_file> \
--replay-candidate-model \
--replay-output
```
When `--replay` is set:
- Load recorded inputs: from the trace identified by `--replay-source`, reconstruct the inbound A2A request (input message, headers including workflow_id/correlation_id).
- Tool oracle: replace the tool dispatch layer with a stub that, for every tool call (tool_name, args), looks up the recorded response from the oracle. Strict by default: no recorded response → return error to the agent (counts as drift in evaluation).
- Egress lockdown: override forge.yaml's egress allow-list to only the LLM provider host. Block everything else regardless of declared allow-list.
- Candidate model: override the LLM dispatch to use `--replay-candidate-model` regardless of forge.yaml.
- Audit emission: tag every emitted audit event with `replay_run_id`; route to a separate NATS subject `replayEvent` (not `auditEvent`) so replay events don't pollute production audit.
- OTel spans: emit normally but tag with `replay_run_id` resource attribute so the trace store can segregate.
- No A2A response: the agent's output is written to `--replay-output` instead of being returned to a caller. The agent process exits when the invocation completes.
Tool oracle protocol
Two formats:
- File: JSON file with an array of `{tool_name, args_hash, response}` entries. Pre-computed from the original trace.
- Endpoint: HTTP service that the eval orchestrator runs; agent calls it per tool dispatch. Simpler to drive from eval#11/Replay-A orchestrator code.
Tool annotation for read-only-real (future)
`forge.yaml`'s tool declarations may include `idempotent: true` — for those, the oracle is bypassed and the tool runs for real. Not in v1; capture in this issue's design but ship strict-only.
Test plan
Dependencies
Phase 7, depends on:
- initializ/eval#11 (Replay-B as the v1 baseline that this replaces for high-stakes agents)
- initializ/eval#15 (collector accepting replay-tagged spans on a separate path or with filtering)
Context
Replay-A: full deterministic re-execution of an agent against a candidate model with tools stubbed by recorded responses. This is the high-fidelity replay strategy reserved for production-critical agents. Replay-B (LLM-only replay, initializ/eval#11) is v1; this is Phase 7 of the eval epic.
Scope
Replay mode entry point
```bash
forge run \
--replay \
--replay-source <invocation_id_or_trace_id> \
--replay-tool-oracle <oracle_endpoint_or_file> \
--replay-candidate-model \
--replay-output
```
When `--replay` is set:
Tool oracle protocol
Two formats:
Tool annotation for read-only-real (future)
`forge.yaml`'s tool declarations may include `idempotent: true` — for those, the oracle is bypassed and the tool runs for real. Not in v1; capture in this issue's design but ship strict-only.
Test plan
Dependencies
Phase 7, depends on: