FORGE-EVAL-3: replay mode (--replay) for full-sandbox eval (Phase 7)

## Context

Replay-A: full deterministic re-execution of an agent against a candidate model with tools stubbed by recorded responses. This is the high-fidelity replay strategy reserved for production-critical agents. Replay-B (LLM-only replay, initializ/eval#11) is v1; this is Phase 7 of the eval epic.

## Scope

### Replay mode entry point

\`\`\`bash
forge run \\
  --replay \\
  --replay-source <invocation_id_or_trace_id> \\
  --replay-tool-oracle <oracle_endpoint_or_file> \\
  --replay-candidate-model <model> \\
  --replay-output <file>
\`\`\`

When \`--replay\` is set:

1. **Load recorded inputs**: from the trace identified by \`--replay-source\`, reconstruct the inbound A2A request (input message, headers including workflow_id/correlation_id).
2. **Tool oracle**: replace the tool dispatch layer with a stub that, for every tool call (tool_name, args), looks up the recorded response from the oracle. Strict by default: no recorded response → return error to the agent (counts as drift in evaluation).
3. **Egress lockdown**: override forge.yaml's egress allow-list to only the LLM provider host. Block everything else regardless of declared allow-list.
4. **Candidate model**: override the LLM dispatch to use \`--replay-candidate-model\` regardless of forge.yaml.
5. **Audit emission**: tag every emitted audit event with \`replay_run_id\`; route to a separate NATS subject \`replayEvent\` (not \`auditEvent\`) so replay events don't pollute production audit.
6. **OTel spans**: emit normally but tag with \`replay_run_id\` resource attribute so the trace store can segregate.
7. **No A2A response**: the agent's output is written to \`--replay-output\` instead of being returned to a caller. The agent process exits when the invocation completes.

### Tool oracle protocol

Two formats:

- **File**: JSON file with an array of \`{tool_name, args_hash, response}\` entries. Pre-computed from the original trace.
- **Endpoint**: HTTP service that the eval orchestrator runs; agent calls it per tool dispatch. Simpler to drive from eval#11/Replay-A orchestrator code.

### Tool annotation for read-only-real (future)

\`forge.yaml\`'s tool declarations may include \`idempotent: true\` — for those, the oracle is bypassed and the tool runs for real. Not in v1; capture in this issue's design but ship strict-only.

## Test plan

- [ ] Replay against a known invocation reproduces the conversation; output captured in --replay-output
- [ ] Candidate model attempts a tool the original didn't use → strict oracle returns error; agent surfaces it; replay completes with drift recorded
- [ ] Egress to a non-LLM host is blocked even if the agent declares it in forge.yaml allow-list
- [ ] Audit events from the replay land on \`replayEvent\`, not \`auditEvent\` — production audit untouched

## Dependencies

Phase 7, depends on:
- initializ/eval#11 (Replay-B as the v1 baseline that this replaces for high-stakes agents)
- initializ/eval#15 (collector accepting replay-tagged spans on a separate path or with filtering)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FORGE-EVAL-3: replay mode (--replay) for full-sandbox eval (Phase 7) #191

Context

Scope

Replay mode entry point

Tool oracle protocol

Tool annotation for read-only-real (future)

Test plan

Dependencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

FORGE-EVAL-3: replay mode (--replay) for full-sandbox eval (Phase 7) #191

Description

Context

Scope

Replay mode entry point

Tool oracle protocol

Tool annotation for read-only-real (future)

Test plan

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions