Goal
Define and implement an agent harness for module and middleware code repositories (communication, persistency, lifecycle, baselibs, and related) with a deterministic evaluation loop covering build, test, lint, and policy checks.
Why
Module repos have different task units, evaluation metrics, and failure modes than the docs-as-code traceability domain. They need their own harness definition, task corpus, and trace schema. Sharing infrastructure with the docs-as-code domain (run contract, CI templates, evidence schema base) is correct; sharing the task corpus or consistency rules is not.
Domain Framing
- Fixed component: the Lane A gate per repo (compile + tests green + lint clean + policy pass)
- Variable component: harness context provided to the agent before it acts (architecture context, relevant test files, known constraints)
- Base model: any — model-agnostic
- Evaluation unit: one failing CI job or issue-scoped code change, with a known expected pass/fail verdict
Agent Roles
| Role |
Responsibility |
| Specifier |
Defines one task as a spec.md (failing scenario, repo context, success criteria) |
| Executor |
Applies the code change given the harness context |
| Validator |
Runs build, tests, lint, and policy checks; emits structured trace artifacts |
| Distiller |
Outer loop script: extracts structured fields from CI output and writes per-task trace files |
Repository Entry Strategy
- Keep the top-level agent instruction file short and navigational
- Keep stack-specific guidance in indexed subsystem docs or path-scoped rules
- Favor deterministic sensors (build, test, lint, structural checks) over long prose where possible
Trace Store Schema
runs/
<iteration>/
<candidate>/
meta.json
score.json # build pass/fail, test pass count, lint error count
traces/
<task_id>/
compile_errors.json # structured compiler errors (file, line, message)
test_failures.json # structured test failures (test id, failure message)
lint_results.json # structured lint findings
score.json # expected verdict vs observed CI result
agent_diff.patch
evolution_summary.jsonl
The proposer should start from evolution_summary.jsonl, then inspect only the
relevant candidate and task traces.
This should mirror the same index-first navigation pattern already piloted in
the docs-as-code harness, even though the domain-specific artifacts differ.
Harness Interface
class CodeHarness:
def get_context(self, task_spec: dict) -> str:
"""Return context to present to the agent before it acts."""
...
def post_process(self, agent_output: str, task_spec: dict) -> dict:
"""Optional: validate or transform agent output before CI runs."""
...
Public Task Corpus
- Sourced from historical failing CI jobs across Wave 1 repos
- Each scenario: repo snapshot at failure point + CI command + expected pass verdict
- No confidential defect data or product-specific customer findings (those stay in OEM internal file)
- Search set: 30-50 scenarios; held-out: 10-15 cleanly separated
- Corpus includes at least one scenario per stack group: C++/Bazel, Rust, Python
Frameworks Used
- Spec Kit (
specify): bootstrap task specs per failing scenario
- Meta-Harness pattern: outer loop reads trace filesystem, proposes improved harness (Lane B)
- Open Harness: evaluate for deterministic CI replay once it reaches stable API (currently early)
Lane A Checks (mandatory)
Varies by repo stack — all must produce structured JSON outputs:
| Stack |
Required checks |
| C++/Bazel |
bazel build, bazel test, clang-tidy or equivalent |
| Rust |
cargo build, cargo test, cargo clippy |
| Python |
pytest, ruff or flake8 |
All stacks: policy check consuming Lane A artifacts, schema validation.
Lightweight Validation Before Full Evaluation
Before a candidate enters the full task set, run a cheap validation step:
- import the candidate harness
- instantiate the harness class
- verify one tiny task spec can be loaded
- verify the candidate emits the expected trace filenames
- expose a small query helper over
runs/ so agents can compare candidates and inspect failed tasks without scanning whole histories
Done When
- At least 20 public task scenarios exist across at least two module repos
- Outer loop runs end-to-end for at least one stack (C++ or Rust)
- One baseline harness candidate is evaluated and trace is grep-able
- Lane A checks run in CI without LLM dependency
- Trace schema produces structured outputs grep-able by proposer
- Run history is navigable from a summary index plus per-task structured traces
- Top-level agent guidance stays concise and delegates stack detail to indexed docs
Parent: #2851
Goal
Define and implement an agent harness for module and middleware code repositories (communication, persistency, lifecycle, baselibs, and related) with a deterministic evaluation loop covering build, test, lint, and policy checks.
Why
Module repos have different task units, evaluation metrics, and failure modes than the docs-as-code traceability domain. They need their own harness definition, task corpus, and trace schema. Sharing infrastructure with the docs-as-code domain (run contract, CI templates, evidence schema base) is correct; sharing the task corpus or consistency rules is not.
Domain Framing
Agent Roles
spec.md(failing scenario, repo context, success criteria)Repository Entry Strategy
Trace Store Schema
The proposer should start from
evolution_summary.jsonl, then inspect only therelevant candidate and task traces.
This should mirror the same index-first navigation pattern already piloted in
the docs-as-code harness, even though the domain-specific artifacts differ.
Harness Interface
Public Task Corpus
Frameworks Used
specify): bootstrap task specs per failing scenarioLane A Checks (mandatory)
Varies by repo stack — all must produce structured JSON outputs:
bazel build,bazel test, clang-tidy or equivalentcargo build,cargo test,cargo clippypytest,rufforflake8All stacks: policy check consuming Lane A artifacts, schema validation.
Lightweight Validation Before Full Evaluation
Before a candidate enters the full task set, run a cheap validation step:
runs/so agents can compare candidates and inspect failed tasks without scanning whole historiesDone When
Parent: #2851