Skip to content

Agent Harness — Module and Middleware Code Domain #2851

@FScholPer

Description

@FScholPer

Goal

Define and implement an agent harness for module and middleware code repositories (communication, persistency, lifecycle, baselibs, and related) with a deterministic evaluation loop covering build, test, lint, and policy checks.

Why

Module repos have different task units, evaluation metrics, and failure modes than the docs-as-code traceability domain. They need their own harness definition, task corpus, and trace schema. Sharing infrastructure with the docs-as-code domain (run contract, CI templates, evidence schema base) is correct; sharing the task corpus or consistency rules is not.

Domain Framing

  • Fixed component: the Lane A gate per repo (compile + tests green + lint clean + policy pass)
  • Variable component: harness context provided to the agent before it acts (architecture context, relevant test files, known constraints)
  • Base model: any — model-agnostic
  • Evaluation unit: one failing CI job or issue-scoped code change, with a known expected pass/fail verdict

Agent Roles

Role Responsibility
Specifier Defines one task as a spec.md (failing scenario, repo context, success criteria)
Executor Applies the code change given the harness context
Validator Runs build, tests, lint, and policy checks; emits structured trace artifacts
Distiller Outer loop script: extracts structured fields from CI output and writes per-task trace files

Repository Entry Strategy

  • Keep the top-level agent instruction file short and navigational
  • Keep stack-specific guidance in indexed subsystem docs or path-scoped rules
  • Favor deterministic sensors (build, test, lint, structural checks) over long prose where possible

Trace Store Schema

runs/
  <iteration>/
    <candidate>/
      meta.json
      score.json                  # build pass/fail, test pass count, lint error count
      traces/
        <task_id>/
          compile_errors.json     # structured compiler errors (file, line, message)
          test_failures.json      # structured test failures (test id, failure message)
          lint_results.json       # structured lint findings
          score.json              # expected verdict vs observed CI result
          agent_diff.patch
evolution_summary.jsonl

The proposer should start from evolution_summary.jsonl, then inspect only the
relevant candidate and task traces.

This should mirror the same index-first navigation pattern already piloted in
the docs-as-code harness, even though the domain-specific artifacts differ.

Harness Interface

class CodeHarness:
    def get_context(self, task_spec: dict) -> str:
        """Return context to present to the agent before it acts."""
        ...
    def post_process(self, agent_output: str, task_spec: dict) -> dict:
        """Optional: validate or transform agent output before CI runs."""
        ...

Public Task Corpus

  • Sourced from historical failing CI jobs across Wave 1 repos
  • Each scenario: repo snapshot at failure point + CI command + expected pass verdict
  • No confidential defect data or product-specific customer findings (those stay in OEM internal file)
  • Search set: 30-50 scenarios; held-out: 10-15 cleanly separated
  • Corpus includes at least one scenario per stack group: C++/Bazel, Rust, Python

Frameworks Used

  • Spec Kit (specify): bootstrap task specs per failing scenario
  • Meta-Harness pattern: outer loop reads trace filesystem, proposes improved harness (Lane B)
  • Open Harness: evaluate for deterministic CI replay once it reaches stable API (currently early)

Lane A Checks (mandatory)

Varies by repo stack — all must produce structured JSON outputs:

Stack Required checks
C++/Bazel bazel build, bazel test, clang-tidy or equivalent
Rust cargo build, cargo test, cargo clippy
Python pytest, ruff or flake8

All stacks: policy check consuming Lane A artifacts, schema validation.

Lightweight Validation Before Full Evaluation

Before a candidate enters the full task set, run a cheap validation step:

  • import the candidate harness
  • instantiate the harness class
  • verify one tiny task spec can be loaded
  • verify the candidate emits the expected trace filenames
  • expose a small query helper over runs/ so agents can compare candidates and inspect failed tasks without scanning whole histories

Done When

  • At least 20 public task scenarios exist across at least two module repos
  • Outer loop runs end-to-end for at least one stack (C++ or Rust)
  • One baseline harness candidate is evaluated and trace is grep-able
  • Lane A checks run in CI without LLM dependency
  • Trace schema produces structured outputs grep-able by proposer
  • Run history is navigable from a summary index plus per-task structured traces
  • Top-level agent guidance stays concise and delegates stack detail to indexed docs

Parent: #2851

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions