Agent Harness — Module and Middleware Code Domain

## Goal

Define and implement an agent harness for module and middleware code repositories (communication, persistency, lifecycle, baselibs, and related) with a deterministic evaluation loop covering build, test, lint, and policy checks.

## Why

Module repos have different task units, evaluation metrics, and failure modes than the docs-as-code traceability domain. They need their own harness definition, task corpus, and trace schema. Sharing infrastructure with the docs-as-code domain (run contract, CI templates, evidence schema base) is correct; sharing the task corpus or consistency rules is not.

## Domain Framing

- **Fixed component**: the Lane A gate per repo (compile + tests green + lint clean + policy pass)
- **Variable component**: harness context provided to the agent before it acts (architecture context, relevant test files, known constraints)
- **Base model**: any — model-agnostic
- **Evaluation unit**: one failing CI job or issue-scoped code change, with a known expected pass/fail verdict

## Agent Roles

| Role | Responsibility |
|---|---|
| **Specifier** | Defines one task as a `spec.md` (failing scenario, repo context, success criteria) |
| **Executor** | Applies the code change given the harness context |
| **Validator** | Runs build, tests, lint, and policy checks; emits structured trace artifacts |
| **Distiller** | Outer loop script: extracts structured fields from CI output and writes per-task trace files |

## Repository Entry Strategy

- Keep the top-level agent instruction file short and navigational
- Keep stack-specific guidance in indexed subsystem docs or path-scoped rules
- Favor deterministic sensors (build, test, lint, structural checks) over long prose where possible

## Trace Store Schema

```
runs/
  <iteration>/
    <candidate>/
      meta.json
      score.json                  # build pass/fail, test pass count, lint error count
      traces/
        <task_id>/
          compile_errors.json     # structured compiler errors (file, line, message)
          test_failures.json      # structured test failures (test id, failure message)
          lint_results.json       # structured lint findings
          score.json              # expected verdict vs observed CI result
          agent_diff.patch
evolution_summary.jsonl
```

The proposer should start from `evolution_summary.jsonl`, then inspect only the
relevant candidate and task traces.

This should mirror the same index-first navigation pattern already piloted in
the docs-as-code harness, even though the domain-specific artifacts differ.

## Harness Interface

```python
class CodeHarness:
    def get_context(self, task_spec: dict) -> str:
        """Return context to present to the agent before it acts."""
        ...
    def post_process(self, agent_output: str, task_spec: dict) -> dict:
        """Optional: validate or transform agent output before CI runs."""
        ...
```

## Public Task Corpus

- Sourced from historical failing CI jobs across Wave 1 repos
- Each scenario: repo snapshot at failure point + CI command + expected pass verdict
- No confidential defect data or product-specific customer findings (those stay in OEM internal file)
- Search set: 30-50 scenarios; held-out: 10-15 cleanly separated
- Corpus includes at least one scenario per stack group: C++/Bazel, Rust, Python

## Frameworks Used

- **Spec Kit** (`specify`): bootstrap task specs per failing scenario
- **Meta-Harness pattern**: outer loop reads trace filesystem, proposes improved harness (Lane B)
- **Open Harness**: evaluate for deterministic CI replay once it reaches stable API (currently early)

## Lane A Checks (mandatory)

Varies by repo stack — all must produce structured JSON outputs:

| Stack | Required checks |
|---|---|
| C++/Bazel | `bazel build`, `bazel test`, clang-tidy or equivalent |
| Rust | `cargo build`, `cargo test`, `cargo clippy` |
| Python | `pytest`, `ruff` or `flake8` |

All stacks: policy check consuming Lane A artifacts, schema validation.

## Lightweight Validation Before Full Evaluation

Before a candidate enters the full task set, run a cheap validation step:

- import the candidate harness
- instantiate the harness class
- verify one tiny task spec can be loaded
- verify the candidate emits the expected trace filenames
- expose a small query helper over `runs/` so agents can compare candidates and inspect failed tasks without scanning whole histories

## Done When

- At least 20 public task scenarios exist across at least two module repos
- Outer loop runs end-to-end for at least one stack (C++ or Rust)
- One baseline harness candidate is evaluated and trace is grep-able
- Lane A checks run in CI without LLM dependency
- Trace schema produces structured outputs grep-able by proposer
- Run history is navigable from a summary index plus per-task structured traces
- Top-level agent guidance stays concise and delegates stack detail to indexed docs

Parent: https://github.com/eclipse-score/score/issues/2851


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Harness — Module and Middleware Code Domain #2851

Goal

Why

Domain Framing

Agent Roles

Repository Entry Strategy

Trace Store Schema

Harness Interface

Public Task Corpus

Frameworks Used

Lane A Checks (mandatory)

Lightweight Validation Before Full Evaluation

Done When

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Role	Responsibility
Specifier	Defines one task as a `spec.md` (failing scenario, repo context, success criteria)
Executor	Applies the code change given the harness context
Validator	Runs build, tests, lint, and policy checks; emits structured trace artifacts
Distiller	Outer loop script: extracts structured fields from CI output and writes per-task trace files

Stack	Required checks
C++/Bazel	`bazel build`, `bazel test`, clang-tidy or equivalent
Rust	`cargo build`, `cargo test`, `cargo clippy`
Python	`pytest`, `ruff` or `flake8`

Agent Harness — Module and Middleware Code Domain #2851

Description

Goal

Why

Domain Framing

Agent Roles

Repository Entry Strategy

Trace Store Schema

Harness Interface

Public Task Corpus

Frameworks Used

Lane A Checks (mandatory)

Lightweight Validation Before Full Evaluation

Done When

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions