Skip to content

Harden eval suites, apply autoresearch-verified skill fixes#23

Merged
zircote merged 9 commits into
mainfrom
eval-suite-and-skill-improvements
Jul 1, 2026
Merged

Harden eval suites, apply autoresearch-verified skill fixes#23
zircote merged 9 commits into
mainfrom
eval-suite-and-skill-improvements

Conversation

@zircote

@zircote zircote commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

This branch runs two passes over all 23 doc-genre skills in this plugin.

The first pass is an eval-doctor review. Every skill's evals/evals.json was checked against the eval-quality rubric. Expectations that relied only on an LLM judgment call, using language such as "must contain," "must equal," or "must match," were converted into deterministic checks. Prompts were reworded to target a named output file instead of transcript.md, since the deterministic checker cannot resolve that path. Deterministic coverage across the suite rose from zero percent to approximately sixty-three percent.

The second pass runs the autoresearch improvement loop against the newly hardened evals. Ten skills returned a genuine, verified improvement. The remaining skills were already scoring at or near the ceiling on their own eval set, so no change was applied.

Changes by skill

arc42-arch-doc and diataxis-tutorial each carried a self-referential bug: the rule text banning a word (TBD) contained that literal word, so the rule leaked into transcripts and tripped its own check. Both were reworded to state the rule without quoting the banned term.

c4-model-diagram gained guidance on widening the outer fence when a mermaid-fenced document is quoted elsewhere, so backtick nesting does not get escaped into invalid output.

diataxis-how-to regained a rule requiring content to be relocated, not deleted, when correcting a draft that has drifted into another genre.

ears-acceptance-criteria, and by extension feature-spec, now instruct the model to commit to a plausible component name and flag it as an assumption when none is given, rather than deferring the choice to the reader.

mif-frontmatter now surfaces the type enum whenever the L1 floor is discussed, not only while drafting YAML.

mif-validate now states the determinism and lossless-round-trip guarantees as part of its answer, not only as internal reasoning.

python-pep and rust-rfc received clarity fixes: a single-state constraint on the Status field, and a rule that a review must supply corrected text alongside every named gap.

Every change above was checked for markdownlint compliance (this repository caps prose at one hundred characters) and, where a template was touched, for MIF round-trip integrity.

Review

This branch went through arbiter's independent review loop: three rounds, each dispatched to an out-of-session sub-agent with no prior involvement in authoring the change. Round one caught two fragile eval regexes and a dropped L2 coverage check. Round two trimmed three redundant restatements and taught feature-spec a convention its own eval already expected. Round three confirmed that the one remaining item, a duplicated gate-notes.md scenario repeated across eight sibling evals, has no safe surgical fix without introducing a shared-fixture mechanism to the eval loader. It is left as a documented limitation, not treated as a bug.

The loop ended at zero must-fix and zero should-fix findings.

Verification

  • npm run validate-plugin: eighty-four checks passed, zero errors
  • npm run lint:md: two hundred seventy-seven files, zero errors
  • npm run test:hook: five of five tests pass
  • MIF round-trip verified clean on both touched templates

Excluded from this pass

Four skills, changelog, diataxis-explanation, diataxis-reference, and sre-runbook, surfaced real, independently verified defects during the autoresearch loop. In each case, the gain from fixing the defect was offset by grading noise on an unrelated eval, so the composite score never exceeded baseline under the strict keep-only-if-better rule. These four are candidates for a follow-up eval-doctor pass targeting the specific checks flagged in those runs before the loop is run again.

Runs an eval-doctor pass across all 23 doc-genre skills, converting
LLM-only expectations into deterministic checks (0% -> ~63% deterministic
coverage) and targeting named output files instead of transcript.md.

Then runs the autoresearch improvement loop against the hardened evals
and applies the 10 improvements it found and verified: self-referential
banned-word bugs in arc42-arch-doc and diataxis-tutorial, missing
mermaid-fence-nesting guidance in c4-model-diagram, a dropped genre-drift
rule in diataxis-how-to, a concrete-naming rule in ears-acceptance-criteria
(mirrored into feature-spec), a type-enum surfacing gap in mif-frontmatter,
a results-reporting gap in mif-validate, and review/status-clarity fixes
in python-pep and rust-rfc.

Reviewed via an independent out-of-session loop (3 rounds, 0 must-fix /
0 should-fix remaining); all touched skills verified against
validate-plugin, lint:md, test:hook, and MIF round-trip.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the evaluation suites for multiple doc-genre skills by adding deterministic checks (and updating prompts/expected outputs to write to concrete output files), while also applying targeted skill-instruction fixes discovered via an autoresearch loop (e.g., clearer genre boundaries, stronger constraints, and additional authoring rules).

Changes:

  • Add deterministic_checks to many skills’ evals/evals.json and update prompts to save artifacts to named files (e.g., runbook.md, spec.md, design.md) for deterministic grading.
  • Update several SKILL.md documents with clarified authoring rules (e.g., RFC vs ADR guidance, PEP review expectations, MIF/frontmatter rules, C4 quoting guidance).
  • Tighten or expand eval prompts/expectations to reduce subjective grading and increase structural/verifiable coverage.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
skills/sre-runbook/evals/evals.json Adds deterministic runbook structure/content checks and updates prompts to write runbook.md/section files.
skills/rust-rfc/SKILL.md Clarifies RFC guidance, including “not an RFC” cases and steering to ADRs when post-decision.
skills/rust-rfc/evals/evals.json Adds deterministic section/frontmatter checks and file-targeted outputs for RFC evals.
skills/python-pep/SKILL.md Adds explicit single-state Status: rule and strengthens “review must include corrected text” guidance.
skills/python-pep/evals/evals.json Adds deterministic header/section/frontmatter checks and new eval scenarios.
skills/prd/evals/evals.json Adds deterministic PRD structure/frontmatter/EARS checks and more specific prompts.
skills/playbook/evals/evals.json Adds deterministic section/frontmatter checks and expands scenario coverage.
skills/mif-validate/SKILL.md Requires reporting determinism + lossless round-trip properties explicitly in answers.
skills/mif-validate/evals/evals.json Reworks prompts toward concrete artifacts and adds deterministic checks for tool outputs/projections.
skills/mif-frontmatter/SKILL.md Makes type enum explicit whenever L1/floors are discussed; tightens placeholder guidance.
skills/mif-frontmatter/evals/evals.json Adds deterministic checks for field presence/absence and date handling across L1–L3 cases.
skills/kiro-tasks/evals/evals.json Adds deterministic checks for checkbox numbering, traceability markers, and frontmatter.
skills/kiro-requirements/evals/evals.json Adds deterministic checks for numbering, EARS forms, unhappy-path criteria, and frontmatter.
skills/kiro-design/evals/evals.json Adds deterministic checks for required sections, requirement citations, and domain specificity.
skills/google-design-doc/evals/evals.json Adds deterministic checks for required sections, frontmatter, and scope/non-goals structure.
skills/feature-spec/SKILL.md Adds edge-case rules (credential validity categories) and explicit-assumption guidance for sparse inputs.
skills/feature-spec/evals/evals.json Adds deterministic section/EARS/frontmatter checks and file-targeted outputs.
skills/ears-acceptance-criteria/SKILL.md Requires choosing a plausible concrete component name when absent and flagging as assumption.
skills/ears-acceptance-criteria/evals/evals.json Adds deterministic checks for templates, placeholder avoidance, and multi-criterion splitting.
skills/doc-set-planner/evals/evals.json Adds deterministic checks for recipe/member naming and linkage language.
skills/diataxis-tutorial/templates/bad.md Refines antipattern commentary to emphasize named explanation pointers.
skills/diataxis-tutorial/SKILL.md Strengthens tutorial rules around “named explanation” pointers and single happy-path constraints.
skills/diataxis-tutorial/evals/evals.json Adds deterministic checks for prerequisites/steps/frontmatter and mode-mixing detection.
skills/diataxis-reference/evals/evals.json Adds deterministic checks for synopsis/tables/no-steps/no-opinion constraints.
skills/diataxis-how-to/SKILL.md Adds explicit rationale-vs-action clause guidance and “relocate, don’t delete” drift-fix procedure.
skills/diataxis-how-to/evals/evals.json Adds deterministic checks for file output, headings/steps, L2 ceiling fields, and anti-placeholders.
skills/diataxis-explanation/evals/evals.json Adds deterministic checks to prevent how-to/reference drift and require trade-off language.
skills/changelog/evals/evals.json Adds deterministic checks for Keep-a-Changelog structure, categorization, and MIF gating language.
skills/c4-model-diagram/SKILL.md Adds guidance for quoting mermaid-fenced docs without escaping (outer-fence widening).
skills/c4-model-diagram/evals/evals.json Adds deterministic checks for C4 mermaid blocks, actors/boundaries, and frontmatter.
skills/arc42-arch-doc/templates/bad.md Removes self-referential placeholder token usage from the bad exemplar text.
skills/arc42-arch-doc/SKILL.md Clarifies section numbering conventions and avoids literal placeholder token mention in rules.
skills/arc42-arch-doc/evals/evals.json Adds deterministic checks for section order/presence, no placeholders, and frontmatter type.
skills/ai-architecture-doc/evals/evals.json Adds deterministic checks for required composite sections and EARS-style NFR presence.
skills/adr/evals/evals.json Adds deterministic checks for MADR structure elements and lifecycle/status constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skills/ears-acceptance-criteria/evals/evals.json
Comment thread skills/ears-acceptance-criteria/evals/evals.json
Comment thread skills/prd/evals/evals.json
Comment thread skills/prd/evals/evals.json
Comment thread skills/ai-architecture-doc/evals/evals.json
Comment thread skills/ai-architecture-doc/evals/evals.json
Comment thread skills/sre-runbook/evals/evals.json
Comment thread skills/sre-runbook/evals/evals.json
zircote added 3 commits July 1, 2026 11:21
…asing

Addresses Copilot review feedback on PR #23:
- prd: EARS regex checks required uppercase WHEN/IF/WHILE/SHALL, failing
  correct title-case output. Added the (?i) flag to match this file's
  existing convention.
- sre-runbook: the "expected result" check required that literal phrase,
  failing compliant runbooks using "Expected:", "you should see", etc.
  Broadened to a regex alternation covering the wording SKILL.md actually
  permits.
Addresses Copilot review feedback on PR #23: the two WHEN/IF/WHILE/SHALL
regexes required all-caps, failing correct title-case EARS output.
Added the (?i) flag, consistent with the file's other checks.
Addresses Copilot review feedback on PR #23: two deterministic checks
shelled out to python3 just to count occurrences of "shall", making the
eval dependent on the runner environment and expanding the attack surface.
Replaced both with a repeated non-greedy regex group that asserts the
same count declaratively.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 9 comments.

Comment thread skills/python-pep/evals/evals.json Outdated
Comment thread skills/diataxis-how-to/evals/evals.json
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
Comment thread skills/diataxis-tutorial/evals/evals.json Outdated
zircote added 3 commits July 1, 2026 11:33
Addresses Copilot review feedback on PR #23: eval 1 drafts a brand-new
PEP, which per PEP lifecycle rules must start in Draft status. The
check previously accepted Accepted/Final/Provisional too, which would
let a clearly wrong lifecycle state pass.
Addresses Copilot review feedback on PR #23: seven "Step N" deterministic
checks used step\s*N without a trailing word boundary, so "Step 1" and
"Step 2" also matched inside "Step 10" and "Step 20". Added \b to all
seven instances.
…start

Addresses Copilot review feedback on PR #23: the check only asserted
"Recommendation: how-to" appeared somewhere in recommendation.md, so a
submission could bury the line later and still pass. Anchored to the
start of the file instead.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.

Comment thread skills/diataxis-reference/evals/evals.json
Comment thread skills/diataxis-reference/evals/evals.json Outdated
Comment thread skills/diataxis-reference/evals/evals.json Outdated
zircote added 2 commits July 1, 2026 11:47
Addresses Copilot review feedback on PR #23 (cycle 3):
- diataxis-reference's synopsis check required a fenced code block, but
  the skill only requires a synopsis/usage line in any form. Loosened to
  accept fenced, inline-coded, or plain lines.
- Two "we recommend" / "best practice" negative checks were case-sensitive,
  missing title-case or all-caps phrasing. Made both case-insensitive.
@zircote zircote marked this pull request as ready for review July 1, 2026 15:50
@zircote zircote merged commit e36f38a into main Jul 1, 2026
9 checks passed
@zircote zircote deleted the eval-suite-and-skill-improvements branch July 1, 2026 15:53
zircote added a commit that referenced this pull request Jul 1, 2026
Bumps package.json and .claude-plugin/plugin.json to 0.3.1 and adds the CHANGELOG section for PR #23 (eval-suite hardening + autoresearch-verified skill fixes + 3 rounds of Copilot-driven eval corrections).
@zircote zircote mentioned this pull request Jul 1, 2026
zircote added a commit that referenced this pull request Jul 1, 2026
Bumps package.json and .claude-plugin/plugin.json to 0.3.1 and adds the CHANGELOG section for PR #23 (eval-suite hardening + autoresearch-verified skill fixes + 3 rounds of Copilot-driven eval corrections).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants