Harden eval suites, apply autoresearch-verified skill fixes by zircote · Pull Request #23 · modeled-information-format/mif-docs-plugin

zircote · 2026-07-01T15:12:12Z

Summary

This branch runs two passes over all 23 doc-genre skills in this plugin.

The first pass is an eval-doctor review. Every skill's evals/evals.json was checked against the eval-quality rubric. Expectations that relied only on an LLM judgment call, using language such as "must contain," "must equal," or "must match," were converted into deterministic checks. Prompts were reworded to target a named output file instead of transcript.md, since the deterministic checker cannot resolve that path. Deterministic coverage across the suite rose from zero percent to approximately sixty-three percent.

The second pass runs the autoresearch improvement loop against the newly hardened evals. Ten skills returned a genuine, verified improvement. The remaining skills were already scoring at or near the ceiling on their own eval set, so no change was applied.

Changes by skill

arc42-arch-doc and diataxis-tutorial each carried a self-referential bug: the rule text banning a word (TBD) contained that literal word, so the rule leaked into transcripts and tripped its own check. Both were reworded to state the rule without quoting the banned term.

c4-model-diagram gained guidance on widening the outer fence when a mermaid-fenced document is quoted elsewhere, so backtick nesting does not get escaped into invalid output.

diataxis-how-to regained a rule requiring content to be relocated, not deleted, when correcting a draft that has drifted into another genre.

ears-acceptance-criteria, and by extension feature-spec, now instruct the model to commit to a plausible component name and flag it as an assumption when none is given, rather than deferring the choice to the reader.

mif-frontmatter now surfaces the type enum whenever the L1 floor is discussed, not only while drafting YAML.

mif-validate now states the determinism and lossless-round-trip guarantees as part of its answer, not only as internal reasoning.

python-pep and rust-rfc received clarity fixes: a single-state constraint on the Status field, and a rule that a review must supply corrected text alongside every named gap.

Every change above was checked for markdownlint compliance (this repository caps prose at one hundred characters) and, where a template was touched, for MIF round-trip integrity.

Review

This branch went through arbiter's independent review loop: three rounds, each dispatched to an out-of-session sub-agent with no prior involvement in authoring the change. Round one caught two fragile eval regexes and a dropped L2 coverage check. Round two trimmed three redundant restatements and taught feature-spec a convention its own eval already expected. Round three confirmed that the one remaining item, a duplicated gate-notes.md scenario repeated across eight sibling evals, has no safe surgical fix without introducing a shared-fixture mechanism to the eval loader. It is left as a documented limitation, not treated as a bug.

The loop ended at zero must-fix and zero should-fix findings.

Verification

npm run validate-plugin: eighty-four checks passed, zero errors
npm run lint:md: two hundred seventy-seven files, zero errors
npm run test:hook: five of five tests pass
MIF round-trip verified clean on both touched templates

Excluded from this pass

Four skills, changelog, diataxis-explanation, diataxis-reference, and sre-runbook, surfaced real, independently verified defects during the autoresearch loop. In each case, the gain from fixing the defect was offset by grading noise on an unrelated eval, so the composite score never exceeded baseline under the strict keep-only-if-better rule. These four are candidates for a follow-up eval-doctor pass targeting the specific checks flagged in those runs before the loop is run again.

Runs an eval-doctor pass across all 23 doc-genre skills, converting LLM-only expectations into deterministic checks (0% -> ~63% deterministic coverage) and targeting named output files instead of transcript.md. Then runs the autoresearch improvement loop against the hardened evals and applies the 10 improvements it found and verified: self-referential banned-word bugs in arc42-arch-doc and diataxis-tutorial, missing mermaid-fence-nesting guidance in c4-model-diagram, a dropped genre-drift rule in diataxis-how-to, a concrete-naming rule in ears-acceptance-criteria (mirrored into feature-spec), a type-enum surfacing gap in mif-frontmatter, a results-reporting gap in mif-validate, and review/status-clarity fixes in python-pep and rust-rfc. Reviewed via an independent out-of-session loop (3 rounds, 0 must-fix / 0 should-fix remaining); all touched skills verified against validate-plugin, lint:md, test:hook, and MIF round-trip.

Copilot

Pull request overview

This PR hardens the evaluation suites for multiple doc-genre skills by adding deterministic checks (and updating prompts/expected outputs to write to concrete output files), while also applying targeted skill-instruction fixes discovered via an autoresearch loop (e.g., clearer genre boundaries, stronger constraints, and additional authoring rules).

Changes:

Add deterministic_checks to many skills’ evals/evals.json and update prompts to save artifacts to named files (e.g., runbook.md, spec.md, design.md) for deterministic grading.
Update several SKILL.md documents with clarified authoring rules (e.g., RFC vs ADR guidance, PEP review expectations, MIF/frontmatter rules, C4 quoting guidance).
Tighten or expand eval prompts/expectations to reduce subjective grading and increase structural/verifiable coverage.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
skills/sre-runbook/evals/evals.json	Adds deterministic runbook structure/content checks and updates prompts to write `runbook.md`/section files.
skills/rust-rfc/SKILL.md	Clarifies RFC guidance, including “not an RFC” cases and steering to ADRs when post-decision.
skills/rust-rfc/evals/evals.json	Adds deterministic section/frontmatter checks and file-targeted outputs for RFC evals.
skills/python-pep/SKILL.md	Adds explicit single-state `Status:` rule and strengthens “review must include corrected text” guidance.
skills/python-pep/evals/evals.json	Adds deterministic header/section/frontmatter checks and new eval scenarios.
skills/prd/evals/evals.json	Adds deterministic PRD structure/frontmatter/EARS checks and more specific prompts.
skills/playbook/evals/evals.json	Adds deterministic section/frontmatter checks and expands scenario coverage.
skills/mif-validate/SKILL.md	Requires reporting determinism + lossless round-trip properties explicitly in answers.
skills/mif-validate/evals/evals.json	Reworks prompts toward concrete artifacts and adds deterministic checks for tool outputs/projections.
skills/mif-frontmatter/SKILL.md	Makes `type` enum explicit whenever L1/floors are discussed; tightens placeholder guidance.
skills/mif-frontmatter/evals/evals.json	Adds deterministic checks for field presence/absence and date handling across L1–L3 cases.
skills/kiro-tasks/evals/evals.json	Adds deterministic checks for checkbox numbering, traceability markers, and frontmatter.
skills/kiro-requirements/evals/evals.json	Adds deterministic checks for numbering, EARS forms, unhappy-path criteria, and frontmatter.
skills/kiro-design/evals/evals.json	Adds deterministic checks for required sections, requirement citations, and domain specificity.
skills/google-design-doc/evals/evals.json	Adds deterministic checks for required sections, frontmatter, and scope/non-goals structure.
skills/feature-spec/SKILL.md	Adds edge-case rules (credential validity categories) and explicit-assumption guidance for sparse inputs.
skills/feature-spec/evals/evals.json	Adds deterministic section/EARS/frontmatter checks and file-targeted outputs.
skills/ears-acceptance-criteria/SKILL.md	Requires choosing a plausible concrete component name when absent and flagging as assumption.
skills/ears-acceptance-criteria/evals/evals.json	Adds deterministic checks for templates, placeholder avoidance, and multi-criterion splitting.
skills/doc-set-planner/evals/evals.json	Adds deterministic checks for recipe/member naming and linkage language.
skills/diataxis-tutorial/templates/bad.md	Refines antipattern commentary to emphasize named explanation pointers.
skills/diataxis-tutorial/SKILL.md	Strengthens tutorial rules around “named explanation” pointers and single happy-path constraints.
skills/diataxis-tutorial/evals/evals.json	Adds deterministic checks for prerequisites/steps/frontmatter and mode-mixing detection.
skills/diataxis-reference/evals/evals.json	Adds deterministic checks for synopsis/tables/no-steps/no-opinion constraints.
skills/diataxis-how-to/SKILL.md	Adds explicit rationale-vs-action clause guidance and “relocate, don’t delete” drift-fix procedure.
skills/diataxis-how-to/evals/evals.json	Adds deterministic checks for file output, headings/steps, L2 ceiling fields, and anti-placeholders.
skills/diataxis-explanation/evals/evals.json	Adds deterministic checks to prevent how-to/reference drift and require trade-off language.
skills/changelog/evals/evals.json	Adds deterministic checks for Keep-a-Changelog structure, categorization, and MIF gating language.
skills/c4-model-diagram/SKILL.md	Adds guidance for quoting mermaid-fenced docs without escaping (outer-fence widening).
skills/c4-model-diagram/evals/evals.json	Adds deterministic checks for C4 mermaid blocks, actors/boundaries, and frontmatter.
skills/arc42-arch-doc/templates/bad.md	Removes self-referential placeholder token usage from the bad exemplar text.
skills/arc42-arch-doc/SKILL.md	Clarifies section numbering conventions and avoids literal placeholder token mention in rules.
skills/arc42-arch-doc/evals/evals.json	Adds deterministic checks for section order/presence, no placeholders, and frontmatter type.
skills/ai-architecture-doc/evals/evals.json	Adds deterministic checks for required composite sections and EARS-style NFR presence.
skills/adr/evals/evals.json	Adds deterministic checks for MADR structure elements and lifecycle/status constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…asing Addresses Copilot review feedback on PR #23: - prd: EARS regex checks required uppercase WHEN/IF/WHILE/SHALL, failing correct title-case output. Added the (?i) flag to match this file's existing convention. - sre-runbook: the "expected result" check required that literal phrase, failing compliant runbooks using "Expected:", "you should see", etc. Broadened to a regex alternation covering the wording SKILL.md actually permits.

Addresses Copilot review feedback on PR #23: the two WHEN/IF/WHILE/SHALL regexes required all-caps, failing correct title-case EARS output. Added the (?i) flag, consistent with the file's other checks.

Addresses Copilot review feedback on PR #23: two deterministic checks shelled out to python3 just to count occurrences of "shall", making the eval dependent on the runner environment and expanding the attack surface. Replaced both with a repeated non-greedy regex group that asserts the same count declaratively.

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 9 comments.

Addresses Copilot review feedback on PR #23: eval 1 drafts a brand-new PEP, which per PEP lifecycle rules must start in Draft status. The check previously accepted Accepted/Final/Provisional too, which would let a clearly wrong lifecycle state pass.

Addresses Copilot review feedback on PR #23: seven "Step N" deterministic checks used step\s*N without a trailing word boundary, so "Step 1" and "Step 2" also matched inside "Step 10" and "Step 20". Added \b to all seven instances.

…start Addresses Copilot review feedback on PR #23: the check only asserted "Recommendation: how-to" appeared somewhere in recommendation.md, so a submission could bury the line later and still pass. Anchored to the start of the file instead.

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.

Addresses Copilot review feedback on PR #23 (cycle 3): - diataxis-reference's synopsis check required a fenced code block, but the skill only requires a synopsis/usage line in any form. Loosened to accept fenced, inline-coded, or plain lines. - Two "we recommend" / "best practice" negative checks were case-sensitive, missing title-case or all-caps phrasing. Made both case-insensitive.

…improvements

Bumps package.json and .claude-plugin/plugin.json to 0.3.1 and adds the CHANGELOG section for PR #23 (eval-suite hardening + autoresearch-verified skill fixes + 3 rounds of Copilot-driven eval corrections).

zircote requested a review from Copilot July 1, 2026 15:12

Copilot started reviewing on behalf of zircote July 1, 2026 15:13 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

zircote added 3 commits July 1, 2026 11:21

fix(evals): case-insensitive EARS check in ai-architecture-doc

cd54002

Addresses Copilot review feedback on PR #23: the two WHEN/IF/WHILE/SHALL regexes required all-caps, failing correct title-case EARS output. Added the (?i) flag, consistent with the file's other checks.

zircote requested a review from Copilot July 1, 2026 15:26

Copilot started reviewing on behalf of zircote July 1, 2026 15:26 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

zircote added 3 commits July 1, 2026 11:33

zircote requested a review from Copilot July 1, 2026 15:39

Copilot started reviewing on behalf of zircote July 1, 2026 15:40 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

Comment thread skills/diataxis-reference/evals/evals.json

Comment thread skills/diataxis-reference/evals/evals.json Outdated

Comment thread skills/diataxis-reference/evals/evals.json Outdated

zircote added 2 commits July 1, 2026 11:47

Merge remote-tracking branch 'origin/main' into eval-suite-and-skill-…

c122e8e

…improvements

zircote marked this pull request as ready for review July 1, 2026 15:50

zircote merged commit e36f38a into main Jul 1, 2026
9 checks passed

zircote deleted the eval-suite-and-skill-improvements branch July 1, 2026 15:53

zircote mentioned this pull request Jul 1, 2026

chore(release): 0.3.1 #24

Merged

Uh oh!

Conversation

zircote commented Jul 1, 2026

Summary

Changes by skill

Review

Verification

Excluded from this pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants