Harden eval suites, apply autoresearch-verified skill fixes#23
Merged
Conversation
Runs an eval-doctor pass across all 23 doc-genre skills, converting LLM-only expectations into deterministic checks (0% -> ~63% deterministic coverage) and targeting named output files instead of transcript.md. Then runs the autoresearch improvement loop against the hardened evals and applies the 10 improvements it found and verified: self-referential banned-word bugs in arc42-arch-doc and diataxis-tutorial, missing mermaid-fence-nesting guidance in c4-model-diagram, a dropped genre-drift rule in diataxis-how-to, a concrete-naming rule in ears-acceptance-criteria (mirrored into feature-spec), a type-enum surfacing gap in mif-frontmatter, a results-reporting gap in mif-validate, and review/status-clarity fixes in python-pep and rust-rfc. Reviewed via an independent out-of-session loop (3 rounds, 0 must-fix / 0 should-fix remaining); all touched skills verified against validate-plugin, lint:md, test:hook, and MIF round-trip.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the evaluation suites for multiple doc-genre skills by adding deterministic checks (and updating prompts/expected outputs to write to concrete output files), while also applying targeted skill-instruction fixes discovered via an autoresearch loop (e.g., clearer genre boundaries, stronger constraints, and additional authoring rules).
Changes:
- Add
deterministic_checksto many skills’evals/evals.jsonand update prompts to save artifacts to named files (e.g.,runbook.md,spec.md,design.md) for deterministic grading. - Update several
SKILL.mddocuments with clarified authoring rules (e.g., RFC vs ADR guidance, PEP review expectations, MIF/frontmatter rules, C4 quoting guidance). - Tighten or expand eval prompts/expectations to reduce subjective grading and increase structural/verifiable coverage.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/sre-runbook/evals/evals.json | Adds deterministic runbook structure/content checks and updates prompts to write runbook.md/section files. |
| skills/rust-rfc/SKILL.md | Clarifies RFC guidance, including “not an RFC” cases and steering to ADRs when post-decision. |
| skills/rust-rfc/evals/evals.json | Adds deterministic section/frontmatter checks and file-targeted outputs for RFC evals. |
| skills/python-pep/SKILL.md | Adds explicit single-state Status: rule and strengthens “review must include corrected text” guidance. |
| skills/python-pep/evals/evals.json | Adds deterministic header/section/frontmatter checks and new eval scenarios. |
| skills/prd/evals/evals.json | Adds deterministic PRD structure/frontmatter/EARS checks and more specific prompts. |
| skills/playbook/evals/evals.json | Adds deterministic section/frontmatter checks and expands scenario coverage. |
| skills/mif-validate/SKILL.md | Requires reporting determinism + lossless round-trip properties explicitly in answers. |
| skills/mif-validate/evals/evals.json | Reworks prompts toward concrete artifacts and adds deterministic checks for tool outputs/projections. |
| skills/mif-frontmatter/SKILL.md | Makes type enum explicit whenever L1/floors are discussed; tightens placeholder guidance. |
| skills/mif-frontmatter/evals/evals.json | Adds deterministic checks for field presence/absence and date handling across L1–L3 cases. |
| skills/kiro-tasks/evals/evals.json | Adds deterministic checks for checkbox numbering, traceability markers, and frontmatter. |
| skills/kiro-requirements/evals/evals.json | Adds deterministic checks for numbering, EARS forms, unhappy-path criteria, and frontmatter. |
| skills/kiro-design/evals/evals.json | Adds deterministic checks for required sections, requirement citations, and domain specificity. |
| skills/google-design-doc/evals/evals.json | Adds deterministic checks for required sections, frontmatter, and scope/non-goals structure. |
| skills/feature-spec/SKILL.md | Adds edge-case rules (credential validity categories) and explicit-assumption guidance for sparse inputs. |
| skills/feature-spec/evals/evals.json | Adds deterministic section/EARS/frontmatter checks and file-targeted outputs. |
| skills/ears-acceptance-criteria/SKILL.md | Requires choosing a plausible concrete component name when absent and flagging as assumption. |
| skills/ears-acceptance-criteria/evals/evals.json | Adds deterministic checks for templates, placeholder avoidance, and multi-criterion splitting. |
| skills/doc-set-planner/evals/evals.json | Adds deterministic checks for recipe/member naming and linkage language. |
| skills/diataxis-tutorial/templates/bad.md | Refines antipattern commentary to emphasize named explanation pointers. |
| skills/diataxis-tutorial/SKILL.md | Strengthens tutorial rules around “named explanation” pointers and single happy-path constraints. |
| skills/diataxis-tutorial/evals/evals.json | Adds deterministic checks for prerequisites/steps/frontmatter and mode-mixing detection. |
| skills/diataxis-reference/evals/evals.json | Adds deterministic checks for synopsis/tables/no-steps/no-opinion constraints. |
| skills/diataxis-how-to/SKILL.md | Adds explicit rationale-vs-action clause guidance and “relocate, don’t delete” drift-fix procedure. |
| skills/diataxis-how-to/evals/evals.json | Adds deterministic checks for file output, headings/steps, L2 ceiling fields, and anti-placeholders. |
| skills/diataxis-explanation/evals/evals.json | Adds deterministic checks to prevent how-to/reference drift and require trade-off language. |
| skills/changelog/evals/evals.json | Adds deterministic checks for Keep-a-Changelog structure, categorization, and MIF gating language. |
| skills/c4-model-diagram/SKILL.md | Adds guidance for quoting mermaid-fenced docs without escaping (outer-fence widening). |
| skills/c4-model-diagram/evals/evals.json | Adds deterministic checks for C4 mermaid blocks, actors/boundaries, and frontmatter. |
| skills/arc42-arch-doc/templates/bad.md | Removes self-referential placeholder token usage from the bad exemplar text. |
| skills/arc42-arch-doc/SKILL.md | Clarifies section numbering conventions and avoids literal placeholder token mention in rules. |
| skills/arc42-arch-doc/evals/evals.json | Adds deterministic checks for section order/presence, no placeholders, and frontmatter type. |
| skills/ai-architecture-doc/evals/evals.json | Adds deterministic checks for required composite sections and EARS-style NFR presence. |
| skills/adr/evals/evals.json | Adds deterministic checks for MADR structure elements and lifecycle/status constraints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…asing Addresses Copilot review feedback on PR #23: - prd: EARS regex checks required uppercase WHEN/IF/WHILE/SHALL, failing correct title-case output. Added the (?i) flag to match this file's existing convention. - sre-runbook: the "expected result" check required that literal phrase, failing compliant runbooks using "Expected:", "you should see", etc. Broadened to a regex alternation covering the wording SKILL.md actually permits.
Addresses Copilot review feedback on PR #23: the two WHEN/IF/WHILE/SHALL regexes required all-caps, failing correct title-case EARS output. Added the (?i) flag, consistent with the file's other checks.
Addresses Copilot review feedback on PR #23: two deterministic checks shelled out to python3 just to count occurrences of "shall", making the eval dependent on the runner environment and expanding the attack surface. Replaced both with a repeated non-greedy regex group that asserts the same count declaratively.
Addresses Copilot review feedback on PR #23: eval 1 drafts a brand-new PEP, which per PEP lifecycle rules must start in Draft status. The check previously accepted Accepted/Final/Provisional too, which would let a clearly wrong lifecycle state pass.
Addresses Copilot review feedback on PR #23: seven "Step N" deterministic checks used step\s*N without a trailing word boundary, so "Step 1" and "Step 2" also matched inside "Step 10" and "Step 20". Added \b to all seven instances.
…start Addresses Copilot review feedback on PR #23: the check only asserted "Recommendation: how-to" appeared somewhere in recommendation.md, so a submission could bury the line later and still pass. Anchored to the start of the file instead.
Addresses Copilot review feedback on PR #23 (cycle 3): - diataxis-reference's synopsis check required a fenced code block, but the skill only requires a synopsis/usage line in any form. Loosened to accept fenced, inline-coded, or plain lines. - Two "we recommend" / "best practice" negative checks were case-sensitive, missing title-case or all-caps phrasing. Made both case-insensitive.
zircote
added a commit
that referenced
this pull request
Jul 1, 2026
Bumps package.json and .claude-plugin/plugin.json to 0.3.1 and adds the CHANGELOG section for PR #23 (eval-suite hardening + autoresearch-verified skill fixes + 3 rounds of Copilot-driven eval corrections).
Merged
zircote
added a commit
that referenced
this pull request
Jul 1, 2026
Bumps package.json and .claude-plugin/plugin.json to 0.3.1 and adds the CHANGELOG section for PR #23 (eval-suite hardening + autoresearch-verified skill fixes + 3 rounds of Copilot-driven eval corrections).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch runs two passes over all 23 doc-genre skills in this plugin.
The first pass is an eval-doctor review. Every skill's
evals/evals.jsonwas checked against the eval-quality rubric. Expectations that relied only on an LLM judgment call, using language such as "must contain," "must equal," or "must match," were converted into deterministic checks. Prompts were reworded to target a named output file instead oftranscript.md, since the deterministic checker cannot resolve that path. Deterministic coverage across the suite rose from zero percent to approximately sixty-three percent.The second pass runs the autoresearch improvement loop against the newly hardened evals. Ten skills returned a genuine, verified improvement. The remaining skills were already scoring at or near the ceiling on their own eval set, so no change was applied.
Changes by skill
arc42-arch-docanddiataxis-tutorialeach carried a self-referential bug: the rule text banning a word (TBD) contained that literal word, so the rule leaked into transcripts and tripped its own check. Both were reworded to state the rule without quoting the banned term.c4-model-diagramgained guidance on widening the outer fence when a mermaid-fenced document is quoted elsewhere, so backtick nesting does not get escaped into invalid output.diataxis-how-toregained a rule requiring content to be relocated, not deleted, when correcting a draft that has drifted into another genre.ears-acceptance-criteria, and by extensionfeature-spec, now instruct the model to commit to a plausible component name and flag it as an assumption when none is given, rather than deferring the choice to the reader.mif-frontmatternow surfaces thetypeenum whenever the L1 floor is discussed, not only while drafting YAML.mif-validatenow states the determinism and lossless-round-trip guarantees as part of its answer, not only as internal reasoning.python-pepandrust-rfcreceived clarity fixes: a single-state constraint on theStatusfield, and a rule that a review must supply corrected text alongside every named gap.Every change above was checked for markdownlint compliance (this repository caps prose at one hundred characters) and, where a template was touched, for MIF round-trip integrity.
Review
This branch went through arbiter's independent review loop: three rounds, each dispatched to an out-of-session sub-agent with no prior involvement in authoring the change. Round one caught two fragile eval regexes and a dropped L2 coverage check. Round two trimmed three redundant restatements and taught
feature-speca convention its own eval already expected. Round three confirmed that the one remaining item, a duplicatedgate-notes.mdscenario repeated across eight sibling evals, has no safe surgical fix without introducing a shared-fixture mechanism to the eval loader. It is left as a documented limitation, not treated as a bug.The loop ended at zero must-fix and zero should-fix findings.
Verification
npm run validate-plugin: eighty-four checks passed, zero errorsnpm run lint:md: two hundred seventy-seven files, zero errorsnpm run test:hook: five of five tests passExcluded from this pass
Four skills,
changelog,diataxis-explanation,diataxis-reference, andsre-runbook, surfaced real, independently verified defects during the autoresearch loop. In each case, the gain from fixing the defect was offset by grading noise on an unrelated eval, so the composite score never exceeded baseline under the strict keep-only-if-better rule. These four are candidates for a follow-up eval-doctor pass targeting the specific checks flagged in those runs before the loop is run again.