eval(workbench): web-design-guidelines eval suite + proposed upstream changes by Zhaiyuqing2003 · Pull Request #45 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-07T16:06:04Z

First skill from the prioritized top-N list. Adds an eval suite for vercel-labs/agent-skills/web-design-guidelines (~298K installs) and a before/after proposal of upstream changes for the team to review before we PR.

Stacked on fix/workbench-linux-docker-permissions (the Linux Docker fix is required to run the suite).

Eval suite — `examples/workbench/web-design-guidelines/`

4 cases × 3 mid-tier models × 3 trials. 20 seeded violations across 6 rule families (a11y / focus / forms / typography / animation / images).

cd examples/workbench/web-design-guidelines
export OPENROUTER_API_KEY=sk-or-...
npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3

Proposed upstream changes — `proposed-upstream-changes/`

Before/after snapshots of the two upstream files we'd PR back. Not published yet — pending team coordination since this is our first contribution to vercel-labs.

Upstream	Change
`vercel-labs/agent-skills` SKILL.md (39→54 lines)	adds explicit two-pass workflow
`vercel-labs/web-interface-guidelines` command.md (180→304 lines)	adds per-element Pass 2 checklist + 5 BAD/GOOD examples for most-missed rules

Eval evidence

Model	Before	After
claude-sonnet-4.6	10/12 (83%)	12/12 (100%)
gpt-5-mini	9/12 (75%)	10/12 (83%)
gemini-2.5-pro	7/12 (58%)	9/12 (75%)
Total	26/36 (72%)	31/36 (86%)

Two rules eliminated entirely (no-empty-state-handling, input-missing-autocomplete); two reduced (submit-disabled 3→1, above-fold-priority 2→1).

Eval case for vercel-labs/agent-skills/web-design-guidelines (~298K installs). Four cases cover six rule families: - review-product-card — Accessibility + Focus States (5 violations) - review-checkout-form — Forms (5 violations) - review-loading-screen — Typography + Content Handling (5 violations) - review-hero-section — Animation + Images + Performance (5 violations) Each case feeds the agent a focused TSX file with seeded violations and grades whether the agent's findings cover them. Workflow mirrors real usage (one file at a time, not a kitchen sink) and avoids overwhelming smaller models. The vendored references/ contains the skill plus a snapshot of the upstream rules document with our proposed two-pass workflow + per-element checklist + BAD/GOOD examples additions baked in. Suite is configured for a 3-provider mid-tier matrix (sonnet, gpt-5-mini, gemini-2.5-pro). Baseline (upstream skill): 26/36 (72%) across 36 trials. With proposed additions: 31/36 (86%) — sonnet 100%, gpt 83%, gemini 75%. Reproduce: cd examples/workbench/web-design-guidelines && npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3

Before/after snapshots of the two upstream files we'd PR back to Vercel: vercel-labs/agent-skills skills/web-design-guidelines/SKILL.md vercel-labs/web-interface-guidelines command.md Nothing here is published yet — the team needs to coordinate the PR since this is our first upstream contribution to vercel-labs. Proposed changes are purely additive: - SKILL.md gains an explicit two-pass workflow (visible patterns → per-element absence checklist). WebFetch behavior unchanged. - command.md gains a "Per-element review (Pass 2 checklist)" section and a "Common-miss examples" section with five BAD/GOOD code blocks for the rules our eval shows are most often overlooked. Existing rule sections are left intact. Eval evidence (4 cases × 3 mid-tier models × 3 trials = 36 trials): before: 26/36 (72%) — sonnet 83%, gpt-5-mini 75%, gemini 58% after: 31/36 (86%) — sonnet 100%, gpt-5-mini 83%, gemini 75% Two rules eliminated entirely (no-empty-state-handling, input-missing-autocomplete); two more reduced (submit-disabled 3→1, above-fold-priority 2→1).

Adds 5 new cases covering the 9 untested upstream rule sections: review-data-table Performance + Typography + Content & Copy review-confirm-dialog Touch & Interaction + Safe Areas + Hover review-search-page Navigation & State + Locale & i18n review-theme-toggle Hydration Safety + Dark Mode + Focus review-blog-post Heading hierarchy + Aria + Focus + Content & Copy Also extends `command.md` (vendored + upstream-proposal) with: - Per-element Pass 2 entries for modal/dialog, native <select>, headings, brand names, and any <button> (hover state, focus-visible, type=button) - BAD/GOOD examples for focus-visible vs focus and translate="no" 9-case matrix (3 mid-tier models × 3 trials = 81 trials): Strict pass: 42/81 (52%) Rule coverage: 334/405 (82%) — load-bearing metric Per-case coverage: product-card 100%, hero-section 100%, checkout-form 98%, blog-post 87%, loading-screen 82%, data-table 80%, theme-toggle 69%, search-page 64%, confirm-dialog 62%. The strict pass rate dropped from 86% (4-case) to 52% (9-case) because the new cases test harder absence-type rules (touch-action, safe-area, brand translate=no) that even the updated skill misses occasionally. The rule-coverage rate of 82% is what shows the SKILL.md + command.md changes generalize across all 16 upstream sections. Coverage of the 81 upstream rules: ~45 graded across 9 cases; ~36 skipped (subjective copy rules, framework-bound SSR/drag concerns, overlap with other rules).

Yuqing Zhai added 2 commits May 7, 2026 10:58

Copilot AI review requested due to automatic review settings May 7, 2026 16:06

Copilot started reviewing on behalf of Zhaiyuqing2003 May 7, 2026 16:06 View session

This comment was marked as low quality.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(workbench): web-design-guidelines eval suite + proposed upstream changes#45

eval(workbench): web-design-guidelines eval suite + proposed upstream changes#45
Zhaiyuqing2003 wants to merge 3 commits into
fix/workbench-linux-docker-permissionsfrom
eval/web-design-guidelines

Zhaiyuqing2003 commented May 7, 2026

Uh oh!

This comment was marked as low quality.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zhaiyuqing2003 commented May 7, 2026

Eval suite — examples/workbench/web-design-guidelines/

Proposed upstream changes — proposed-upstream-changes/

Eval evidence

Uh oh!

This comment was marked as low quality.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Eval suite — `examples/workbench/web-design-guidelines/`

Proposed upstream changes — `proposed-upstream-changes/`