eval(workbench): web-design-guidelines eval suite + proposed upstream changes#45
Open
Zhaiyuqing2003 wants to merge 3 commits into
Open
Conversation
added 2 commits
May 7, 2026 10:58
Eval case for vercel-labs/agent-skills/web-design-guidelines (~298K installs). Four cases cover six rule families: - review-product-card — Accessibility + Focus States (5 violations) - review-checkout-form — Forms (5 violations) - review-loading-screen — Typography + Content Handling (5 violations) - review-hero-section — Animation + Images + Performance (5 violations) Each case feeds the agent a focused TSX file with seeded violations and grades whether the agent's findings cover them. Workflow mirrors real usage (one file at a time, not a kitchen sink) and avoids overwhelming smaller models. The vendored references/ contains the skill plus a snapshot of the upstream rules document with our proposed two-pass workflow + per-element checklist + BAD/GOOD examples additions baked in. Suite is configured for a 3-provider mid-tier matrix (sonnet, gpt-5-mini, gemini-2.5-pro). Baseline (upstream skill): 26/36 (72%) across 36 trials. With proposed additions: 31/36 (86%) — sonnet 100%, gpt 83%, gemini 75%. Reproduce: cd examples/workbench/web-design-guidelines && npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3
Before/after snapshots of the two upstream files we'd PR back to Vercel: vercel-labs/agent-skills skills/web-design-guidelines/SKILL.md vercel-labs/web-interface-guidelines command.md Nothing here is published yet — the team needs to coordinate the PR since this is our first upstream contribution to vercel-labs. Proposed changes are purely additive: - SKILL.md gains an explicit two-pass workflow (visible patterns → per-element absence checklist). WebFetch behavior unchanged. - command.md gains a "Per-element review (Pass 2 checklist)" section and a "Common-miss examples" section with five BAD/GOOD code blocks for the rules our eval shows are most often overlooked. Existing rule sections are left intact. Eval evidence (4 cases × 3 mid-tier models × 3 trials = 36 trials): before: 26/36 (72%) — sonnet 83%, gpt-5-mini 75%, gemini 58% after: 31/36 (86%) — sonnet 100%, gpt-5-mini 83%, gemini 75% Two rules eliminated entirely (no-empty-state-handling, input-missing-autocomplete); two more reduced (submit-disabled 3→1, above-fold-priority 2→1).
Adds 5 new cases covering the 9 untested upstream rule sections: review-data-table Performance + Typography + Content & Copy review-confirm-dialog Touch & Interaction + Safe Areas + Hover review-search-page Navigation & State + Locale & i18n review-theme-toggle Hydration Safety + Dark Mode + Focus review-blog-post Heading hierarchy + Aria + Focus + Content & Copy Also extends `command.md` (vendored + upstream-proposal) with: - Per-element Pass 2 entries for modal/dialog, native <select>, headings, brand names, and any <button> (hover state, focus-visible, type=button) - BAD/GOOD examples for focus-visible vs focus and translate="no" 9-case matrix (3 mid-tier models × 3 trials = 81 trials): Strict pass: 42/81 (52%) Rule coverage: 334/405 (82%) — load-bearing metric Per-case coverage: product-card 100%, hero-section 100%, checkout-form 98%, blog-post 87%, loading-screen 82%, data-table 80%, theme-toggle 69%, search-page 64%, confirm-dialog 62%. The strict pass rate dropped from 86% (4-case) to 52% (9-case) because the new cases test harder absence-type rules (touch-action, safe-area, brand translate=no) that even the updated skill misses occasionally. The rule-coverage rate of 82% is what shows the SKILL.md + command.md changes generalize across all 16 upstream sections. Coverage of the 81 upstream rules: ~45 graded across 9 cases; ~36 skipped (subjective copy rules, framework-bound SSR/drag concerns, overlap with other rules).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First skill from the prioritized top-N list. Adds an eval suite for
vercel-labs/agent-skills/web-design-guidelines(~298K installs) and a before/after proposal of upstream changes for the team to review before we PR.Stacked on
fix/workbench-linux-docker-permissions(the Linux Docker fix is required to run the suite).Eval suite —
examples/workbench/web-design-guidelines/4 cases × 3 mid-tier models × 3 trials. 20 seeded violations across 6 rule families (a11y / focus / forms / typography / animation / images).
Proposed upstream changes —
proposed-upstream-changes/Before/after snapshots of the two upstream files we'd PR back. Not published yet — pending team coordination since this is our first contribution to vercel-labs.
vercel-labs/agent-skillsSKILL.md (39→54 lines)vercel-labs/web-interface-guidelinescommand.md (180→304 lines)Eval evidence
Two rules eliminated entirely (
no-empty-state-handling,input-missing-autocomplete); two reduced (submit-disabled3→1,above-fold-priority2→1).