Skip to content

eval(workbench): web-design-guidelines eval suite + proposed upstream changes#45

Open
Zhaiyuqing2003 wants to merge 3 commits into
fix/workbench-linux-docker-permissionsfrom
eval/web-design-guidelines
Open

eval(workbench): web-design-guidelines eval suite + proposed upstream changes#45
Zhaiyuqing2003 wants to merge 3 commits into
fix/workbench-linux-docker-permissionsfrom
eval/web-design-guidelines

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

First skill from the prioritized top-N list. Adds an eval suite for vercel-labs/agent-skills/web-design-guidelines (~298K installs) and a before/after proposal of upstream changes for the team to review before we PR.

Stacked on fix/workbench-linux-docker-permissions (the Linux Docker fix is required to run the suite).

Eval suite — examples/workbench/web-design-guidelines/

4 cases × 3 mid-tier models × 3 trials. 20 seeded violations across 6 rule families (a11y / focus / forms / typography / animation / images).

cd examples/workbench/web-design-guidelines
export OPENROUTER_API_KEY=sk-or-...
npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3

Proposed upstream changes — proposed-upstream-changes/

Before/after snapshots of the two upstream files we'd PR back. Not published yet — pending team coordination since this is our first contribution to vercel-labs.

Upstream Change
vercel-labs/agent-skills SKILL.md (39→54 lines) adds explicit two-pass workflow
vercel-labs/web-interface-guidelines command.md (180→304 lines) adds per-element Pass 2 checklist + 5 BAD/GOOD examples for most-missed rules

Eval evidence

Model Before After
claude-sonnet-4.6 10/12 (83%) 12/12 (100%)
gpt-5-mini 9/12 (75%) 10/12 (83%)
gemini-2.5-pro 7/12 (58%) 9/12 (75%)
Total 26/36 (72%) 31/36 (86%)

Two rules eliminated entirely (no-empty-state-handling, input-missing-autocomplete); two reduced (submit-disabled 3→1, above-fold-priority 2→1).

Yuqing Zhai added 2 commits May 7, 2026 10:58
Eval case for vercel-labs/agent-skills/web-design-guidelines (~298K
installs). Four cases cover six rule families:

- review-product-card  — Accessibility + Focus States  (5 violations)
- review-checkout-form — Forms                          (5 violations)
- review-loading-screen — Typography + Content Handling (5 violations)
- review-hero-section  — Animation + Images + Performance (5 violations)

Each case feeds the agent a focused TSX file with seeded violations and
grades whether the agent's findings cover them. Workflow mirrors real
usage (one file at a time, not a kitchen sink) and avoids overwhelming
smaller models.

The vendored references/ contains the skill plus a snapshot of the
upstream rules document with our proposed two-pass workflow + per-element
checklist + BAD/GOOD examples additions baked in. Suite is configured for
a 3-provider mid-tier matrix (sonnet, gpt-5-mini, gemini-2.5-pro).

Baseline (upstream skill): 26/36 (72%) across 36 trials.
With proposed additions:    31/36 (86%) — sonnet 100%, gpt 83%, gemini 75%.

Reproduce: cd examples/workbench/web-design-guidelines &&
  npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3
Before/after snapshots of the two upstream files we'd PR back to Vercel:

  vercel-labs/agent-skills        skills/web-design-guidelines/SKILL.md
  vercel-labs/web-interface-guidelines  command.md

Nothing here is published yet — the team needs to coordinate the PR
since this is our first upstream contribution to vercel-labs.

Proposed changes are purely additive:
- SKILL.md gains an explicit two-pass workflow (visible patterns →
  per-element absence checklist). WebFetch behavior unchanged.
- command.md gains a "Per-element review (Pass 2 checklist)" section
  and a "Common-miss examples" section with five BAD/GOOD code blocks
  for the rules our eval shows are most often overlooked. Existing
  rule sections are left intact.

Eval evidence (4 cases × 3 mid-tier models × 3 trials = 36 trials):
  before: 26/36 (72%)  — sonnet 83%, gpt-5-mini 75%, gemini 58%
  after:  31/36 (86%)  — sonnet 100%, gpt-5-mini 83%, gemini 75%

Two rules eliminated entirely (no-empty-state-handling,
input-missing-autocomplete); two more reduced (submit-disabled 3→1,
above-fold-priority 2→1).
Copilot AI review requested due to automatic review settings May 7, 2026 16:06

This comment was marked as low quality.

Adds 5 new cases covering the 9 untested upstream rule sections:

  review-data-table     Performance + Typography + Content & Copy
  review-confirm-dialog Touch & Interaction + Safe Areas + Hover
  review-search-page    Navigation & State + Locale & i18n
  review-theme-toggle   Hydration Safety + Dark Mode + Focus
  review-blog-post      Heading hierarchy + Aria + Focus + Content & Copy

Also extends `command.md` (vendored + upstream-proposal) with:
- Per-element Pass 2 entries for modal/dialog, native <select>, headings,
  brand names, and any <button> (hover state, focus-visible, type=button)
- BAD/GOOD examples for focus-visible vs focus and translate="no"

9-case matrix (3 mid-tier models × 3 trials = 81 trials):
  Strict pass:    42/81 (52%)
  Rule coverage:  334/405 (82%) — load-bearing metric

Per-case coverage: product-card 100%, hero-section 100%, checkout-form
98%, blog-post 87%, loading-screen 82%, data-table 80%, theme-toggle
69%, search-page 64%, confirm-dialog 62%.

The strict pass rate dropped from 86% (4-case) to 52% (9-case) because
the new cases test harder absence-type rules (touch-action, safe-area,
brand translate=no) that even the updated skill misses occasionally.
The rule-coverage rate of 82% is what shows the SKILL.md + command.md
changes generalize across all 16 upstream sections.

Coverage of the 81 upstream rules: ~45 graded across 9 cases; ~36
skipped (subjective copy rules, framework-bound SSR/drag concerns,
overlap with other rules).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants