fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills#20
Merged
Conversation
…everities, treat code-quality signals as N/A for markdown skills
Three independent scoring bugs were compounding to give every skill the
wrong grade. On the Odos skill pack (the reference web3-router skill),
these fixes move the grade from 47/D to 82/B without any upstream changes.
1. **MetricsAnalyzer was silently broken on every audit.** The CLI called
`new MetricsAnalyzer()` with no constructor arg and then
`.analyze(skill)` with the wrong arg signature — the access to
`this.skill.files` threw, the try/catch swallowed the error, and the
code fell through to a hardcoded default of
`{hasReadme: false, hasLicense: false, maintenanceHealth: 50}`
regardless of what the skill actually shipped. Fixed the call site to
pass the skill into the constructor and call `.analyze()` zero-arg.
2. **AST-W12 + W04M severities were miscalibrated as HIGH/MEDIUM.** Audit
sink, kill-switch contract, incident runbook, and missing
`permissions[]` are governance / forensics hygiene gaps, not
exploitable vulnerabilities — most skills today have none of these.
Surfacing them as HIGH (21 security-score points each) tanked
otherwise-clean skills and trained authors to treat them as noise.
Downgraded to LOW (6 points each) with rationale comments on each
rule. The autonomy-claim companion check (W12-020) still escalates
when a skill markets itself as autonomous without a kill-switch.
3. **Code-library scoring model was applied to markdown-only skill packs.**
The maintenance scorer demanded tests / CI / types / linter / formatter
(80% of its weight) and the documentation scorer demanded inline
comments and JSDoc (65% of its weight) — all absurd for a 6-file
markdown skill that has no source code to test, type, or lint. Both
scorers now detect markdown-only skills (no `.ts`/`.js`/`.py`/etc) and
treat those signals as N/A: the maintenance scorer gives full credit
for the inapplicable axes, and the documentation scorer lets a strong
README absorb the full doc weight (up from a 0.35 cap).
Verified end-to-end against the real Odos skill at github.com/odos-xyz/odos-skills@93b3db6:
| Metric | Before | After |
|---------------|--------|-------|
| Overall | 47 D | 82 B |
| Security | 34 | 76 |
| Quality | 65 | 80 |
| Maintenance | 50† | 100 |
† The 50 was the broken-analyzer fallback default, not a real measurement.
The actual measurement on the same skill (license + gitignore, nothing else
applicable) is 20 — which the skill-type-aware fix correctly normalizes to
100 since the missing axes are N/A for a markdown skill, not failures.
All 343 affected tests pass (288 web3 + 55 metrics). Two pre-existing
e2e/web3.test.ts failures about WEB3_RULES registry length are unrelated
to this change (verified by stashing and re-running on HEAD).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Markeljan
added a commit
that referenced
this pull request
May 21, 2026
…imitations heading - Delete docs/odos-skill-recommendations.md — was a one-off snapshot against agentsec v0.2.7 + odos-skills@f88b7c89; PR #20 invalidated the numbers (Odos moved 47/D → 82/B on a newer commit) and the doc is no longer worth maintaining. - Drop the now-dangling JSDoc citation in openclaw/src/formats.ts. - Rename packages/web3/README.md "Limitations of v0.2.0" → "Current limitations" — the bullets describe current limitations, not 0.2.0-specific historical ones, and stamping a version on the heading invites the same drift the Odos doc just hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent scoring bugs were compounding to give every skill the wrong grade. On the Odos skill pack (the reference web3-router skill in our recs doc), these fixes move the grade from 47/D to 82/B without any upstream changes.
MetricsAnalyzerwas silently broken on every audit — every score relied on a hardcoded fallbackThe 3 bugs and the fixes
1.
MetricsAnalyzerwas silently broken on every auditpackages/cli/src/commands/audit.ts:181callednew MetricsAnalyzer()with no constructor arg andanalyzer.analyze(skill)with the wrong arg signature. The analyzer's actual API isnew MetricsAnalyzer(skill)then.analyze()zero-arg — so the wrong call path threw onthis.skill.files, the try/catch swallowed the exception, and the code fell through to a hardcoded default:Every audit on every skill ever has been getting these fallback defaults regardless of what the skill actually shipped. Odos's audit reported
hasReadme: falseandhasLicense: falsedespite the repo having a 2,703-byte README.md and a LICENSE file at the root.Fixed the call site to use the analyzer's real signature.
2. AST-W12 + W04M severities were miscalibrated
Four findings were rated HIGH/MEDIUM despite being governance / forensics hygiene gaps, not exploitable vulnerabilities:
Per-finding security cost dropped from 21 (high) and 12 (medium) to 6 (low). The audit still surfaces all four as actionable, just no longer treats them as showstoppers. Each rule got an inline rationale comment explaining the calibration.
3. Code-library scoring model was applied to markdown-only skills
The maintenance scorer demanded tests (25 pts), CI (20 pts), types (20 pts), linter (10 pts), formatter (5 pts) — 80% of its weight — and the documentation scorer demanded inline comments and JSDoc (65% of its weight). All absurd for a 6-file markdown skill pack with no source code to test, type, lint, or document with inline comments.
Both scorers now detect markdown-only skills (no
.ts/.js/.py/.rb/.rs/.go/etc files) and:Verified end-to-end against the real Odos skill
Audited
odos-xyz/odos-skills@93b3db6before and after:† The 50 was the broken-analyzer fallback default, not a real measurement. The real measurement on the same skill is 20 (license + gitignore, nothing else applicable on a markdown skill) — which the skill-type-aware fix correctly normalizes to 100 since the missing axes are N/A, not failures.
To reach A (90+), Odos needs to apply rec #8 of docs/odos-skill-recommendations.md — the same upstream PR our recs doc has been asking for all along.
Test plan
cd packages/web3 && bun test— 288 pass / 0 failcd packages/metrics && bun test— 55 pass / 0 failodos-xyz/odos-skills@93b3db6produces 82/B with the new findings + metricse2e/web3.test.tstests aboutWEB3_RULESregistry length are pre-existing on HEAD)e2e/fixtures/profiles/to confirm grade movements are reasonableexamples/if they get regenerated as part of the version bumpNotes
e2e/web3.test.ts(WEB3_RULES registry > contains exactly 12 rulesandevery rule has a unique category) are registry drift unrelated to this change. Verified by stashing and re-running on HEAD.🤖 Generated with Claude Code