Skip to content

fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills#20

Merged
Markeljan merged 1 commit into
mainfrom
claude/scoring-honesty
May 21, 2026
Merged

fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills#20
Markeljan merged 1 commit into
mainfrom
claude/scoring-honesty

Conversation

@Markeljan
Copy link
Copy Markdown
Collaborator

Summary

Three independent scoring bugs were compounding to give every skill the wrong grade. On the Odos skill pack (the reference web3-router skill in our recs doc), these fixes move the grade from 47/D to 82/B without any upstream changes.

  • 🐛 MetricsAnalyzer was silently broken on every audit — every score relied on a hardcoded fallback
  • 📉 AST-W12 + W04M severities were miscalibrated as HIGH/MEDIUM for governance hygiene that isn't exploitable
  • 🪜 Code-library scoring (tests/CI/types/linter/formatter) was being applied to markdown-only skill packs

The 3 bugs and the fixes

1. MetricsAnalyzer was silently broken on every audit

packages/cli/src/commands/audit.ts:181 called new MetricsAnalyzer() with no constructor arg and analyzer.analyze(skill) with the wrong arg signature. The analyzer's actual API is new MetricsAnalyzer(skill) then .analyze() zero-arg — so the wrong call path threw on this.skill.files, the try/catch swallowed the exception, and the code fell through to a hardcoded default:

return {
  // ...
  documentationScore: 0,
  maintenanceHealth: 50,
  hasReadme: false,
  hasLicense: false,
  // ...
};

Every audit on every skill ever has been getting these fallback defaults regardless of what the skill actually shipped. Odos's audit reported hasReadme: false and hasLicense: false despite the repo having a 2,703-byte README.md and a LICENSE file at the root.

Fixed the call site to use the analyzer's real signature.

2. AST-W12 + W04M severities were miscalibrated

Four findings were rated HIGH/MEDIUM despite being governance / forensics hygiene gaps, not exploitable vulnerabilities:

Rule Before After Why
W12-001 audit sink high low Forensics-trail nice-to-have; most skills today have none
W12-002 kill-switch high low Operational nice-to-have; W12-020 still escalates on autonomy claims
W12-003 incident runbook medium low Documentation gap only
W04M-002 permissions array medium low Manifest hygiene; runtime can enforce via other channels

Per-finding security cost dropped from 21 (high) and 12 (medium) to 6 (low). The audit still surfaces all four as actionable, just no longer treats them as showstoppers. Each rule got an inline rationale comment explaining the calibration.

3. Code-library scoring model was applied to markdown-only skills

The maintenance scorer demanded tests (25 pts), CI (20 pts), types (20 pts), linter (10 pts), formatter (5 pts) — 80% of its weight — and the documentation scorer demanded inline comments and JSDoc (65% of its weight). All absurd for a 6-file markdown skill pack with no source code to test, type, lint, or document with inline comments.

Both scorers now detect markdown-only skills (no .ts/.js/.py/.rb/.rs/.go/etc files) and:

  • Maintenance: treat tests/CI/types/linter/formatter as N/A (full credit). Score still hinges on license, gitignore, and the other axes that actually apply.
  • Documentation: let the README absorb the full weight when there's no source code, so a strong README earns the full 1.0 instead of being capped at 0.35 by absent code-doc signals.

Verified end-to-end against the real Odos skill

Audited odos-xyz/odos-skills@93b3db6 before and after:

Metric Before After
Overall 47 / D 82 / B
Security 34 76
Quality 65 80
Maintenance 50† 100
Critical findings 0 0
High findings 2 0
Medium findings 2 0
Low findings 0 4

† The 50 was the broken-analyzer fallback default, not a real measurement. The real measurement on the same skill is 20 (license + gitignore, nothing else applicable on a markdown skill) — which the skill-type-aware fix correctly normalizes to 100 since the missing axes are N/A, not failures.

To reach A (90+), Odos needs to apply rec #8 of docs/odos-skill-recommendations.md — the same upstream PR our recs doc has been asking for all along.

Test plan

  • cd packages/web3 && bun test — 288 pass / 0 fail
  • cd packages/metrics && bun test — 55 pass / 0 fail
  • End-to-end audit on odos-xyz/odos-skills@93b3db6 produces 82/B with the new findings + metrics
  • No new test failures introduced (verified 2 failing e2e/web3.test.ts tests about WEB3_RULES registry length are pre-existing on HEAD)
  • Manual sanity check on the fixtures in e2e/fixtures/profiles/ to confirm grade movements are reasonable
  • Eyeball the example reports in examples/ if they get regenerated as part of the version bump

Notes

🤖 Generated with Claude Code

…everities, treat code-quality signals as N/A for markdown skills

Three independent scoring bugs were compounding to give every skill the
wrong grade. On the Odos skill pack (the reference web3-router skill),
these fixes move the grade from 47/D to 82/B without any upstream changes.

1. **MetricsAnalyzer was silently broken on every audit.** The CLI called
   `new MetricsAnalyzer()` with no constructor arg and then
   `.analyze(skill)` with the wrong arg signature — the access to
   `this.skill.files` threw, the try/catch swallowed the error, and the
   code fell through to a hardcoded default of
   `{hasReadme: false, hasLicense: false, maintenanceHealth: 50}`
   regardless of what the skill actually shipped. Fixed the call site to
   pass the skill into the constructor and call `.analyze()` zero-arg.

2. **AST-W12 + W04M severities were miscalibrated as HIGH/MEDIUM.** Audit
   sink, kill-switch contract, incident runbook, and missing
   `permissions[]` are governance / forensics hygiene gaps, not
   exploitable vulnerabilities — most skills today have none of these.
   Surfacing them as HIGH (21 security-score points each) tanked
   otherwise-clean skills and trained authors to treat them as noise.
   Downgraded to LOW (6 points each) with rationale comments on each
   rule. The autonomy-claim companion check (W12-020) still escalates
   when a skill markets itself as autonomous without a kill-switch.

3. **Code-library scoring model was applied to markdown-only skill packs.**
   The maintenance scorer demanded tests / CI / types / linter / formatter
   (80% of its weight) and the documentation scorer demanded inline
   comments and JSDoc (65% of its weight) — all absurd for a 6-file
   markdown skill that has no source code to test, type, or lint. Both
   scorers now detect markdown-only skills (no `.ts`/`.js`/`.py`/etc) and
   treat those signals as N/A: the maintenance scorer gives full credit
   for the inapplicable axes, and the documentation scorer lets a strong
   README absorb the full doc weight (up from a 0.35 cap).

Verified end-to-end against the real Odos skill at github.com/odos-xyz/odos-skills@93b3db6:

| Metric        | Before | After |
|---------------|--------|-------|
| Overall       | 47 D   | 82 B  |
| Security      | 34     | 76    |
| Quality       | 65     | 80    |
| Maintenance   | 50†    | 100   |

† The 50 was the broken-analyzer fallback default, not a real measurement.
The actual measurement on the same skill (license + gitignore, nothing else
applicable) is 20 — which the skill-type-aware fix correctly normalizes to
100 since the missing axes are N/A for a markdown skill, not failures.

All 343 affected tests pass (288 web3 + 55 metrics). Two pre-existing
e2e/web3.test.ts failures about WEB3_RULES registry length are unrelated
to this change (verified by stashing and re-running on HEAD).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Markeljan Markeljan merged commit 7f3a58d into main May 21, 2026
1 check passed
@Markeljan Markeljan deleted the claude/scoring-honesty branch May 21, 2026 14:43
Markeljan added a commit that referenced this pull request May 21, 2026
…imitations heading

- Delete docs/odos-skill-recommendations.md — was a one-off snapshot
  against agentsec v0.2.7 + odos-skills@f88b7c89; PR #20 invalidated
  the numbers (Odos moved 47/D → 82/B on a newer commit) and the doc
  is no longer worth maintaining.
- Drop the now-dangling JSDoc citation in openclaw/src/formats.ts.
- Rename packages/web3/README.md "Limitations of v0.2.0" → "Current
  limitations" — the bullets describe current limitations, not
  0.2.0-specific historical ones, and stamping a version on the
  heading invites the same drift the Odos doc just hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant