fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills by Markeljan · Pull Request #20 · semiotic-ai/agentsec

Markeljan · 2026-05-21T14:38:15Z

Summary

Three independent scoring bugs were compounding to give every skill the wrong grade. On the Odos skill pack (the reference web3-router skill in our recs doc), these fixes move the grade from 47/D to 82/B without any upstream changes.

🐛 MetricsAnalyzer was silently broken on every audit — every score relied on a hardcoded fallback
📉 AST-W12 + W04M severities were miscalibrated as HIGH/MEDIUM for governance hygiene that isn't exploitable
🪜 Code-library scoring (tests/CI/types/linter/formatter) was being applied to markdown-only skill packs

The 3 bugs and the fixes

1. `MetricsAnalyzer` was silently broken on every audit

packages/cli/src/commands/audit.ts:181 called new MetricsAnalyzer() with no constructor arg and analyzer.analyze(skill) with the wrong arg signature. The analyzer's actual API is new MetricsAnalyzer(skill) then .analyze() zero-arg — so the wrong call path threw on this.skill.files, the try/catch swallowed the exception, and the code fell through to a hardcoded default:

return {
  // ...
  documentationScore: 0,
  maintenanceHealth: 50,
  hasReadme: false,
  hasLicense: false,
  // ...
};

Every audit on every skill ever has been getting these fallback defaults regardless of what the skill actually shipped. Odos's audit reported hasReadme: false and hasLicense: false despite the repo having a 2,703-byte README.md and a LICENSE file at the root.

Fixed the call site to use the analyzer's real signature.

2. AST-W12 + W04M severities were miscalibrated

Four findings were rated HIGH/MEDIUM despite being governance / forensics hygiene gaps, not exploitable vulnerabilities:

Rule	Before	After	Why
W12-001 audit sink	high	low	Forensics-trail nice-to-have; most skills today have none
W12-002 kill-switch	high	low	Operational nice-to-have; W12-020 still escalates on autonomy claims
W12-003 incident runbook	medium	low	Documentation gap only
W04M-002 permissions array	medium	low	Manifest hygiene; runtime can enforce via other channels

Per-finding security cost dropped from 21 (high) and 12 (medium) to 6 (low). The audit still surfaces all four as actionable, just no longer treats them as showstoppers. Each rule got an inline rationale comment explaining the calibration.

3. Code-library scoring model was applied to markdown-only skills

The maintenance scorer demanded tests (25 pts), CI (20 pts), types (20 pts), linter (10 pts), formatter (5 pts) — 80% of its weight — and the documentation scorer demanded inline comments and JSDoc (65% of its weight). All absurd for a 6-file markdown skill pack with no source code to test, type, lint, or document with inline comments.

Both scorers now detect markdown-only skills (no .ts/.js/.py/.rb/.rs/.go/etc files) and:

Maintenance: treat tests/CI/types/linter/formatter as N/A (full credit). Score still hinges on license, gitignore, and the other axes that actually apply.
Documentation: let the README absorb the full weight when there's no source code, so a strong README earns the full 1.0 instead of being capped at 0.35 by absent code-doc signals.

Verified end-to-end against the real Odos skill

Audited odos-xyz/odos-skills@93b3db6 before and after:

Metric	Before	After
Overall	47 / D	82 / B
Security	34	76
Quality	65	80
Maintenance	50†	100
Critical findings	0	0
High findings	2	0
Medium findings	2	0
Low findings	0	4

† The 50 was the broken-analyzer fallback default, not a real measurement. The real measurement on the same skill is 20 (license + gitignore, nothing else applicable on a markdown skill) — which the skill-type-aware fix correctly normalizes to 100 since the missing axes are N/A, not failures.

To reach A (90+), Odos needs to apply rec #8 of docs/odos-skill-recommendations.md — the same upstream PR our recs doc has been asking for all along.

Test plan

cd packages/web3 && bun test — 288 pass / 0 fail
cd packages/metrics && bun test — 55 pass / 0 fail
End-to-end audit on odos-xyz/odos-skills@93b3db6 produces 82/B with the new findings + metrics
No new test failures introduced (verified 2 failing e2e/web3.test.ts tests about WEB3_RULES registry length are pre-existing on HEAD)
Manual sanity check on the fixtures in e2e/fixtures/profiles/ to confirm grade movements are reasonable
Eyeball the example reports in examples/ if they get regenerated as part of the version bump

Notes

This PR is independent of fix(openclaw): parse nested SKILL.md frontmatter + hoist metadata.openclaw.web3 #19 (the parser fix). fix(openclaw): parse nested SKILL.md frontmatter + hoist metadata.openclaw.web3 #19 is what made the W12 findings visible in the first place; this PR is what stops those findings (and the broken analyzer fallback) from inappropriately tanking the grade. They compound nicely — merged together, the Odos audit goes from a false-clean 65/C to a real 82/B.
The two pre-existing failing tests in e2e/web3.test.ts (WEB3_RULES registry > contains exactly 12 rules and every rule has a unique category) are registry drift unrelated to this change. Verified by stashing and re-running on HEAD.

🤖 Generated with Claude Code

…everities, treat code-quality signals as N/A for markdown skills Three independent scoring bugs were compounding to give every skill the wrong grade. On the Odos skill pack (the reference web3-router skill), these fixes move the grade from 47/D to 82/B without any upstream changes. 1. **MetricsAnalyzer was silently broken on every audit.** The CLI called `new MetricsAnalyzer()` with no constructor arg and then `.analyze(skill)` with the wrong arg signature — the access to `this.skill.files` threw, the try/catch swallowed the error, and the code fell through to a hardcoded default of `{hasReadme: false, hasLicense: false, maintenanceHealth: 50}` regardless of what the skill actually shipped. Fixed the call site to pass the skill into the constructor and call `.analyze()` zero-arg. 2. **AST-W12 + W04M severities were miscalibrated as HIGH/MEDIUM.** Audit sink, kill-switch contract, incident runbook, and missing `permissions[]` are governance / forensics hygiene gaps, not exploitable vulnerabilities — most skills today have none of these. Surfacing them as HIGH (21 security-score points each) tanked otherwise-clean skills and trained authors to treat them as noise. Downgraded to LOW (6 points each) with rationale comments on each rule. The autonomy-claim companion check (W12-020) still escalates when a skill markets itself as autonomous without a kill-switch. 3. **Code-library scoring model was applied to markdown-only skill packs.** The maintenance scorer demanded tests / CI / types / linter / formatter (80% of its weight) and the documentation scorer demanded inline comments and JSDoc (65% of its weight) — all absurd for a 6-file markdown skill that has no source code to test, type, or lint. Both scorers now detect markdown-only skills (no `.ts`/`.js`/`.py`/etc) and treat those signals as N/A: the maintenance scorer gives full credit for the inapplicable axes, and the documentation scorer lets a strong README absorb the full doc weight (up from a 0.35 cap). Verified end-to-end against the real Odos skill at github.com/odos-xyz/odos-skills@93b3db6: | Metric | Before | After | |---------------|--------|-------| | Overall | 47 D | 82 B | | Security | 34 | 76 | | Quality | 65 | 80 | | Maintenance | 50† | 100 | † The 50 was the broken-analyzer fallback default, not a real measurement. The actual measurement on the same skill (license + gitignore, nothing else applicable) is 20 — which the skill-type-aware fix correctly normalizes to 100 since the missing axes are N/A for a markdown skill, not failures. All 343 affected tests pass (288 web3 + 55 metrics). Two pre-existing e2e/web3.test.ts failures about WEB3_RULES registry length are unrelated to this change (verified by stashing and re-running on HEAD). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…imitations heading - Delete docs/odos-skill-recommendations.md — was a one-off snapshot against agentsec v0.2.7 + odos-skills@f88b7c89; PR #20 invalidated the numbers (Odos moved 47/D → 82/B on a newer commit) and the doc is no longer worth maintaining. - Drop the now-dangling JSDoc citation in openclaw/src/formats.ts. - Rename packages/web3/README.md "Limitations of v0.2.0" → "Current limitations" — the bullets describe current limitations, not 0.2.0-specific historical ones, and stamping a version on the heading invites the same drift the Odos doc just hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Markeljan merged commit 7f3a58d into main May 21, 2026
1 check passed

Markeljan deleted the claude/scoring-honesty branch May 21, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills#20

fix(scoring): honest grades — wire metrics analyzer, calibrate web3 severities, treat code-quality signals as N/A for markdown skills#20
Markeljan merged 1 commit into
mainfrom
claude/scoring-honesty

Markeljan commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Markeljan commented May 21, 2026

Summary

The 3 bugs and the fixes

1. MetricsAnalyzer was silently broken on every audit

2. AST-W12 + W04M severities were miscalibrated

3. Code-library scoring model was applied to markdown-only skills

Verified end-to-end against the real Odos skill

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `MetricsAnalyzer` was silently broken on every audit