Audit and harden skills + agents architecture against Anthropic best-practices
Why now
A 2026-05-10 audit of .claude/agents/ (35 agents) and .claude/skills/ (37 skills, plus ~43 plugin skills auto-loaded) compared against Anthropic's primary docs surfaces a small number of latent risks. The most pressing: at ~80 auto-discoverable skills × ~125 chars median listing entry, this repo sits at or just over the 10,000-char skills-listing budget (Skills > "Skill descriptions are cut short", code.claude.com). Past that line, descriptions are silently truncated or dropped entirely — names always remain, but the trigger keywords disappear, so auto-routing degrades without any error surfaced. Several other items below are pure hygiene that the official docs flag explicitly.
This issue captures findings from a structured deep-dive (architecture audit + four parallel research probes covering best-practices, real-world failure modes, conductor patterns, token-budget mechanics, and trigger-description engineering). All claims are linked to primary sources retrieved on 2026-05-10.
Scope: template-self improvement. Tag track:specorator-improvement.
Findings — Critical
C1. At the skills-listing budget cliff edge
- Symptom: With ~80 auto-discoverable skills (37 in-repo + ~43 plugin), median listing entry ~125 chars, total ≈ 10,000 chars. Anthropic's listing budget is
1% × context window with a fallback of 8,000 chars. On Opus 4.7 (1M context) that is exactly 10,000 chars. Past it, descriptions are silently shortened or dropped (names always preserved); trigger keywords disappear → undertrigger.
- Source: Extend Claude with skills > "Skill descriptions are cut short"; per-entry cap 1,536 chars confirmed in issue #47627 (raised from 250 in v2.1.105). Engineering blog estimate
~100 tokens median metadata: anthropic.com/engineering/equipping-agents….
- Proposed fix:
- Cap
description at ≤200 chars per skill in .claude/skills/*/SKILL.md (~80 × 200 = 16,000 chars envelope).
- Bump
SLASH_COMMAND_TOOL_CHAR_BUDGET=20000 in user-settings guidance for adopters (docs/specorator-product/tech.md).
- Set
disable-model-invocation: true on user-only utilities — caveman-help, token-budget-review, verify, new-adr, new-glossary-entry, record-decision — they leave the always-on listing entirely; still invocable via /name.
- Set
skillOverrides: name-only on dormant skills via settings.local.json template.
- Verification: run
/context and /skills before + after; record character delta in PR body.
C2. Two oversized skills
- Symptom:
.claude/skills/tackle-issue/SKILL.md is 337 lines; .claude/skills/issue-breakdown/SKILL.md is 164 lines. Anthropic recommends ≤500 lines as a hard ceiling, but 337 lines is high enough that the body crowds the post-compaction 5,000-token / 25,000-token-combined re-attachment budget when other skills are also active.
- Source: Skills > "Keep SKILL.md under 500 lines"; skill content lifecycle caps quoted in Skills > "Skill content lifecycle".
- Proposed fix: split each into
SKILL.md + supporting references/ files (one level deep only, per docs). For tackle-issue/: extract Phase 3+ procedure into references/execution.md; keep triage logic in SKILL.md. For issue-breakdown/: extract slice-template + examples into references/slicing.md and examples/.
- Acceptance: both
SKILL.md bodies < 200 lines; references one level deep; existing behaviour unchanged.
C3. No frontmatter linter / CI gate for skills + agents
- Symptom: Frontmatter shape is convention-only. No CI enforces
description ≤ 1024 chars, third-person voice, "use when" phrasing, or tools: allowlist explicitness on agents.
- Source: Required field rules and limits in Skill authoring best practices; third-person rule is the only
<Warning> in that doc. wshobson/agents PluginEval codifies the practical lint vocabulary (EMPTY_DESCRIPTION, MISSING_TRIGGER, BLOATED_SKILL, ORPHAN_REFERENCE, OVER_CONSTRAINED).
- Proposed fix: add
scripts/verify/skills-frontmatter.ts to the verify gate. Checks per skill:
name present, lowercase + hyphens, ≤64 chars, not anthropic / claude.
description present, ≤1024 chars, ≤200 char project-cap, third-person (no ^I /I'll /You can /Helps you openers), contains a Use when clause.
- SKILL.md body ≤500 lines.
- No second-level references (file linked from a referenced file).
- Per agent:
tools: field present and non-empty; description ≤1024 chars; name rules.
- Acceptance: gate runs on every PR; current repo passes after C1+C2+M1 land.
Findings — Medium
M1. Description voice + format inconsistency
- Symptom: ~7 skills open with first/second person or trigger-leaks (workflow steps in description). Frontmatter format mixes single-line and YAML
> fold across caveman*, issue-pr-sync, issue-draft. Agents are uniformly single-line; skills are not.
- Source: Anthropic best-practices doc warns explicitly: "Always write in third person… inconsistent point-of-view can cause discovery problems." obra/superpowers writing-skills body documents that workflow-leaks in description make Claude follow the description and skip the body.
- Proposed fix: sweep all 37 skills; rewrite descriptions to
<verb-phrase>. Use when <triggers>. pattern. Single-line YAML for every skill. Document the rule in .claude/skills/README.md and in the new linter.
M2. Four practice skills lack examples
- Symptom:
grill, tdd-cycle, tracer-bullet, record-decision have no examples/ or references/ — body prose is the only invocation guide. Cognitive load is high; trigger reliability suffers.
- Source: skill-creator SKILL.md recommends references one level deep when triggers are subtle.
- Proposed fix: add 1–2 worked examples per skill in
examples/ (one happy path, one edge case). Linked from SKILL.md body, not preloaded.
M3. Default model on read-heavy agents leaves cost lever unused
- Symptom: All 35 agents use the inherited (Opus by default) model. Read-heavy roles —
analyst, legacy-auditor, reviewer, project-reviewer, retrospective, brand-reviewer — could run on Haiku 4.5 with no quality regression, at ~5× cost reduction.
- Source: sub-agents > "Control costs": "Control costs by routing tasks to faster, cheaper models like Haiku." The built-in
Explore agent is Anthropic's own demonstration. wshobson/agents's four-tier strategy is the most explicit external example.
- Proposed fix: add
model: haiku to the six read-heavy agents above. Document the heuristic in .claude/agents/README.md: read-heavy → Haiku; reasoning-heavy → Opus; default → inherit.
M4. Subagent tools: allowlists not consistently scoped
- Symptom: Most agents follow least-privilege correctly. Mild inconsistencies:
roadmap-manager carries WebSearch+WebFetch while peer portfolio-manager does not despite similar information-gathering scope; product-page-designer has Bash while same-stage ui-designer does not.
- Source: sub-agents docs: inheritance default is wrong default for production. Adam Wolf (Claude Code core engineer) quoted in claudekit.cc 2025-12: "Sub agents work best when they just look for information and provide a small amount of summary back."
- Proposed fix: review each tool grant against the agent's documented purpose; remove unused. Add the rationale as a one-line comment in each agent body where the choice is non-obvious.
M5. Skill trigger-evals not part of authoring contract
- Symptom: No skill in this repo has a 20-query eval; description changes ship blind.
- Source: skill-creator SKILL.md — "Create 20 eval queries — a mix of should-trigger and should-not-trigger."
- Proposed fix: add
eval.json to each skill (8–10 should-trigger queries, 8–10 should-not). Run on description-only PRs. Start with the 12 conductor skills (highest blast radius); expand opportunistically. Eval runner can borrow from hqhq1025/skill-optimizer (community tool, 60/40 train/test split, 3 runs/query).
M6. Conductor-calling-conductor stacks SKILL.md bodies in context
- Symptom: When
orchestrate invokes spec:design which invokes arc42-baseline, all three SKILL.md bodies stay in context until compaction. Specorator's track conductors call inner stage skills routinely, so this stacking is the steady-state pattern.
- Source: Skills > Skill content lifecycle — "the rendered SKILL.md content enters the conversation as a single message and stays there for the rest of the session."
- Proposed fix: for inner stage skills called only by a conductor, audit whether they should be inlined as
references/ to the conductor instead of separate top-level skills. Candidates: spec:clarify, spec:analyze, the per-phase discovery commands when invoked from discovery-sprint. Don't blanket-merge — keep direct user-invocability where it matters.
Findings — Low
L1. Deprecated ubiquitous-language skill still loads
- Symptom:
.claude/skills/ubiquitous-language/SKILL.md is 93 lines, marked deprecated by ADR-0010, still in the always-on listing.
- Proposed fix: add
disable-model-invocation: true and description: '[Deprecated by ADR-0010]…' so it leaves the auto-routing pool but stays for fork compatibility.
L2. No agent–skill adjacency map
- Symptom: Which agents invoke which skills at which stage is implicit in body prose. New contributors have no overview.
- Proposed fix: generate
docs/agent-skill-matrix.md from the canonical .claude/agents/ + .claude/skills/ (extracted from frontmatter description and body cross-refs). Include in verify so it stays current.
L3. Argument-hint inconsistency
- Symptom:
argument-hint: is set on most conductors but missing on arc42-baseline, improve-codebase-architecture, project-run, etc.
- Proposed fix: sweep; add where the skill takes meaningful arguments.
L4. Plugin-skill description budget invisible to repo
- Symptom: ~43 plugin skills add to the listing budget but cannot be lint-controlled from this repo. They cannot be
skillOverridesd either (see Skills > "Override skill visibility").
- Proposed fix: document in
docs/specorator-product/tech.md: which plugins this repo expects to coexist with, and the recommended SLASH_COMMAND_TOOL_CHAR_BUDGET floor for adopters. Future contributors then know the headroom they're working in.
L5. Windows-platform footguns documented
- Symptom: Active Claude Code bugs disproportionately hit Windows: PostToolUse
Skill matcher drops stdin payload on Git Bash (#57250); context: fork skill spawn broken since v2.1.113 (#51165); background subagent permissions regressed in v2.1.114 (#50267). Specorator does not currently use context: fork, so the second item is informational; the first applies if any hooks rely on PostToolUse Skill payload.
- Proposed fix: check
.claude/settings.json for Skill-matched PostToolUse hooks; if present, add a Windows fallback or move the telemetry inside the slash-command body. Add a short Windows-platform notes section to docs/verify-gate.md.
Findings — Strengths to preserve
Audit-grade good news, worth recording so future template improvements don't accidentally regress:
- Frontmatter conformance is 35/35 agents with
name/description/tools. No drift.
- Tool restriction by stage role is intentional and consistent (design phases lack Bash; build phases have Bash). Documented in
.claude/agents/README.md.
- Permission deny-list (no push to main/develop/demo, no
--no-verify) is the right backstop.
- The conductor-skills idiom (12 conductors + AskUserQuestion gates + file-based
workflow-state.md per feature) is idiomatic by community convention, even though "conductor" is not first-party Anthropic vocabulary. Closest first-party term is "team lead" in agent-teams docs. The pattern is well-established in wshobson/agents (Conductor plugin) and htdocs.dev write-ups.
- ADR-blessed track taxonomy (ADR-0026) is unusual but coherent — methodology layer above skills/agents, not in conflict with first-party model.
/specorator:update, /token-review, and the verify gate already cover the self-improvement infra most other repos lack.
Acceptance criteria
Out of scope
- Migrating away from the conductor-skill pattern (idiomatic; not a defect).
- Renaming
agentic-workflow → specorator (locked by ADR-0024).
- Anything that would change the v1 track taxonomy (ADR-0026).
- Redesigning the
inputs/ ingestion convention (separate concern).
Suggested ordering
- C1 → C2 → C3 (architectural; everything else builds on the linter).
- M1 → M3 → M4 (low-risk sweeps; hit M3 + M4 in same PR).
- M2 → M6 (touches authoring patterns; can ship with C2).
- M5 (depends on linter; eval framework is its own slice).
- L-series, opportunistically.
References
Primary sources, all retrieved 2026-05-10:
Internal context:
Generated from a 2026-05-10 audit of .claude/agents/ (35), .claude/skills/ (37 in-repo + ~43 plugin) plus four parallel research probes against Anthropic primary docs and high-signal community sources.
Audit and harden skills + agents architecture against Anthropic best-practices
Why now
A 2026-05-10 audit of
.claude/agents/(35 agents) and.claude/skills/(37 skills, plus ~43 plugin skills auto-loaded) compared against Anthropic's primary docs surfaces a small number of latent risks. The most pressing: at ~80 auto-discoverable skills × ~125 chars median listing entry, this repo sits at or just over the 10,000-char skills-listing budget (Skills > "Skill descriptions are cut short", code.claude.com). Past that line, descriptions are silently truncated or dropped entirely — names always remain, but the trigger keywords disappear, so auto-routing degrades without any error surfaced. Several other items below are pure hygiene that the official docs flag explicitly.This issue captures findings from a structured deep-dive (architecture audit + four parallel research probes covering best-practices, real-world failure modes, conductor patterns, token-budget mechanics, and trigger-description engineering). All claims are linked to primary sources retrieved on 2026-05-10.
Scope: template-self improvement. Tag
track:specorator-improvement.Findings — Critical
C1. At the skills-listing budget cliff edge
1% × context windowwith a fallback of 8,000 chars. On Opus 4.7 (1M context) that is exactly 10,000 chars. Past it, descriptions are silently shortened or dropped (names always preserved); trigger keywords disappear → undertrigger.~100 tokensmedian metadata: anthropic.com/engineering/equipping-agents….descriptionat ≤200 chars per skill in.claude/skills/*/SKILL.md(~80 × 200 = 16,000 chars envelope).SLASH_COMMAND_TOOL_CHAR_BUDGET=20000in user-settings guidance for adopters (docs/specorator-product/tech.md).disable-model-invocation: trueon user-only utilities —caveman-help,token-budget-review,verify,new-adr,new-glossary-entry,record-decision— they leave the always-on listing entirely; still invocable via/name.skillOverrides: name-onlyon dormant skills viasettings.local.jsontemplate./contextand/skillsbefore + after; record character delta in PR body.C2. Two oversized skills
.claude/skills/tackle-issue/SKILL.mdis 337 lines;.claude/skills/issue-breakdown/SKILL.mdis 164 lines. Anthropic recommends ≤500 lines as a hard ceiling, but 337 lines is high enough that the body crowds the post-compaction 5,000-token / 25,000-token-combined re-attachment budget when other skills are also active.SKILL.md+ supportingreferences/files (one level deep only, per docs). Fortackle-issue/: extract Phase 3+ procedure intoreferences/execution.md; keep triage logic inSKILL.md. Forissue-breakdown/: extract slice-template + examples intoreferences/slicing.mdandexamples/.SKILL.mdbodies < 200 lines; references one level deep; existing behaviour unchanged.C3. No frontmatter linter / CI gate for skills + agents
description ≤ 1024 chars, third-person voice, "use when" phrasing, ortools:allowlist explicitness on agents.<Warning>in that doc. wshobson/agentsPluginEvalcodifies the practical lint vocabulary (EMPTY_DESCRIPTION, MISSING_TRIGGER, BLOATED_SKILL, ORPHAN_REFERENCE, OVER_CONSTRAINED).scripts/verify/skills-frontmatter.tsto theverifygate. Checks per skill:namepresent, lowercase + hyphens, ≤64 chars, notanthropic/claude.descriptionpresent, ≤1024 chars, ≤200 char project-cap, third-person (no^I/I'll/You can/Helps youopeners), contains aUse whenclause.tools:field present and non-empty;description≤1024 chars;namerules.Findings — Medium
M1. Description voice + format inconsistency
>fold acrosscaveman*,issue-pr-sync,issue-draft. Agents are uniformly single-line; skills are not.<verb-phrase>. Use when <triggers>.pattern. Single-line YAML for every skill. Document the rule in.claude/skills/README.mdand in the new linter.M2. Four practice skills lack examples
grill,tdd-cycle,tracer-bullet,record-decisionhave noexamples/orreferences/— body prose is the only invocation guide. Cognitive load is high; trigger reliability suffers.examples/(one happy path, one edge case). Linked from SKILL.md body, not preloaded.M3. Default model on read-heavy agents leaves cost lever unused
analyst,legacy-auditor,reviewer,project-reviewer,retrospective,brand-reviewer— could run on Haiku 4.5 with no quality regression, at ~5× cost reduction.Exploreagent is Anthropic's own demonstration. wshobson/agents's four-tier strategy is the most explicit external example.model: haikuto the six read-heavy agents above. Document the heuristic in.claude/agents/README.md: read-heavy → Haiku; reasoning-heavy → Opus; default → inherit.M4. Subagent
tools:allowlists not consistently scopedroadmap-managercarriesWebSearch+WebFetchwhile peerportfolio-managerdoes not despite similar information-gathering scope;product-page-designerhas Bash while same-stageui-designerdoes not.M5. Skill trigger-evals not part of authoring contract
eval.jsonto each skill (8–10 should-trigger queries, 8–10 should-not). Run on description-only PRs. Start with the 12 conductor skills (highest blast radius); expand opportunistically. Eval runner can borrow from hqhq1025/skill-optimizer (community tool, 60/40 train/test split, 3 runs/query).M6. Conductor-calling-conductor stacks SKILL.md bodies in context
orchestrateinvokesspec:designwhich invokesarc42-baseline, all three SKILL.md bodies stay in context until compaction. Specorator's track conductors call inner stage skills routinely, so this stacking is the steady-state pattern.references/to the conductor instead of separate top-level skills. Candidates:spec:clarify,spec:analyze, the per-phase discovery commands when invoked fromdiscovery-sprint. Don't blanket-merge — keep direct user-invocability where it matters.Findings — Low
L1. Deprecated
ubiquitous-languageskill still loads.claude/skills/ubiquitous-language/SKILL.mdis 93 lines, marked deprecated by ADR-0010, still in the always-on listing.disable-model-invocation: trueanddescription: '[Deprecated by ADR-0010]…'so it leaves the auto-routing pool but stays for fork compatibility.L2. No agent–skill adjacency map
docs/agent-skill-matrix.mdfrom the canonical.claude/agents/+.claude/skills/(extracted from frontmatterdescriptionand body cross-refs). Include inverifyso it stays current.L3. Argument-hint inconsistency
argument-hint:is set on most conductors but missing onarc42-baseline,improve-codebase-architecture,project-run, etc.L4. Plugin-skill description budget invisible to repo
skillOverridesd either (see Skills > "Override skill visibility").docs/specorator-product/tech.md: which plugins this repo expects to coexist with, and the recommendedSLASH_COMMAND_TOOL_CHAR_BUDGETfloor for adopters. Future contributors then know the headroom they're working in.L5. Windows-platform footguns documented
Skillmatcher drops stdin payload on Git Bash (#57250);context: forkskill spawn broken since v2.1.113 (#51165); background subagent permissions regressed in v2.1.114 (#50267). Specorator does not currently usecontext: fork, so the second item is informational; the first applies if any hooks rely on PostToolUse Skill payload..claude/settings.jsonforSkill-matched PostToolUse hooks; if present, add a Windows fallback or move the telemetry inside the slash-command body. Add a short Windows-platform notes section todocs/verify-gate.md.Findings — Strengths to preserve
Audit-grade good news, worth recording so future template improvements don't accidentally regress:
name/description/tools. No drift..claude/agents/README.md.--no-verify) is the right backstop.workflow-state.mdper feature) is idiomatic by community convention, even though "conductor" is not first-party Anthropic vocabulary. Closest first-party term is "team lead" in agent-teams docs. The pattern is well-established inwshobson/agents(Conductor plugin) andhtdocs.devwrite-ups./specorator:update,/token-review, and theverifygate already cover the self-improvement infra most other repos lack.Acceptance criteria
descriptioncap ≤200 chars enforced; ≤6 skills move todisable-model-invocation: true;/contextshows ≥1500 chars headroom over the 1% budget.tackle-issue/SKILL.mdandissue-breakdown/SKILL.md< 200 lines; supporting refs added; behaviour unchanged.scripts/verify/skills-frontmatter.tslands and is wired intoverify; all current skills + agents pass.descriptionfollows<verb-phrase>. Use when <triggers>.in third person; YAML format single-line.grill,tdd-cycle,tracer-bullet,record-decision).model: haiku; rationale documented.eval.json(20 queries); CI runs eval on description-only PRs.spec:clarify,spec:analyze, per-phase discovery commands.wontfixrationale.Out of scope
agentic-workflow→specorator(locked by ADR-0024).inputs/ingestion convention (separate concern).Suggested ordering
References
Primary sources, all retrieved 2026-05-10:
anthropics/skills/skill-creator/SKILL.mdobra/superpowers—writing-skillsand friendswshobson/agents— four-tier model assignment + Conductor plugin patterncontext: forkWindowsInternal context:
docs/verify-gate.md,docs/branching.md.claude/skills/README.md,.claude/agents/README.mdGenerated from a 2026-05-10 audit of
.claude/agents/(35),.claude/skills/(37 in-repo + ~43 plugin) plus four parallel research probes against Anthropic primary docs and high-signal community sources.