Skip to content

Audit and harden skills + agents architecture against Anthropic best-practices #462

@Luis85

Description

@Luis85

Audit and harden skills + agents architecture against Anthropic best-practices

Why now

A 2026-05-10 audit of .claude/agents/ (35 agents) and .claude/skills/ (37 skills, plus ~43 plugin skills auto-loaded) compared against Anthropic's primary docs surfaces a small number of latent risks. The most pressing: at ~80 auto-discoverable skills × ~125 chars median listing entry, this repo sits at or just over the 10,000-char skills-listing budget (Skills > "Skill descriptions are cut short", code.claude.com). Past that line, descriptions are silently truncated or dropped entirely — names always remain, but the trigger keywords disappear, so auto-routing degrades without any error surfaced. Several other items below are pure hygiene that the official docs flag explicitly.

This issue captures findings from a structured deep-dive (architecture audit + four parallel research probes covering best-practices, real-world failure modes, conductor patterns, token-budget mechanics, and trigger-description engineering). All claims are linked to primary sources retrieved on 2026-05-10.

Scope: template-self improvement. Tag track:specorator-improvement.


Findings — Critical

C1. At the skills-listing budget cliff edge

  • Symptom: With ~80 auto-discoverable skills (37 in-repo + ~43 plugin), median listing entry ~125 chars, total ≈ 10,000 chars. Anthropic's listing budget is 1% × context window with a fallback of 8,000 chars. On Opus 4.7 (1M context) that is exactly 10,000 chars. Past it, descriptions are silently shortened or dropped (names always preserved); trigger keywords disappear → undertrigger.
  • Source: Extend Claude with skills > "Skill descriptions are cut short"; per-entry cap 1,536 chars confirmed in issue #47627 (raised from 250 in v2.1.105). Engineering blog estimate ~100 tokens median metadata: anthropic.com/engineering/equipping-agents….
  • Proposed fix:
    1. Cap description at ≤200 chars per skill in .claude/skills/*/SKILL.md (~80 × 200 = 16,000 chars envelope).
    2. Bump SLASH_COMMAND_TOOL_CHAR_BUDGET=20000 in user-settings guidance for adopters (docs/specorator-product/tech.md).
    3. Set disable-model-invocation: true on user-only utilities — caveman-help, token-budget-review, verify, new-adr, new-glossary-entry, record-decision — they leave the always-on listing entirely; still invocable via /name.
    4. Set skillOverrides: name-only on dormant skills via settings.local.json template.
  • Verification: run /context and /skills before + after; record character delta in PR body.

C2. Two oversized skills

  • Symptom: .claude/skills/tackle-issue/SKILL.md is 337 lines; .claude/skills/issue-breakdown/SKILL.md is 164 lines. Anthropic recommends ≤500 lines as a hard ceiling, but 337 lines is high enough that the body crowds the post-compaction 5,000-token / 25,000-token-combined re-attachment budget when other skills are also active.
  • Source: Skills > "Keep SKILL.md under 500 lines"; skill content lifecycle caps quoted in Skills > "Skill content lifecycle".
  • Proposed fix: split each into SKILL.md + supporting references/ files (one level deep only, per docs). For tackle-issue/: extract Phase 3+ procedure into references/execution.md; keep triage logic in SKILL.md. For issue-breakdown/: extract slice-template + examples into references/slicing.md and examples/.
  • Acceptance: both SKILL.md bodies < 200 lines; references one level deep; existing behaviour unchanged.

C3. No frontmatter linter / CI gate for skills + agents

  • Symptom: Frontmatter shape is convention-only. No CI enforces description ≤ 1024 chars, third-person voice, "use when" phrasing, or tools: allowlist explicitness on agents.
  • Source: Required field rules and limits in Skill authoring best practices; third-person rule is the only <Warning> in that doc. wshobson/agents PluginEval codifies the practical lint vocabulary (EMPTY_DESCRIPTION, MISSING_TRIGGER, BLOATED_SKILL, ORPHAN_REFERENCE, OVER_CONSTRAINED).
  • Proposed fix: add scripts/verify/skills-frontmatter.ts to the verify gate. Checks per skill:
    1. name present, lowercase + hyphens, ≤64 chars, not anthropic / claude.
    2. description present, ≤1024 chars, ≤200 char project-cap, third-person (no ^I /I'll /You can /Helps you openers), contains a Use when clause.
    3. SKILL.md body ≤500 lines.
    4. No second-level references (file linked from a referenced file).
  • Per agent: tools: field present and non-empty; description ≤1024 chars; name rules.
  • Acceptance: gate runs on every PR; current repo passes after C1+C2+M1 land.

Findings — Medium

M1. Description voice + format inconsistency

  • Symptom: ~7 skills open with first/second person or trigger-leaks (workflow steps in description). Frontmatter format mixes single-line and YAML > fold across caveman*, issue-pr-sync, issue-draft. Agents are uniformly single-line; skills are not.
  • Source: Anthropic best-practices doc warns explicitly: "Always write in third person… inconsistent point-of-view can cause discovery problems." obra/superpowers writing-skills body documents that workflow-leaks in description make Claude follow the description and skip the body.
  • Proposed fix: sweep all 37 skills; rewrite descriptions to <verb-phrase>. Use when <triggers>. pattern. Single-line YAML for every skill. Document the rule in .claude/skills/README.md and in the new linter.

M2. Four practice skills lack examples

  • Symptom: grill, tdd-cycle, tracer-bullet, record-decision have no examples/ or references/ — body prose is the only invocation guide. Cognitive load is high; trigger reliability suffers.
  • Source: skill-creator SKILL.md recommends references one level deep when triggers are subtle.
  • Proposed fix: add 1–2 worked examples per skill in examples/ (one happy path, one edge case). Linked from SKILL.md body, not preloaded.

M3. Default model on read-heavy agents leaves cost lever unused

  • Symptom: All 35 agents use the inherited (Opus by default) model. Read-heavy roles — analyst, legacy-auditor, reviewer, project-reviewer, retrospective, brand-reviewer — could run on Haiku 4.5 with no quality regression, at ~5× cost reduction.
  • Source: sub-agents > "Control costs": "Control costs by routing tasks to faster, cheaper models like Haiku." The built-in Explore agent is Anthropic's own demonstration. wshobson/agents's four-tier strategy is the most explicit external example.
  • Proposed fix: add model: haiku to the six read-heavy agents above. Document the heuristic in .claude/agents/README.md: read-heavy → Haiku; reasoning-heavy → Opus; default → inherit.

M4. Subagent tools: allowlists not consistently scoped

  • Symptom: Most agents follow least-privilege correctly. Mild inconsistencies: roadmap-manager carries WebSearch+WebFetch while peer portfolio-manager does not despite similar information-gathering scope; product-page-designer has Bash while same-stage ui-designer does not.
  • Source: sub-agents docs: inheritance default is wrong default for production. Adam Wolf (Claude Code core engineer) quoted in claudekit.cc 2025-12: "Sub agents work best when they just look for information and provide a small amount of summary back."
  • Proposed fix: review each tool grant against the agent's documented purpose; remove unused. Add the rationale as a one-line comment in each agent body where the choice is non-obvious.

M5. Skill trigger-evals not part of authoring contract

  • Symptom: No skill in this repo has a 20-query eval; description changes ship blind.
  • Source: skill-creator SKILL.md"Create 20 eval queries — a mix of should-trigger and should-not-trigger."
  • Proposed fix: add eval.json to each skill (8–10 should-trigger queries, 8–10 should-not). Run on description-only PRs. Start with the 12 conductor skills (highest blast radius); expand opportunistically. Eval runner can borrow from hqhq1025/skill-optimizer (community tool, 60/40 train/test split, 3 runs/query).

M6. Conductor-calling-conductor stacks SKILL.md bodies in context

  • Symptom: When orchestrate invokes spec:design which invokes arc42-baseline, all three SKILL.md bodies stay in context until compaction. Specorator's track conductors call inner stage skills routinely, so this stacking is the steady-state pattern.
  • Source: Skills > Skill content lifecycle"the rendered SKILL.md content enters the conversation as a single message and stays there for the rest of the session."
  • Proposed fix: for inner stage skills called only by a conductor, audit whether they should be inlined as references/ to the conductor instead of separate top-level skills. Candidates: spec:clarify, spec:analyze, the per-phase discovery commands when invoked from discovery-sprint. Don't blanket-merge — keep direct user-invocability where it matters.

Findings — Low

L1. Deprecated ubiquitous-language skill still loads

  • Symptom: .claude/skills/ubiquitous-language/SKILL.md is 93 lines, marked deprecated by ADR-0010, still in the always-on listing.
  • Proposed fix: add disable-model-invocation: true and description: '[Deprecated by ADR-0010]…' so it leaves the auto-routing pool but stays for fork compatibility.

L2. No agent–skill adjacency map

  • Symptom: Which agents invoke which skills at which stage is implicit in body prose. New contributors have no overview.
  • Proposed fix: generate docs/agent-skill-matrix.md from the canonical .claude/agents/ + .claude/skills/ (extracted from frontmatter description and body cross-refs). Include in verify so it stays current.

L3. Argument-hint inconsistency

  • Symptom: argument-hint: is set on most conductors but missing on arc42-baseline, improve-codebase-architecture, project-run, etc.
  • Proposed fix: sweep; add where the skill takes meaningful arguments.

L4. Plugin-skill description budget invisible to repo

  • Symptom: ~43 plugin skills add to the listing budget but cannot be lint-controlled from this repo. They cannot be skillOverridesd either (see Skills > "Override skill visibility").
  • Proposed fix: document in docs/specorator-product/tech.md: which plugins this repo expects to coexist with, and the recommended SLASH_COMMAND_TOOL_CHAR_BUDGET floor for adopters. Future contributors then know the headroom they're working in.

L5. Windows-platform footguns documented

  • Symptom: Active Claude Code bugs disproportionately hit Windows: PostToolUse Skill matcher drops stdin payload on Git Bash (#57250); context: fork skill spawn broken since v2.1.113 (#51165); background subagent permissions regressed in v2.1.114 (#50267). Specorator does not currently use context: fork, so the second item is informational; the first applies if any hooks rely on PostToolUse Skill payload.
  • Proposed fix: check .claude/settings.json for Skill-matched PostToolUse hooks; if present, add a Windows fallback or move the telemetry inside the slash-command body. Add a short Windows-platform notes section to docs/verify-gate.md.

Findings — Strengths to preserve

Audit-grade good news, worth recording so future template improvements don't accidentally regress:

  • Frontmatter conformance is 35/35 agents with name/description/tools. No drift.
  • Tool restriction by stage role is intentional and consistent (design phases lack Bash; build phases have Bash). Documented in .claude/agents/README.md.
  • Permission deny-list (no push to main/develop/demo, no --no-verify) is the right backstop.
  • The conductor-skills idiom (12 conductors + AskUserQuestion gates + file-based workflow-state.md per feature) is idiomatic by community convention, even though "conductor" is not first-party Anthropic vocabulary. Closest first-party term is "team lead" in agent-teams docs. The pattern is well-established in wshobson/agents (Conductor plugin) and htdocs.dev write-ups.
  • ADR-blessed track taxonomy (ADR-0026) is unusual but coherent — methodology layer above skills/agents, not in conflict with first-party model.
  • /specorator:update, /token-review, and the verify gate already cover the self-improvement infra most other repos lack.

Acceptance criteria

  • C1: description cap ≤200 chars enforced; ≤6 skills move to disable-model-invocation: true; /context shows ≥1500 chars headroom over the 1% budget.
  • C2: tackle-issue/SKILL.md and issue-breakdown/SKILL.md < 200 lines; supporting refs added; behaviour unchanged.
  • C3: scripts/verify/skills-frontmatter.ts lands and is wired into verify; all current skills + agents pass.
  • M1: every description follows <verb-phrase>. Use when <triggers>. in third person; YAML format single-line.
  • M2: 1–2 examples per practice skill (grill, tdd-cycle, tracer-bullet, record-decision).
  • M3: read-heavy agents pinned to model: haiku; rationale documented.
  • M4: tool grants reviewed; non-obvious choices have a one-line comment.
  • M5: 12 conductor skills have eval.json (20 queries); CI runs eval on description-only PRs.
  • M6: inline-vs-skill decision recorded for spec:clarify, spec:analyze, per-phase discovery commands.
  • L1–L5: each addressed or deferred with wontfix rationale.

Out of scope

  • Migrating away from the conductor-skill pattern (idiomatic; not a defect).
  • Renaming agentic-workflowspecorator (locked by ADR-0024).
  • Anything that would change the v1 track taxonomy (ADR-0026).
  • Redesigning the inputs/ ingestion convention (separate concern).

Suggested ordering

  1. C1 → C2 → C3 (architectural; everything else builds on the linter).
  2. M1 → M3 → M4 (low-risk sweeps; hit M3 + M4 in same PR).
  3. M2 → M6 (touches authoring patterns; can ship with C2).
  4. M5 (depends on linter; eval framework is its own slice).
  5. L-series, opportunistically.

References

Primary sources, all retrieved 2026-05-10:

Internal context:


Generated from a 2026-05-10 audit of .claude/agents/ (35), .claude/skills/ (37 in-repo + ~43 plugin) plus four parallel research probes against Anthropic primary docs and high-signal community sources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2High / next-up, important but not blockingarea:docsstatus:draftIssue still being iterated on; not yet ready for /spec:starttrack:specorator-improvementImprovement to the Specorator template itself

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions