Audit and harden skills + agents architecture against Anthropic best-practices

# Audit and harden skills + agents architecture against Anthropic best-practices

## Why now

A 2026-05-10 audit of `.claude/agents/` (35 agents) and `.claude/skills/` (37 skills, plus ~43 plugin skills auto-loaded) compared against Anthropic's primary docs surfaces a small number of latent risks. The most pressing: at ~80 auto-discoverable skills × ~125 chars median listing entry, this repo sits **at or just over the 10,000-char skills-listing budget** ([Skills > "Skill descriptions are cut short", code.claude.com](https://code.claude.com/docs/en/skills#skill-descriptions-are-cut-short)). Past that line, descriptions are silently truncated or dropped entirely — names always remain, but the trigger keywords disappear, so auto-routing degrades without any error surfaced. Several other items below are pure hygiene that the official docs flag explicitly.

This issue captures findings from a structured deep-dive (architecture audit + four parallel research probes covering best-practices, real-world failure modes, conductor patterns, token-budget mechanics, and trigger-description engineering). All claims are linked to primary sources retrieved on 2026-05-10.

Scope: template-self improvement. Tag `track:specorator-improvement`.

---

## Findings — Critical

### C1. At the skills-listing budget cliff edge

- **Symptom:** With ~80 auto-discoverable skills (37 in-repo + ~43 plugin), median listing entry ~125 chars, total ≈ 10,000 chars. Anthropic's listing budget is `1% × context window` with a fallback of 8,000 chars. On Opus 4.7 (1M context) that is exactly 10,000 chars. Past it, descriptions are silently shortened or dropped (names always preserved); trigger keywords disappear → undertrigger.
- **Source:** [Extend Claude with skills > "Skill descriptions are cut short"](https://code.claude.com/docs/en/skills#skill-descriptions-are-cut-short); per-entry cap 1,536 chars confirmed in [issue #47627](https://github.com/anthropics/claude-code/issues/47627) (raised from 250 in v2.1.105). Engineering blog estimate `~100 tokens` median metadata: [anthropic.com/engineering/equipping-agents…](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills).
- **Proposed fix:**
  1. Cap `description` at ≤200 chars per skill in `.claude/skills/*/SKILL.md` (~80 × 200 = 16,000 chars envelope).
  2. Bump `SLASH_COMMAND_TOOL_CHAR_BUDGET=20000` in user-settings guidance for adopters (`docs/specorator-product/tech.md`).
  3. Set `disable-model-invocation: true` on user-only utilities — `caveman-help`, `token-budget-review`, `verify`, `new-adr`, `new-glossary-entry`, `record-decision` — they leave the always-on listing entirely; still invocable via `/name`.
  4. Set `skillOverrides: name-only` on dormant skills via `settings.local.json` template.
- **Verification:** run `/context` and `/skills` before + after; record character delta in PR body.

### C2. Two oversized skills

- **Symptom:** `.claude/skills/tackle-issue/SKILL.md` is **337 lines**; `.claude/skills/issue-breakdown/SKILL.md` is **164 lines**. Anthropic recommends ≤500 lines as a hard ceiling, but 337 lines is high enough that the body crowds the post-compaction 5,000-token / 25,000-token-combined re-attachment budget when other skills are also active.
- **Source:** [Skills > "Keep SKILL.md under 500 lines"](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices); skill content lifecycle caps quoted in [Skills > "Skill content lifecycle"](https://code.claude.com/docs/en/skills#skill-content-lifecycle).
- **Proposed fix:** split each into `SKILL.md` + supporting `references/` files (one level deep only, per docs). For `tackle-issue/`: extract Phase 3+ procedure into `references/execution.md`; keep triage logic in `SKILL.md`. For `issue-breakdown/`: extract slice-template + examples into `references/slicing.md` and `examples/`.
- **Acceptance:** both `SKILL.md` bodies < 200 lines; references one level deep; existing behaviour unchanged.

### C3. No frontmatter linter / CI gate for skills + agents

- **Symptom:** Frontmatter shape is convention-only. No CI enforces `description ≤ 1024 chars`, third-person voice, "use when" phrasing, or `tools:` allowlist explicitness on agents.
- **Source:** Required field rules and limits in [Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices); third-person rule is the only `<Warning>` in that doc. wshobson/agents `PluginEval` codifies the practical lint vocabulary (EMPTY_DESCRIPTION, MISSING_TRIGGER, BLOATED_SKILL, ORPHAN_REFERENCE, OVER_CONSTRAINED).
- **Proposed fix:** add `scripts/verify/skills-frontmatter.ts` to the `verify` gate. Checks per skill:
  1. `name` present, lowercase + hyphens, ≤64 chars, not `anthropic` / `claude`.
  2. `description` present, ≤1024 chars, ≤200 char project-cap, third-person (no `^I `/`I'll `/`You can `/`Helps you ` openers), contains a `Use when` clause.
  3. SKILL.md body ≤500 lines.
  4. No second-level references (file linked from a referenced file).
- Per agent: `tools:` field present and non-empty; `description` ≤1024 chars; `name` rules.
- **Acceptance:** gate runs on every PR; current repo passes after C1+C2+M1 land.

---

## Findings — Medium

### M1. Description voice + format inconsistency

- **Symptom:** ~7 skills open with first/second person or trigger-leaks (workflow steps in description). Frontmatter format mixes single-line and YAML `>` fold across `caveman*`, `issue-pr-sync`, `issue-draft`. Agents are uniformly single-line; skills are not.
- **Source:** Anthropic best-practices doc warns explicitly: *"Always write in third person… inconsistent point-of-view can cause discovery problems."* obra/superpowers writing-skills body documents that workflow-leaks in description make Claude follow the description and skip the body.
- **Proposed fix:** sweep all 37 skills; rewrite descriptions to `<verb-phrase>. Use when <triggers>.` pattern. Single-line YAML for every skill. Document the rule in `.claude/skills/README.md` and in the new linter.

### M2. Four practice skills lack examples

- **Symptom:** `grill`, `tdd-cycle`, `tracer-bullet`, `record-decision` have no `examples/` or `references/` — body prose is the only invocation guide. Cognitive load is high; trigger reliability suffers.
- **Source:** [skill-creator SKILL.md](https://github.com/anthropics/skills/blob/main/skill-creator/SKILL.md) recommends references one level deep when triggers are subtle.
- **Proposed fix:** add 1–2 worked examples per skill in `examples/` (one happy path, one edge case). Linked from SKILL.md body, not preloaded.

### M3. Default model on read-heavy agents leaves cost lever unused

- **Symptom:** All 35 agents use the inherited (Opus by default) model. Read-heavy roles — `analyst`, `legacy-auditor`, `reviewer`, `project-reviewer`, `retrospective`, `brand-reviewer` — could run on Haiku 4.5 with no quality regression, at ~5× cost reduction.
- **Source:** [sub-agents > "Control costs"](https://code.claude.com/docs/en/sub-agents): *"Control costs by routing tasks to faster, cheaper models like Haiku."* The built-in `Explore` agent is Anthropic's own demonstration. wshobson/agents's four-tier strategy is the most explicit external example.
- **Proposed fix:** add `model: haiku` to the six read-heavy agents above. Document the heuristic in `.claude/agents/README.md`: read-heavy → Haiku; reasoning-heavy → Opus; default → inherit.

### M4. Subagent `tools:` allowlists not consistently scoped

- **Symptom:** Most agents follow least-privilege correctly. Mild inconsistencies: `roadmap-manager` carries `WebSearch`+`WebFetch` while peer `portfolio-manager` does not despite similar information-gathering scope; `product-page-designer` has Bash while same-stage `ui-designer` does not.
- **Source:** [sub-agents docs](https://code.claude.com/docs/en/sub-agents): inheritance default is wrong default for production. Adam Wolf (Claude Code core engineer) quoted in [claudekit.cc 2025-12](https://claudekit.cc/blog/vc-04-subagents-from-basic-to-deep-dive-i-misunderstood): *"Sub agents work best when they just look for information and provide a small amount of summary back."*
- **Proposed fix:** review each tool grant against the agent's documented purpose; remove unused. Add the rationale as a one-line comment in each agent body where the choice is non-obvious.

### M5. Skill trigger-evals not part of authoring contract

- **Symptom:** No skill in this repo has a 20-query eval; description changes ship blind.
- **Source:** [skill-creator SKILL.md](https://github.com/anthropics/skills/blob/main/skill-creator/SKILL.md) — *"Create 20 eval queries — a mix of should-trigger and should-not-trigger."*
- **Proposed fix:** add `eval.json` to each skill (8–10 should-trigger queries, 8–10 should-not). Run on description-only PRs. Start with the 12 conductor skills (highest blast radius); expand opportunistically. Eval runner can borrow from [hqhq1025/skill-optimizer](https://github.com/hqhq1025/skill-optimizer) (community tool, 60/40 train/test split, 3 runs/query).

### M6. Conductor-calling-conductor stacks SKILL.md bodies in context

- **Symptom:** When `orchestrate` invokes `spec:design` which invokes `arc42-baseline`, all three SKILL.md bodies stay in context until compaction. Specorator's track conductors call inner stage skills routinely, so this stacking is the steady-state pattern.
- **Source:** [Skills > Skill content lifecycle](https://code.claude.com/docs/en/skills#skill-content-lifecycle) — *"the rendered SKILL.md content enters the conversation as a single message and stays there for the rest of the session."*
- **Proposed fix:** for inner stage skills called only by a conductor, audit whether they should be **inlined as `references/` to the conductor** instead of separate top-level skills. Candidates: `spec:clarify`, `spec:analyze`, the per-phase discovery commands when invoked from `discovery-sprint`. Don't blanket-merge — keep direct user-invocability where it matters.

---

## Findings — Low

### L1. Deprecated `ubiquitous-language` skill still loads

- **Symptom:** `.claude/skills/ubiquitous-language/SKILL.md` is 93 lines, marked deprecated by ADR-0010, still in the always-on listing.
- **Proposed fix:** add `disable-model-invocation: true` and `description: '[Deprecated by ADR-0010]…'` so it leaves the auto-routing pool but stays for fork compatibility.

### L2. No agent–skill adjacency map

- **Symptom:** Which agents invoke which skills at which stage is implicit in body prose. New contributors have no overview.
- **Proposed fix:** generate `docs/agent-skill-matrix.md` from the canonical `.claude/agents/` + `.claude/skills/` (extracted from frontmatter `description` and body cross-refs). Include in `verify` so it stays current.

### L3. Argument-hint inconsistency

- **Symptom:** `argument-hint:` is set on most conductors but missing on `arc42-baseline`, `improve-codebase-architecture`, `project-run`, etc.
- **Proposed fix:** sweep; add where the skill takes meaningful arguments.

### L4. Plugin-skill description budget invisible to repo

- **Symptom:** ~43 plugin skills add to the listing budget but cannot be lint-controlled from this repo. They cannot be `skillOverrides`d either (see [Skills > "Override skill visibility"](https://code.claude.com/docs/en/skills#override-skill-visibility-from-settings)).
- **Proposed fix:** document in `docs/specorator-product/tech.md`: which plugins this repo expects to coexist with, and the recommended `SLASH_COMMAND_TOOL_CHAR_BUDGET` floor for adopters. Future contributors then know the headroom they're working in.

### L5. Windows-platform footguns documented

- **Symptom:** Active Claude Code bugs disproportionately hit Windows: PostToolUse `Skill` matcher drops stdin payload on Git Bash ([#57250](https://github.com/anthropics/claude-code/issues/57250)); `context: fork` skill spawn broken since v2.1.113 ([#51165](https://github.com/anthropics/claude-code/issues/51165)); background subagent permissions regressed in v2.1.114 ([#50267](https://github.com/anthropics/claude-code/issues/50267)). Specorator does not currently use `context: fork`, so the second item is informational; the first applies if any hooks rely on PostToolUse Skill payload.
- **Proposed fix:** check `.claude/settings.json` for `Skill`-matched PostToolUse hooks; if present, add a Windows fallback or move the telemetry inside the slash-command body. Add a short Windows-platform notes section to `docs/verify-gate.md`.

---

## Findings — Strengths to preserve

Audit-grade good news, worth recording so future template improvements don't accidentally regress:

- Frontmatter conformance is **35/35 agents** with `name`/`description`/`tools`. No drift.
- Tool restriction by stage role is intentional and consistent (design phases lack Bash; build phases have Bash). Documented in `.claude/agents/README.md`.
- Permission deny-list (no push to main/develop/demo, no `--no-verify`) is the right backstop.
- The conductor-skills idiom (12 conductors + AskUserQuestion gates + file-based `workflow-state.md` per feature) is **idiomatic by community convention**, even though "conductor" is not first-party Anthropic vocabulary. Closest first-party term is "team lead" in [agent-teams docs](https://code.claude.com/docs/en/agent-teams). The pattern is well-established in `wshobson/agents` (Conductor plugin) and `htdocs.dev` write-ups.
- ADR-blessed track taxonomy ([ADR-0026](docs/adr/0026-freeze-v1-workflow-track-taxonomy.md)) is unusual but coherent — methodology layer above skills/agents, not in conflict with first-party model.
- `/specorator:update`, `/token-review`, and the `verify` gate already cover the self-improvement infra most other repos lack.

---

## Acceptance criteria

- [ ] C1: `description` cap ≤200 chars enforced; ≤6 skills move to `disable-model-invocation: true`; `/context` shows ≥1500 chars headroom over the 1% budget.
- [ ] C2: `tackle-issue/SKILL.md` and `issue-breakdown/SKILL.md` < 200 lines; supporting refs added; behaviour unchanged.
- [ ] C3: `scripts/verify/skills-frontmatter.ts` lands and is wired into `verify`; all current skills + agents pass.
- [ ] M1: every `description` follows `<verb-phrase>. Use when <triggers>.` in third person; YAML format single-line.
- [ ] M2: 1–2 examples per practice skill (`grill`, `tdd-cycle`, `tracer-bullet`, `record-decision`).
- [ ] M3: read-heavy agents pinned to `model: haiku`; rationale documented.
- [ ] M4: tool grants reviewed; non-obvious choices have a one-line comment.
- [ ] M5: 12 conductor skills have `eval.json` (20 queries); CI runs eval on description-only PRs.
- [ ] M6: inline-vs-skill decision recorded for `spec:clarify`, `spec:analyze`, per-phase discovery commands.
- [ ] L1–L5: each addressed or deferred with `wontfix` rationale.

## Out of scope

- Migrating away from the conductor-skill pattern (idiomatic; not a defect).
- Renaming `agentic-workflow` → `specorator` (locked by [ADR-0024](docs/adr/0024-lock-specorator-agentic-workflow-naming-contract.md)).
- Anything that would change the v1 track taxonomy ([ADR-0026](docs/adr/0026-freeze-v1-workflow-track-taxonomy.md)).
- Redesigning the `inputs/` ingestion convention (separate concern).

## Suggested ordering

1. C1 → C2 → C3 (architectural; everything else builds on the linter).
2. M1 → M3 → M4 (low-risk sweeps; hit M3 + M4 in same PR).
3. M2 → M6 (touches authoring patterns; can ship with C2).
4. M5 (depends on linter; eval framework is its own slice).
5. L-series, opportunistically.

## References

Primary sources, all retrieved 2026-05-10:

- [Agent Skills overview — platform.claude.com](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview)
- [Skill authoring best practices — platform.claude.com](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices)
- [Extend Claude with skills — code.claude.com](https://code.claude.com/docs/en/skills)
- [Create custom subagents — code.claude.com](https://code.claude.com/docs/en/sub-agents)
- [Orchestrate teams of Claude Code sessions (agent-teams) — code.claude.com](https://code.claude.com/docs/en/agent-teams)
- [How Claude Code works — code.claude.com](https://code.claude.com/docs/en/how-claude-code-works)
- [Equipping agents for the real world with Agent Skills — anthropic.com/engineering](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)
- [Skills explained — claude.com/blog](https://claude.com/blog/skills-explained)
- [`anthropics/skills/skill-creator/SKILL.md`](https://github.com/anthropics/skills/blob/main/skill-creator/SKILL.md)
- [`obra/superpowers` — `writing-skills` and friends](https://github.com/obra/superpowers/tree/main/skills/writing-skills)
- [`wshobson/agents`](https://github.com/wshobson/agents) — four-tier model assignment + Conductor plugin pattern
- [Issue #47627](https://github.com/anthropics/claude-code/issues/47627) — listing cap raised 250 → 1,536 in v2.1.105
- [Issue #50267](https://github.com/anthropics/claude-code/issues/50267) — background subagent permissions regression
- [Issue #51165](https://github.com/anthropics/claude-code/issues/51165) — `context: fork` Windows
- [Issue #57250](https://github.com/anthropics/claude-code/issues/57250) — PostToolUse Skill matcher Windows stdin

Internal context:

- Memory rule: [Plugin bundle is generated — never hand-edit](../.claude/memory/feedback_plugin_bundle_generated.md)
- [`docs/verify-gate.md`](../docs/verify-gate.md), [`docs/branching.md`](../docs/branching.md)
- [`.claude/skills/README.md`](../.claude/skills/README.md), [`.claude/agents/README.md`](../.claude/agents/README.md)

---

*Generated from a 2026-05-10 audit of `.claude/agents/` (35), `.claude/skills/` (37 in-repo + ~43 plugin) plus four parallel research probes against Anthropic primary docs and high-signal community sources.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit and harden skills + agents architecture against Anthropic best-practices #462

Audit and harden skills + agents architecture against Anthropic best-practices

Why now

Findings — Critical

C1. At the skills-listing budget cliff edge

C2. Two oversized skills

C3. No frontmatter linter / CI gate for skills + agents

Findings — Medium

M1. Description voice + format inconsistency

M2. Four practice skills lack examples

M3. Default model on read-heavy agents leaves cost lever unused

M4. Subagent `tools:` allowlists not consistently scoped

M5. Skill trigger-evals not part of authoring contract

M6. Conductor-calling-conductor stacks SKILL.md bodies in context

Findings — Low

L1. Deprecated `ubiquitous-language` skill still loads

L2. No agent–skill adjacency map

L3. Argument-hint inconsistency

L4. Plugin-skill description budget invisible to repo

L5. Windows-platform footguns documented

Findings — Strengths to preserve

Acceptance criteria

Out of scope

Suggested ordering

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit and harden skills + agents architecture against Anthropic best-practices #462

Description

Audit and harden skills + agents architecture against Anthropic best-practices

Why now

Findings — Critical

C1. At the skills-listing budget cliff edge

C2. Two oversized skills

C3. No frontmatter linter / CI gate for skills + agents

Findings — Medium

M1. Description voice + format inconsistency

M2. Four practice skills lack examples

M3. Default model on read-heavy agents leaves cost lever unused

M4. Subagent tools: allowlists not consistently scoped

M5. Skill trigger-evals not part of authoring contract

M6. Conductor-calling-conductor stacks SKILL.md bodies in context

Findings — Low

L1. Deprecated ubiquitous-language skill still loads

L2. No agent–skill adjacency map

L3. Argument-hint inconsistency

L4. Plugin-skill description budget invisible to repo

L5. Windows-platform footguns documented

Findings — Strengths to preserve

Acceptance criteria

Out of scope

Suggested ordering

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

M4. Subagent `tools:` allowlists not consistently scoped

L1. Deprecated `ubiquitous-language` skill still loads