Marketplace Test Strategy

Status: ⏸ Phase 1 plan — awaiting approval before Phase 2 execution.

LLM-facing reference for the marketplace-wide testing initiative. Phase 1 product. Read this before executing any per-plugin plan in testing/plans/.

1. Scope

2026-05-08 update. Five plugins removed from the marketplace in commit 3b8323e: claude-sync, design-assistant, docs-manager, linux-sysadmin, python-dev. Four were in-scope and one (python-dev) was excluded as pure-markdown. Updated counts below.

In scope (11): github-repo-manager, handoff, home-assistant-dev, nominal, opus-context, plugin-test-harness, qt-suite, release-pipeline, repo-hygiene, test-driver, up-docs. PTH stays in scope per §2 (peer plugin and canonical live-loop runner). Total marketplace plugins = 12, minus 1 excluded = 11.
Excluded (1): qdev. Pure-markdown plugin (commands + agents + skills only). Zero non-vendored source files. There is no mechanical surface to unit-test — its behavior is validated by direct invocation through Claude Code, not by a test harness.
Vendored trees ignored: node_modules/, .venv/, dist/, build/, __pycache__/.

2. Plugin-test-harness role (disambiguation)

PTH is both a peer plugin and the canonical live-loop runner — but not the canonical unit-test runner.

Surface	What	Where
Peer plugin	Distributed in marketplace; tests its own internals	`plugins/plugin-test-harness/` (Jest)
Canonical live runner	Drives test/fix/reload loop against another plugin's MCP server or `.claude-plugin/` interface	Invoked from any session via `pth_*` MCP tools
NOT	Canonical unit-test framework for the marketplace	Each plugin owns its unit framework per language

Implication for Phase 2: per-plugin unit suites use the per-language canonical framework (§3). PTH is complementary — once a plugin's unit suite is green, PTH can drive behavioral/integration coverage in a session branch. PTH coverage is out of scope for Phase 2 unless a plan explicitly calls it out.

3. Canonical frameworks (by language)

Language	Framework	Rationale
Bash / shell	`bats-core`	Already used by 9 of 11 in-scope plugins (github-repo-manager, handoff, nominal, opus-context, qt-suite, release-pipeline, repo-hygiene, test-driver, up-docs). Idiomatic `@test`, `setup`/`teardown`, `bats-assert` ergonomics.
Python	`pytest` (+ `pytest-asyncio` where async)	Already used by home-assistant-dev and qt-suite. Fixtures, parametrize, markers established.
TypeScript	`Jest` with `NODE_OPTIONS=--experimental-vm-modules` for ESM	Already used by home-assistant-dev/mcp-server and plugin-test-harness. ts-jest or esbuild-jest acceptable.

Legacy ad-hoc runners (do not migrate in Phase 2)

Plugin	Legacy file shape	Decision
github-repo-manager	`tests/run-tier-{a,b,c}.sh`, `lib.sh`, `cleanup.sh`	Leave intact (Tier A/B/C is a meaningful taxonomy worth preserving). New unit-level coverage in `*.bats`.
release-pipeline	`tests/test-.sh` + `run-all.sh` (alongside `.bats` added in Phase 2)	Leave intact. Phase 2 introduced bats as the canonical layer; ad-hoc files are the integration smoke layer.

Rationale: the prompt's "tests validate existing behavior; do not refactor plugin source" extends to existing tests. Ad-hoc bash test runners are part of the existing-behavior surface — migrating them would be churn that exceeds Phase 2 scope and risks losing tribal-knowledge edge cases.

4. Naming & layout conventions

Framework	Test file	Fixture location
bats	`tests/<script-name>.bats` (mirror script name)	`tests/fixtures/<topic>/` (create on demand)
pytest	`tests/test_<module>.py` (PEP 8)	`tests/conftest.py` per directory; `tests/fixtures/` for data
Jest	`test/unit/<path-mirror>/<module>.test.ts`	`test/fixtures/<scenario>/`

Plugin sub-trees that have their own runner (e.g., home-assistant-dev/mcp-server/__tests__/, qt-suite/mcp/qt-pilot/tests/) keep their existing layout. A plugin can have multiple sub-suites under different runners — that's fine, document it in the plugin plan.

5. Enforcement-layer mapping (the principle → layer rule)

Every proposed test in a per-plugin plan is tagged with the layer it exercises. Layer assignment uses these definitions, ordered strongest-to-weakest:

Mechanical

A hook, script, or process that deterministically blocks/warns/asserts regardless of AI behavior. Verifiable by running with deterministic input → asserting on output (exit code, stdout/stderr, file mutations, JSON return shape).

Test signature: run invocation in bats / subprocess.run in pytest / execa in Jest, asserting on side effects.

Structural

Verifiable by reading file structure, manifests, JSON contracts — without executing code. Schema conformance, hook event-keying, manifest-version match between marketplace.json and plugin.json, presence of required scripts, frontmatter shape on skills.

Test signature: jq / python -c "json.load(...)" / Zod schema parse over file contents.

Behavioral

Anything requiring an LLM, agent, or human to interpret prompts (skill markdown, agent system prompts, command instruction blocks). Cannot be unit-tested. Coverage comes from PTH live-loop sessions or design review. Plans explicitly mark these as Behavioral — out of unit-test scope so the gap is visible, not hidden.

Per-test mapping format (used in every plugin plan)

| Principle | Layer | Proposed test | Rationale |
|-----------|-------|---------------|-----------|
| [P3] Succeed quietly, fail transparently | Mechanical | hook script exits 0 on no-op input; non-zero exit emits raw error to stderr + recovery hint | Verifies stated quiet-success behavior at the script seam |
| [P1] Principles before architecture       | Behavioral — out of scope | n/a | Workflow constraint; covered by /design-draft session UX, not unit tests |

6. Priority order for Phase 2

Highest test-debt first (no tests / poor principle coverage / release-critical), then breadth-fill.

Phase 2 is complete (~225 cases cherry-picked to main per session 2026-05-07). The original 15-row priority table is preserved below with the four removed plugins struck out for historical traceability.

#	Plugin	Why this rank
1	release-pipeline	14 src + 9 ad-hoc shell tests but 0 bats, no coverage of waiver/tag-reconcile/auto-stash principle compliance. Release-critical surface — failures here ship broken plugins.
2	opus-context	1 src (SessionStart hook) + 0 tests. Whole plugin's mechanical layer is one shell script that emits `additionalContext` JSON. Tiny scope, high-value Mechanical coverage.
3	handoff	2 src + 2 bats (1:1 nominal) but bats files cover happy paths only — `[P1] Complete Context` and `[P2] Actionable Next Steps` are not asserted. Small surface; low effort to close gap.
4	up-docs	3 bats + 4 src; convergence-tracker and link-audit have script-level coverage but `[P3] Update, don't rewrite` and `[P4] Ground Truth Wins` are unverified.
5	repo-hygiene	3 bats + 7 src. `[P4] Safety by Construction` (3-check orphan delete) is the highest-stakes guarantee in the plugin and only one of the three checks is tested today.
6	github-repo-manager	3 bats + ~20 src (8 shell + ~18 helper/.js). Helper CLI is entirely untested by Jest*; tier classification + PreToolUse guard are Mechanical claims that should have explicit assertions.
7	nominal	6 bats + 6 src (1:1). Best-covered shell plugin; gaps are subtle (flight-log append-only on race, abort.json schema).
8	test-driver	5 bats + 5 src (1:1). Similar shape to nominal.
9	qt-suite	4 pytest + 5 src. Qt Pilot MCP is the Mechanical surface; existing tests cover annotations + harness + main + imports. Gap-fill mostly.
10	home-assistant-dev	7 pytest + 3 Jest + 1 e2e + 12 mcp-server src + 5 scripts. Best-covered plugin in the marketplace; gap is targeted (specific tools, specific IQS rules).
11	plugin-test-harness	15 Jest unit + 35 src. Highest coverage already. Gap-fill opportunity is `convergence.ts` trend math and `gap-analyzer.ts` snapshot diff.
—	~~linux-sysadmin~~	Plugin removed 2026-05-08.
—	~~claude-sync~~	Plugin removed 2026-05-08.
—	~~docs-manager~~	Plugin removed 2026-05-08.
—	~~design-assistant~~	Plugin removed 2026-05-08.

7. Marketplace-level conventions (cross-cutting)

These apply to every per-plugin plan and to any new tests introduced in Phase 2.

Tests assert on existing behavior. If a test would require changing the script under test to make a seam available, flag the issue in the plan, do not change the script. Phase 2 is not allowed to refactor.
No network, no real tokens, no real GitHub API. Mock at the boundary. helper/*.js tests stub client.js with a fake fetch; bats tests stub gh/git via PATH precedence.
No real ~/.claude/, no real ~/.pth/, no real /mnt/share/. Use BATS_TMPDIR / tmp_path / os.tmpdir() and HOME=$BATS_TMPDIR/home redirection. Tests that touch the user's real config are forbidden.
set -euo pipefail semantics. Bats tests of bash scripts must reproduce the script's actual flag environment — the ((var++)) and [[ test ]] && action failure modes are real and have shipped bugs in this repo before.
JSON validation tests stay deterministic. Compare with jq -S (sorted keys) or parsed Python dict equality, not raw string compare — key-order non-determinism between platforms breaks tests silently.
Marker discipline (pytest only). unit / integration / validation markers carry over from ha-dev's existing convention. New pytest suites adopt the same names so CI matrix remains uniform if/when extended.
No multi-plugin tests in Phase 2. Cross-plugin integration (e.g., test-driver → opus-context profile loading) is deferred. STRATEGY notes one candidate: validate-marketplace.sh as a marketplace-level Mechanical suite — proposed but not in this Phase 2 batch unless user confirms during review.

8. Out of scope for Phase 2 (will not write tests for these)

CI/CD pipeline changes. Existing .github/workflows/ha-dev-plugin-tests.yml and plugin-test-harness-ci.yml are the only test-running workflows. Adding bats CI for the other 9 in-scope plugins is a separate prompt.
Cross-plugin integration tests (e.g., release-pipeline → repo-hygiene chained sweep). Listed as a candidate; gated on user approval.
Refactoring source to add seams. If a script has no testable seam (source of stdin-reading code without isolation, hardcoded ~/.claude/ paths, etc.), the per-plugin plan flags it with a Risk: note. No source change.
PTH-driven behavioral coverage. PTH is mentioned where relevant but not bundled into Phase 2 plans.
qdev — see §1.

9. Phase 2 execution rules (recap)

For each plugin, in priority order:

Confirm target with user.
Direct commits to main. (Phase 2 originally branched tests/<plugin> from a testing integration branch; that workflow was retired 2026-05-07.)
Implement strictly per testing/plans/<plugin>.md.
Run the plugin's existing runner + new tests; both must be green.
Update plan with deviations (and why).
Single commit (or logically-grouped commits) on main.
HALT — surface coverage delta, ask for next plugin.

10. Document index

File	Purpose
`testing/STRATEGY.md`	This file
`testing/plans/<plugin>.md` × 11	Per-plugin Phase 2 work order (was 15; 4 plans deleted 2026-05-08 alongside their plugins)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marketplace Test Strategy

1. Scope

2. Plugin-test-harness role (disambiguation)

3. Canonical frameworks (by language)

Legacy ad-hoc runners (do not migrate in Phase 2)

4. Naming & layout conventions

5. Enforcement-layer mapping (the principle → layer rule)

Mechanical

Structural

Behavioral

Per-test mapping format (used in every plugin plan)

6. Priority order for Phase 2

7. Marketplace-level conventions (cross-cutting)

8. Out of scope for Phase 2 (will not write tests for these)

9. Phase 2 execution rules (recap)

10. Document index

FilesExpand file tree

STRATEGY.md

Latest commit

History

STRATEGY.md

File metadata and controls

Marketplace Test Strategy

1. Scope

2. Plugin-test-harness role (disambiguation)

3. Canonical frameworks (by language)

Legacy ad-hoc runners (do not migrate in Phase 2)

4. Naming & layout conventions

5. Enforcement-layer mapping (the principle → layer rule)

Mechanical

Structural

Behavioral

Per-test mapping format (used in every plugin plan)

6. Priority order for Phase 2

7. Marketplace-level conventions (cross-cutting)

8. Out of scope for Phase 2 (will not write tests for these)

9. Phase 2 execution rules (recap)

10. Document index