v1.1.18 | Quality & Testing | 20 iterations
The Red-Green-Refactor methodology for writing failing tests before implementation code -- across Python (pytest), TypeScript (Vitest), Playwright E2E, Emacs Lisp (ERT), and Zod schema validation. Single skill + 12 references + 5 scripts + 4 templates
TDD works cycle by cycle, not all at once. Give the skill enough context to write the first meaningful failing test -- and the cycle begins.
What information to include in your prompt:
- Language and testing framework -- "Python with pytest," "TypeScript with Vitest," "Playwright E2E," or "Emacs Lisp with ERT." This determines which patterns, fixtures, and conventions apply.
- The specific behavior to implement -- not "authentication module" but "a function that validates email format and returns a validation result." Smaller scope = better tests.
- Key business rules or edge cases -- list any rules that are not obvious from the function name. "Discount stacking: percentage discounts apply before fixed discounts." These become test cases.
- What already exists -- if there is existing code, paste the function signature or interface so tests can reference it. If there is existing test infrastructure, describe it (fixtures, helpers, conftest).
- Coverage goals if relevant -- if you are filling gaps ("coverage is at 65%, need 80%"), paste the coverage report output so the skill can identify the highest-impact gaps.
What makes results better vs worse:
- Better: name one concrete behavior to test first ("the simplest happy path case"), not the whole feature
- Better: provide the function signature or type definition even if implementation does not exist yet
- Better: if refactoring, share the current test output (all green) alongside the code to be refactored
- Worse: asking to "add tests to the whole module" -- TDD works incrementally, one behavior at a time
- Worse: providing implementation code without specifying which behavior to test first
- Worse: skipping the framework -- language-specific patterns matter (pytest fixtures vs Vitest
describe, async patterns, etc.)
Template prompt:
I need to implement [feature/behavior] using TDD in [language] with [framework: pytest / Vitest / Playwright / ERT].
The function/component signature: [paste or describe the interface]
Key behaviors (each will become a test):
1. [behavior 1 -- this is where we start]
2. [behavior 2]
3. [edge case or error condition]
[Optional: existing test infrastructure -- fixtures, helpers, conftest, test database]
Start with the failing test for behavior 1.
Most developers write code first and tests second -- if they write tests at all. The tests they write are shaped by the implementation they already built, so the tests verify the code does what it does, not what it should do. Edge cases that were not considered during implementation are not considered during testing either. The tests become a rubber stamp rather than a design tool, and the team ships bugs that a test-first approach would have caught.
When developers do practice TDD, they often do it wrong. They write tests that are too large (testing an entire workflow instead of one behavior), too coupled to implementation (asserting mock call arguments instead of observable behavior), or too isolated (100% unit test coverage but no integration tests, so the units work individually and fail together). The Red-Green-Refactor cycle degrades into "write test, write code, move on" -- the refactoring step is skipped because the tests are green and there is pressure to ship.
The problem compounds across languages. A team that practices TDD well in Python with pytest does not know how to apply the same discipline in TypeScript with Vitest. Playwright E2E tests are written without the TDD mindset because "E2E tests cannot be written first." Schema validation with Zod is not tested at all because "the schema IS the test." These gaps mean TDD coverage is inconsistent across the stack, and bugs hide in the untested layers.
This plugin implements the complete TDD methodology across four languages and five testing frameworks: Python with pytest, TypeScript with Vitest, browser E2E with Playwright, Emacs Lisp with ERT, and schema validation with Zod. It enforces the Red-Green-Refactor cycle at every level: write a failing test that defines the expected behavior (RED), write the minimal code to make it pass (GREEN), then improve the code while keeping tests green (REFACTOR).
The skill covers three testing tiers with clear boundaries: Unit tests (fast, isolated, mock external dependencies, run on every commit), Integration tests (test component interactions, may use test databases, run on pull requests), and E2E tests (complete user workflows, run before deployment). Coverage targets are explicit: 80-90% line coverage for unit tests, critical paths for integration, and main user workflows for E2E.
Twelve reference files provide language-specific patterns, test design patterns, refactoring techniques, and coverage validation. Five utility scripts handle test execution, coverage analysis, threshold checking, test template generation, and implementation validation. Four templates provide starting points for pytest, ERT, checklists, and session logging.
| Without this plugin | With this plugin |
|---|---|
| Code first, tests second -- tests verify what the code does, not what it should do | Tests first, code second -- tests define the expected behavior before implementation exists |
| Tests too large: testing entire workflows instead of individual behaviors | One test at a time, one behavior at a time -- small steps that build confidence incrementally |
| Green bar and move on -- refactoring step skipped under delivery pressure | Explicit REFACTOR phase after every GREEN: improve code quality while tests guarantee behavior |
| TDD practiced in one language but not others -- gaps across the stack | Consistent methodology across Python (pytest), TypeScript (Vitest), E2E (Playwright), Emacs Lisp (ERT), and Zod |
Test names like test_discount() -- no indication of what behavior is being tested |
Descriptive naming: test_calculate_discount_returns_zero_for_empty_cart -- behavior documented in the name |
| Shared mutable state between tests -- test order affects results | Test independence enforced: no shared state, fixtures for setup/teardown, each test runs in isolation |
Add the SkillStack marketplace, then install this plugin:
/plugin marketplace add viktorbezdek/skillstack
/plugin install test-driven-development@skillstack
After installing, test with:
I want to implement a shopping cart discount calculator using TDD -- walk me through the red-green-refactor cycle
The skill should activate and start with the RED phase: writing a failing test for the simplest case before any implementation code exists.
- Install the plugin using the commands above
- Start with the test:
I need a function that validates email addresses -- let's do TDD - The skill writes the first failing test (RED):
test_valid_email_returns_truewith a simple happy path case - You confirm the test fails as expected (not a syntax error -- it fails because the function does not exist yet)
- The skill writes minimal implementation (GREEN): just enough code to pass that one test, then immediately asks for the next behavior to test
User wants to build a feature using TDD
│
▼
┌─────────────────────────────────────────────────────────┐
│ test-driven-development (skill) │
│ │
│ Red-Green-Refactor Cycle: │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ RED: Write failing test │ │
│ │ (define expected behavior) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ GREEN: Write minimal code │ │
│ │ (just enough to pass) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ REFACTOR: Improve code quality │ │
│ │ (tests stay green) │ │
│ │ │ │ │
│ │ └──── REPEAT ──── ↑ │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ Three Testing Tiers: │
│ ├── Unit (fast, isolated, 80-90% coverage) │
│ ├── Integration (component interactions, critical paths) │
│ └── E2E (user workflows, pre-deployment) │
│ │
│ Language Support: │
│ ├── Python ──── pytest ── python-tdd.md │
│ ├── TypeScript ── Vitest ── vitest-patterns.md │
│ ├── E2E ──────── Playwright ── playwright-*.md │
│ ├── Emacs Lisp ── ERT ── elisp-tdd.md │
│ └── Schema ───── Zod ── zod-testing-patterns.md │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 12 References│ │ 5 Scripts │ │ 4 Templates │ │
│ │ patterns, │ │ run, cover- │ │ pytest, ERT │ │
│ │ refactoring, │ │ age, check, │ │ checklist, │ │
│ │ coverage │ │ generate, │ │ session log │ │
│ │ │ │ validate │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
| Component | Type | Purpose |
|---|---|---|
test-driven-development |
skill | Core TDD methodology: Red-Green-Refactor, Arrange-Act-Assert, three testing tiers, coverage targets, naming conventions |
general-tdd.md |
reference | TDD principles and philosophy |
python-tdd.md |
reference | Python-specific TDD with pytest: fixtures, parametrization, markers |
vitest-patterns.md |
reference | Vitest testing patterns for TypeScript/JavaScript |
playwright-e2e-patterns.md |
reference | Playwright E2E test patterns with TDD mindset |
playwright-best-practices.md |
reference | E2E testing guidelines and common pitfalls |
zod-testing-patterns.md |
reference | Schema validation testing with Zod |
elisp-tdd.md |
reference | Emacs Lisp TDD with ERT framework |
test-design-patterns.md |
reference | Common test patterns: builder, mother, fixture |
test-structure-guide.md |
reference | Test file organization and naming conventions |
refactoring-with-tests.md |
reference | Safe refactoring techniques with green tests as safety net |
coverage-validation.md |
reference | Coverage analysis, threshold enforcement, gap identification |
extended-patterns.md |
reference | Detailed language-specific examples and the 6-phase TDD workflow |
run_tests.py |
script | Test runner utility with formatting and filtering |
coverage_analyzer.py |
script | Analyze coverage data and identify gaps |
coverage_check.py |
script | Enforce minimum coverage thresholds |
test_template_generator.py |
script | Generate test boilerplate for new modules |
skill_validator.py |
script | Validate test implementations against TDD practices |
| 4 template files | template | pytest, ERT, coverage checklist, and TDD session log |
Eval coverage: 13 trigger evaluation cases, 3 output evaluation cases.
What it does: Activates when you want to practice TDD, follow the Red-Green-Refactor cycle, or write tests before implementation. Walks you through each phase: write a failing test (RED), implement minimal code to pass (GREEN), refactor while tests stay green (REFACTOR). Supports Python/pytest, TypeScript/Vitest, Playwright E2E, Emacs Lisp/ERT, and Zod schema testing.
Input -> Output: A feature or behavior to implement -> A test-first development session with failing test, minimal implementation, refactoring, and iterative expansion through successive TDD cycles.
When to use: Implementing new features with test-first methodology. Adding tests to increase coverage. Refactoring with a test safety net. Writing E2E tests that define expected UX. Practicing TDD in a new language or framework.
When NOT to use: Choosing or setting up test frameworks (use testing-framework). Finding and fixing bugs (use debugging). Reviewing existing code or PRs (use code-review).
Try these prompts:
I need to implement a JWT authentication module in Python using pytest. The module should validate tokens,
extract claims, and reject expired or tampered tokens. Start with the simplest case: a valid token returns
the user ID claim. Here is the function signature:
def validate_token(token: str, secret: str) -> dict | None: ...
Walk me through red-green-refactor for a React ProductList component that filters by category using Vitest.
The component receives products as a prop (array of {id, name, category}) and a selectedCategory string.
When selectedCategory is "electronics", only electronics products should render.
I want to write Playwright E2E tests using TDD for our three-step checkout flow: cart review, shipping
address, payment. Define the expected UX as tests before I build the UI. The key assertions: cart shows
correct totals, address validates on blur, payment submits and shows confirmation screen.
My coverage report shows 65% overall. Here are the uncovered lines from pytest --cov-report=term-missing:
[paste coverage output]. I need to reach 80%. Identify the highest-impact gaps (business logic and error
handling first) and write the tests for them.
I have a 280-line PaymentProcessor class. All tests pass. I need to extract the retry logic into a separate
RetryStrategy class. Walk me through safe refactoring: one extraction at a time, full suite after each step,
no behavior changes. Here is the current implementation: [paste code]
Write Vitest tests for this TypeScript price calculation utility -- TDD style, start with failing tests.
Here are the functions to implement:
- calculateSubtotal(items: CartItem[]): number
- applyDiscount(subtotal: number, code: DiscountCode): number
- calculateTax(amount: number, region: string): number
Key references:
| Reference | Topic |
|---|---|
general-tdd.md |
TDD principles and methodology |
python-tdd.md |
pytest fixtures, parametrization, markers |
vitest-patterns.md |
Vitest testing patterns for TypeScript |
playwright-e2e-patterns.md |
Playwright E2E with TDD mindset |
test-design-patterns.md |
Builder, mother, fixture patterns |
refactoring-with-tests.md |
Safe refactoring techniques |
coverage-validation.md |
Coverage analysis and gap identification |
CLI: python scripts/coverage_analyzer.py --path src/ --threshold 80
What it produces: A coverage analysis report identifying uncovered lines, branches, and functions, ranked by impact. Highlights the highest-value gaps to close first.
Typical workflow: After a TDD session, run to verify coverage targets are met and identify remaining gaps.
CLI: python scripts/test_template_generator.py --module src/auth.py --output tests/test_auth.py
What it produces: A test file skeleton with Arrange-Act-Assert structure, proper fixtures, and descriptive test names for each public function in the module.
Typical workflow: When starting TDD on an existing module that has no tests, generate the skeleton first then fill in assertions.
| Bad (skips the TDD mindset) | Good (embraces test-first thinking) |
|---|---|
| "Write a discount calculator" | "I need a discount calculator with percentage, fixed, and stacking rules -- let's TDD it starting with the simplest case" |
| "Write tests for this function" | "I want to practice TDD for this feature -- start with the failing test before any implementation exists" |
| "Add tests to get coverage up" | "My coverage is 65% and I need 80%. Analyze the gaps and prioritize: untested business logic first, edge cases second" |
| "Test this component" | "Walk me through red-green-refactor for this React component -- I want E2E tests with Playwright that define the UX first" |
| "Refactor this code" | "This 400-line function works (tests green). Walk me through extracting smaller functions with the test safety net." |
For starting a new feature with TDD:
I need to implement [feature description] in [language: Python / TypeScript / Emacs Lisp].
The key behaviors are [list 2-3 core behaviors]. Let's do TDD with [framework: pytest / Vitest / ERT].
Start with the failing test for the simplest behavior.
For increasing coverage:
My [project / module] has [N]% test coverage and I need [target]%.
Analyze the gaps and write tests for the highest-impact uncovered code.
Focus on [priority: business logic / error handling / edge cases] first.
For safe refactoring:
This [function / class / module] is [N] lines and needs refactoring.
All tests pass. Walk me through safe refactoring: one extraction at a time,
run tests after each change, no behavior changes.
For E2E with TDD mindset:
Write Playwright E2E tests for [user workflow] using TDD. Define the expected
user experience as tests BEFORE I build the UI. The key steps are:
[list user actions and expected outcomes].
- Asking to "write tests for existing code" without the TDD mindset: This is test-after, not test-driven. The skill works best when you describe the behavior you want to implement and let it write the failing test first. If the code already exists, ask for coverage gap analysis instead.
- Requesting the entire implementation at once: TDD works in small cycles. Asking to "implement the full authentication system with TDD" skips the incremental nature of the methodology. Instead, ask to start with one behavior ("valid email returns true") and build up.
- Skipping the RED verification: The failing test must fail for the right reason (missing functionality, not syntax errors). If you do not verify the failure, you might write a test that accidentally passes, defeating the purpose.
- Asking to "just make it pass" without mentioning refactoring: The REFACTOR phase is not optional. After GREEN, explicitly ask about refactoring opportunities. The skill will prompt you, but acknowledging the phase keeps the cycle disciplined.
You are building a notification service that sends alerts through email, Slack, and SMS based on user preferences. The service needs to handle preference lookup, channel routing, rate limiting, and delivery confirmation. You want to build it test-first to ensure every behavior is explicitly defined before implementation.
You open Claude Code and say:
I'm building a notification service in Python. It sends alerts via email, Slack, and SMS based on user preferences. Let's do TDD with pytest.
The skill starts with the RED phase. Before writing any implementation code, it identifies the simplest behavior to test first: "Given a user with email preferences, sending a notification should route to the email channel."
def test_send_notification_routes_to_email_when_user_prefers_email():
user = User(id="u1", preferences=NotificationPreferences(channels=["email"]))
notification = Notification(user_id="u1", message="Server is down", severity="critical")
result = send_notification(notification, user)
assert result.channel == "email"
assert result.status == "sent"You run the test: it fails with ImportError: cannot import name 'send_notification'. This is the correct kind of failure -- the function does not exist yet. RED is complete.
The skill moves to GREEN: write the minimal code to pass this one test. Not the entire notification service -- just enough:
def send_notification(notification, user):
channel = user.preferences.channels[0]
return NotificationResult(channel=channel, status="sent")Deliberately naive. Test passes. GREEN is complete. REFACTOR: nothing to refactor yet -- too simple.
Second cycle -- multiple channels: Write a test expecting a list of results for multiple preferred channels. RED: fails because the function returns a single result. GREEN: modify to iterate over channels. REFACTOR: extract channel routing into a separate function.
Third cycle -- rate limiting: Write a test that sends 4 notifications when the rate limit is 3, asserting the 4th returns "rate_limited." RED: fails. GREEN: add a counter. REFACTOR: extract into a RateLimiter class.
After six TDD cycles, you have channel routing, multi-channel delivery, rate limiting, severity-based overrides, delivery confirmation, and retry logic -- every behavior defined by a test before implementation. The test suite has 14 tests. Coverage check:
uv run pytest --cov=src/notifications --cov-report=term-missing
# 94% coverage -- 2 lines uncovered in retry backoffOne more test for the backoff edge case brings coverage to 97%. The skill then walks you through a final refactoring pass: the send_notification function has grown to 45 lines across six cycles. Using the test safety net, you extract it into three functions (route, deliver, confirm), running the full suite after each extraction.
Context: You are adding a discount calculation feature with percentage, fixed amount, minimum purchase, and stacking rules.
You say: I need to implement discount calculation with percentage, fixed, and stacking rules. Let's TDD it.
The skill provides:
- Starting test for the simplest case (single percentage discount)
- Progressive test additions for each rule
- Refactoring guidance after each GREEN
- Edge case tests: zero discount, negative total, expired discounts
You end up with: A discount calculator with 20+ tests where each business rule was defined as a test before implementation.
Context: Building a checkout flow and want E2E tests to define the expected UX before the UI exists.
You say: Write Playwright E2E tests for our checkout flow using TDD -- define the UX before I build it
The skill provides:
- E2E tests defining the happy path: cart, shipping, payment, confirmation
- Page object structure and locator patterns
- Assertion patterns for each step
- Guidance on E2E vs unit test boundaries
You end up with: A Playwright suite that serves as a living specification -- the implementation passes when the UX matches.
Context: 65% coverage, CI requires 80%. Need to fill the gaps efficiently.
You say: My coverage is at 65% and I need 80%. Find the gaps and write the missing tests.
The skill provides:
- Coverage analysis identifying uncovered lines and branches
- Priority ranking: business logic first, error handling second, edge cases third
- Test generation for highest-impact gaps
- Threshold enforcement with
coverage_check.py
You end up with: Targeted tests closing the 15% gap, focused on code that matters.
Context: A 400-line function works correctly but is impossible to maintain. Tests are green.
You say: This 400-line function needs refactoring. Tests are green. Walk me through safe refactoring.
The skill provides:
- Extract method strategy, one function at a time
- Test execution after each extraction
- New tests for previously implicit behaviors
- Verification that refactoring preserved behavior
You end up with: The same behavior in 5 well-named functions under 80 lines each, with original tests passing plus new tests for implicit behaviors.
How does the skill choose which testing tier to apply?
- Unit tests (default): when testing isolated functions, business logic, or data transformations with no external dependencies. Mock everything external. Run in milliseconds.
- Integration tests: when testing interactions between components, database queries, or external API contracts. Use test databases or fixtures. Run in seconds.
- E2E tests: when defining or verifying complete user workflows through the UI. Use Playwright with real or staging environments. Run in minutes.
The skill defaults to unit tests unless the task involves component interactions (integration) or user workflows (E2E).
How does the skill choose which language reference to load?
Based on file extensions and explicit mentions:
.pyfiles or "pytest" mentions ->python-tdd.md.ts/.tsxfiles or "Vitest" mentions ->vitest-patterns.md- "Playwright" or "E2E" or "browser" ->
playwright-e2e-patterns.md .elfiles or "ERT" or "Emacs Lisp" ->elisp-tdd.md- "Zod" or "schema" ->
zod-testing-patterns.md
When does REFACTOR happen?
After every GREEN phase. The skill checks: Is the implementation code clean? Are there duplicated patterns? Has the function grown too long? Are there extract method opportunities? If yes, refactor now with tests as the safety net. If the code is still simple (early cycles), the skill notes this and moves to the next RED.
| Failure | Symptom | Recovery |
|---|---|---|
| Test passes when it should fail (GREEN before RED) | The test accidentally passes because the function already exists or the assertion is wrong | Verify the RED phase: the test must fail because the behavior is not implemented, not because of a syntax error. If the test passes immediately, the assertion is not testing the right thing. |
| Tests coupled to implementation, not behavior | Refactoring breaks tests even though behavior has not changed -- tests assert mock call arguments instead of outputs | Rewrite tests to assert observable behavior (return values, side effects, state changes) not implementation details (which functions were called, in what order). |
| Refactoring step consistently skipped | Code works but grows messily across TDD cycles -- functions become long, names become vague | The skill explicitly prompts for refactoring after every GREEN. If you skip it, technical debt accumulates during the TDD session itself. Treat REFACTOR as mandatory, not optional. |
| Coverage numbers are high but tests are shallow | 90% line coverage but tests only check happy paths -- no edge cases, no error handling, no boundary conditions | Line coverage is necessary but not sufficient. The skill uses coverage_analyzer.py to identify covered-but-shallow code and suggests edge case, error, and boundary tests. |
| E2E tests are too slow for TDD rhythm | Playwright tests take 30+ seconds per run, breaking the fast feedback cycle | Use unit/integration tests for the rapid RED-GREEN-REFACTOR cycle. Write E2E tests to define the workflow specification, but do not run them on every micro-cycle. Run E2E after each feature is unit-tested. |
- Developers adopting TDD for the first time -- the skill walks through each phase (RED, GREEN, REFACTOR) explicitly, preventing the common mistake of skipping refactoring
- Teams building complex business logic -- test-first ensures every business rule is defined as a test before implementation, making rules auditable and preventing regression
- Polyglot teams working across Python, TypeScript, and Emacs Lisp -- consistent TDD methodology across languages with framework-specific patterns
- Engineers refactoring legacy code -- the test safety net approach ensures refactoring does not change behavior
- Choosing or setting up test frameworks -- use testing-framework for framework selection, infrastructure setup, and configuration across languages
- Finding and fixing bugs -- use debugging for root cause analysis and stack trace interpretation; TDD prevents bugs, it does not diagnose them
- Reviewing existing code or PRs -- use code-review for structured code review; TDD is a development methodology, not a review tool
- Testing Framework -- Test infrastructure setup and framework selection (complementary: TDD is the methodology, testing-framework is the tooling)
- Python Development -- Python-specific patterns including pytest fixtures and parametrization
- React Development -- React component testing with hooks and component architecture patterns
- Debugging -- When TDD catches a bug, debugging helps trace the root cause
- Code Review -- Review test quality and coverage as part of PR reviews
Part of SkillStack -- production-grade plugins for Claude Code.