Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ node_modules/
# Output directories - all generated outputs should not be committed
outputs/
specwright/outputs/
# But keep demo fixtures tracked in git
!fixtures/demo/outputs/

# System files
.DS_Store
Expand Down
14 changes: 14 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,20 @@ It has a CLI (Node.js/Express) and a Web UI (React/Vite) that share a file-based
- `npm run format` / `npm run format:check` — Prettier
- No test framework is configured. Do not create test files.

## Demo Fixtures

Pre-populated projects for demos live in `fixtures/demo/`. Load them with:

```bash
./scripts/load-demo.sh # Copy fixtures into outputs/ (backs up existing)
./scripts/load-demo.sh --clean # Wipe outputs/ and load fresh
./scripts/load-demo.sh --reset # Restore original outputs/ from backup
```

5 NovaMind AI projects at different workflow stages: complete with issues, docs reviewing, mid-workflow, early (PM questions), and complete with all issues pending.

When editing fixtures, keep files in `fixtures/demo/outputs/` (tracked in git via `.gitignore` negation). The `outputs/` directory itself remains gitignored.

## Local CLI Testing (Unpublished Changes)

**MANDATORY: After ANY code edit session, ALWAYS run `npm run build && npm link` before finishing.** This ensures `specwright-dev` reflects the latest changes. Verify with `specwright-dev --version`. Never skip this step.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
{
"project_id": "001-ai-prompt-playground",
"project_name": "AI Prompt Playground",
"job_stories": [
{
"job_story_id": "js_001",
"title": "Write and Test a Prompt",
"situation": "I have a prompt idea and want to see how different LLMs respond",
"motivation": "quickly test and compare outputs without switching between provider playgrounds",
"outcome": "I can evaluate which provider gives the best result for my use case",
"acceptance_criteria": [
{
"id": "ac_001_01",
"given": "I am on the Prompt Playground page",
"when": "I type a prompt with {{variable}} placeholders in the editor",
"then": "Variable inputs appear below the editor for each detected placeholder"
},
{
"id": "ac_001_02",
"given": "I have written a prompt and selected Claude and GPT-4",
"when": "I click 'Run'",
"then": "Both providers stream responses simultaneously in side-by-side panels"
},
{
"id": "ac_001_03",
"given": "Responses are streaming",
"when": "I view the response panels",
"then": "Each panel shows a live token counter and elapsed time"
},
{
"id": "ac_001_04",
"given": "A provider returns an error",
"when": "The error is received",
"then": "The panel shows the error message with a 'Retry' button while other panels continue"
}
]
},
{
"job_story_id": "js_002",
"title": "Compare Responses Across Providers",
"situation": "multiple providers have returned responses to the same prompt",
"motivation": "evaluate which response is best for quality, accuracy, and cost",
"outcome": "I can make an informed decision about which provider to use",
"acceptance_criteria": [
{
"id": "ac_002_01",
"given": "Responses from 2+ providers are displayed",
"when": "I view the comparison layout",
"then": "Each response shows token count, latency, and estimated cost"
},
{
"id": "ac_002_02",
"given": "I am viewing a response",
"when": "I click the thumbs up or thumbs down button",
"then": "The rating is saved and visible in the version history for this run"
},
{
"id": "ac_002_03",
"given": "I want to read one response in detail",
"when": "I click 'Expand' on a response panel",
"then": "The panel takes full width with the others collapsed to tabs"
},
{
"id": "ac_002_04",
"given": "Responses contain markdown",
"when": "I view the response",
"then": "Markdown is rendered with proper headings, code blocks, and lists"
}
]
},
{
"job_story_id": "js_003",
"title": "Version and Iterate on Prompts",
"situation": "I have been iterating on a prompt across multiple test runs",
"motivation": "see what changes I made and how they affected response quality",
"outcome": "I can learn what prompt patterns work best and avoid regressions",
"acceptance_criteria": [
{
"id": "ac_003_01",
"given": "I have run a prompt test",
"when": "The responses complete",
"then": "A new version is auto-saved with timestamp, prompt text, and quality scores"
},
{
"id": "ac_003_02",
"given": "I am in the version history panel",
"when": "I select two versions",
"then": "A diff view shows additions in green and deletions in red between the two prompts"
},
{
"id": "ac_003_03",
"given": "I am viewing a previous version",
"when": "I click 'Restore'",
"then": "The prompt editor loads that version's text as the active prompt"
},
{
"id": "ac_003_04",
"given": "I have multiple versions with quality scores",
"when": "I view the version list",
"then": "A sparkline chart shows quality score trend across versions"
}
]
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Design Brief: AI Prompt Playground

## Design Goals

1. **Editor-first** - The prompt editor is the hero; everything else supports it
2. **Comparison-friendly** - Make it effortless to see differences between providers and versions
3. **Fast iteration** - Minimize clicks between writing a prompt and seeing results

## User Flows

### Flow 1: Write and Run a Prompt

```
New Prompt → Editor appears
|
Type prompt with {{variables}}
- Variable inputs auto-appear below
- Provider checkboxes on the right
|
Click "Run" (or Cmd+Enter)
|
Responses stream side-by-side
- Token count + latency per provider
- Rate each response (thumbs up/down)
```

### Flow 2: Compare and Iterate

```
View responses from Run #1
|
Edit prompt → Run again
|
Version history sidebar shows Run #1 and #2
|
Select both → Diff view shows prompt changes
|
Quality trend sparkline shows improvement
```

### Flow 3: Browse and Use Templates

```
Click "Templates" in sidebar
|
Browse by category (Extraction, Classification, Generation, etc.)
|
Preview template with sample variables
|
Click "Use Template" → loads into editor
|
Customize and run
```

## Key Screens

1. **Prompt Editor** - Split view: editor left, response panels right
2. **Version History** - Sidebar with version list, diff view on select
3. **Template Library** - Grid of template cards by category
4. **Settings** - Provider API keys, default model selection, preferences

## Visual Guidelines

- Clean, minimal interface — the content (prompts and responses) is the focus
- Monospace font in editor, proportional in responses (with code blocks monospace)
- Provider colors: Claude (orange), GPT-4 (green), Gemini (blue)
- Quality indicators: High (green), Medium (yellow), Low (red)
- Dark mode default with light mode option

## Accessibility

- Full keyboard navigation (Tab between panels, Cmd+Enter to run)
- Screen reader announces response completion and quality scores
- Sufficient contrast for diff highlighting (not just color-dependent)
- Focus management: after Run, focus moves to first response panel

---
*Document generated as part of SpecWright specification*
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Product Requirements Document: AI Prompt Playground

## Overview

An interactive workspace for prompt engineering that lets developers write, test, and refine AI prompts across multiple LLM providers. Compare responses side-by-side, track prompt iterations with version history, score response quality, and build a reusable template library.

## Problem Statement

Prompt engineering is iterative and messy. Developers bounce between provider playgrounds, lose track of what they tested, and have no way to compare outputs systematically. There is no single tool that combines writing, testing, versioning, and quality tracking in one place.

## Goals

1. **Multi-provider testing** - Run the same prompt against Claude, GPT-4, and Gemini in one click
2. **Version control** - Track prompt iterations with diffs so you can see what changed and why
3. **Quality measurement** - Automated and manual scoring to track improvement over time
4. **Reusable templates** - Build a library of proven prompt patterns for the team

## User Stories

### US-1: Write and Test a Prompt
**As a** developer, **I want to** write a prompt and test it against multiple LLMs simultaneously, **so that** I can compare outputs and pick the best provider for my use case.

**Acceptance Criteria:**
- Rich text editor for prompt with variable placeholder support ({{variable}})
- Select one or more providers to test against
- Run all selected providers in parallel
- Streaming responses displayed in real-time

### US-2: Compare Responses Side-by-Side
**As a** developer, **I want to** see responses from different providers next to each other, **so that** I can evaluate quality, tone, and accuracy differences.

**Acceptance Criteria:**
- Side-by-side columns for each provider response
- Token count and latency shown per response
- Thumbs up/down rating per response
- Expand any single response to full width for detailed reading

### US-3: Version and Iterate on Prompts
**As a** developer, **I want to** save prompt versions and see diffs between iterations, **so that** I can track what changes improved or degraded quality.

**Acceptance Criteria:**
- Auto-save creates a new version on each test run
- Version list shows timestamp, provider tested, quality score
- Diff view highlights additions/deletions between any two versions
- Can restore any previous version as the active prompt

### US-4: Use and Create Templates
**As a** developer, **I want to** start from proven prompt templates, **so that** I don't reinvent common patterns.

**Acceptance Criteria:**
- Browse templates by category (classification, extraction, generation, etc.)
- Preview template with example variables filled in
- One-click to load template into editor
- Save any prompt as a new template

## Scope

### In Scope
- Multi-provider prompt testing (Claude, GPT-4, Gemini)
- Streaming responses with real-time display
- Prompt version history with diff comparison
- Automated + manual quality scoring
- Template library with categories and search
- Export/import prompts as JSON

### Out of Scope
- Team collaboration (v2)
- CI/CD integration for prompt regression testing (v2)
- Prompt chaining / multi-step workflows
- Fine-tuning integration

## Success Metrics

| Metric | Target |
|--------|--------|
| Prompts tested per session | avg 5+ |
| Version comparison usage | >60% of users |
| Template adoption rate | >40% start from template |
| Time to first test | <30 seconds |

---
*Document generated as part of SpecWright specification*
Loading
Loading