Enhance eval wrapper with XML structure, parallel research, token budget awareness, and GOLDEN framework

## Summary

Enhance the prompt evaluation wrapper (`scripts/improve-prompt.py`) with four high-impact improvements based on Claude/Claude Code prompt engineering best practices research:

1. **Convert to XML structure** - Leverage Claude's exceptional XML parsing capabilities
2. **Encourage parallel research** - Align with Sonnet 4.5's parallel tool calling
3. **Token budget awareness** - Prevent context exhaustion during research
4. **GOLDEN framework** - Structure evaluation using Goal-Output-Limits-Data-Evaluation pattern

These improvements will enhance instruction-following accuracy, reduce research latency, improve context management, and provide clearer evaluation structure.

---

## 1. Convert to XML Structure

### Current State
Plain text with dashes and bullet points for evaluation instructions.

### Proposed Change
Replace plain text structure with XML tags throughout the wrapped prompt.

### Implementation Details

**Before:**
```
EVALUATE: Is this prompt clear enough to execute, or does it need enrichment?

PROCEED IMMEDIATELY if:
- Detailed/specific OR you have sufficient context OR can infer intent
```

**After:**
```xml
<prompt_evaluation>
<original_request>{escaped_prompt}</original_request>

<evaluation_criteria>
Is this prompt clear enough to execute, or does it need enrichment?

<proceed_immediately>
- Detailed/specific OR you have sufficient context OR can infer intent
</proceed_immediately>

<clarification_required>
Only if genuinely vague (e.g., "fix the bug" with no context)
</clarification_required>
</evaluation_criteria>

<critical_rules>
- Trust user intent by default. Check conversation history before doing research.
- Do not rely on base knowledge.
- Never skip Phase 1. Research before asking.
- Don't announce evaluation - just proceed or ask.
</critical_rules>

<phase_1_research>
<required>DO NOT SKIP</required>
<steps>
1. Preface with brief note: "Prompt Improver Hook is seeking clarification because [specific reason]"
2. Create research plan with TodoWrite: "What do I need to research to clarify this vague request?"
3. Execute research using Task/Explore, WebSearch, Read/Grep
4. Use research findings to formulate grounded questions
5. Mark completed
</steps>
</phase_1_research>

<phase_2_ask>
<prerequisite>Only after Phase 1</prerequisite>
<steps>
1. Use AskUserQuestion tool with max 1-6 questions offering specific options from research
2. Use the answers to execute the original user request
</steps>
</phase_2_ask>
</prompt_evaluation>
```

### File Changes
- `scripts/improve-prompt.py:43-69` - Replace `wrapped_prompt` variable construction

### Benefits
- Claude's training heavily emphasizes XML structure recognition
- Improved instruction-following accuracy
- Clearer hierarchical organization
- Better parsing of nested instructions

### Acceptance Criteria
- [ ] All evaluation instructions use XML tags
- [ ] Nested structure clearly defines phases and steps
- [ ] Original functionality preserved (same bypass logic)
- [ ] Token count remains reasonable (~300 tokens or less)

---

## 2. Encourage Parallel Research

### Current State
Sequential research instructions with no guidance on parallel execution.

### Proposed Change
Add explicit guidance to use parallel tool calls during research phase.

### Implementation Details

Add new section within `<phase_1_research>`:

```xml
<research_execution>
Execute research efficiently:
- Use parallel tool calls when researching independent aspects
- Example: Run WebSearch + Task/Explore + Grep simultaneously
- Only sequence tools when they have dependencies
- Maximize throughput by batching independent operations
</research_execution>
```

### File Changes
- `scripts/improve-prompt.py:59-64` - Add `<research_execution>` guidance within Phase 1

### Benefits
- Aligns with Sonnet 4.5's aggressive parallel tool calling capabilities
- Reduces research phase latency significantly
- More efficient use of Claude Code's tool system
- Faster time-to-clarification for users

### Acceptance Criteria
- [ ] Parallel research guidance added to Phase 1 instructions
- [ ] Examples demonstrate parallel vs. sequential tool usage
- [ ] Instructions specify when to use parallel execution
- [ ] Guidance integrated into XML structure

---

## 3. Token Budget Awareness

### Current State
No context window management during research phase.

### Proposed Change
Add guidance to monitor token budget and keep research concise.

### Implementation Details

Add new section after `<phase_1_research>`:

```xml
<context_management>
Monitor your token budget during research phase:
- Keep research findings concise and high-signal
- Prioritize most relevant context over exhaustive exploration
- If approaching token limits, summarize and proceed
- Aim for minimal necessary context to formulate questions
</context_management>
```

### File Changes
- `scripts/improve-prompt.py` - Add `<context_management>` section after Phase 1

### Benefits
- Prevents context exhaustion during research phase
- Leverages Sonnet 4.5's context window awareness capability
- Encourages focused, efficient research
- Reduces risk of hitting token limits before asking questions

### Acceptance Criteria
- [ ] Token budget awareness guidance added
- [ ] Instructions emphasize concise, high-signal research
- [ ] Guidance mentions summarization strategy
- [ ] Integrated naturally into evaluation flow

---

## 4. GOLDEN Framework Structure

### Current State
Implicit evaluation criteria without formal structure.

### Proposed Change
Add explicit GOLDEN framework section (Goal-Output-Limits-Data-Evaluation).

### Implementation Details

Add new section at the beginning of `<prompt_evaluation>`:

```xml
<evaluation_framework>
<goal>Determine if prompt needs enrichment to achieve successful first-attempt execution</goal>

<output>Either (a) proceed immediately with clear prompt, or (b) ask 1-6 grounded questions based on research</output>

<limits>
- Max 1-6 questions in Phase 2
- Research before asking (no base knowledge assumptions)
- Respect conversation context and history
- Honor bypass prefixes (*, /, #)
- Maintain ~300 token overhead maximum
</limits>

<data>
Available context sources:
- User prompt content and clarity
- Conversation history
- Codebase context (via Task/Explore, Grep, Read)
- External research (via WebSearch)
</data>

<evaluation>
Prompt clarity sufficient? Context available in conversation? Intent inferable from history?
</evaluation>
</evaluation_framework>
```

### File Changes
- `scripts/improve-prompt.py:43` - Add `<evaluation_framework>` at start of wrapped prompt

### Benefits
- Aligns with proven GOLDEN framework for optimal prompt construction
- Provides clear success criteria for evaluation
- Makes evaluation boundaries explicit
- Helps Claude understand constraints and objectives upfront

### Acceptance Criteria
- [ ] GOLDEN framework section added
- [ ] All five components present (Goal, Output, Limits, Data, Evaluation)
- [ ] Framework appears before detailed instructions
- [ ] Framework references maintained throughout other sections

---

## Implementation Strategy

### Recommended Approach

**Step 1: XML Conversion**
- Convert existing plain text to XML structure
- Test with sample prompts (clear, vague, bypass)
- Ensure no regressions

**Step 2: Add New Sections**
- Add GOLDEN framework first (provides context for other sections)
- Add parallel research guidance
- Add token budget awareness
- Integrate all sections cohesively

**Step 3: Testing**
- Test with various prompt types
- Verify token count stays reasonable
- Confirm bypass logic still works
- Validate research quality improves

### Testing Checklist

- [ ] Clear prompts proceed without intervention
- [ ] Vague prompts trigger research phase
- [ ] Bypass prefixes (*, /, #) work correctly
- [ ] Research uses parallel tools when appropriate
- [ ] Questions are grounded in research findings
- [ ] Token overhead remains ~300 tokens or less
- [ ] XML structure doesn't break JSON escaping

### Files to Modify

- `scripts/improve-prompt.py` - Main implementation
- `README.md` - Update documentation with new features
- `CHANGELOG.md` - Document changes for next version

### Version Target

Suggest version `v0.4.0` for these enhancements.

---

## Additional Context

### Research Sources

Based on comprehensive research of:
- Claude 4/Sonnet 4.5 best practices (docs.claude.com)
- Anthropic's "Effective Context Engineering for AI Agents"
- Anthropic's "Claude Code Best Practices"
- Claude prompt engineering overview and techniques
- GOLDEN framework for prompt optimization

### Token Budget Impact

Current wrapper: ~300 tokens
Expected with changes: ~280-320 tokens (net neutral to slight reduction)

XML structure is more compact than prose, potentially offsetting additions.

### Breaking Changes

None expected. All changes are additive improvements to instruction quality.

---

## Success Metrics

How we'll know this is successful:
1. Improved accuracy in proceed vs. ask decisions
2. Faster research phase (via parallel execution)
3. Higher quality questions (via GOLDEN framework structure)
4. No context exhaustion issues during research
5. Maintained or improved user experience

---

## References
- Related: Claude docs on XML structure, parallel tool calling, context awareness
- Framework: GOLDEN (Goal, Output, Limits, Data, Evaluation)

Enhance eval wrapper with XML structure, parallel research, token budget awareness, and GOLDEN framework #8

Description

Summary

1. Convert to XML Structure

Current State

Proposed Change

Implementation Details

File Changes

Benefits

Acceptance Criteria

2. Encourage Parallel Research

Current State

Proposed Change

Implementation Details

File Changes

Benefits

Acceptance Criteria

3. Token Budget Awareness

Current State

Proposed Change

Implementation Details

File Changes

Benefits

Acceptance Criteria

4. GOLDEN Framework Structure

Current State

Proposed Change

Implementation Details

File Changes

Benefits

Acceptance Criteria

Implementation Strategy

Recommended Approach

Testing Checklist

Files to Modify

Version Target

Additional Context

Research Sources

Token Budget Impact

Breaking Changes

Success Metrics

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions