Summary
Enhance the prompt evaluation wrapper (scripts/improve-prompt.py) with four high-impact improvements based on Claude/Claude Code prompt engineering best practices research:
- Convert to XML structure - Leverage Claude's exceptional XML parsing capabilities
- Encourage parallel research - Align with Sonnet 4.5's parallel tool calling
- Token budget awareness - Prevent context exhaustion during research
- GOLDEN framework - Structure evaluation using Goal-Output-Limits-Data-Evaluation pattern
These improvements will enhance instruction-following accuracy, reduce research latency, improve context management, and provide clearer evaluation structure.
1. Convert to XML Structure
Current State
Plain text with dashes and bullet points for evaluation instructions.
Proposed Change
Replace plain text structure with XML tags throughout the wrapped prompt.
Implementation Details
Before:
EVALUATE: Is this prompt clear enough to execute, or does it need enrichment?
PROCEED IMMEDIATELY if:
- Detailed/specific OR you have sufficient context OR can infer intent
After:
<prompt_evaluation>
<original_request>{escaped_prompt}</original_request>
<evaluation_criteria>
Is this prompt clear enough to execute, or does it need enrichment?
<proceed_immediately>
- Detailed/specific OR you have sufficient context OR can infer intent
</proceed_immediately>
<clarification_required>
Only if genuinely vague (e.g., "fix the bug" with no context)
</clarification_required>
</evaluation_criteria>
<critical_rules>
- Trust user intent by default. Check conversation history before doing research.
- Do not rely on base knowledge.
- Never skip Phase 1. Research before asking.
- Don't announce evaluation - just proceed or ask.
</critical_rules>
<phase_1_research>
<required>DO NOT SKIP</required>
<steps>
1. Preface with brief note: "Prompt Improver Hook is seeking clarification because [specific reason]"
2. Create research plan with TodoWrite: "What do I need to research to clarify this vague request?"
3. Execute research using Task/Explore, WebSearch, Read/Grep
4. Use research findings to formulate grounded questions
5. Mark completed
</steps>
</phase_1_research>
<phase_2_ask>
<prerequisite>Only after Phase 1</prerequisite>
<steps>
1. Use AskUserQuestion tool with max 1-6 questions offering specific options from research
2. Use the answers to execute the original user request
</steps>
</phase_2_ask>
</prompt_evaluation>
File Changes
scripts/improve-prompt.py:43-69 - Replace wrapped_prompt variable construction
Benefits
- Claude's training heavily emphasizes XML structure recognition
- Improved instruction-following accuracy
- Clearer hierarchical organization
- Better parsing of nested instructions
Acceptance Criteria
2. Encourage Parallel Research
Current State
Sequential research instructions with no guidance on parallel execution.
Proposed Change
Add explicit guidance to use parallel tool calls during research phase.
Implementation Details
Add new section within <phase_1_research>:
<research_execution>
Execute research efficiently:
- Use parallel tool calls when researching independent aspects
- Example: Run WebSearch + Task/Explore + Grep simultaneously
- Only sequence tools when they have dependencies
- Maximize throughput by batching independent operations
</research_execution>
File Changes
scripts/improve-prompt.py:59-64 - Add <research_execution> guidance within Phase 1
Benefits
- Aligns with Sonnet 4.5's aggressive parallel tool calling capabilities
- Reduces research phase latency significantly
- More efficient use of Claude Code's tool system
- Faster time-to-clarification for users
Acceptance Criteria
3. Token Budget Awareness
Current State
No context window management during research phase.
Proposed Change
Add guidance to monitor token budget and keep research concise.
Implementation Details
Add new section after <phase_1_research>:
<context_management>
Monitor your token budget during research phase:
- Keep research findings concise and high-signal
- Prioritize most relevant context over exhaustive exploration
- If approaching token limits, summarize and proceed
- Aim for minimal necessary context to formulate questions
</context_management>
File Changes
scripts/improve-prompt.py - Add <context_management> section after Phase 1
Benefits
- Prevents context exhaustion during research phase
- Leverages Sonnet 4.5's context window awareness capability
- Encourages focused, efficient research
- Reduces risk of hitting token limits before asking questions
Acceptance Criteria
4. GOLDEN Framework Structure
Current State
Implicit evaluation criteria without formal structure.
Proposed Change
Add explicit GOLDEN framework section (Goal-Output-Limits-Data-Evaluation).
Implementation Details
Add new section at the beginning of <prompt_evaluation>:
<evaluation_framework>
<goal>Determine if prompt needs enrichment to achieve successful first-attempt execution</goal>
<output>Either (a) proceed immediately with clear prompt, or (b) ask 1-6 grounded questions based on research</output>
<limits>
- Max 1-6 questions in Phase 2
- Research before asking (no base knowledge assumptions)
- Respect conversation context and history
- Honor bypass prefixes (*, /, #)
- Maintain ~300 token overhead maximum
</limits>
<data>
Available context sources:
- User prompt content and clarity
- Conversation history
- Codebase context (via Task/Explore, Grep, Read)
- External research (via WebSearch)
</data>
<evaluation>
Prompt clarity sufficient? Context available in conversation? Intent inferable from history?
</evaluation>
</evaluation_framework>
File Changes
scripts/improve-prompt.py:43 - Add <evaluation_framework> at start of wrapped prompt
Benefits
- Aligns with proven GOLDEN framework for optimal prompt construction
- Provides clear success criteria for evaluation
- Makes evaluation boundaries explicit
- Helps Claude understand constraints and objectives upfront
Acceptance Criteria
Implementation Strategy
Recommended Approach
Step 1: XML Conversion
- Convert existing plain text to XML structure
- Test with sample prompts (clear, vague, bypass)
- Ensure no regressions
Step 2: Add New Sections
- Add GOLDEN framework first (provides context for other sections)
- Add parallel research guidance
- Add token budget awareness
- Integrate all sections cohesively
Step 3: Testing
- Test with various prompt types
- Verify token count stays reasonable
- Confirm bypass logic still works
- Validate research quality improves
Testing Checklist
Files to Modify
scripts/improve-prompt.py - Main implementation
README.md - Update documentation with new features
CHANGELOG.md - Document changes for next version
Version Target
Suggest version v0.4.0 for these enhancements.
Additional Context
Research Sources
Based on comprehensive research of:
- Claude 4/Sonnet 4.5 best practices (docs.claude.com)
- Anthropic's "Effective Context Engineering for AI Agents"
- Anthropic's "Claude Code Best Practices"
- Claude prompt engineering overview and techniques
- GOLDEN framework for prompt optimization
Token Budget Impact
Current wrapper: ~300 tokens
Expected with changes: ~280-320 tokens (net neutral to slight reduction)
XML structure is more compact than prose, potentially offsetting additions.
Breaking Changes
None expected. All changes are additive improvements to instruction quality.
Success Metrics
How we'll know this is successful:
- Improved accuracy in proceed vs. ask decisions
- Faster research phase (via parallel execution)
- Higher quality questions (via GOLDEN framework structure)
- No context exhaustion issues during research
- Maintained or improved user experience
References
- Related: Claude docs on XML structure, parallel tool calling, context awareness
- Framework: GOLDEN (Goal, Output, Limits, Data, Evaluation)
Summary
Enhance the prompt evaluation wrapper (
scripts/improve-prompt.py) with four high-impact improvements based on Claude/Claude Code prompt engineering best practices research:These improvements will enhance instruction-following accuracy, reduce research latency, improve context management, and provide clearer evaluation structure.
1. Convert to XML Structure
Current State
Plain text with dashes and bullet points for evaluation instructions.
Proposed Change
Replace plain text structure with XML tags throughout the wrapped prompt.
Implementation Details
Before:
After:
File Changes
scripts/improve-prompt.py:43-69- Replacewrapped_promptvariable constructionBenefits
Acceptance Criteria
2. Encourage Parallel Research
Current State
Sequential research instructions with no guidance on parallel execution.
Proposed Change
Add explicit guidance to use parallel tool calls during research phase.
Implementation Details
Add new section within
<phase_1_research>:File Changes
scripts/improve-prompt.py:59-64- Add<research_execution>guidance within Phase 1Benefits
Acceptance Criteria
3. Token Budget Awareness
Current State
No context window management during research phase.
Proposed Change
Add guidance to monitor token budget and keep research concise.
Implementation Details
Add new section after
<phase_1_research>:File Changes
scripts/improve-prompt.py- Add<context_management>section after Phase 1Benefits
Acceptance Criteria
4. GOLDEN Framework Structure
Current State
Implicit evaluation criteria without formal structure.
Proposed Change
Add explicit GOLDEN framework section (Goal-Output-Limits-Data-Evaluation).
Implementation Details
Add new section at the beginning of
<prompt_evaluation>:File Changes
scripts/improve-prompt.py:43- Add<evaluation_framework>at start of wrapped promptBenefits
Acceptance Criteria
Implementation Strategy
Recommended Approach
Step 1: XML Conversion
Step 2: Add New Sections
Step 3: Testing
Testing Checklist
Files to Modify
scripts/improve-prompt.py- Main implementationREADME.md- Update documentation with new featuresCHANGELOG.md- Document changes for next versionVersion Target
Suggest version
v0.4.0for these enhancements.Additional Context
Research Sources
Based on comprehensive research of:
Token Budget Impact
Current wrapper: ~300 tokens
Expected with changes: ~280-320 tokens (net neutral to slight reduction)
XML structure is more compact than prose, potentially offsetting additions.
Breaking Changes
None expected. All changes are additive improvements to instruction quality.
Success Metrics
How we'll know this is successful:
References