Skip to content

Enhance eval wrapper with XML structure, parallel research, token budget awareness, and GOLDEN framework #8

@severity1

Description

@severity1

Summary

Enhance the prompt evaluation wrapper (scripts/improve-prompt.py) with four high-impact improvements based on Claude/Claude Code prompt engineering best practices research:

  1. Convert to XML structure - Leverage Claude's exceptional XML parsing capabilities
  2. Encourage parallel research - Align with Sonnet 4.5's parallel tool calling
  3. Token budget awareness - Prevent context exhaustion during research
  4. GOLDEN framework - Structure evaluation using Goal-Output-Limits-Data-Evaluation pattern

These improvements will enhance instruction-following accuracy, reduce research latency, improve context management, and provide clearer evaluation structure.


1. Convert to XML Structure

Current State

Plain text with dashes and bullet points for evaluation instructions.

Proposed Change

Replace plain text structure with XML tags throughout the wrapped prompt.

Implementation Details

Before:

EVALUATE: Is this prompt clear enough to execute, or does it need enrichment?

PROCEED IMMEDIATELY if:
- Detailed/specific OR you have sufficient context OR can infer intent

After:

<prompt_evaluation>
<original_request>{escaped_prompt}</original_request>

<evaluation_criteria>
Is this prompt clear enough to execute, or does it need enrichment?

<proceed_immediately>
- Detailed/specific OR you have sufficient context OR can infer intent
</proceed_immediately>

<clarification_required>
Only if genuinely vague (e.g., "fix the bug" with no context)
</clarification_required>
</evaluation_criteria>

<critical_rules>
- Trust user intent by default. Check conversation history before doing research.
- Do not rely on base knowledge.
- Never skip Phase 1. Research before asking.
- Don't announce evaluation - just proceed or ask.
</critical_rules>

<phase_1_research>
<required>DO NOT SKIP</required>
<steps>
1. Preface with brief note: "Prompt Improver Hook is seeking clarification because [specific reason]"
2. Create research plan with TodoWrite: "What do I need to research to clarify this vague request?"
3. Execute research using Task/Explore, WebSearch, Read/Grep
4. Use research findings to formulate grounded questions
5. Mark completed
</steps>
</phase_1_research>

<phase_2_ask>
<prerequisite>Only after Phase 1</prerequisite>
<steps>
1. Use AskUserQuestion tool with max 1-6 questions offering specific options from research
2. Use the answers to execute the original user request
</steps>
</phase_2_ask>
</prompt_evaluation>

File Changes

  • scripts/improve-prompt.py:43-69 - Replace wrapped_prompt variable construction

Benefits

  • Claude's training heavily emphasizes XML structure recognition
  • Improved instruction-following accuracy
  • Clearer hierarchical organization
  • Better parsing of nested instructions

Acceptance Criteria

  • All evaluation instructions use XML tags
  • Nested structure clearly defines phases and steps
  • Original functionality preserved (same bypass logic)
  • Token count remains reasonable (~300 tokens or less)

2. Encourage Parallel Research

Current State

Sequential research instructions with no guidance on parallel execution.

Proposed Change

Add explicit guidance to use parallel tool calls during research phase.

Implementation Details

Add new section within <phase_1_research>:

<research_execution>
Execute research efficiently:
- Use parallel tool calls when researching independent aspects
- Example: Run WebSearch + Task/Explore + Grep simultaneously
- Only sequence tools when they have dependencies
- Maximize throughput by batching independent operations
</research_execution>

File Changes

  • scripts/improve-prompt.py:59-64 - Add <research_execution> guidance within Phase 1

Benefits

  • Aligns with Sonnet 4.5's aggressive parallel tool calling capabilities
  • Reduces research phase latency significantly
  • More efficient use of Claude Code's tool system
  • Faster time-to-clarification for users

Acceptance Criteria

  • Parallel research guidance added to Phase 1 instructions
  • Examples demonstrate parallel vs. sequential tool usage
  • Instructions specify when to use parallel execution
  • Guidance integrated into XML structure

3. Token Budget Awareness

Current State

No context window management during research phase.

Proposed Change

Add guidance to monitor token budget and keep research concise.

Implementation Details

Add new section after <phase_1_research>:

<context_management>
Monitor your token budget during research phase:
- Keep research findings concise and high-signal
- Prioritize most relevant context over exhaustive exploration
- If approaching token limits, summarize and proceed
- Aim for minimal necessary context to formulate questions
</context_management>

File Changes

  • scripts/improve-prompt.py - Add <context_management> section after Phase 1

Benefits

  • Prevents context exhaustion during research phase
  • Leverages Sonnet 4.5's context window awareness capability
  • Encourages focused, efficient research
  • Reduces risk of hitting token limits before asking questions

Acceptance Criteria

  • Token budget awareness guidance added
  • Instructions emphasize concise, high-signal research
  • Guidance mentions summarization strategy
  • Integrated naturally into evaluation flow

4. GOLDEN Framework Structure

Current State

Implicit evaluation criteria without formal structure.

Proposed Change

Add explicit GOLDEN framework section (Goal-Output-Limits-Data-Evaluation).

Implementation Details

Add new section at the beginning of <prompt_evaluation>:

<evaluation_framework>
<goal>Determine if prompt needs enrichment to achieve successful first-attempt execution</goal>

<output>Either (a) proceed immediately with clear prompt, or (b) ask 1-6 grounded questions based on research</output>

<limits>
- Max 1-6 questions in Phase 2
- Research before asking (no base knowledge assumptions)
- Respect conversation context and history
- Honor bypass prefixes (*, /, #)
- Maintain ~300 token overhead maximum
</limits>

<data>
Available context sources:
- User prompt content and clarity
- Conversation history
- Codebase context (via Task/Explore, Grep, Read)
- External research (via WebSearch)
</data>

<evaluation>
Prompt clarity sufficient? Context available in conversation? Intent inferable from history?
</evaluation>
</evaluation_framework>

File Changes

  • scripts/improve-prompt.py:43 - Add <evaluation_framework> at start of wrapped prompt

Benefits

  • Aligns with proven GOLDEN framework for optimal prompt construction
  • Provides clear success criteria for evaluation
  • Makes evaluation boundaries explicit
  • Helps Claude understand constraints and objectives upfront

Acceptance Criteria

  • GOLDEN framework section added
  • All five components present (Goal, Output, Limits, Data, Evaluation)
  • Framework appears before detailed instructions
  • Framework references maintained throughout other sections

Implementation Strategy

Recommended Approach

Step 1: XML Conversion

  • Convert existing plain text to XML structure
  • Test with sample prompts (clear, vague, bypass)
  • Ensure no regressions

Step 2: Add New Sections

  • Add GOLDEN framework first (provides context for other sections)
  • Add parallel research guidance
  • Add token budget awareness
  • Integrate all sections cohesively

Step 3: Testing

  • Test with various prompt types
  • Verify token count stays reasonable
  • Confirm bypass logic still works
  • Validate research quality improves

Testing Checklist

  • Clear prompts proceed without intervention
  • Vague prompts trigger research phase
  • Bypass prefixes (*, /, #) work correctly
  • Research uses parallel tools when appropriate
  • Questions are grounded in research findings
  • Token overhead remains ~300 tokens or less
  • XML structure doesn't break JSON escaping

Files to Modify

  • scripts/improve-prompt.py - Main implementation
  • README.md - Update documentation with new features
  • CHANGELOG.md - Document changes for next version

Version Target

Suggest version v0.4.0 for these enhancements.


Additional Context

Research Sources

Based on comprehensive research of:

  • Claude 4/Sonnet 4.5 best practices (docs.claude.com)
  • Anthropic's "Effective Context Engineering for AI Agents"
  • Anthropic's "Claude Code Best Practices"
  • Claude prompt engineering overview and techniques
  • GOLDEN framework for prompt optimization

Token Budget Impact

Current wrapper: ~300 tokens
Expected with changes: ~280-320 tokens (net neutral to slight reduction)

XML structure is more compact than prose, potentially offsetting additions.

Breaking Changes

None expected. All changes are additive improvements to instruction quality.


Success Metrics

How we'll know this is successful:

  1. Improved accuracy in proceed vs. ask decisions
  2. Faster research phase (via parallel execution)
  3. Higher quality questions (via GOLDEN framework structure)
  4. No context exhaustion issues during research
  5. Maintained or improved user experience

References

  • Related: Claude docs on XML structure, parallel tool calling, context awareness
  • Framework: GOLDEN (Goal, Output, Limits, Data, Evaluation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions