Agent System

The agent system is the heart of PrismBench's evaluation methodology. It uses specialized agents to orchestrate complex multi-step evaluations that mirror real-world software development workflows.

Agent Architecture

Overview

PrismBench employs a multi-agent architecture where specialized agents collaborate to create, evaluate, and analyze coding challenges. Each agent has a specific role, expertise area, and set of responsibilities.

graph TB
    subgraph "Agent Workflow"
        subgraph "Problem Creation"
            CD[Challenge Designer<br/>📝 Creates Problems]
            CDA[Challenge Designer Advanced<br/>📝 Diverse Problems]
        end
        
        subgraph "Test Development" 
            TG[Test Generator<br/>🧪 Creates Tests]
            TV[Test Validator<br/>✅ Validates Tests]
        end
        
        subgraph "Solution Development"
            PS[Problem Solver<br/>💡 Writes Solutions]
            PF[Problem Fixer<br/>🔧 Debugs Code]
        end
        
        subgraph "Analysis"
            TEA[Test Error Analyzer<br/>🔍 Analyzes Failures]
            SPA[Solution Pattern Analyzer<br/>📊 Studies Patterns]
        end
    end
    
    CD --> TG
    CDA --> TV
    TG --> PS
    TV --> PS
    PS --> PF
    PS --> TEA
    PF --> SPA

Design Principles

1. Specialization

Each agent is designed for a specific task with:

Domain-specific expertise
Tailored prompts and instructions
Optimized output formats
Task-appropriate model parameters

2. Collaboration

Agents work together in defined workflows:

Sequential processing pipelines
Information passing between agents
Error handling and recovery
Quality assurance checkpoints

3. Modularity

Agents are independently configurable:

Swappable implementations
Adjustable parameters
Custom prompt templates
Provider-agnostic design

4. Consistency

Standardized interaction patterns:

Common configuration format
Uniform output formatting
Consistent error handling
Predictable behavior

Agent Roles and Responsibilities

1. Challenge Designer (`challenge_designer`)

Purpose: Creates coding problems focused on specific CS concepts and difficulty levels.

Key Responsibilities:

Generate problem statements similar to LeetCode challenges
Ensure problems match specified concepts and difficulty
Provide clear input/output specifications
Include comprehensive examples and constraints
Explain concept relevance

Input Requirements:

concepts: List of CS concepts to test
difficulty_level: Target difficulty (very easy to very hard)

Output Format:

<problem_description>
## Problem Title
Difficulty: [Level]

[Problem description with input/output specs, constraints, examples]
</problem_description>

Configuration Example:

agent_name: challenge_designer
configs:
  model_name: gpt-4o-mini
  provider: openai
  params:
    temperature: 0.8  # High creativity for diverse problems
    max_tokens: 5120
system_prompt: >
  You are an expert computer science educator specializing in creating 
  coding challenges...

2. Challenge Designer Advanced (`challenge_designer_advanced`)

Purpose: Creates diverse, unique coding problems while avoiding duplication.

Key Differentiators:

Duplicate Avoidance: Analyzes previous problems to ensure uniqueness
Variation Generation: Creates substantially different problem approaches
Context Diversity: Uses different scenarios and problem contexts
Advanced Constraints: Implements more complex requirement patterns

Input Requirements:

concepts: CS concepts to test
difficulty_level: Target difficulty
previous_problems: List of previously generated problems to avoid

Advanced Features:

Problem similarity analysis
Context variation (different domains, scenarios)
Approach diversification (different algorithmic strategies)
Constraint variation (different input/output patterns)

3. Test Generator (`test_generator`)

Purpose: Develops comprehensive unittest test cases for coding problems.

Key Responsibilities:

Create test classes inheriting from unittest.TestCase
Cover scenarios from very easy to very hard
Include edge cases and boundary conditions
Provide descriptive test method names and docstrings
Use appropriate assertions for result verification

Test Coverage Strategy:

Basic functionality: Core problem requirements
Edge cases: Boundary values and special conditions
Error handling: Invalid inputs and error conditions
Performance: Large inputs and stress testing
Corner cases: Unusual but valid scenarios

Output Format:

<test_code>
import unittest

class TestFunction(unittest.TestCase):
    def test_basic_case(self):
        """Test basic functionality."""
        self.assertEqual(function_to_test([1,2,3]), expected_result)
    
    def test_edge_case(self):
        """Test edge cases."""
        self.assertEqual(function_to_test([]), expected_edge_result)
</test_code>

4. Test Validator (`test_validator`)

Purpose: Validates test case quality and coverage.

Validation Criteria:

Coverage Analysis: Ensures all problem requirements are tested
Edge Case Verification: Confirms comprehensive edge case testing
Assertion Correctness: Validates test expectations
Test Structure: Checks test organization and naming
Quality Assessment: Evaluates test effectiveness

Output Format:

<test_validation>
1. Missing Test Scenarios:
   - [List of missing test cases]

2. Incorrect Assertions:
   - [Issues with test expectations]

3. Suggestions for Improving Test Coverage:
   - [Recommendations for better coverage]

4. Analysis of Edge Cases:
   - [Edge case analysis and recommendations]
</test_validation>

5. Problem Solver (`problem_solver`)

Purpose: Implements efficient, well-structured solutions to coding problems.

Key Responsibilities:

Develop algorithmic solutions
Write clean, commented code
Handle all specified constraints
Optimize for time and space complexity
Provide single solution function

Solution Requirements:

Function named solution
Handles all input specifications
Returns output as specified
Includes clear comments
Follows Python best practices

Output Format:

<generated_solution>
def solution(input_params):
    """
    Clear description of the approach and complexity.
    
    Args:
        input_params: Description of inputs
        
    Returns:
        Description of output
    """
    # Implementation with clear comments
    pass
</generated_solution>

6. Problem Fixer (`problem_fixer`)

Purpose: Analyzes failing solutions and provides corrected versions.

Debugging Process:

Error Analysis: Identify specific failure points
Root Cause Analysis: Determine underlying issues
Solution Strategy: Plan correction approach
Code Repair: Implement fixes
Verification: Ensure fixes address all issues

Input Requirements:

problem_statement: Original problem description
test_cases: Test cases that are failing
current_solution: Current failing solution
error_output: Detailed error information

Analysis Areas:

Logic errors and algorithm issues
Edge case handling problems
Performance and efficiency issues
Syntax and runtime errors
Test requirement misunderstandings

7. Test Error Analyzer (`test_error_analyzer`)

Purpose: Provides detailed analysis of test execution failures.

Analysis Categories:

Test Failures: Specific tests that didn't pass
Error Analysis: Runtime errors and exceptions
Root Cause Identification: Underlying problem causes
Improvement Suggestions: Recommendations for fixes

Output Format:

<error_analysis>
Test Failures:
1. [test_name]:
   Failure Reason: [detailed explanation]
   Root Cause: [underlying issue]

Test Errors:
1. [test_name]:
   Error Reason: [error details]
   Root Cause: [error source]

Root Causes:
- [List of identified root causes]

Suggested Areas to Investigate:
- [Specific recommendations for investigation]
</error_analysis>

8. Solution Pattern Analyzer (`solution_pattern_analyzer`)

Purpose: Analyzes solution code to identify patterns and implementation approaches.

Analysis Dimensions:

Algorithm Patterns: Strategic approaches used
Data Structure Usage: Choice and application of data structures
Code Organization: Structure and modularity patterns
Implementation Choices: Language features and techniques
Performance Characteristics: Complexity and optimization patterns

Output Format:

<pattern_analysis>
{
  "algorithm_patterns": {
    "main_strategy": "dynamic programming",
    "time_complexity": "O(n^2)",
    "space_complexity": "O(n)"
  },
  "data_structures": {
    "primary": ["hashmap", "array"],
    "usage_patterns": {
      "hashmap": "O(1) lookups for memoization"
    }
  },
  "implementation_choices": {
    "language_features": ["list comprehension"],
    "optimization_techniques": ["early termination"]
  }
}
</pattern_analysis>

Agent Configuration System

Configuration Structure

Each agent is defined by a YAML configuration file with four main sections:

agent_name: [unique_identifier]
configs: [model and provider settings]
system_prompt: [role definition and instructions]
interaction_templates: [input/output templates]

Model Configuration

Agents can use different models optimized for their tasks:

configs:
  model_name: gpt-4o-mini    # Model selection
  provider: openai          # Provider choice
  params:
    temperature: 0.8        # Creativity level
    max_tokens: 5120        # Response length
  local: false              # Local vs cloud model

Prompt Engineering

System prompts define agent expertise and behavior:

system_prompt: >
  You are an expert [domain] specialist with expertise in [areas].
  Your role is to [specific responsibilities].
  
  When given [inputs], you should [expected actions].
  
  Your response should include [output requirements].
  
  **IMPORTANT:** [critical formatting requirements]

Interaction Templates

Templates define how agents receive inputs and format outputs:

interaction_templates:
  - name: basic
    required_keys: [input1, input2]
    template: >
      [Input processing template with {placeholders}]
    output_format:
      response_begin: <tag>
      response_end: </tag>

Agent Workflows

Standard Workflow (`environment_coding_challenge`)

Sequential agent pipeline for basic evaluation:

sequenceDiagram
    participant E as Environment
    participant CD as Challenge Designer
    participant TG as Test Generator
    participant PS as Problem Solver
    participant PF as Problem Fixer
    
    E->>CD: Generate Problem
    CD->>E: Problem Description
    E->>TG: Generate Tests
    TG->>E: Test Cases
    E->>PS: Solve Problem
    PS->>E: Solution Code
    E->>E: Execute Tests
    alt Tests Fail
        E->>PF: Fix Solution
        PF->>E: Fixed Code
    end

Enhanced Workflow (`environment_enhanced_coding_challenge`)

Extended pipeline with validation and analysis:

sequenceDiagram
    participant E as Environment
    participant CDA as Challenge Designer Advanced
    participant TG as Test Generator
    participant TV as Test Validator
    participant PS as Problem Solver
    participant PF as Problem Fixer
    participant TEA as Test Error Analyzer
    
    E->>CDA: Generate Unique Problem
    CDA->>E: Problem Description
    E->>TG: Generate Tests
    TG->>E: Test Cases
    E->>TV: Validate Tests
    TV->>E: Validation Report
    E->>PS: Solve Problem
    PS->>E: Solution Code
    E->>E: Execute Tests
    alt Tests Fail
        E->>TEA: Analyze Failures
        TEA->>E: Error Analysis
        E->>PF: Fix Solution
        PF->>E: Fixed Code
    end

Agent Communication

Session Management

Agents communicate through the LLM Interface Service:

Session Initialization: Create agent-specific session
Request Submission: Send task with formatted input
Asynchronous Processing: Task queued and processed
Status Monitoring: Poll for completion
Result Retrieval: Extract formatted output
Session Cleanup: Clean up resources

Input/Output Formatting

Standardized communication format:

# Input to agent
{
    "session_id": "uuid",
    "input_data": {
        "concepts": ["loops", "arrays"],
        "difficulty_level": "medium",
        # ... other template parameters
    },
    "use_agent": False
}

# Output from agent
{
    "status": "completed",
    "result": {
        "response": "[formatted response with delimiters]"
    }
}

Error Handling

Robust error handling at multiple levels:

Input Validation: Check required parameters
Model Errors: Handle API failures and timeouts
Output Parsing: Validate response format

Advanced Agent Features

Dynamic Model Selection

Agents can be configured with different models for different tasks:

# High creativity for problem generation
challenge_designer:
  configs:
    temperature: 0.8
    
# Low temperature for code generation
problem_solver:
  configs:
    temperature: 0.2

Custom Agent Templates

Create domain-specific agents:

agent_name: security_analyst
configs:
  model_name: gpt-4o-mini
  provider: openai
system_prompt: >
  You are a cybersecurity expert specializing in code security analysis...
interaction_templates:
  - name: security_review
    required_keys: [code, security_requirements]
    template: >
      Analyze the following code for security vulnerabilities: {code}

Agent Composition

Combine agents for complex workflows:

custom_environment:
  agents:
    - challenge_designer
    - security_analyst
    - performance_analyzer
    - problem_solver
    - code_reviewer

Performance Optimization

Concurrent Execution

Multiple agents can run concurrently:

Independent task processing
Parallel test execution
Async communication patterns
Resource management

Caching Strategies

Optimize repeated operations:

Session reuse for similar tasks
Connection pooling

Next Steps:

🌍 Environment System - How agents work within environments
🤖 Agent Configurations - Detailed configuration guide
🔧 Custom Agents - Creating new agent types
💡 Examples - Agent usage examples

Related Pages

🔧 Agent Development

🧩 Custom Agents - Creating specialized agents with custom prompts
🌐 Custom Environments - Agent orchestration in environments
🔗 Extension Combinations - Combining agents with other extensions

🌍 System Integration

🌍 Environment System - How agents work within environments
🏗️ Architecture Overview - Overall system design
📋 Configuration Overview - Agent configuration system

🛠️ Implementation

🔧 Extending PrismBench - Framework extension overview
📊 Results Analysis - Understanding agent performance
🆘 Troubleshooting - Agent-related issues and solutions

📚 PrismBench Wiki

🚀 Getting Started

🎯 Core Framework

🧠 MCTS System

🤖 Agent System

🌍 Environment System

🔧 Configuration Reference

📋 Main Configuration

🛠️ Development

🔧 Extension

📊 Analysis & Results

💡 Examples & Tutorials

💡 Basic Examples (Coming Soon)
🏗️ Advanced Examples (Coming Soon)
📚 Step-by-Step Tutorials (Coming Soon)

🆘 Support

🤝 Community

⬆️ Back to Top

Agent System

Agent System

Agent Architecture

Overview

Design Principles

1. Specialization

2. Collaboration

3. Modularity

4. Consistency

Agent Roles and Responsibilities

1. Challenge Designer (challenge_designer)

2. Challenge Designer Advanced (challenge_designer_advanced)

3. Test Generator (test_generator)

4. Test Validator (test_validator)

5. Problem Solver (problem_solver)

6. Problem Fixer (problem_fixer)

7. Test Error Analyzer (test_error_analyzer)

8. Solution Pattern Analyzer (solution_pattern_analyzer)

Agent Configuration System

Configuration Structure

Model Configuration

Prompt Engineering

Interaction Templates

Agent Workflows

Standard Workflow (environment_coding_challenge)

Enhanced Workflow (environment_enhanced_coding_challenge)

Agent Communication

Session Management

Input/Output Formatting

Error Handling

Advanced Agent Features

Dynamic Model Selection

Custom Agent Templates

Agent Composition

Performance Optimization

Concurrent Execution

Caching Strategies

Related Pages

🔧 Agent Development

🌍 System Integration

🛠️ Implementation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📚 PrismBench Wiki

🚀 Getting Started

🎯 Core Framework

🔧 Configuration Reference

🛠️ Development

📊 Analysis & Results

💡 Examples & Tutorials

🆘 Support

🤝 Community

Clone this wiki locally

1. Challenge Designer (`challenge_designer`)

2. Challenge Designer Advanced (`challenge_designer_advanced`)

3. Test Generator (`test_generator`)

4. Test Validator (`test_validator`)

5. Problem Solver (`problem_solver`)

6. Problem Fixer (`problem_fixer`)

7. Test Error Analyzer (`test_error_analyzer`)

8. Solution Pattern Analyzer (`solution_pattern_analyzer`)

Standard Workflow (`environment_coding_challenge`)

Enhanced Workflow (`environment_enhanced_coding_challenge`)