-
Notifications
You must be signed in to change notification settings - Fork 0
Agent System
The agent system is the heart of PrismBench's evaluation methodology. It uses specialized agents to orchestrate complex multi-step evaluations that mirror real-world software development workflows.
PrismBench employs a multi-agent architecture where specialized agents collaborate to create, evaluate, and analyze coding challenges. Each agent has a specific role, expertise area, and set of responsibilities.
graph TB
subgraph "Agent Workflow"
subgraph "Problem Creation"
CD[Challenge Designer<br/>π Creates Problems]
CDA[Challenge Designer Advanced<br/>π Diverse Problems]
end
subgraph "Test Development"
TG[Test Generator<br/>π§ͺ Creates Tests]
TV[Test Validator<br/>β
Validates Tests]
end
subgraph "Solution Development"
PS[Problem Solver<br/>π‘ Writes Solutions]
PF[Problem Fixer<br/>π§ Debugs Code]
end
subgraph "Analysis"
TEA[Test Error Analyzer<br/>π Analyzes Failures]
SPA[Solution Pattern Analyzer<br/>π Studies Patterns]
end
end
CD --> TG
CDA --> TV
TG --> PS
TV --> PS
PS --> PF
PS --> TEA
PF --> SPA
Each agent is designed for a specific task with:
- Domain-specific expertise
- Tailored prompts and instructions
- Optimized output formats
- Task-appropriate model parameters
Agents work together in defined workflows:
- Sequential processing pipelines
- Information passing between agents
- Error handling and recovery
- Quality assurance checkpoints
Agents are independently configurable:
- Swappable implementations
- Adjustable parameters
- Custom prompt templates
- Provider-agnostic design
Standardized interaction patterns:
- Common configuration format
- Uniform output formatting
- Consistent error handling
- Predictable behavior
Purpose: Creates coding problems focused on specific CS concepts and difficulty levels.
Key Responsibilities:
- Generate problem statements similar to LeetCode challenges
- Ensure problems match specified concepts and difficulty
- Provide clear input/output specifications
- Include comprehensive examples and constraints
- Explain concept relevance
Input Requirements:
-
concepts: List of CS concepts to test -
difficulty_level: Target difficulty (very easy to very hard)
Output Format:
<problem_description>
## Problem Title
Difficulty: [Level]
[Problem description with input/output specs, constraints, examples]
</problem_description>Configuration Example:
agent_name: challenge_designer
configs:
model_name: gpt-4o-mini
provider: openai
params:
temperature: 0.8 # High creativity for diverse problems
max_tokens: 5120
system_prompt: >
You are an expert computer science educator specializing in creating
coding challenges...Purpose: Creates diverse, unique coding problems while avoiding duplication.
Key Differentiators:
- Duplicate Avoidance: Analyzes previous problems to ensure uniqueness
- Variation Generation: Creates substantially different problem approaches
- Context Diversity: Uses different scenarios and problem contexts
- Advanced Constraints: Implements more complex requirement patterns
Input Requirements:
-
concepts: CS concepts to test -
difficulty_level: Target difficulty -
previous_problems: List of previously generated problems to avoid
Advanced Features:
- Problem similarity analysis
- Context variation (different domains, scenarios)
- Approach diversification (different algorithmic strategies)
- Constraint variation (different input/output patterns)
Purpose: Develops comprehensive unittest test cases for coding problems.
Key Responsibilities:
- Create test classes inheriting from
unittest.TestCase - Cover scenarios from very easy to very hard
- Include edge cases and boundary conditions
- Provide descriptive test method names and docstrings
- Use appropriate assertions for result verification
Test Coverage Strategy:
- Basic functionality: Core problem requirements
- Edge cases: Boundary values and special conditions
- Error handling: Invalid inputs and error conditions
- Performance: Large inputs and stress testing
- Corner cases: Unusual but valid scenarios
Output Format:
<test_code>
import unittest
class TestFunction(unittest.TestCase):
def test_basic_case(self):
"""Test basic functionality."""
self.assertEqual(function_to_test([1,2,3]), expected_result)
def test_edge_case(self):
"""Test edge cases."""
self.assertEqual(function_to_test([]), expected_edge_result)
</test_code>Purpose: Validates test case quality and coverage.
Validation Criteria:
- Coverage Analysis: Ensures all problem requirements are tested
- Edge Case Verification: Confirms comprehensive edge case testing
- Assertion Correctness: Validates test expectations
- Test Structure: Checks test organization and naming
- Quality Assessment: Evaluates test effectiveness
Output Format:
<test_validation>
1. Missing Test Scenarios:
- [List of missing test cases]
2. Incorrect Assertions:
- [Issues with test expectations]
3. Suggestions for Improving Test Coverage:
- [Recommendations for better coverage]
4. Analysis of Edge Cases:
- [Edge case analysis and recommendations]
</test_validation>Purpose: Implements efficient, well-structured solutions to coding problems.
Key Responsibilities:
- Develop algorithmic solutions
- Write clean, commented code
- Handle all specified constraints
- Optimize for time and space complexity
- Provide single
solutionfunction
Solution Requirements:
- Function named
solution - Handles all input specifications
- Returns output as specified
- Includes clear comments
- Follows Python best practices
Output Format:
<generated_solution>
def solution(input_params):
"""
Clear description of the approach and complexity.
Args:
input_params: Description of inputs
Returns:
Description of output
"""
# Implementation with clear comments
pass
</generated_solution>Purpose: Analyzes failing solutions and provides corrected versions.
Debugging Process:
- Error Analysis: Identify specific failure points
- Root Cause Analysis: Determine underlying issues
- Solution Strategy: Plan correction approach
- Code Repair: Implement fixes
- Verification: Ensure fixes address all issues
Input Requirements:
-
problem_statement: Original problem description -
test_cases: Test cases that are failing -
current_solution: Current failing solution -
error_output: Detailed error information
Analysis Areas:
- Logic errors and algorithm issues
- Edge case handling problems
- Performance and efficiency issues
- Syntax and runtime errors
- Test requirement misunderstandings
Purpose: Provides detailed analysis of test execution failures.
Analysis Categories:
- Test Failures: Specific tests that didn't pass
- Error Analysis: Runtime errors and exceptions
- Root Cause Identification: Underlying problem causes
- Improvement Suggestions: Recommendations for fixes
Output Format:
<error_analysis>
Test Failures:
1. [test_name]:
Failure Reason: [detailed explanation]
Root Cause: [underlying issue]
Test Errors:
1. [test_name]:
Error Reason: [error details]
Root Cause: [error source]
Root Causes:
- [List of identified root causes]
Suggested Areas to Investigate:
- [Specific recommendations for investigation]
</error_analysis>Purpose: Analyzes solution code to identify patterns and implementation approaches.
Analysis Dimensions:
- Algorithm Patterns: Strategic approaches used
- Data Structure Usage: Choice and application of data structures
- Code Organization: Structure and modularity patterns
- Implementation Choices: Language features and techniques
- Performance Characteristics: Complexity and optimization patterns
Output Format:
<pattern_analysis>
{
"algorithm_patterns": {
"main_strategy": "dynamic programming",
"time_complexity": "O(n^2)",
"space_complexity": "O(n)"
},
"data_structures": {
"primary": ["hashmap", "array"],
"usage_patterns": {
"hashmap": "O(1) lookups for memoization"
}
},
"implementation_choices": {
"language_features": ["list comprehension"],
"optimization_techniques": ["early termination"]
}
}
</pattern_analysis>Each agent is defined by a YAML configuration file with four main sections:
agent_name: [unique_identifier]
configs: [model and provider settings]
system_prompt: [role definition and instructions]
interaction_templates: [input/output templates]Agents can use different models optimized for their tasks:
configs:
model_name: gpt-4o-mini # Model selection
provider: openai # Provider choice
params:
temperature: 0.8 # Creativity level
max_tokens: 5120 # Response length
local: false # Local vs cloud modelSystem prompts define agent expertise and behavior:
system_prompt: >
You are an expert [domain] specialist with expertise in [areas].
Your role is to [specific responsibilities].
When given [inputs], you should [expected actions].
Your response should include [output requirements].
**IMPORTANT:** [critical formatting requirements]Templates define how agents receive inputs and format outputs:
interaction_templates:
- name: basic
required_keys: [input1, input2]
template: >
[Input processing template with {placeholders}]
output_format:
response_begin: <tag>
response_end: </tag>Sequential agent pipeline for basic evaluation:
sequenceDiagram
participant E as Environment
participant CD as Challenge Designer
participant TG as Test Generator
participant PS as Problem Solver
participant PF as Problem Fixer
E->>CD: Generate Problem
CD->>E: Problem Description
E->>TG: Generate Tests
TG->>E: Test Cases
E->>PS: Solve Problem
PS->>E: Solution Code
E->>E: Execute Tests
alt Tests Fail
E->>PF: Fix Solution
PF->>E: Fixed Code
end
Extended pipeline with validation and analysis:
sequenceDiagram
participant E as Environment
participant CDA as Challenge Designer Advanced
participant TG as Test Generator
participant TV as Test Validator
participant PS as Problem Solver
participant PF as Problem Fixer
participant TEA as Test Error Analyzer
E->>CDA: Generate Unique Problem
CDA->>E: Problem Description
E->>TG: Generate Tests
TG->>E: Test Cases
E->>TV: Validate Tests
TV->>E: Validation Report
E->>PS: Solve Problem
PS->>E: Solution Code
E->>E: Execute Tests
alt Tests Fail
E->>TEA: Analyze Failures
TEA->>E: Error Analysis
E->>PF: Fix Solution
PF->>E: Fixed Code
end
Agents communicate through the LLM Interface Service:
- Session Initialization: Create agent-specific session
- Request Submission: Send task with formatted input
- Asynchronous Processing: Task queued and processed
- Status Monitoring: Poll for completion
- Result Retrieval: Extract formatted output
- Session Cleanup: Clean up resources
Standardized communication format:
# Input to agent
{
"session_id": "uuid",
"input_data": {
"concepts": ["loops", "arrays"],
"difficulty_level": "medium",
# ... other template parameters
},
"use_agent": False
}
# Output from agent
{
"status": "completed",
"result": {
"response": "[formatted response with delimiters]"
}
}Robust error handling at multiple levels:
- Input Validation: Check required parameters
- Model Errors: Handle API failures and timeouts
- Output Parsing: Validate response format
Agents can be configured with different models for different tasks:
# High creativity for problem generation
challenge_designer:
configs:
temperature: 0.8
# Low temperature for code generation
problem_solver:
configs:
temperature: 0.2Create domain-specific agents:
agent_name: security_analyst
configs:
model_name: gpt-4o-mini
provider: openai
system_prompt: >
You are a cybersecurity expert specializing in code security analysis...
interaction_templates:
- name: security_review
required_keys: [code, security_requirements]
template: >
Analyze the following code for security vulnerabilities: {code}Combine agents for complex workflows:
custom_environment:
agents:
- challenge_designer
- security_analyst
- performance_analyzer
- problem_solver
- code_reviewerMultiple agents can run concurrently:
- Independent task processing
- Parallel test execution
- Async communication patterns
- Resource management
Optimize repeated operations:
- Session reuse for similar tasks
- Connection pooling
Next Steps:
- π Environment System - How agents work within environments
- π€ Agent Configurations - Detailed configuration guide
- π§ Custom Agents - Creating new agent types
- π‘ Examples - Agent usage examples
- π§© Custom Agents - Creating specialized agents with custom prompts
- π Custom Environments - Agent orchestration in environments
- π Extension Combinations - Combining agents with other extensions
- π Environment System - How agents work within environments
- ποΈ Architecture Overview - Overall system design
- π Configuration Overview - Agent configuration system
- π§ Extending PrismBench - Framework extension overview
- π Results Analysis - Understanding agent performance
- π Troubleshooting - Agent-related issues and solutions
π§ MCTS System
- π MCTS Algorithm
- π Core MCTS Process
- π Key Components
- π PrismBench's Three-Phase MCTS
- π³ Tree Structure
- π³ Node Structure
π€ Agent System
- π€ Agent Overview
- π Agent Roles
- π§ Agent Configuration
- π§ Agent Workflows
- π§ Agent Communication
π Environment System
- π Environment Overview
- ποΈ Environment Types
- π§ Environment Registry
- π§ Agent Integration
- π§ Environment Configuration
π Main Configuration
- βοΈ Configuration Overview
- π Agent Configurations
- π Environment Configurations
- π Phase Configurations
- π³ Tree Configurations
π§ Extension
- π Extending PrismBench
- π€ Custom Agents
- π Custom Environments
- π Custom MCTS Phases
- π Extension Combinations
- π‘ Basic Examples (Coming Soon)
- ποΈ Advanced Examples (Coming Soon)
- π Step-by-Step Tutorials (Coming Soon)