-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Systematic evaluation of language models through Monte Carlo Tree Search
PrismBench is a comprehensive framework for evaluating Large Language Model capabilities in computer science problem-solving. Using a three-phase Monte Carlo Tree Search approach, it systematically maps model strengths, discovers challenging areas, and provides detailed performance analysis.
Core Approach:
- Phase 1: Maps initial capabilities across CS concepts
- Phase 2: Discovers challenging concept combinations
- Phase 3: Conducts comprehensive evaluation of weaknesses
New to PrismBench? Follow our quick start guide to get running in 5 minutes.
Need detailed setup? See our comprehensive configuration documentation.
| Component | Description | Documentation |
|---|---|---|
| MCTS Algorithm | Three-phase search strategy for capability mapping | MCTS Algorithm β |
| Agent System | Multi-agent architecture for challenge creation and evaluation | Agent System β |
| Environment System | Pluggable evaluation environments for different scenarios | Environment System β |
| Architecture | System design and component interactions | Architecture Overview β |
| Topic | Description | Documentation |
|---|---|---|
| Results Analysis | Understanding and interpreting evaluation results | Results Analysis β |
| Tree Structure | Search tree implementation and concept organization | Tree Structure β |
PrismBench is designed to be extensible, allowing you to add custom agents, environments, and MCTS phases.
- Extending PrismBench β
- Custom Agents β
- Custom Environments β
- Custom MCTS Phases β
- Extension Combinations β
PrismBench follows a microservices architecture with three core services:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Search β β Environment β β LLM Interface β
β Port 8002 βββββΊβ Port 8001 βββββΊβ Port 8000 β
β β β β β β
β MCTS Engine β β Challenge Exec β β Model Comm β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- Systematic Evaluation through MCTS-driven exploration
- Challenge Discovery automatically identifies model weaknesses
- Comprehensive Analysis with detailed performance metrics
- Containerized Deployment with Docker support
- API Compatible with OpenAI-compatible endpoints
- Extensible Architecture for custom components
| Resource | Description |
|---|---|
| Troubleshooting | Common issues and solutions |
| GitHub Discussions | Community support and questions |
| Issue Tracker | Bug reports and feature requests |
We welcome contributions to PrismBench! Whether you're fixing bugs, adding features, or improving documentation, your help is appreciated.
- β‘ Quick Start - Setup and first run
- βοΈ Configuration Overview - Complete configuration guide
- ποΈ Architecture Overview - System design and components
- π³ MCTS Algorithm - Monte Carlo Tree Search implementation
- π€ Agent System - Multi-agent architecture
- π Environment System - Evaluation environments
- π§ Extending PrismBench - Framework extensions
- π Results Analysis - Understanding evaluation results
- π Troubleshooting - Common issues and solutions
Made with enough βοΈ to fell an elephant and a whole lot of β€οΈ by anonymous(for now)
π§ MCTS System
- π MCTS Algorithm
- π Core MCTS Process
- π Key Components
- π PrismBench's Three-Phase MCTS
- π³ Tree Structure
- π³ Node Structure
π€ Agent System
- π€ Agent Overview
- π Agent Roles
- π§ Agent Configuration
- π§ Agent Workflows
- π§ Agent Communication
π Environment System
- π Environment Overview
- ποΈ Environment Types
- π§ Environment Registry
- π§ Agent Integration
- π§ Environment Configuration
π Main Configuration
- βοΈ Configuration Overview
- π Agent Configurations
- π Environment Configurations
- π Phase Configurations
- π³ Tree Configurations
π§ Extension
- π Extending PrismBench
- π€ Custom Agents
- π Custom Environments
- π Custom MCTS Phases
- π Extension Combinations
- π‘ Basic Examples (Coming Soon)
- ποΈ Advanced Examples (Coming Soon)
- π Step-by-Step Tutorials (Coming Soon)