MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

A Model Context Protocol (MCP) based LLM deep evaluation framework.

Overview

This project provides a framework for evaluating Large Language Models using the Model Context Protocol. It enables automating end- to-end task generation and deep evaluation of LLM agents across diverse dimensions.

Demo

🎬 Watch Full Demo Video (with audio)

Click above to download and view the complete MCPEval demonstration with audio explanation

Architecture

MCPEval system architecture showing the complete evaluation pipeline from task generation to analysis

Homepage

MCPEval web interface providing intuitive access to all evaluation features

News

v1.1.0 — Multi-turn simulation web UI, conversation replay viewer, model comparison dashboard with statistical testing, SQLite persistence & v1 REST API, SFRGateway proxy, CI pipeline, and comprehensive test suite
Supporting GPT-5
Using model-config for using any model to generate and evaluate
A new revalidation cli is released for generating high-quality data

Features

🚀 Automated End-to-End Evaluation — Single-command pipeline from task generation to analysis with parallel execution
🔧 MCP Protocol Integration — 15+ built-in MCP servers spanning enterprise, utility, and public API domains
📊 Comprehensive Analysis & Insights — Statistical model comparison with bootstrap confidence intervals and paired tests
💻 User-Friendly Web Interface — Conversation replay viewer, model comparison dashboard, and multi-turn simulation UI
⚡ Advanced CLI Commands — Generate, verify, evaluate, simulate, and judge with flexible model configuration
🗄️ SQLite Persistence & REST API — Durable storage for evaluation runs with a v1 leaderboard and runs API
🔬 Multi-Turn Simulation — LLM-as-user simulation with scenario generation, persona support, and 5-dimension LLM judging
🌐 SFRGateway Proxy — Self-contained LLM inference via the Salesforce Research gateway (no direct API keys needed)
✅ CI & Test Suite — GitHub Actions pipeline with unit and integration tests

Citation

If you find our system or paper useful, please cite

@misc{liu2025mcpevalautomaticmcpbaseddeep,
      title={MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models}, 
      author={Zhiwei Liu and Jielin Qiu and Shiyu Wang and Jianguo Zhang and Zuxin Liu and Roshan Ram and Haolin Chen and Weiran Yao and Huan Wang and Shelby Heinecke and Silvio Savarese and Caiming Xiong},
      year={2025},
      eprint={2507.12806},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.12806}, 
}

Installation

Quick Setup (Recommended)

For complete setup including both CLI and Web UI:

# Clone the repository
git clone https://github.com/SalesforceAIResearch/MCPEval.git
cd MCPEval

# Run unified setup script (installs CLI, backend API, and frontend UI)
./setup.sh

This will set up:

✅ Core CLI evaluation framework
✅ Flask REST API backend
✅ React web interface
✅ All dependencies using uv package manager

CLI-Only Setup

For command-line usage only:

# Make sure uv is installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package
uv sync
uv sync --extra dev

Environment Configuration

cp .env.template .env

Edit the .env file to add your OpenAI API key:

OPENAI_API_KEY=YOUR_OPENAI_API_KEY_HERE

OR export the key in your terminal:

export OPENAI_API_KEY=YOUR_OPENAI_API_KEY_HERE

SFRGateway Proxy (Optional)

For self-contained LLM inference without managing API keys directly, use the bundled SFRGateway proxy:

cd sfrgateway
cp .env.template .env   # edit .env with your X_API_KEY
PROXY_PORT=8008 uv run python server.py

Then point model configs at http://localhost:8008/v1 with "api_key": "dummy". See sfrgateway/README.md for details.

Usage

Web Interface (Recommended for New Users)

After running the setup script:

Start the backend API:
```
cd backend
uv run app.py
```
Backend will run on http://localhost:22358
Start the frontend (in a new terminal):
```
cd frontend
npm start
```
Frontend will run on http://localhost:22359
Access the web application:
- Open http://localhost:22359 in your browser
- Use the intuitive interface to generate tasks, run evaluations, and view results
- Conversation Replay — Browse and inspect multi-turn conversations turn by turn
- Model Comparison — Side-by-side model comparison with statistical significance testing
- Multi-Turn Simulation — Generate scenarios, run user simulations, and evaluate conversations from the UI
- Real-time progress tracking for all operations

Note: The frontend automatically proxies API requests to the backend server (port 22358). No additional configuration is needed.

Command Line Interface

For advanced users and automation:

Example Usage

We provide an example about a special calculator MCP application. We define an example special calculator MCP server and use OpenAI client to interact with the server.

Quick start:

# Basic example with local MCP server
uv run mcp_clients/example_openai_client/client.py --servers mcp_servers/special_calculator/server.py

# Multiple servers with environment variables (use ^ for env vars)
uv run mcp_clients/example_openai_client/client.py --servers @modelcontextprotocol/server-sequential-thinking mcp-server-nationalparks^NPS_API_KEY=your-api-key-here

# Combined example with arguments and environment variables
uv run mcp_clients/example_openai_client/client.py --servers @openbnb/mcp-server-airbnb:--ignore-robots-txt mcp-server-nationalparks^NPS_API_KEY=your-api-key-here

For more details on the OpenAI client usage, see the OpenAI Client README.

Available MCP Servers

MCPEval includes a diverse set of MCP servers spanning enterprise domains, public APIs, and computation utilities. Each server exposes tools that LLM agents are evaluated against.

Self-Contained Servers (No Credentials Required)

These servers are fully deterministic with embedded data or pure computation — ideal for reproducible evaluation.

Server	Tools	Domain	Description
hr_management	10	Enterprise	Departments, employees, leave requests, performance reviews, org chart. Embedded SQLite with 70+ rows.
ecommerce	11	Enterprise	Products, orders, customers, inventory, sales summaries. Embedded SQLite with 80+ rows.
datetime_tools	7	Utility	Timezone conversion, date difference, business days, holiday support (US/UK/DE/FR/JP).
unit_converter	6	Utility	Length, weight, temperature, volume, speed, data size conversion with strict enum schemas.
special_calculator	4	Demo	Basic arithmetic with special transformations (add+double, subtract+halve, etc.).
sqlite	8	Database	General-purpose SQLite operations — create tables, query, insert, with sample datasets.
filesystem	14	System	Local file operations (read, write, search, directory listing). npm: `@modelcontextprotocol/server-filesystem`
memory	9	Knowledge	Knowledge graph with entities, relations, and observations. npm: `@modelcontextprotocol/server-memory`

Public API Servers (Free, No Credentials)

Server	Tools	Domain	Description
book	8	Library	Open Library search — books by title/ISBN, authors, advanced search.
youtube	4	Media	YouTube transcript extraction, search, and summarization.
healthcare	5	Medical	FDA drug lookup, PubMed search, clinical trials, ICD-10 codes.
sports	4	Sports	NBA, MLB, NFL teams, players, and game data via balldontlie.io.

Servers Requiring API Keys

Server	Tools	Domain	Credentials
travel_assistant	6	Travel	Flights, hotels, restaurants, local events. Requires `SERPAPI_API_KEY`, `YELP_API_KEY`.
airbnb	2	Travel	Airbnb listing search and details. npm: `@openbnb/mcp-server-airbnb`
yfinance	10	Finance	Stock prices, financials, options, analyst recommendations via Yahoo Finance.
national_park	6	Parks	U.S. National Parks info, alerts, campgrounds, events. Requires `NPS_API_KEY` (free).
crm_bench	11	CRM	Salesforce CRM operations (stub implementation for benchmarking).

Multi-Turn Simulation

MCPEval supports multi-turn user simulation where a simulator LLM plays the user role and an agent LLM is tested:

# Generate scenarios from verified tasks
mcp-eval generate-scenarios \
  --servers mcp_servers/hr_management/server.py \
  --output scenarios.jsonl \
  --num-scenarios 5

# Run multi-turn simulation
mcp-eval simulate \
  --servers mcp_servers/hr_management/server.py \
  --simulator-model-config simulator_model.json \
  --agent-model-config agent_model.json \
  --scenarios-file scenarios.jsonl \
  --output multiturn_results.jsonl

# Evaluate conversations with LLM judge
mcp-eval evaluate-multiturn \
  --input multiturn_results.jsonl \
  --output multiturn_evaluation.jsonl

The judge evaluates on 5 dimensions: clarification handling, context maintenance, tool usage efficiency, goal achievement, and response quality.

Quick Development Setup

# Complete development environment
./setup.sh

# Start backend API (Terminal 1)
cd backend && uv run app.py

# Start frontend UI (Terminal 2)  
cd frontend && npm start

# Access at http://localhost:22359

Contributing

For each benchmark contribution, please follow the following steps:

Create a new directory in the benchmarks/your_benchmark_name folder.
If you are developing a new MCP server, please create a new folder and add the server script in the mcp_servers folder.
If you are developing a new MCP client, please create a new folder and add the client script in the mcp_clients folder.
Add your benchmark scripts to the benchmarks/your_benchmark_name folder.

For web interface contributions:

Frontend components: frontend/src/components/ and frontend/src/pages/
Backend API endpoints: backend/app.py

Development Roadmap

See our detailed Development Roadmap for the current progress and planned features across all components.

MCPEval CLI Usage

The MCPEval CLI provides a comprehensive toolkit for managing MCP servers and evaluating LLMs. For detailed documentation, parameter descriptions, and advanced usage examples, see the CLI README.

Quick Start

Auto Workflow (Recommended) - Complete evaluation pipeline in one command:

# Automatically generate tasks, verify, evaluate, and analyze results
mcp-eval auto \
  --servers @openbnb/mcp-server-airbnb:--ignore-robots-txt \
  --working-dir evaluation_results/airbnb_eval \
  --task-model gpt-4o-2024-11-20 \
  --eval-model-configs benchmarks/airbnb/eval_models/gpt-4o.json \
  --num-tasks 50

Manual Workflow

For more control over each step:

# 1. Generate tasks
mcp-eval generate-tasks \
  --servers @openbnb/mcp-server-airbnb:--ignore-robots-txt \
  --model-config benchmarks/airbnb/eval_models/gpt-4o.json \
  --num-tasks 200 \
  --output data/airbnb/evaluation_tasks.jsonl

# 2. Verify tasks work correctly
mcp-eval verify-tasks \
  --servers @openbnb/mcp-server-airbnb:--ignore-robots-txt \
  --tasks-file data/airbnb/evaluation_tasks.jsonl \
  --output data/airbnb/evaluation_tasks_verified.jsonl

# 3. Revalidate task descriptions based on execution data (optional but recommended)
mcp-eval revalidate-tasks \
  --verified-tasks-file data/airbnb/evaluation_tasks_verified.jsonl \
  --model-config benchmarks/airbnb/eval_models/gpt-4o.json \
  --output data/airbnb/evaluation_tasks_final.jsonl

# 4. Evaluate model performance
mcp-eval evaluate \
  --servers @openbnb/mcp-server-airbnb:--ignore-robots-txt \
  --model-config benchmarks/airbnb/eval_models/gpt-4o.json \
  --tasks-file data/airbnb/evaluation_tasks_final.jsonl \
  --output benchmarks/airbnb/results/gpt4o_evaluation.json \
  --max-turns 30

# 5. Analyze results and generate reports
mcp-eval analyze \
  --predictions benchmarks/airbnb/results/gpt4o_evaluation.json \
  --ground-truth data/airbnb/evaluation_tasks_final.jsonl \
  --generate-report

# 6. Optional: Run LLM judge evaluation
mcp-eval judge \
  --input-file benchmarks/airbnb/results/gpt4o_evaluation.json \
  --output-dir benchmarks/airbnb/results \
  --model-config benchmarks/airbnb/eval_models/gpt-4o.json

# 7. Optional: Analyze LLM judgment results
mcp-eval judge-rubric \
  --trajectory-file benchmarks/airbnb/results/gpt4o_evaluation_trajectory.json \
  --completion-file benchmarks/airbnb/results/gpt4o_evaluation_completion.json \
  --output-dir benchmarks/airbnb/report

Note: The revalidation step (step 3) analyzes the actual tool conversations from verified tasks and improves task descriptions to be more accurate and specific. This leads to higher-quality evaluation datasets and better task clarity for subsequent evaluations.

Available Commands

generate-tasks - Generate evaluation tasks for MCP servers
verify-tasks - Verify tasks can be executed successfully
revalidate-tasks - Improve task descriptions based on actual execution data
evaluate - Evaluate models using MCP servers and tasks
analyze - Analyze evaluation results and generate reports
judge - Run LLM-based evaluation of execution trajectories
judge-rubric - Analyze LLM judgment results
generate-scenarios - Generate multi-turn scenarios from tasks or servers
simulate - Run multi-turn user simulation conversations
evaluate-multiturn - Evaluate multi-turn conversations with LLM judge
convert-data - Convert data to different formats (e.g., XLAM)
auto - Complete automated evaluation workflow

Model Configuration

Models are configured using JSON files. Examples:

{
  "model": "gpt-4o-2024-11-20",
  "temperature": 0.01,
  "max_tokens": 16384
}

For custom endpoints:

{
  "model": "mistral-24b",
  "api_key": "default",
  "temperature": 0.01,
  "max_tokens": 3000,
  "base_url": "http://<IP_Address>:<port>/v1"
}

Getting Help

# General help
mcp-eval --help

# Command-specific help
mcp-eval generate-tasks --help
mcp-eval evaluate --help

For comprehensive documentation, examples, and advanced usage patterns, see the Complete CLI Documentation.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contact

For any questions or feedback, please contact Zhiwei Liu at zhiweiliu@salesforce.com.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
backend		backend
benchmarks		benchmarks
bin		bin
frontend		frontend
mcp_clients/example_openai_client		mcp_clients/example_openai_client
mcp_servers		mcp_servers
page		page
scripts		scripts
sfrgateway		sfrgateway
src/mcpeval		src/mcpeval
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
AI_ETHICS.md		AI_ETHICS.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Overview

Demo

Architecture

Homepage

News

Features

Citation

Installation

Quick Setup (Recommended)

CLI-Only Setup

Environment Configuration

SFRGateway Proxy (Optional)

Usage

Web Interface (Recommended for New Users)

Command Line Interface

Example Usage

Available MCP Servers

Self-Contained Servers (No Credentials Required)

Public API Servers (Free, No Credentials)

Servers Requiring API Keys

Multi-Turn Simulation

Quick Development Setup

Contributing

Development Roadmap

MCPEval CLI Usage

Quick Start

Manual Workflow

Available Commands

Model Configuration

Getting Help

License

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages