Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,22 @@ python -m pipeline \
--k 1 \ # run once to quick start
--models gpt-5 \ # or any model you configured
--tasks file_property/size_classification
# Add --task-suite easy to run the lightweight dataset (where available)
```

Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).
Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` for the standard suite and `./results/{exp_name}/{model}__{mcp}-easy/run-*/...` when you run `--task-suite easy` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...` or `./results/test-run/gpt-5__github-easy/run-1/...`).

---

## Run your evaluations

### Task suites (standard vs easy)

- Each MCP service now stores tasks under `tasks/<mcp>/<task_suite>/<category>/<task>/`.
- `standard` (default) covers the full benchmark (127 tasks today).
- `easy` hosts 10 lightweight tasks per MCP, ideal for smoke tests and CI (GitHub’s are already available under `tasks/github/easy`).
- Switch suites with `--task-suite easy` (defaults to `--task-suite standard`).

### Single run (k=1)
```bash
# Run ALL tasks for a service
Expand Down Expand Up @@ -173,7 +181,7 @@ python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-mo
## Contributing

Contributions are welcome:
1. Add a new task under `tasks/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
1. Add a new task under `tasks/<mcp>/<task_suite>/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
2. Ensure local checks pass and open a PR.
3. See `docs/contributing/make-contribution.md`.

Expand Down
4 changes: 2 additions & 2 deletions docs/contributing/make-contribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

1. Fork the repository and create a feature branch.

2. Add new tasks under `tasks/<category>/<task_n>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.
2. Add new tasks under `tasks/<mcp>/<task_suite>/<category>/<task_id>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.

3. Ensure all tests pass.

4. Submit a pull request β€” contributions are welcome!
4. Submit a pull request β€” contributions are welcome!
16 changes: 9 additions & 7 deletions docs/datasets/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ tasks
β”‚
└───filesystem
β”‚
└───file_context
└───standard # task_suite (also supports `easy`)
β”‚
└───create_file_write
β”‚ meta.json
β”‚ description.md
β”‚ verify.py
└───file_context # category_id
β”‚
└───create_file_write
β”‚ meta.json
β”‚ description.md
β”‚ verify.py
```

Note that all tasks are placed under `tasks/`. `filesystem` refers to the environment for the MCP service.
All tasks live under `tasks/<mcp>/<task_suite>/<category>/<task_id>/`. `filesystem` refers to the MCP service and `task_suite` captures the difficulty slice (`standard` benchmark vs `easy` smoke tests).

`meta.json` includes the meta information about the task, including the following key
- task_id: the id of the task.
Expand Down Expand Up @@ -68,4 +70,4 @@ Accordingly, the `verify.py` contains the following functionalities
- Check whether the target directory contains the file with target file name. [![Check Target File Existence](https://i.postimg.cc/Qx0Zwnf6/task-sample-verify-file-existence.png)](https://postimg.cc/7fGRTX87)
- Check whether the target file contains the desired content `EXPECTED_PATTERNS = ["Hello Wolrd"]`. [![Check Content in Target File](https://i.postimg.cc/JzzMhWyV/task-sample-verify-check-content.png)](https://postimg.cc/w7ZSWZc0)

- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.
- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.
2 changes: 1 addition & 1 deletion docs/installation_and_docker_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ The `run-task.sh` script provides simplified Docker usage:
./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K
```

where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/` for more information), *K* refers to the time of independent experiments.
where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), *K* refers to the time of independent experiments.


Additionally, the `run-benchmark.sh` script evaluates models across all MCP services:
Expand Down
8 changes: 8 additions & 0 deletions pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,12 @@ def main():
default="all",
help='Tasks to run: (1). "all"; (2). "category"; or (3). "category/task".',
)
parser.add_argument(
"--task-suite",
default="standard",
choices=["standard", "easy"],
help="Task suite to run (default: standard). Use 'easy' to run the lightweight dataset.",
)
parser.add_argument(
"--exp-name",
default=None,
Expand Down Expand Up @@ -111,6 +117,7 @@ def main():

logger.info("MCPMark Evaluation")
logger.info(f"Experiment: {args.exp_name} | {len(model_list)} Model(s): {', '.join(model_list)}")
logger.info(f"Task suite: {args.task_suite}")
if args.k > 1:
logger.info(f"Running {args.k} evaluation runs for pass@k metrics")

Expand Down Expand Up @@ -147,6 +154,7 @@ def main():
output_dir=run_output_dir,
reasoning_effort=args.reasoning_effort,
agent_name=args.agent,
task_suite=args.task_suite,
)

pipeline.run_evaluation(args.tasks)
Expand Down
71 changes: 52 additions & 19 deletions src/aggregators/aggregate_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,12 @@
from src.aggregators.pricing import compute_cost_usd


def discover_tasks() -> Dict[str, List[str]]:
"""Discover all tasks from ./tasks directory."""
# Supported difficulty splits in ./tasks/<service>/<task_set>/
SUPPORTED_TASK_SETS = {"standard", "easy"}


def discover_tasks(task_set: str = "standard") -> Dict[str, List[str]]:
"""Discover all tasks from ./tasks directory filtered by task set."""
tasks_dir = Path("./tasks")

all_tasks = {}
Expand All @@ -37,22 +41,39 @@ def discover_tasks() -> Dict[str, List[str]]:
}

for mcp_service, task_dirs in service_mappings.items():
tasks = []
tasks: List[str] = []
for task_dir_name in task_dirs:
service_path = tasks_dir / task_dir_name
if not service_path.exists():
continue

# Find all category/task combinations
for category_dir in service_path.iterdir():
if not category_dir.is_dir() or category_dir.name.startswith("__"):
continue

for task_dir in category_dir.iterdir():
if task_dir.is_dir():
# Use unified naming for both playwright and webarena variants
tasks.append(f"{category_dir.name}__{task_dir.name}")


selected_root = service_path / task_set

# Detect if this service has partitioned task sets (e.g. standard/easy)
has_partitioned_layout = any(
child.is_dir() and child.name in SUPPORTED_TASK_SETS
for child in service_path.iterdir()
)

if selected_root.exists():
search_roots = [selected_root]
elif has_partitioned_layout:
# Requested task set missing for this service; skip it for this run
print(f" ⚠️ No '{task_set}' tasks found under {service_path}")
search_roots = []
else:
# Legacy layout without task sets – fall back to original structure
search_roots = [service_path]

for root in search_roots:
for category_dir in root.iterdir():
if not category_dir.is_dir() or category_dir.name.startswith("__"):
continue

for task_dir in category_dir.iterdir():
if task_dir.is_dir() and not task_dir.name.startswith("__"):
tasks.append(f"{category_dir.name}__{task_dir.name}")

all_tasks[mcp_service] = sorted(tasks)

return all_tasks
Expand Down Expand Up @@ -655,14 +676,19 @@ def render_section(title: str, section_data: Dict[str, Any]) -> List[str]:
f"# {exp_name} - Evaluation Results",
"",
f"Generated: {summary['generated_at']}",
"",
]

task_set = summary.get("task_set")
if task_set:
lines.append(f"Task set: {task_set}")

lines.append("")

# Overall table
lines.extend(render_section("Overall Performance", summary.get("overall", {})))

# Service tables: infer service keys from summary
reserved = {"overall", "generated_at", "k", "experiment_name"}
reserved = {"overall", "generated_at", "k", "experiment_name", "task_set"}
service_keys = [key for key in summary.keys() if key not in reserved]
# Keep stable order
for service in sorted(service_keys):
Expand Down Expand Up @@ -875,6 +901,12 @@ def main():
type=str,
help="Comma-separated list of models that only need run-1"
)
parser.add_argument(
"--task-set",
choices=sorted(SUPPORTED_TASK_SETS),
default="standard",
help="Which task subset to aggregate (default: standard)"
)
parser.add_argument("--push", action="store_true", help="Push to GitHub (default to main)")

args = parser.parse_args()
Expand All @@ -894,8 +926,8 @@ def main():
print(f"πŸ”„ Processing experiment: {args.exp_name}")

# Discover all tasks
print("πŸ“‹ Discovering tasks...")
all_tasks = discover_tasks()
print(f"πŸ“‹ Discovering tasks (task set: {args.task_set})...")
all_tasks = discover_tasks(args.task_set)
total_tasks = sum(len(tasks) for tasks in all_tasks.values())
print(f" Found {total_tasks} tasks across {len(all_tasks)} services")

Expand All @@ -920,6 +952,7 @@ def main():
print("\nπŸ“Š Calculating metrics...")
summary = calculate_metrics(complete_models, all_tasks, args.k, single_run_models)
summary["experiment_name"] = args.exp_name
summary["task_set"] = args.task_set

# Save summary
summary_path = exp_dir / "summary.json"
Expand Down Expand Up @@ -954,4 +987,4 @@ def main():


if __name__ == "__main__":
exit(main())
exit(main())
8 changes: 7 additions & 1 deletion src/base/task_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ def __init__(
mcp_service: str = None,
task_class: type = None,
task_organization: str = None,
task_suite: str | None = "standard",
):
"""Initialize the base task manager.

Expand All @@ -63,13 +64,15 @@ def __init__(
mcp_service: MCP service name (e.g., 'notion', 'github', 'filesystem')
task_class: Custom task class to use (defaults to BaseTask)
task_organization: 'file' or 'directory' based task organization
task_suite: Logical task suite (e.g., 'standard', 'easy')
"""
self.tasks_root = tasks_root
self.mcp_service = mcp_service or self.__class__.__name__.lower().replace(
"taskmanager", ""
)
self.task_class = task_class or BaseTask
self.task_organization = task_organization
self.task_suite = task_suite
self._tasks_cache = None

# =========================================================================
Expand All @@ -85,6 +88,8 @@ def discover_all_tasks(self) -> List[BaseTask]:
service_dir = self.tasks_root / (
self.mcp_service or self._get_service_directory_name()
)
if self.task_suite:
service_dir = service_dir / self.task_suite

if not service_dir.exists():
logger.warning(
Expand Down Expand Up @@ -112,9 +117,10 @@ def discover_all_tasks(self) -> List[BaseTask]:
# Sort by category_id and a stringified task_id to handle both numeric IDs and slugs uniformly
self._tasks_cache = sorted(tasks, key=lambda t: (t.category_id, str(t.task_id)))
logger.info(
"Discovered %d %s tasks across all categories",
"Discovered %d %s tasks across all categories (suite=%s)",
len(self._tasks_cache),
self.mcp_service.title(),
self.task_suite or "default",
)
return self._tasks_cache

Expand Down
10 changes: 8 additions & 2 deletions src/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,13 @@ def __init__(
output_dir: Path = None,
reasoning_effort: str = "default",
agent_name: str = "mcpmark",
task_suite: str = "standard",
):
# Main configuration
self.mcp_service = mcp_service
self.timeout = timeout
self.agent_name = (agent_name or "mcpmark").lower()
self.task_suite = (task_suite or "standard").lower()
if self.agent_name not in AGENT_REGISTRY:
raise ValueError(f"Unsupported agent '{agent_name}'. Available: {sorted(AGENT_REGISTRY)}")

Expand All @@ -48,7 +50,9 @@ def __init__(
self.litellm_run_model_name = None

# Initialize managers using the factory pattern (simplified)
self.task_manager = MCPServiceFactory.create_task_manager(mcp_service)
self.task_manager = MCPServiceFactory.create_task_manager(
mcp_service, task_suite=self.task_suite
)
self.state_manager = MCPServiceFactory.create_state_manager(mcp_service)

# Obtain static service configuration from state manager (e.g., notion_key)
Expand Down Expand Up @@ -80,7 +84,9 @@ def __init__(
model_slug = self.model_name.replace(".", "-")

service_for_dir = "playwright" if mcp_service == "playwright_webarena" else mcp_service
self.base_experiment_dir = output_dir / f"{model_slug}__{service_for_dir}" / exp_name
suite_suffix = "" if self.task_suite in ("standard", "", None) else f"-{self.task_suite}"
service_dir_name = f"{service_for_dir}{suite_suffix}"
self.base_experiment_dir = output_dir / f"{model_slug}__{service_dir_name}" / exp_name
self.base_experiment_dir.mkdir(parents=True, exist_ok=True)

def _format_duration(self, seconds: float) -> str:
Expand Down
3 changes: 2 additions & 1 deletion src/mcp_services/filesystem/filesystem_task_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class FilesystemTask(BaseTask):
class FilesystemTaskManager(BaseTaskManager):
"""Simplified filesystem task manager using enhanced base class."""

def __init__(self, tasks_root: Path = None):
def __init__(self, tasks_root: Path = None, task_suite: str = "standard"):
"""Initialize filesystem task manager."""
if tasks_root is None:
tasks_root = Path(__file__).resolve().parents[3] / "tasks"
Expand All @@ -40,6 +40,7 @@ def __init__(self, tasks_root: Path = None):
mcp_service="filesystem",
task_class=FilesystemTask,
task_organization="directory",
task_suite=task_suite,
)

# Override only what's needed for filesystem-specific behavior
Expand Down
30 changes: 29 additions & 1 deletion src/mcp_services/github/github_state_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -626,7 +626,35 @@ def _request_with_retry(

# Initial state for each task category is resolved via self.initial_state_mapping
def select_initial_state_for_task(self, task_category: str) -> Optional[str]:
return self.initial_state_mapping.get(task_category)
"""Resolve template name for a task category with light normalization."""
if not task_category:
return None

candidate_keys = []
candidate_keys.append(task_category)

# Allow users to swap between hyphen/underscore naming conventions.
hyphen_to_underscore = task_category.replace("-", "_")
if hyphen_to_underscore not in candidate_keys:
candidate_keys.append(hyphen_to_underscore)

underscore_to_hyphen = task_category.replace("_", "-")
if underscore_to_hyphen not in candidate_keys:
candidate_keys.append(underscore_to_hyphen)

for key in candidate_keys:
template = self.initial_state_mapping.get(key)
if template:
if key != task_category:
logger.debug(
"| Resolved GitHub template for %s via alias %s -> %s",
task_category,
key,
template,
)
return template

return None

def extract_repo_info_from_url(self, repo_url: str) -> tuple[str, str]:
"""Extract owner and repo name from GitHub URL."""
Expand Down
3 changes: 2 additions & 1 deletion src/mcp_services/github/github_task_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class GitHubTask(BaseTask):
class GitHubTaskManager(BaseTaskManager):
"""Manages task discovery, filtering, and verification for GitHub-based MCPMark evaluation."""

def __init__(self, tasks_root: Path = None):
def __init__(self, tasks_root: Path = None, task_suite: str = "standard"):
"""Initialize GitHub task manager.

Args:
Expand All @@ -57,6 +57,7 @@ def __init__(self, tasks_root: Path = None):
mcp_service="github",
task_class=GitHubTask,
task_organization="file",
task_suite=task_suite,
) # GitHub uses file-based tasks

# =========================================================================
Expand Down
Loading