eval-sys · xyliugo · Nov 17, 2025 · Nov 17, 2025 · Nov 17, 2025
diff --git a/README.md b/README.md
@@ -85,14 +85,22 @@ python -m pipeline \
   --k 1 \ # run once to quick start
   --models gpt-5  \ # or any model you configured
   --tasks file_property/size_classification
+# Add --task-suite easy to run the lightweight dataset (where available)
 ```
 
-Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).
+Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` for the standard suite and `./results/{exp_name}/{model}__{mcp}-easy/run-*/...` when you run `--task-suite easy` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...` or `./results/test-run/gpt-5__github-easy/run-1/...`).
 
 ---
 
 ## Run your evaluations
 
+### Task suites (standard vs easy)
+
+- Each MCP service now stores tasks under `tasks/<mcp>/<task_suite>/<category>/<task>/`.
+- `standard` (default) covers the full benchmark (127 tasks today).
+- `easy` hosts 10 lightweight tasks per MCP, ideal for smoke tests and CI (GitHub’s are already available under `tasks/github/easy`).
+- Switch suites with `--task-suite easy` (defaults to `--task-suite standard`).
+
 ### Single run (k=1)
 ```bash
 # Run ALL tasks for a service
@@ -173,7 +181,7 @@ python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-mo
 ## Contributing
 
 Contributions are welcome:
-1. Add a new task under `tasks/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
+1. Add a new task under `tasks/<mcp>/<task_suite>/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
 2. Ensure local checks pass and open a PR.
 3. See `docs/contributing/make-contribution.md`.
 

diff --git a/docs/contributing/make-contribution.md b/docs/contributing/make-contribution.md
@@ -2,8 +2,8 @@
 
 1. Fork the repository and create a feature branch.
 
-2. Add new tasks under `tasks/<category>/<task_n>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.
+2. Add new tasks under `tasks/<mcp>/<task_suite>/<category>/<task_id>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.
 
 3. Ensure all tests pass.
 
-4. Submit a pull request — contributions are welcome!
+4. Submit a pull request — contributions are welcome!
diff --git a/docs/datasets/task.md b/docs/datasets/task.md
@@ -18,15 +18,17 @@ tasks
 │
 └───filesystem
    │
-   └───file_context
+   └───standard          # task_suite (also supports `easy`)
       │
-      └───create_file_write
-         │   meta.json 
-         │   description.md
-         │   verify.py
+      └───file_context   # category_id
+         │
+         └───create_file_write
+            │   meta.json 
+            │   description.md
+            │   verify.py
 ```
 
-Note that all tasks are placed under `tasks/`. `filesystem` refers to the environment for the MCP service.
+All tasks live under `tasks/<mcp>/<task_suite>/<category>/<task_id>/`. `filesystem` refers to the MCP service and `task_suite` captures the difficulty slice (`standard` benchmark vs `easy` smoke tests).
 
 `meta.json` includes the meta information about the task, including the following key
 - task_id: the id of the task.
@@ -68,4 +70,4 @@ Accordingly, the `verify.py` contains the following functionalities
 - Check whether the target directory contains the file with target file name. [![Check Target File Existence](https://i.postimg.cc/Qx0Zwnf6/task-sample-verify-file-existence.png)](https://postimg.cc/7fGRTX87)
 - Check whether the target file contains the desired content `EXPECTED_PATTERNS = ["Hello Wolrd"]`. [![Check Content in Target File](https://i.postimg.cc/JzzMhWyV/task-sample-verify-check-content.png)](https://postimg.cc/w7ZSWZc0)
 
-- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.
+- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.
diff --git a/docs/installation_and_docker_usage.md b/docs/installation_and_docker_usage.md
@@ -44,7 +44,7 @@ The `run-task.sh` script provides simplified Docker usage:
 ./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K
 ```
 
-where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/` for more information), *K* refers to the time of independent experiments.
+where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), *K* refers to the time of independent experiments.
 
 
 Additionally, the `run-benchmark.sh` script evaluates models across all MCP services:

diff --git a/pipeline.py b/pipeline.py
@@ -54,6 +54,12 @@ def main():
         default="all",
         help='Tasks to run: (1). "all"; (2). "category"; or (3). "category/task".',
     )
+    parser.add_argument(
+        "--task-suite",
+        default="standard",
+        choices=["standard", "easy"],
+        help="Task suite to run (default: standard). Use 'easy' to run the lightweight dataset.",
+    )
     parser.add_argument(
         "--exp-name",
         default=None,
@@ -111,6 +117,7 @@ def main():
 
     logger.info("MCPMark Evaluation")
     logger.info(f"Experiment: {args.exp_name} | {len(model_list)} Model(s): {', '.join(model_list)}")
+    logger.info(f"Task suite: {args.task_suite}")
     if args.k > 1:
         logger.info(f"Running {args.k} evaluation runs for pass@k metrics")
 
@@ -147,6 +154,7 @@ def main():
                 output_dir=run_output_dir,
                 reasoning_effort=args.reasoning_effort,
                 agent_name=args.agent,
+                task_suite=args.task_suite,
             )
 
             pipeline.run_evaluation(args.tasks)

diff --git a/src/aggregators/aggregate_results.py b/src/aggregators/aggregate_results.py
@@ -20,8 +20,12 @@
 from src.aggregators.pricing import compute_cost_usd
 
 
-def discover_tasks() -> Dict[str, List[str]]:
-    """Discover all tasks from ./tasks directory."""
+# Supported difficulty splits in ./tasks/<service>/<task_set>/
+SUPPORTED_TASK_SETS = {"standard", "easy"}
+
+
+def discover_tasks(task_set: str = "standard") -> Dict[str, List[str]]:
+    """Discover all tasks from ./tasks directory filtered by task set."""
     tasks_dir = Path("./tasks")
 
     all_tasks = {}
@@ -37,22 +41,39 @@ def discover_tasks() -> Dict[str, List[str]]:
     }
 
     for mcp_service, task_dirs in service_mappings.items():
-        tasks = []
+        tasks: List[str] = []
         for task_dir_name in task_dirs:
             service_path = tasks_dir / task_dir_name
             if not service_path.exists():
                 continue
-
-            # Find all category/task combinations
-            for category_dir in service_path.iterdir():
-                if not category_dir.is_dir() or category_dir.name.startswith("__"):
-                    continue
-
-                for task_dir in category_dir.iterdir():
-                    if task_dir.is_dir():
-                        # Use unified naming for both playwright and webarena variants
-                        tasks.append(f"{category_dir.name}__{task_dir.name}")
-
+
+            selected_root = service_path / task_set
+
+            # Detect if this service has partitioned task sets (e.g. standard/easy)
+            has_partitioned_layout = any(
+                child.is_dir() and child.name in SUPPORTED_TASK_SETS
+                for child in service_path.iterdir()
+            )
+
+            if selected_root.exists():
+                search_roots = [selected_root]
+            elif has_partitioned_layout:
+                # Requested task set missing for this service; skip it for this run
+                print(f"  ⚠️ No '{task_set}' tasks found under {service_path}")
+                search_roots = []
+            else:
+                # Legacy layout without task sets – fall back to original structure
+                search_roots = [service_path]
+
+            for root in search_roots:
+                for category_dir in root.iterdir():
+                    if not category_dir.is_dir() or category_dir.name.startswith("__"):
+                        continue
+
+                    for task_dir in category_dir.iterdir():
+                        if task_dir.is_dir() and not task_dir.name.startswith("__"):
+                            tasks.append(f"{category_dir.name}__{task_dir.name}")
+
         all_tasks[mcp_service] = sorted(tasks)
 
     return all_tasks
@@ -655,14 +676,19 @@ def render_section(title: str, section_data: Dict[str, Any]) -> List[str]:
         f"# {exp_name} - Evaluation Results",
         "",
         f"Generated: {summary['generated_at']}",
-        "",
     ]
 
+    task_set = summary.get("task_set")
+    if task_set:
+        lines.append(f"Task set: {task_set}")
+
+    lines.append("")
+
     # Overall table
     lines.extend(render_section("Overall Performance", summary.get("overall", {})))
 
     # Service tables: infer service keys from summary
-    reserved = {"overall", "generated_at", "k", "experiment_name"}
+    reserved = {"overall", "generated_at", "k", "experiment_name", "task_set"}
     service_keys = [key for key in summary.keys() if key not in reserved]
     # Keep stable order
     for service in sorted(service_keys):
@@ -875,6 +901,12 @@ def main():
         type=str,
         help="Comma-separated list of models that only need run-1"
     )
+    parser.add_argument(
+        "--task-set",
+        choices=sorted(SUPPORTED_TASK_SETS),
+        default="standard",
+        help="Which task subset to aggregate (default: standard)"
+    )
     parser.add_argument("--push", action="store_true", help="Push to GitHub (default to main)")
 
     args = parser.parse_args()
@@ -894,8 +926,8 @@ def main():
     print(f"🔄 Processing experiment: {args.exp_name}")
 
     # Discover all tasks
-    print("📋 Discovering tasks...")
-    all_tasks = discover_tasks()
+    print(f"📋 Discovering tasks (task set: {args.task_set})...")
+    all_tasks = discover_tasks(args.task_set)
     total_tasks = sum(len(tasks) for tasks in all_tasks.values())
     print(f"  Found {total_tasks} tasks across {len(all_tasks)} services")
 
@@ -920,6 +952,7 @@ def main():
     print("\n📊 Calculating metrics...")
     summary = calculate_metrics(complete_models, all_tasks, args.k, single_run_models)
     summary["experiment_name"] = args.exp_name
+    summary["task_set"] = args.task_set
 
     # Save summary
     summary_path = exp_dir / "summary.json"
@@ -954,4 +987,4 @@ def main():
 
 
 if __name__ == "__main__":
-    exit(main())
+    exit(main())
diff --git a/src/base/task_manager.py b/src/base/task_manager.py
@@ -55,6 +55,7 @@ def __init__(
         mcp_service: str = None,
         task_class: type = None,
         task_organization: str = None,
+        task_suite: str | None = "standard",
     ):
         """Initialize the base task manager.
 
@@ -63,13 +64,15 @@ def __init__(
             mcp_service: MCP service name (e.g., 'notion', 'github', 'filesystem')
             task_class: Custom task class to use (defaults to BaseTask)
             task_organization: 'file' or 'directory' based task organization
+            task_suite: Logical task suite (e.g., 'standard', 'easy')
         """
         self.tasks_root = tasks_root
         self.mcp_service = mcp_service or self.__class__.__name__.lower().replace(
             "taskmanager", ""
         )
         self.task_class = task_class or BaseTask
         self.task_organization = task_organization
+        self.task_suite = task_suite
         self._tasks_cache = None
 
     # =========================================================================
@@ -85,6 +88,8 @@ def discover_all_tasks(self) -> List[BaseTask]:
         service_dir = self.tasks_root / (
             self.mcp_service or self._get_service_directory_name()
         )
+        if self.task_suite:
+            service_dir = service_dir / self.task_suite
 
         if not service_dir.exists():
             logger.warning(
@@ -112,9 +117,10 @@ def discover_all_tasks(self) -> List[BaseTask]:
         # Sort by category_id and a stringified task_id to handle both numeric IDs and slugs uniformly
         self._tasks_cache = sorted(tasks, key=lambda t: (t.category_id, str(t.task_id)))
         logger.info(
-            "Discovered %d %s tasks across all categories",
+            "Discovered %d %s tasks across all categories (suite=%s)",
             len(self._tasks_cache),
             self.mcp_service.title(),
+            self.task_suite or "default",
         )
         return self._tasks_cache
 

diff --git a/src/evaluator.py b/src/evaluator.py
@@ -27,11 +27,13 @@ def __init__(
         output_dir: Path = None,
         reasoning_effort: str = "default",
         agent_name: str = "mcpmark",
+        task_suite: str = "standard",
     ):
         # Main configuration
         self.mcp_service = mcp_service
         self.timeout = timeout
         self.agent_name = (agent_name or "mcpmark").lower()
+        self.task_suite = (task_suite or "standard").lower()
         if self.agent_name not in AGENT_REGISTRY:
             raise ValueError(f"Unsupported agent '{agent_name}'. Available: {sorted(AGENT_REGISTRY)}")
 
@@ -48,7 +50,9 @@ def __init__(
         self.litellm_run_model_name = None
 
         # Initialize managers using the factory pattern (simplified)
-        self.task_manager = MCPServiceFactory.create_task_manager(mcp_service)
+        self.task_manager = MCPServiceFactory.create_task_manager(
+            mcp_service, task_suite=self.task_suite
+        )
         self.state_manager = MCPServiceFactory.create_state_manager(mcp_service)
 
         # Obtain static service configuration from state manager (e.g., notion_key)
@@ -80,7 +84,9 @@ def __init__(
             model_slug = self.model_name.replace(".", "-")
 
         service_for_dir = "playwright" if mcp_service == "playwright_webarena" else mcp_service
-        self.base_experiment_dir = output_dir / f"{model_slug}__{service_for_dir}" / exp_name
+        suite_suffix = "" if self.task_suite in ("standard", "", None) else f"-{self.task_suite}"
+        service_dir_name = f"{service_for_dir}{suite_suffix}"
+        self.base_experiment_dir = output_dir / f"{model_slug}__{service_dir_name}" / exp_name
         self.base_experiment_dir.mkdir(parents=True, exist_ok=True)
 
     def _format_duration(self, seconds: float) -> str:

diff --git a/src/mcp_services/filesystem/filesystem_task_manager.py b/src/mcp_services/filesystem/filesystem_task_manager.py
@@ -30,7 +30,7 @@ class FilesystemTask(BaseTask):
 class FilesystemTaskManager(BaseTaskManager):
     """Simplified filesystem task manager using enhanced base class."""
 
-    def __init__(self, tasks_root: Path = None):
+    def __init__(self, tasks_root: Path = None, task_suite: str = "standard"):
         """Initialize filesystem task manager."""
         if tasks_root is None:
             tasks_root = Path(__file__).resolve().parents[3] / "tasks"
@@ -40,6 +40,7 @@ def __init__(self, tasks_root: Path = None):
             mcp_service="filesystem",
             task_class=FilesystemTask,
             task_organization="directory",
+            task_suite=task_suite,
         )
 
     # Override only what's needed for filesystem-specific behavior

diff --git a/src/mcp_services/github/github_state_manager.py b/src/mcp_services/github/github_state_manager.py
@@ -626,7 +626,35 @@ def _request_with_retry(
 
     # Initial state for each task category is resolved via self.initial_state_mapping
     def select_initial_state_for_task(self, task_category: str) -> Optional[str]:
-        return self.initial_state_mapping.get(task_category)
+        """Resolve template name for a task category with light normalization."""
+        if not task_category:
+            return None
+
+        candidate_keys = []
+        candidate_keys.append(task_category)
+
+        # Allow users to swap between hyphen/underscore naming conventions.
+        hyphen_to_underscore = task_category.replace("-", "_")
+        if hyphen_to_underscore not in candidate_keys:
+            candidate_keys.append(hyphen_to_underscore)
+
+        underscore_to_hyphen = task_category.replace("_", "-")
+        if underscore_to_hyphen not in candidate_keys:
+            candidate_keys.append(underscore_to_hyphen)
+
+        for key in candidate_keys:
+            template = self.initial_state_mapping.get(key)
+            if template:
+                if key != task_category:
+                    logger.debug(
+                        "| Resolved GitHub template for %s via alias %s -> %s",
+                        task_category,
+                        key,
+                        template,
+                    )
+                return template
+
+        return None
 
     def extract_repo_info_from_url(self, repo_url: str) -> tuple[str, str]:
         """Extract owner and repo name from GitHub URL."""

diff --git a/src/mcp_services/github/github_task_manager.py b/src/mcp_services/github/github_task_manager.py
@@ -42,7 +42,7 @@ class GitHubTask(BaseTask):
 class GitHubTaskManager(BaseTaskManager):
     """Manages task discovery, filtering, and verification for GitHub-based MCPMark evaluation."""
 
-    def __init__(self, tasks_root: Path = None):
+    def __init__(self, tasks_root: Path = None, task_suite: str = "standard"):
         """Initialize GitHub task manager.
 
         Args:
@@ -57,6 +57,7 @@ def __init__(self, tasks_root: Path = None):
             mcp_service="github",
             task_class=GitHubTask,
             task_organization="file",
+            task_suite=task_suite,
         )  # GitHub uses file-based tasks
 
     # =========================================================================