Clarify Table 5: Individual Tasks (10,115) vs LLM Judger (50)

Hi, thanks for the excellent work and for open-sourcing MCPEval — it’s very helpful to the community. I have two quick questions:

1. In **Table 5**, what’s the exact difference between **Individual Tasks (10,115)** and **LLM Judger (50)**? What does each sample count represent, and how is each aggregated?

2. The **Trajectory Evaluation Aspects** and **Task Completion Evaluation Aspects** are defined in Appendix **B.2 LLM Judger Criteria**. Why do they also appear under **Individual Tasks (10,115)**? Are the same rubrics applied at the task level vs. the model×domain level, and if not, what differs?

Thanks again!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify Table 5: Individual Tasks (10,115) vs LLM Judger (50) #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify Table 5: Individual Tasks (10,115) vs LLM Judger (50) #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions