Hi, thanks for the excellent work and for open-sourcing MCPEval — it’s very helpful to the community. I have two quick questions:
-
In Table 5, what’s the exact difference between Individual Tasks (10,115) and LLM Judger (50)? What does each sample count represent, and how is each aggregated?
-
The Trajectory Evaluation Aspects and Task Completion Evaluation Aspects are defined in Appendix B.2 LLM Judger Criteria. Why do they also appear under Individual Tasks (10,115)? Are the same rubrics applied at the task level vs. the model×domain level, and if not, what differs?
Thanks again!
Hi, thanks for the excellent work and for open-sourcing MCPEval — it’s very helpful to the community. I have two quick questions:
In Table 5, what’s the exact difference between Individual Tasks (10,115) and LLM Judger (50)? What does each sample count represent, and how is each aggregated?
The Trajectory Evaluation Aspects and Task Completion Evaluation Aspects are defined in Appendix B.2 LLM Judger Criteria. Why do they also appear under Individual Tasks (10,115)? Are the same rubrics applied at the task level vs. the model×domain level, and if not, what differs?
Thanks again!