Skip to content

Clarify Table 5: Individual Tasks (10,115) vs LLM Judger (50) #16

@Ethereal-sakura

Description

@Ethereal-sakura

Hi, thanks for the excellent work and for open-sourcing MCPEval — it’s very helpful to the community. I have two quick questions:

  1. In Table 5, what’s the exact difference between Individual Tasks (10,115) and LLM Judger (50)? What does each sample count represent, and how is each aggregated?

  2. The Trajectory Evaluation Aspects and Task Completion Evaluation Aspects are defined in Appendix B.2 LLM Judger Criteria. Why do they also appear under Individual Tasks (10,115)? Are the same rubrics applied at the task level vs. the model×domain level, and if not, what differs?

Thanks again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions