Skip to content

Could You Share the Full Cross-Harness Evaluation Results? #11

@SpiritsYouthHarmony

Description

@SpiritsYouthHarmony

Hi, thanks for releasing WildClawBench and the accompanying paper.

I have a question about the harness comparison experiment. In Section 4.1, the paper states that “we evaluate 19 frontier models on WildClawBench under four harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent.” However, Table 2 reports the main results for all 19 models only under the default OpenClaw harness, while Table 3 compares different harnesses using only four models: GPT 5.4, GLM 5, MiMo V2 Pro, and MiniMax M2.7.

Could you please share the full evaluation results of all 19 models across the four harnesses (OpenClaw, Claude Code, Codex, Hermes Agent)?

Thanks again for the great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions