Could You Share the Full Cross-Harness Evaluation Results?

Hi, thanks for releasing WildClawBench and the accompanying paper.

I have a question about the harness comparison experiment. In Section 4.1, the paper states that “we evaluate 19 frontier models on WildClawBench under four harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent.” However, Table 2 reports the main results for all 19 models only under the default OpenClaw harness, while Table 3 compares different harnesses using only four models: GPT 5.4, GLM 5, MiMo V2 Pro, and MiniMax M2.7.

Could you please share the full evaluation results of all 19 models across the four harnesses (OpenClaw, Claude Code, Codex, Hermes Agent)?

Thanks again for the great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could You Share the Full Cross-Harness Evaluation Results? #11

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Could You Share the Full Cross-Harness Evaluation Results? #11

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions