Hi, thanks for releasing WildClawBench and the accompanying paper.
I have a question about the harness comparison experiment. In Section 4.1, the paper states that “we evaluate 19 frontier models on WildClawBench under four harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent.” However, Table 2 reports the main results for all 19 models only under the default OpenClaw harness, while Table 3 compares different harnesses using only four models: GPT 5.4, GLM 5, MiMo V2 Pro, and MiniMax M2.7.
Could you please share the full evaluation results of all 19 models across the four harnesses (OpenClaw, Claude Code, Codex, Hermes Agent)?
Thanks again for the great work!
Hi, thanks for releasing WildClawBench and the accompanying paper.
I have a question about the harness comparison experiment. In Section 4.1, the paper states that “we evaluate 19 frontier models on WildClawBench under four harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent.” However, Table 2 reports the main results for all 19 models only under the default OpenClaw harness, while Table 3 compares different harnesses using only four models: GPT 5.4, GLM 5, MiMo V2 Pro, and MiniMax M2.7.
Could you please share the full evaluation results of all 19 models across the four harnesses (OpenClaw, Claude Code, Codex, Hermes Agent)?
Thanks again for the great work!