benchmark/AGENTS.md at main · MultiturnRL/benchmark

This repository is about benchmarking models with our custom agent harness, which gives the agent a sandbox (container) that the agent can use. It has several tools in it, like browser, terminal, etc.

We use uv for the project, so if you want to add a new package, you can use uv commands like uv add [package].

Try to keep things simple while mainintaing correctness.