This repository is about benchmarking models with our custom agent harness, which gives the agent a sandbox (container) that the agent can use. It has several tools in it, like browser, terminal, etc.
We use uv for the project, so if you want to add a new package, you can use uv commands like uv add [package].
Try to keep things simple while mainintaing correctness.