For coding agents that need sandboxed execution. Docker ensures agents cannot access ground truth during testing.
# 1. Build base image
docker build -t iplotbench:latest -f docker/Dockerfile .
# 2. Prepare environments
python -m eval.prepare_envs --output ./env
# 3. Build and run your agent
docker run -v $(pwd)/env:/data my-agent:latest
# 4. Compute metrics
python -m eval.run_eval my_agentcd iPlotBench
docker build -t iplotbench:latest -f docker/Dockerfile .python -m eval.prepare_envs --output ./envThis creates:
env/
├── test/
│ └── {figure_id}/
│ └── input.png # Input for both task1 and task2
├── query/
│ └── {figure_id}/
│ └── questions.json # Questions only (no answers)
└── output/ # Agent writes results here
Create a Dockerfile in your agent project:
FROM iplotbench:latest
# Install your dependencies
# Copy your agent code
COPY . /agent
WORKDIR /agent
# Set entry point
ENTRYPOINT ["python", "-m", "your_runner"]Build your image:
docker build -t my-agent-eval:latest .docker run \
-v /path/to/iPlotBench/env:/data \
my-agent-eval:latestcd iPlotBench
python -m eval.run_eval my_agentInput:
/data/test/{figure_id}/
└── input.png # The visualization to analyze
For task2, questions are provided:
/data/query/{figure_id}/
└── questions.json # Questions only (no answers)
questions.json format:
[
{"question_id": 0, "question_string": "Is X the minimum?"},
{"question_id": 1, "question_string": "Is Y greater than Z?"}
]Output:
Agent should produce:
/data/output/{figure_id}/
├── output_{config}_task1.json # Recreation task
└── output_{config}_task2_q{n}.json # QA task
Task 1 (Recreation) Output Format:
{
"data": [...],
"layout": {...}
}Task 2 (QA) Output Format:
{
"answer": 0
}Where answer is 0 (No) or 1 (Yes).