FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances:
- Exploratory freedom (avoiding tasks that are too broad to benchmark), and
- Empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).
We evaluate agent performance through claim-level analysis. Both agent conclusions C_agent and ground-truth conclusions C_gt are decomposed into atomic, verifiable claims. Overall performance is measured using Precision, Recall, and the F_1 score.
|
|
mamba create -n firebench python=3.11 # or conda
mamba activate firebench
pip install -r requirements.txtCreate a .env file:
OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN=
ANTHROPIC_API_KEY=
USE_SUBSCRIPTION=1 (mark this as 1 if you want to use Claude Code subscription, 0 if using API key; note that some benchmark papers running on Claude models still require ANTHROPIC_API_KEY)
- Codex:
npx @openai/codex@0.39.0 --version(v0.39.0 required for timestamp logging) - Claude Code: Setup guide or
curl -fsSL https://claude.ai/install.sh | bash - OpenHands:
export OPENHANDS_HOME=./.openhands; mkdir -p ./.openhands(Requires docker)
In benchmark folder, some papers have empty data folder since the data can be loaded directly from HuggingFace. The sources of the data in nonempty data folder are described in dataset.txt.
Use the tree parser to decompose a research paper (PDF) into a hierarchical research-problem tree via OpenAI:
# Single paper
bash run_tree_parser.sh --papers /path/to/paper.pdf
# Multiple papers (quote the list)
bash run_tree_parser.sh --papers "/path/to/paper1.pdf /path/to/paper2.pdf" --model gpt-4o
# Glob pattern for a whole directory
bash run_tree_parser.sh --papers "/path/to/papers/*.pdf" --output_dir benchmark/treesOptions:
--papers: space-separated PDF paths or a glob pattern (required)--model: OpenAI model name (default:gpt-4o)--output_dir: directory for output JSON trees (default:benchmark/trees)--max_tokens: max output tokens (default:16384)--temperature: sampling temperature (default:0.0)
Each paper produces a <name>_tree.json file containing the problem tree.
You can use FIRE-Bench to evaluate any agent system beyond the built-in ones (Codex, Claude Code, OpenHands).
Each task in benchmark/papers/<task_id>/ contains:
benchmark/papers/<task_id>/
├── instruction/
│ ├── instruction.txt # Research prompt for the agent
│ └── instruction_gt.txt # (some tasks) Ground-truth experimental plan
└── data/ # (optional) Datasets for the task
instruction.txtdefines the research question, available resources (models, datasets), and constraints.data/contains task-specific datasets. Some tasks load data directly from HuggingFace instead (described in the instruction).
There are 35 benchmark tasks spanning topics such as CoT reasoning, hallucination, bias, safety, multimodal understanding, and more.
For each task you want to evaluate, set up a working directory for your agent:
- Copy
benchmark/papers/<task_id>/data/(if it exists) into the working directory asdata/ - Copy the project-level
utils/folder into the working directory (providesLLMInferenceand other shared helpers) - Create a
.envfile with API keys so the agent can call LLMs during experiments
Read the prompt from benchmark/papers/<task_id>/instruction/instruction.txt and pass it to your agent as the task instruction. The agent should:
- Design and execute experiments using the provided datasets and models
- Produce a final conclusion summarizing its findings
The evaluation pipeline reads logs from:
log/<agent_name>/<model_name>/<task_id>/<timestamp>/log.log
Each log.log must begin with three metadata lines followed by the full agent output:
agent_id: <your_agent_name>
task_id: <task_id>
llm_model: <model_name>
========================================
<full agent trajectory and output>
The evaluator extracts the agent's final conclusion from the log. It recognizes three formats — append one of these at the end of your log:
| Format | How to emit |
|---|---|
| JSON (simplest) | Append a JSON line: {"result": "<final conclusion>"} |
| OpenHands-style | Include final_thought='<conclusion>', outputs= in the log |
| Codex-style | Bracket conclusions between [YYYY-MM-DDTHH:MM:SS] timestamp lines |
Run the evaluation pipeline on your logs:
bash run_eval.sh --agents <your_agent_name> --models <model_name> --tasks <task_id>
# Or evaluate everything at once
bash run_eval.sh --agents all --models all --tasks allThe pipeline decomposes both the agent's conclusion and the ground-truth into atomic claims, then computes Precision, Recall, and F₁ via claim-level analysis.
Edit run_experiment.sh to configure your agent/task/model combinations, then run:
bash run_experiment.shThis iterates over all combinations of AGENT_IDS, TASK_IDS, and LLM_MODELS, calling run_agent.py for each. Results are saved to the log/ folder.
Parameters in run_experiment.sh:
AGENT_IDS: agents to run (e.g.,codex,claude_code,openhands)TASK_IDS: benchmark tasks (e.g.,rational)LLM_MODELS: models to use (e.g.,gpt-5)
After experiments finish, evaluate the generated logs:
# Evaluate all agents/models/tasks
bash run_eval.sh --agents all --models all --tasks all
# Evaluate a specific run
bash run_eval.sh --agents codex --models gpt-5 --tasks rational --timestamp 20251016232701_10997Options:
--agents: agent name orall--models: model name orall--tasks: task name orall--timestamp: (optional) evaluate a specific run by timestamp
