GitHub - maitrix-org/FIRE-Bench

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances:

Exploratory freedom (avoiding tasks that are too broad to benchmark), and
Empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).

We evaluate agent performance through claim-level analysis. Both agent conclusions C_agent and ground-truth conclusions C_gt are decomposed into atomic, verifiable claims. Overall performance is measured using Precision, Recall, and the F_1 score.

Workflow and Results

#	Agent	Prec.	Recall	F₁ Score
1	CC_Sonnet-4	52.1_±26.1	48.3_±24.8	46.7_±23.4
2	CX_gpt-5-med	44.8_±24.1	49.0_±28.5	41.9_±25.4
3	OH_gpt-5	41.7_±22.7	41.4_±24.9	37.9_±23.0
4	OH_o4-mini	36.8_±18.5	36.6_±19.2	31.9_±17.6

Setup

Environment

mamba create -n firebench python=3.11  # or conda
mamba activate firebench
pip install -r requirements.txt

API Keys

Create a .env file:

OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN=
ANTHROPIC_API_KEY=
USE_SUBSCRIPTION=1 (mark this as 1 if you want to use Claude Code subscription, 0 if using API key; note that some benchmark papers running on Claude models still require ANTHROPIC_API_KEY)

Agent Dependencies

Codex: npx @openai/codex@0.39.0 --version (v0.39.0 required for timestamp logging)
Claude Code: Setup guide or curl -fsSL https://claude.ai/install.sh | bash
OpenHands: export OPENHANDS_HOME=./.openhands; mkdir -p ./.openhands (Requires docker)

Usage

1. Download Data for Specific Benchmark

In benchmark folder, some papers have empty data folder since the data can be loaded directly from HuggingFace. The sources of the data in nonempty data folder are described in dataset.txt.

2. Parse Other Papers into Problem Trees (Optional)

Use the tree parser to decompose a research paper (PDF) into a hierarchical research-problem tree via OpenAI:

# Single paper
bash run_tree_parser.sh --papers /path/to/paper.pdf

# Multiple papers (quote the list)
bash run_tree_parser.sh --papers "/path/to/paper1.pdf /path/to/paper2.pdf" --model gpt-4o

# Glob pattern for a whole directory
bash run_tree_parser.sh --papers "/path/to/papers/*.pdf" --output_dir benchmark/trees

Options:

--papers: space-separated PDF paths or a glob pattern (required)
--model: OpenAI model name (default: gpt-4o)
--output_dir: directory for output JSON trees (default: benchmark/trees)
--max_tokens: max output tokens (default: 16384)
--temperature: sampling temperature (default: 0.0)

Each paper produces a <name>_tree.json file containing the problem tree.

3. Benchmark Your Own Agent

You can use FIRE-Bench to evaluate any agent system beyond the built-in ones (Codex, Claude Code, OpenHands).

Benchmark Task Structure

Each task in benchmark/papers/<task_id>/ contains:

benchmark/papers/<task_id>/
├── instruction/
│   ├── instruction.txt       # Research prompt for the agent
│   └── instruction_gt.txt    # (some tasks) Ground-truth experimental plan
└── data/                     # (optional) Datasets for the task

instruction.txt defines the research question, available resources (models, datasets), and constraints.
data/ contains task-specific datasets. Some tasks load data directly from HuggingFace instead (described in the instruction).

There are 35 benchmark tasks spanning topics such as CoT reasoning, hallucination, bias, safety, multimodal understanding, and more.

Step A: Prepare the Agent's Working Directory

For each task you want to evaluate, set up a working directory for your agent:

Copy benchmark/papers/<task_id>/data/ (if it exists) into the working directory as data/
Copy the project-level utils/ folder into the working directory (provides LLMInference and other shared helpers)
Create a .env file with API keys so the agent can call LLMs during experiments

Step B: Run Your Agent

Read the prompt from benchmark/papers/<task_id>/instruction/instruction.txt and pass it to your agent as the task instruction. The agent should:

Design and execute experiments using the provided datasets and models
Produce a final conclusion summarizing its findings

Step C: Save Output in the Expected Log Format

The evaluation pipeline reads logs from:

log/<agent_name>/<model_name>/<task_id>/<timestamp>/log.log

Each log.log must begin with three metadata lines followed by the full agent output:

agent_id: <your_agent_name>
task_id: <task_id>
llm_model: <model_name>
========================================
<full agent trajectory and output>

The evaluator extracts the agent's final conclusion from the log. It recognizes three formats — append one of these at the end of your log:

Format	How to emit
JSON (simplest)	Append a JSON line: `{"result": "<final conclusion>"}`
OpenHands-style	Include `final_thought='<conclusion>', outputs=` in the log
Codex-style	Bracket conclusions between `[YYYY-MM-DDTHH:MM:SS]` timestamp lines

Step D: Evaluate

Run the evaluation pipeline on your logs:

bash run_eval.sh --agents <your_agent_name> --models <model_name> --tasks <task_id>

# Or evaluate everything at once
bash run_eval.sh --agents all --models all --tasks all

The pipeline decomposes both the agent's conclusion and the ground-truth into atomic claims, then computes Precision, Recall, and F₁ via claim-level analysis.

4. Run Built-in Experiments

Edit run_experiment.sh to configure your agent/task/model combinations, then run:

bash run_experiment.sh

This iterates over all combinations of AGENT_IDS, TASK_IDS, and LLM_MODELS, calling run_agent.py for each. Results are saved to the log/ folder.

Parameters in run_experiment.sh:

AGENT_IDS: agents to run (e.g., codex, claude_code, openhands)
TASK_IDS: benchmark tasks (e.g., rational)
LLM_MODELS: models to use (e.g., gpt-5)

5. Evaluate Results

After experiments finish, evaluate the generated logs:

# Evaluate all agents/models/tasks
bash run_eval.sh --agents all --models all --tasks all

# Evaluate a specific run
bash run_eval.sh --agents codex --models gpt-5 --tasks rational --timestamp 20251016232701_10997

Options:

--agents: agent name or all
--models: model name or all
--tasks: task name or all
--timestamp: (optional) evaluate a specific run by timestamp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Workflow and Results

Setup

Environment

API Keys

Agent Dependencies

Usage

1. Download Data for Specific Benchmark

2. Parse Other Papers into Problem Trees (Optional)

3. Benchmark Your Own Agent

Benchmark Task Structure

Step A: Prepare the Agent's Working Directory

Step B: Run Your Agent

Step C: Save Output in the Expected Log Format

Step D: Evaluate

4. Run Built-in Experiments

5. Evaluate Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
agents		agents
benchmark		benchmark
eval		eval
resources		resources
utils		utils
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
run_agent.py		run_agent.py
run_eval.sh		run_eval.sh
run_experiment.sh		run_experiment.sh
run_plan_generator.sh		run_plan_generator.sh
run_tree_parser.sh		run_tree_parser.sh

Folders and files

Latest commit

History

Repository files navigation

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Workflow and Results

Setup

Environment

API Keys

Agent Dependencies

Usage

1. Download Data for Specific Benchmark

2. Parse Other Papers into Problem Trees (Optional)

3. Benchmark Your Own Agent

Benchmark Task Structure

Step A: Prepare the Agent's Working Directory

Step B: Run Your Agent

Step C: Save Output in the Expected Log Format

Step D: Evaluate

4. Run Built-in Experiments

5. Evaluate Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages