Here we provide a detailed benchmarking guide to reproduce the paper’s results across ten benchmarks.
We provide an automated VLLM serving script in scripts/serve_vllm.sh.
bash scripts/serve_vllm.shConfiguration
You can configure the following parameters in scripts/serve_vllm.sh:
| Parameter | Description | Default |
|---|---|---|
| MODEL | Model path to serve (HuggingFace or local) | "AgentFlow/agentflow-planner-7b" |
| GPU | GPU device ID(s) to use | "0" |
| PORT | VLLM serving port | 8000 |
| TP | Tensor-parallel-size | 1 |
We provide task-specific scripts to run benchmarks. These scripts execute our agentic system, save the outputs, and automatically invoke the LLM for evaluation.
To run a specific benchmark (e.g., Bamboogle):
cd test
bash bamboogle/run.shYou can configure benchmark settings in each task's run.sh script (e.g., test/bamboogle/run.sh).
Example configuration in test/bamboogle/run.sh:
#!/bin/bash
# Configuration
TASK="bamboogle"
THREADS=20
DATA_FILE_NAME="data.json"
MODELS=(
"8000:vllm-AgentFlow/agentflow-planner-7b,AgentFlow-7B,\
Base_Generator_Tool|Python_Coder_Tool|Google_Search_Tool|Wikipedia_Search_Tool,\
gpt-4o-mini|gpt-4o-mini|Default|Default,\
trainable|gpt-4o|gpt-4o|gpt-4o"
)Step:
1. Set Parallelism Set the data parallelism for inference (too high a value may exceed API thresholds).
THREADS=20 # Number of parallel workers2. Select Tasks
Enable or disable benchmarks by commenting/uncommenting:
TASKS=(
"aime24"
"gameof24"
"bamboogle"
# "gpqa"
)3. Define Models
Specify models with their configurations:
MODELS=(
"8000:vllm-AgentFlow/agentflow-planner-7b,AgentFlow-7B,Base_Generator_Tool|Python_Coder_Tool|Google_Search_Tool|Wikipedia_Search_Tool,dashscope-qwen2.5-7b-instruct|dashscope-qwen2.5-7b-instruct|Default|Default"
)Format: "port:model_path,label,Tool1|Tool2,engine1|engine2,planner|fixed|verifier|executor"
- port: VLLM serving port (leave empty for API-based models)
- model_path: Model engine name (e.g.,
gpt-4oorvllm-AgentFlow/agentflow-planner-7b) - label: Display name for results (used for folder naming)
- tools: Pipe-separated tool list (e.g.,
Tool1|Tool2) - tool_engine: Pipe-separated engines for each tool
- model_engines: Configuration for the four agent modules (e.g.,
trainable|gpt-4o|gpt-4o|gpt-4o)
Note: For all agents except the planner, we now use gpt-4o by default to ensure high-quality reasoning.
After benchmark completion, results are organized in the following structure:
test/
└── {TASK_NAME}/ # e.g., aime24, bamboogle
├── logs/
│ └── {MODEL_LABEL}/ # e.g., AgentFlow-7B
│ ├── 0.log # Per-problem execution logs
│ ├── 1.log
│ └── ...
├── results/
│ └── {MODEL_LABEL}/
│ ├── final_results_direct_output.json # Per-problem analysis
│ ├── final_scores_direct_output.json # Aggregate metrics
│ ├── final_score_direct_output.log # Scoring process log
│ ├── output_0.json # Individual outputs
│ ├── output_1.json
│ └── ...
└── cache/ # Cached intermediate results
| File | Description |
|---|---|
final_scores_direct_output.json |
Aggregate metrics: accuracy, correct/wrong counts, tool usage statistics |
final_results_direct_output.json |
Detailed per-problem results with verification and analysis |
output_{i}.json |
Complete execution trace: query, response, memory, tool calls |
final_score_direct_output.log |
Detailed scoring process log |