AgentFlow/assets/doc/benchmark.md at main · lupantech/AgentFlow

Benchmarking Guide

Here we provide a detailed benchmarking guide to reproduce the paper’s results across ten benchmarks.

Serving Models with VLLM

We provide an automated VLLM serving script in scripts/serve_vllm.sh.

bash scripts/serve_vllm.sh

Configuration

You can configure the following parameters in scripts/serve_vllm.sh:

Parameter	Description	Default
MODEL	Model path to serve (HuggingFace or local)	`"AgentFlow/agentflow-planner-7b"`
GPU	GPU device ID(s) to use	`"0"`
PORT	VLLM serving port	`8000`
TP	Tensor-parallel-size	`1`

We provide task-specific scripts to run benchmarks. These scripts execute our agentic system, save the outputs, and automatically invoke the LLM for evaluation.

To run a specific benchmark (e.g., Bamboogle):

cd test
bash bamboogle/run.sh

You can configure benchmark settings in each task's run.sh script (e.g., test/bamboogle/run.sh).

Example configuration in test/bamboogle/run.sh:

#!/bin/bash

# Configuration
TASK="bamboogle"
THREADS=20
DATA_FILE_NAME="data.json"

MODELS=(
    "8000:vllm-AgentFlow/agentflow-planner-7b,AgentFlow-7B,\
Base_Generator_Tool|Python_Coder_Tool|Google_Search_Tool|Wikipedia_Search_Tool,\
gpt-4o-mini|gpt-4o-mini|Default|Default,\
trainable|gpt-4o|gpt-4o|gpt-4o"
)

Step:

1. Set Parallelism Set the data parallelism for inference (too high a value may exceed API thresholds).

THREADS=20  # Number of parallel workers

2. Select Tasks

Enable or disable benchmarks by commenting/uncommenting:

TASKS=(
    "aime24"
    "gameof24"
    "bamboogle"
    # "gpqa"
)

3. Define Models

Specify models with their configurations:

MODELS=(
      "8000:vllm-AgentFlow/agentflow-planner-7b,AgentFlow-7B,Base_Generator_Tool|Python_Coder_Tool|Google_Search_Tool|Wikipedia_Search_Tool,dashscope-qwen2.5-7b-instruct|dashscope-qwen2.5-7b-instruct|Default|Default"
)

port: VLLM serving port (leave empty for API-based models)
model_path: Model engine name (e.g., gpt-4o or vllm-AgentFlow/agentflow-planner-7b)
label: Display name for results (used for folder naming)
tools: Pipe-separated tool list (e.g., Tool1|Tool2)
tool_engine: Pipe-separated engines for each tool
model_engines: Configuration for the four agent modules (e.g., trainable|gpt-4o|gpt-4o|gpt-4o)

Note: For all agents except the planner, we now use gpt-4o by default to ensure high-quality reasoning.

Results Organization

After benchmark completion, results are organized in the following structure:

test/
└── {TASK_NAME}/              # e.g., aime24, bamboogle
    ├── logs/
    │   └── {MODEL_LABEL}/     # e.g., AgentFlow-7B
    │       ├── 0.log          # Per-problem execution logs
    │       ├── 1.log
    │       └── ...
    ├── results/
    │   └── {MODEL_LABEL}/
    │       ├── final_results_direct_output.json    # Per-problem analysis
    │       ├── final_scores_direct_output.json    # Aggregate metrics
    │       ├── final_score_direct_output.log       # Scoring process log
    │       ├── output_0.json                      # Individual outputs
    │       ├── output_1.json
    │       └── ...
    └── cache/                 # Cached intermediate results

Key Result Files

File	Description
`final_scores_direct_output.json`	Aggregate metrics: accuracy, correct/wrong counts, tool usage statistics
`final_results_direct_output.json`	Detailed per-problem results with verification and analysis
`output_{i}.json`	Complete execution trace: query, response, memory, tool calls
`final_score_direct_output.log`	Detailed scoring process log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Guide

Serving Models with VLLM

Results Organization

Key Result Files

FilesExpand file tree

benchmark.md

Latest commit

History

benchmark.md

File metadata and controls

Benchmarking Guide

Serving Models with VLLM

Results Organization

Key Result Files