|
1 | 1 | # Usage |
2 | 2 |
|
3 | | -You can run benchmarks using `main.py` or the provided shell script. |
| 3 | +This guide explains how to run benchmarks and use the automated workflow optimizer with MASArena. |
4 | 4 |
|
5 | | -## Configuration |
| 5 | +## Prerequisites |
6 | 6 |
|
7 | | -First, create a `.env` file in the project root and set the following: |
| 7 | +1. **Install dependencies:** |
| 8 | + If you haven't already, install the required packages. We recommend using `uv`. |
| 9 | + ```bash |
| 10 | + uv sync |
| 11 | + ``` |
8 | 12 |
|
9 | | -```bash |
10 | | -OPENAI_API_KEY=your_openai_api_key |
11 | | -MODEL_NAME=gpt-4o-mini |
12 | | -OPENAI_API_BASE=https://api.openai.com/v1 |
13 | | -``` |
| 13 | +2. **Configure Environment Variables:** |
| 14 | + Create a `.env` file in the project root and set your OpenAI API key and desired model. |
| 15 | + ```bash |
| 16 | + OPENAI_API_KEY=your_openai_api_key |
| 17 | + MODEL_NAME=gpt-4o-mini |
| 18 | + OPENAI_API_BASE=https://api.openai.com/v1 |
| 19 | + ``` |
14 | 20 |
|
15 | | -## Using `main.py` |
| 21 | +## Running Benchmarks |
16 | 22 |
|
17 | | -### Basic Usage |
| 23 | +You can run benchmarks using the convenience shell script `run_benchmark.sh` (recommended) or by directly calling `main.py`. |
18 | 24 |
|
19 | | -```bash |
20 | | -# Run a math benchmark with a single agent |
21 | | -python main.py --benchmark math --agent-system single_agent --limit 5 |
| 25 | +### Using the Shell Script (`run_benchmark.sh`) |
22 | 26 |
|
23 | | -# Run with supervisor-based multi-agent system |
24 | | -python main.py --benchmark math --agent-system supervisor_mas --limit 10 |
25 | | - |
26 | | -# Run with swarm-based multi-agent system |
27 | | -python main.py --benchmark math --agent-system swarm --limit 5 |
28 | | -``` |
29 | | - |
30 | | -### Using the Shell Runner |
31 | | - |
32 | | -A convenience script `run_benchmark.sh` is provided for quick runs. |
| 27 | +The `run_benchmark.sh` script is the simplest way to run evaluations. |
33 | 28 |
|
| 29 | +**Syntax:** |
34 | 30 | ```bash |
35 | | -# Syntax: ./run_benchmark.sh <benchmark_name> <agent_system> <limit> |
| 31 | +# Usage: ./run_benchmark.sh [benchmark] [agent_system] [limit] [mcp_config] [concurrency] [optimizer] |
36 | 32 | ./run_benchmark.sh math supervisor_mas 10 |
37 | 33 | ``` |
38 | | -### Advanced Usage: Asynchronous Execution |
39 | 34 |
|
40 | | -For benchmarks that support concurrency, you can run them asynchronously to speed up evaluation. |
| 35 | +**Examples:** |
41 | 36 |
|
42 | 37 | ```bash |
43 | | -# Run the humaneval benchmark with a concurrency of 10 |
44 | | -python main.py --benchmark humaneval --async-run --concurrency 10 |
| 38 | +# Run the 'math' benchmark on 10 problems with the 'supervisor_mas' agent system |
| 39 | +./run_benchmark.sh math supervisor_mas 10 |
| 40 | +
|
| 41 | +# Run the 'humaneval' benchmark asynchronously with a concurrency of 10 |
| 42 | +# The "" is a placeholder for the mcp_config argument. |
| 43 | +./run_benchmark.sh humaneval single_agent 20 "" 10 |
45 | 44 | ``` |
46 | | -*Note: Benchmarks that do not support concurrency (e.g., `math`, `aime`) will automatically run in synchronous mode, even if `--async-run` is specified.* |
47 | 45 |
|
48 | | -### Advanced Usage: Optimizer Execution |
| 46 | +## Automated Workflow Optimization (AFlow) |
49 | 47 |
|
50 | | -You can run an optimization process before the benchmark. For example, to use the `aflow` optimizer: |
| 48 | +MASArena includes AFlow implementation, an automated optimizer for agent workflows. |
| 49 | + |
| 50 | +**Example:** |
| 51 | +To run AFlow to optimize an agent for the `humaneval` benchmark, provide `aflow` as the optimizer argument to the shell script: |
51 | 52 |
|
52 | 53 | ```bash |
53 | | -python main.py --run-optimizer aflow --benchmark humaneval |
| 54 | +# The "" arguments are placeholders for mcp_config and concurrency. |
| 55 | +./run_benchmark.sh humaneval single_agent 10 "" "" aflow |
54 | 56 | ``` |
55 | 57 |
|
56 | | -## Command-Line Arguments |
57 | 58 |
|
58 | | -Here are some of the most common arguments for `main.py`: |
59 | | - |
60 | | -| Argument | Description | Default | |
61 | | -|---------------------| ------------------------------------------------------------------------ |-------------------------------| |
62 | | -| `--benchmark` | The name of the benchmark to run. | `math` | |
63 | | -| `--agent-system` | The agent system to use for the benchmark. | `single_agent` | |
64 | | -| `--verbose` | Print progress information | `True` | |
65 | | -| `--limit` | The maximum number of problems to evaluate. | `None` | |
66 | | -| `--data` | Path to a custom benchmark data file (JSONL format). | `data/{benchmark}_test.jsonl` | |
67 | | -| `--data-id` | A specific data ID to run from the benchmark file. | `None` | |
68 | | -| `--async-run` | Run the benchmark asynchronously for faster evaluation. | `False` | |
69 | | -| `--concurrency` | Set the concurrency level for asynchronous runs. | `10` | |
70 | | -| `--results-dir` | Directory to store detailed JSON results. | `results/` | |
71 | | -| `--use-tools` | Enable the agent to use integrated tools (e.g., code interpreter). | `False` | |
72 | | -| `--use-mcp-tools` | Enable the agent to use tools via the Multi-Agent Communication Protocol. | `False` | |
73 | | -| `--mcp-config-file` | Path to the MCP server configuration file. Required if using MCP tools. | `None` | |
74 | 59 |
|
75 | | -### Optimizer Arguments |
| 60 | +## Command-Line Arguments |
76 | 61 |
|
77 | | -When using `--run-optimizer`, the following arguments are available: |
| 62 | +Here are the most common arguments for `main.py`. |
| 63 | + |
| 64 | +### Main Arguments |
78 | 65 |
|
79 | 66 | | Argument | Description | Default | |
80 | 67 | |---|---|---| |
81 | | -| `--run-optimizer` | The optimization process to run. | `None` | |
82 | | -| `--graph_path` | Path to the agent flow graph configuration. | `mas_arena/configs/aflow` | |
83 | | -| `--optimized_path` | Path to save the optimized agent flow graph. | `example/aflow/humaneval/optimization` | |
84 | | -| `--validation_rounds` | Number of validation rounds. | `1` | |
85 | | -| `--eval_rounds` | Number of evaluation rounds. | `1` | |
86 | | -| `--max_rounds` | Maximum number of optimization rounds. | `3` | |
| 68 | +| `--benchmark` | The name of the benchmark to run. | `math` | |
| 69 | +| `--agent-system` | The agent system to use for the benchmark. | `single_agent` | |
| 70 | +| `--limit` | The maximum number of problems to evaluate. | `None` (all) | |
| 71 | +| `--data` | Path to a custom benchmark data file (JSONL format). | `data/{benchmark}_test.jsonl` | |
| 72 | +| `--results-dir` | Directory to store detailed JSON results. | `results/` | |
| 73 | +| `--verbose` | Print progress information. | `True` | |
| 74 | +| `--async-run` | Run the benchmark asynchronously for faster evaluation. | `False` | |
| 75 | +| `--concurrency` | Set the concurrency level for asynchronous runs. | `10` | |
| 76 | +| `--use-tools` | Enable the agent to use integrated tools (e.g., code interpreter). | `False` | |
| 77 | +| `--use-mcp-tools` | Enable the agent to use tools via MCP. | `False` | |
| 78 | +| `--mcp-config-file`| Path to the MCP server configuration file. Required for MCP tools. | `None` | |
| 79 | +| `--data-id` | Data ID to use. | `None` | |
| 80 | + |
| 81 | +### Optimizer Arguments |
| 82 | + |
| 83 | +These arguments are used when running an optimizer like AFlow via `--run-optimizer`. |
87 | 84 |
|
| 85 | +| Argument | Type | Default | Description | |
| 86 | +|---|---|---|---| |
| 87 | +| `--run-optimizer` | str | `None` | Specifies the optimizer to run. Use `aflow`. | |
| 88 | +| `--graph_path` | str | `mas_arena/configs/aflow` | Path to the base AFlow graph configuration. | |
| 89 | +| `--optimized_path` | str | `example/aflow/humaneval/optimization` | Path to save the optimized AFlow graph. | |
| 90 | +| `--validation_rounds`| int | 1 | Number of validation rounds per optimization cycle. | |
| 91 | +| `--eval_rounds` | int | 1 | Number of evaluation rounds per optimization cycle. | |
| 92 | +| `--max_rounds` | int | 3 | Maximum number of optimization rounds. | |
88 | 93 |
|
89 | 94 | ## Example Output |
90 | 95 |
|
|
0 commit comments