A minimal end-to-end example that demonstrates running an agent task and grading the result.
The agent is asked to explore a filesystem containing several subdirectories, each with an animal image file (with random names), and find the path to the gorilla image.
Directory structure:
animals/
xk92m/qz7fw.png (gorilla - the target)
ab47z/rt3ky.png (cat)
mn83p/jw9vb.png (elephant)
jh21q/pl4nc.png (penguin)
cd archipelago/examples/simple_task
GOOGLE_API_KEY=your_key_here ./run.shThe script will:
- Start the environment container (if not already running)
- Populate the environment with the world snapshot
- Configure MCP servers
- Run the agent
- Save the final snapshot
- Run grading and display results
| File | Description |
|---|---|
initial_messages.json |
Task prompt for the agent |
agent_config.json |
Agent configuration (loop_agent with defaults) |
orchestrator_config.json |
Model and extra args for the agent |
mcp_config.json |
MCP server configuration (filesystem only) |
original_snapshot.zip |
World snapshot with animal image directories |
verifiers.json |
Grading criteria |
eval_configs.json |
Eval definitions |
grading_settings.json |
LLM judge settings |
scoring_config.json |
Scoring method configuration |
This walkthrough shows exactly how to run 1 task from start to finish, explaining each step.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ 1. Environment │───▶│ 2. Agent │───▶│ 3. Grading │
│ (Docker) │ │ (Python/uv) │ │ (Python/uv) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
Load snapshot Run LLM loop Compare snapshots
Configure MCP Call tools Evaluate criteria
Expose /mcp/ Save trajectory Calculate score
Create a task file (task.json) with your task definition:
{
"trajectory_id": "task_001",
"model": "openai/gpt-5",
"initial_messages": [
{
"role": "user",
"content": "List all files in the filesystem and create a summary.txt file."
}
],
"mcp_client_config": {
"filesystem-server": {
"transport": "stdio",
"command": "python",
"args": ["main.py"],
"cwd": "./mcp_servers/filesystem_server",
"env": {
"APP_FS_ROOT": "/filesystem"
}
}
}
}Key fields:
trajectory_id: Unique identifier for this runmodel: LLM to use (format:provider/model-name)initial_messages: The task promptmcp_client_config: MCP server configuration- Must include
"transport": "stdio" cwdmust be a valid path inside the container
- Must include
cd archipelago/environment
# Copy and configure .env
cp .env.example .env
# Start the container
docker-compose up --buildWait for Application startup complete message.
The environment expects .tar.gz archives. If you have a .zip file, convert it:
import tarfile
import zipfile
import io
def zip_to_tar_gz(zip_path, tar_gz_path):
with zipfile.ZipFile(zip_path, "r") as zf:
with tarfile.open(tar_gz_path, "w:gz") as tar:
for name in zf.namelist():
data = zf.read(name)
info = tarfile.TarInfo(name=name)
info.size = len(data)
tar.addfile(info, io.BytesIO(data))
zip_to_tar_gz("world.zip", "world.tar.gz")Then populate the environment:
curl -X POST "http://localhost:8080/data/populate?subsystem=filesystem" \
-F "archive=@world.tar.gz"curl -X POST http://localhost:8080/apps \
-H "Content-Type: application/json" \
-d @- << 'EOF'
{
"mcpServers": {
"filesystem-server": {
"transport": "stdio",
"command": "python",
"args": ["main.py"],
"cwd": "./mcp_servers/filesystem_server",
"env": {"APP_FS_ROOT": "/filesystem"}
}
}
}
EOFThe agent runner needs separate config files:
agent_config.json:
{
"agent_config_id": "loop_agent",
"agent_name": "Loop Agent",
"agent_config_values": {
"timeout": 3600,
"max_steps": 50,
"tool_call_timeout": 60,
"llm_response_timeout": 300
}
}initial_messages.json:
[
{
"role": "user",
"content": "List all files in the filesystem and create a summary.txt file."
}
]cd archipelago/agents
# Copy and configure .env
cp .env.example .env
# Edit .env with your LLM API key
# Install dependencies
uv sync
# Run the agent
uv run python -m runner.main \
--trajectory-id "task_001" \
--initial-messages ./initial_messages.json \
--mcp-gateway-url "http://localhost:8080/mcp/" \
--agent-config ./agent_config.json \
--orchestrator-model "openai/gpt-5" \
--output ./trajectory.json# Download as tar.gz
curl -X POST http://localhost:8080/data/snapshot -o final_snapshot.tar.gz
# Convert to zip for grading
python -c "
import tarfile, zipfile, io
with tarfile.open('final_snapshot.tar.gz', 'r:gz') as tar:
with zipfile.ZipFile('final_snapshot.zip', 'w') as zf:
for m in tar.getmembers():
if m.isfile():
f = tar.extractfile(m)
if f: zf.writestr(m.name, f.read())
"verifiers.json:
[
{
"verifier_id": "ver_001",
"verifier_version": 1,
"world_id": null,
"task_id": "task_001",
"eval_config_id": "ec_output_llm",
"verifier_values": {
"criteria": "The agent listed all files and created a summary.txt file",
"is_primary_objective": true
},
"verifier_index": 0,
"verifier_dependencies": null
}
]eval_configs.json:
[
{
"eval_config_id": "ec_output_llm",
"eval_config_name": "Output LLM Verifier",
"eval_defn_id": "output_llm",
"eval_config_values": {}
}
]grading_settings.json:
{
"llm_judge_model": "openai/gpt-5",
"llm_judge_extra_args": null
}scoring_config.json:
{
"scoring_config_id": "sc_default",
"scoring_config_name": "Default Scoring",
"scoring_defn_id": "task_score_unweighted_and_universal_penalty",
"scoring_config_values": {
"task_primary_objective_scaling_factor": 2.0,
"task_non_primary_objective_scaling_factor": 1.0,
"task_negative_scaling_factor": 2.0,
"universal_penalty_cap": 0.2,
"final_score_ceiling": 1.0,
"final_score_floor": 0.0
}
}cd archipelago/grading
# Copy and configure .env
cp .env.example .env
# Edit .env with your LLM API key
# Install dependencies
uv sync
# Run grading
uv run python -m runner.main \
--grading-run-id "gr_001" \
--trajectory-id "task_001" \
--initial-snapshot ./original_snapshot.zip \
--final-snapshot ./final_snapshot.zip \
--trajectory ./trajectory.json \
--grading-settings ./grading_settings.json \
--verifiers ./verifiers.json \
--eval-configs ./eval_configs.json \
--scoring-config ./scoring_config.json \
--output ./grades.jsoncat grades.json | jq '.'After a successful run:
trajectory.json- Agent's conversation history and tool callsfinal_snapshot.zip- Final state of the filesystemgrades.json- Grading results with score and rationale
The agent should find and report the path /animals/xk92m/qz7fw.png as the gorilla image.
This example uses the simplest scoring method (template) which requires no configuration:
{
"scoring_defn_id": "template",
"scoring_config_values": {}
}For production use, you may want task_score_unweighted_and_universal_penalty which supports weighted scoring, negative criteria penalties, and score clamping. See archipelago/grading/runner/scoring_methods/ for available methods.