A systematic comparison of 9 combinations of multi-agent orchestrators × spec-driven toolkits, all building the same Python CLI Snake game from a shared specification.
Article: Evaluating 9 AI Agent Combinations on the Same Snake Game
| GSD | Spec Kit | OpenSpec | |
|---|---|---|---|
| Superpowers | Procedural (178 LOC) | OOP 5-class (213 LOC) 🏆 | Functional+dataclass (179 LOC) |
| DeerFlow | Module functions (181 LOC) | State machine (152 LOC) | Event-driven (171 LOC) |
| Squad | Clean minimal (132 LOC) | Protocol pattern (180 LOC) | Dict-state pure (157 LOC) |
# Clone the repo
git clone https://github.com/nimanch/multi-agent-benchmark.git
cd multi-agent-benchmark
# Run any of the 9 snake games (no dependencies beyond Python 3.x):
python3 superpowers-gsd/snake.py
python3 deerflow-speckit/snake.py
python3 squad-openspec/snake.py
# ... etc.Requirements: Python 3.x with curses (standard library). Terminal at least 20×10 chars.
Controls: Arrow keys to move, r to restart, q to quit.
This section explains how to set up each tool from scratch and run the experiments yourself. The experiment was conducted on an NVIDIA Jetson (ARM64, 8GB RAM, Ubuntu, Python 3.10) but should work on any Linux/macOS machine.
- Python 3.10+ (3.12+ required for DeerFlow native runs)
- Node.js 22+ (for Copilot CLI / Squad)
- GitHub CLI (
gh) — for GitHub Models API access and Copilot - Git
All 9 experiments start from the same specification: SNAKE_SPEC.md. Each experiment directory gets a copy as SPEC.md. This is the "contract" every agent combination must fulfill.
Superpowers (obra, 94K+ stars) is a subagent-driven development framework with planning, implementation, and two-stage review (spec compliance + code quality).
Install:
# Superpowers is a set of agent skills — clone to your agent skills directory
mkdir -p ~/.agents/skills
git clone https://github.com/obra/superpowers.git ~/.agents/skills/superpowersHow it works: Superpowers provides skill definitions in .superpowers/skills/ that guide AI agents through structured workflows: writing-plans, executing-plans, subagent-driven-development, and verification-before-completion. Your AI coding agent (e.g., Claude Code, Copilot) reads these skills and follows the multi-stage process.
Project setup for an experiment:
mkdir superpowers-gsd && cd superpowers-gsd
# Link Superpowers skills
mkdir -p .superpowers/skills
# Copy or symlink the relevant skills:
cp -r ~/.agents/skills/superpowers/skills/* .superpowers/skills/
# Copy the shared spec
cp ../SNAKE_SPEC.md SPEC.mdThen run your AI agent in the directory — it picks up the .superpowers/ context automatically.
DeerFlow 2.0 (ByteDance) follows a Research → Plan → Code → Review pipeline with sub-agent memory.
Install:
# DeerFlow requires Python 3.12+
# If your system Python is older, use pyenv:
pyenv install 3.12.9
pyenv local 3.12.9
# Clone and install
git clone https://github.com/bytedance/deer-flow.git ~/deerflow
cd ~/deerflow/backend
uv sync # or: pip install -e .Configuration — create .deerflow/config.yaml in each experiment directory:
config_version: 3
models:
- name: gpt-4o
display_name: GPT-4o (GitHub Models)
use: langchain_openai:ChatOpenAI
model: gpt-4o
api_key: <your-github-token> # from: gh auth token
base_url: https://models.inference.ai.azure.com
max_tokens: 4096
temperature: 0.7
tool_groups:
- name: web
- name: file:read
- name: file:write
- name: bash
tools:
- name: write_file
group: file:write
use: deerflow.sandbox.tools:write_file_tool
- name: read_file
group: file:read
use: deerflow.sandbox.tools:read_file_tool
sandbox:
use: deerflow.sandbox.local:LocalSandboxProviderRun natively:
from deerflow import DeerFlowClient
client = DeerFlowClient()
result = client.chat("""
Read SPEC.md and implement the Snake game as snake.py.
Follow the GSD methodology: define milestones, execute phases, ship.
""")Gotchas:
sandbox: mode: localfails Pydantic validation — usesandbox: use: deerflow.sandbox.local:LocalSandboxProvider- DeerFlow requires Python ≥3.12 (won't work with system Python 3.10)
- GitHub Models API works well as the LLM backend (
gh auth tokenfor the API key)
Squad (v0.8.25) is a team management layer for GitHub Copilot that creates persistent specialist roles.
Install:
# Install Squad CLI
npm install -g @anthropic/squad # or the appropriate package
# Verify
squad --version
# Install GitHub Copilot CLI (required)
gh extension install github/gh-copilotInitialize a project:
mkdir squad-gsd && cd squad-gsd
squad init
# This creates .squad/ with:
# config.json, team.md, routing.md, agents/, ceremonies.md, etc.
# Copy the shared spec
cp ../SNAKE_SPEC.md SPEC.mdRun:
# Squad doesn't register as --agent for Copilot.
# Instead, Copilot reads .squad/ context when invoked in the directory:
copilot --allow-all --no-ask-user -s \
-p "Read SPEC.md and implement snake.py following GSD methodology" \
--model gpt-5.2Important note: --agent squad does not work — Squad is a team coordination layer for GitHub Issues/PR workflows, not a single-prompt agent. In this benchmark, Squad provided project context (.squad/ files) that Copilot read, but no true multi-agent orchestration occurred. See SQUAD_RERUN.md for details.
Each toolkit shapes how the agent approaches the problem. They're combined with orchestrators (one toolkit per experiment directory).
GSD (v1.28.0) is a milestone-based project lifecycle manager. Focus: ship pragmatic code fast.
Install:
# GSD installs as a Copilot agent skill
mkdir -p ~/.copilot/get-shit-done
git clone https://github.com/gsd-build/get-shit-done.git ~/.copilot/get-shit-doneProject setup:
# In each experiment directory, link GSD:
mkdir -p .copilot
cp -r ~/.copilot/get-shit-done .copilot/get-shit-done
# Optionally add MCP config:
cat > .copilot/mcp-config.json << 'EOF'
{
"mcpServers": {
"github": {
"command": "npx",
"args": ["-y", "@anthropic/github-mcp-server"],
"env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
}
}
}
EOFEffect on code: Pragmatic, minimal, "just works" style. Fewest lines. Milestone/phase workflows encourage shipping over architecture.
Spec Kit is a structured feature specification toolkit with scenarios and acceptance criteria.
Install:
git clone https://github.com/github/spec-kit.git ~/spec-kitProject setup:
# Create a spec-kit/ directory in the experiment folder with a feature spec:
mkdir -p spec-kit
cat > spec-kit/snake-game.md << 'EOF'
# Feature: Terminal Snake Game
## Summary
A terminal-based Snake game in Python using curses.
## Requirements
- Snake moves continuously, arrow keys change direction
- Cannot reverse direction
- Food spawns randomly as `*`, snake body `█`, head `O`
- Score +10 per food, displayed at top
- Wall and self collision = game over
- Game over screen with final score, 'q' quit, 'r' restart
- 100ms tick, min terminal 20x10
- Single file: snake.py
## Acceptance Criteria
See SPEC.md for full criteria list.
EOFEffect on code: Most structured output — OOP patterns, type hints, clear separation of concerns. Encourages scenarios-based thinking.
OpenSpec is a lightweight spec-driven framework with plan → apply → archive workflow.
Install:
git clone https://github.com/Fission-AI/openspec.git ~/openspecProject setup:
# Create an openspec/ directory with a plan:
mkdir -p openspec
cat > openspec/plan.md << 'EOF'
# Plan: Terminal Snake Game
## Goal
Build a terminal Snake game in Python curses as a single file (snake.py).
## Tasks
1. Set up curses window with border
2. Initialize snake (position, direction, body list)
3. Implement game loop with 100ms tick
4. Handle arrow key input, prevent reverse
5. Move snake, check wall/self collision
6. Spawn food randomly, detect eating
7. Track and display score
8. Implement game over screen with restart
## Constraints
- Python 3, curses only, no external deps
- See SPEC.md for full specification
EOFEffect on code: Clean functional patterns. State-as-data approach. Task-oriented decomposition.
Here's the general flow for each of the 9 combinations:
mkdir <orchestrator>-<toolkit>
cd <orchestrator>-<toolkit>
# Copy the shared spec
cp ../SNAKE_SPEC.md SPEC.md- Superpowers: Copy
.superpowers/skills/into the directory - DeerFlow: Create
.deerflow/config.yaml(see above) - Squad: Run
squad initto generate.squad/
- GSD: Copy
.copilot/get-shit-done/into the directory - Spec Kit: Create
spec-kit/snake-game.mdwith the feature spec - OpenSpec: Create
openspec/plan.mdwith the task plan
For Superpowers experiments, use any agent that reads .superpowers/ skills (e.g., Claude Code):
# The agent reads .superpowers/ and SPEC.md, then follows the
# subagent-driven-development workflow automaticallyFor DeerFlow experiments:
from deerflow import DeerFlowClient
client = DeerFlowClient()
result = client.chat("Read SPEC.md and implement snake.py. Use <toolkit> methodology.")For Squad experiments:
copilot --allow-all --no-ask-user -s \
-p "Read SPEC.md and implement snake.py following <toolkit> methodology" \
--model gpt-5.2# Check it compiles
python3 -m py_compile snake.py
# Play it
python3 snake.pyThese frameworks are AI agent orchestration prompts/workflows, not automated CLI pipelines. Each experiment was conducted by:
- Following the framework's documented methodology
- Using its architectural patterns to guide code structure
- Applying its review/verification processes
Some frameworks couldn't run natively on the test hardware:
- DeerFlow: Initially simulated (Python 3.12 unavailable), later re-run natively via pyenv. Both versions preserved (
snake.py= native,snake.py.simulated= simulated). See DEERFLOW_NATIVE.md. - Squad:
--agent squaddoesn't register with Copilot. Effectively single-agent Copilot with Squad's.squad/context. See SQUAD_RERUN.md. - Superpowers: Ran natively — the only orchestrator that fully worked as documented.
See EVALUATION.md for detailed per-implementation scoring (5 dimensions, 25 points max).
| Rank | Implementation | Score | Key Strength |
|---|---|---|---|
| 🥇 | deerflow-gsd | 22/25 | Food-spawn safety net, graceful terminal check, no bugs |
| 🥈 | squad-gsd | 21/25 | Cleanest, most readable code, zero over-engineering |
| 🥈 | squad-openspec | 21/25 | Most testable design, interactive terminal resize loop |
| 4 | superpowers-gsd | 20/25 | Solid all-around, Unicode borders |
| 5 | superpowers-speckit | 19/25 | Best OOP architecture, but has grow-timing bug |
| 6 | deerflow-speckit | 18/25 | Good state machine pattern |
| 7 | superpowers-openspec | 17/25 | Decent functional approach |
| 7 | deerflow-openspec | 17/25 | EventBus with zero subscribers |
| 7 | squad-speckit | 17/25 | Over-typed, unused Protocol |
- GSD swept the top 3 — pragmatic "just ship it" beats architectural ambition for this task size
- Spec Kit produces the most structured code but introduced bugs (grow-timing delay in 2 of 3 implementations)
- The toolkit shaped code more than the orchestrator — same toolkit produced similar patterns regardless of orchestrator
- Over-engineering correlates with lower scores — EventBus with no subscribers, Protocol with no dispatch
- Native DeerFlow produced simpler code than simulated — the
DeerFlowClient.chat()API doesn't fully exercise the documented multi-agent pipeline
multi-agent-benchmark/
├── README.md # This file
├── SNAKE_SPEC.md # Common spec all 9 games implement
├── RESULTS.md # Comparison & analysis
├── EVALUATION.md # Detailed scoring (5 dimensions × 9 implementations)
├── ARTICLE.md # Medium article
├── DEERFLOW_NATIVE.md # DeerFlow native vs simulated analysis
├── SQUAD_RERUN.md # Squad re-run findings
├── article.html # Rendered article
├── screenshots/ # Game screenshots
├── gifs/ # Gameplay GIFs
├── superpowers-gsd/ # Superpowers + GSD
│ ├── .superpowers/ # Superpowers skills
│ ├── .copilot/ # GSD workflow + MCP config
│ ├── SPEC.md # Snake spec (copy)
│ └── snake.py # Output: 178 LOC procedural
├── superpowers-speckit/ # Superpowers + Spec Kit
│ ├── .superpowers/
│ ├── spec-kit/ # Feature spec
│ └── snake.py # Output: 213 LOC OOP (5 classes)
├── superpowers-openspec/ # Superpowers + OpenSpec
│ ├── .superpowers/
│ ├── openspec/ # Task plan
│ └── snake.py # Output: 179 LOC functional+dataclass
├── deerflow-gsd/ # DeerFlow + GSD
│ ├── .deerflow/ # DeerFlow config
│ ├── .copilot/ # GSD workflow
│ ├── snake.py # Output: native (91 LOC)
│ └── snake.py.simulated # Simulated version (181 LOC)
├── deerflow-speckit/ # DeerFlow + Spec Kit
│ ├── .deerflow/
│ ├── spec-kit/
│ ├── snake.py # Output: native (110 LOC)
│ └── snake.py.simulated # Simulated version (152 LOC)
├── deerflow-openspec/ # DeerFlow + OpenSpec
│ ├── .deerflow/
│ ├── openspec/
│ ├── snake.py # Output: native (131 LOC)
│ └── snake.py.simulated # Simulated version (171 LOC)
├── squad-gsd/ # Squad + GSD
│ ├── .squad/ # Squad team definitions
│ ├── .copilot/ # GSD workflow
│ └── snake.py # Output: 132 LOC clean procedural
├── squad-speckit/ # Squad + Spec Kit
│ ├── .squad/
│ ├── spec-kit/
│ └── snake.py # Output: 180 LOC Protocol pattern
└── squad-openspec/ # Squad + OpenSpec
├── .squad/
├── openspec/
└── snake.py # Output: 157 LOC dict-state pure functions
Tested on NVIDIA Jetson (ARM64, 8GB RAM, Ubuntu, Python 3.10/3.12).
Nishant Manchanda — March 2026