-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy patharticle.html
More file actions
419 lines (343 loc) · 58.4 KB
/
article.html
File metadata and controls
419 lines (343 loc) · 58.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Evaluating 9 AI Agent Combinations on the Same Snake Game</title>
</head>
<body style="margin:0;padding:0;background:#fff;color:#1a1a1a;font-family:Charter,'Bitstream Charter',Georgia,'Times New Roman',serif;font-size:18px;line-height:1.7;">
<div style="max-width:720px;margin:0 auto;padding:40px 20px;">
<h1 style="font-size:36px;line-height:1.2;font-weight:700;margin-bottom:8px;color:#0d0d0d;">Evaluating 9 AI Agent Combinations on the Same Snake Game</h1>
<p style="color:#757575;font-size:16px;margin-top:0;margin-bottom:40px;"><em>Nishant Manchanda · March 2026</em></p>
<hr style="border:none;border-top:1px solid #e6e6e6;margin:32px 0;">
<p>There's a new AI coding agent every week. Most benchmarks compare them on vibes. I wanted something more controlled: give multiple agent combinations the same spec, same language, same machine, and compare the output code systematically.</p>
<p>I picked three multi-agent orchestrators (<a href="https://github.com/obra/superpowers" style="color:#1a8917;text-decoration:none;">Superpowers</a>, <a href="https://github.com/bytedance/deer-flow" style="color:#1a8917;text-decoration:none;">DeerFlow 2.0</a>, <a href="https://github.com/bradygaster/squad" style="color:#1a8917;text-decoration:none;">Squad</a>) and three spec-driven toolkits (<a href="https://github.com/gsd-build/get-shit-done" style="color:#1a8917;text-decoration:none;">GSD</a>, <a href="https://github.com/github/spec-kit" style="color:#1a8917;text-decoration:none;">Spec Kit</a>, <a href="https://github.com/Fission-AI/openspec" style="color:#1a8917;text-decoration:none;">OpenSpec</a>), producing a 3×3 matrix of nine implementations. The task: a terminal Snake game in Python curses, from a <a href="https://github.com/nimanch/multi-agent-benchmark/blob/main/SNAKE_SPEC.md" style="color:#1a8917;text-decoration:none;">shared spec</a>.</p>
<p>Then I had an LLM judge score each implementation across five dimensions.</p>
<p>The results surprised me. The “best architecture” had a bug. The simplest approach swept the top three spots. And the spec toolkit shaped the code structure more than the orchestrator did.</p>
<p><strong>Repo</strong>: <a href="https://github.com/nimanch/multi-agent-benchmark" style="color:#1a8917;text-decoration:none;">github.com/nimanch/multi-agent-benchmark</a></p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">The Tools</h2>
<p>Before getting into results, here's what each tool actually does. These aren't interchangeable — they solve fundamentally different problems.</p>
<h3 style="font-size:22px;font-weight:700;margin-top:36px;margin-bottom:12px;color:#0d0d0d;">Orchestrators</h3>
<p><strong>Superpowers</strong> (<a href="https://github.com/obra/superpowers" style="color:#1a8917;text-decoration:none;">github.com/obra/superpowers</a>, 94K+ stars) is a subagent-driven development framework. When you give it a task, it doesn't just generate code — it dispatches specialized subagents in parallel. A planning agent breaks the problem down. Implementation agents write code concurrently. Then a two-stage review process kicks in: first a spec-compliance check (does the code match the requirements?), then a code-quality review (is the code well-structured?). Superpowers uses a skill-based architecture where agents have defined capabilities and constraints. It ran natively in this experiment — the only orchestrator that did.</p>
<p><strong>DeerFlow 2.0</strong> (<a href="https://github.com/bytedance/deer-flow" style="color:#1a8917;text-decoration:none;">github.com/bytedance/deer-flow</a>, ByteDance) follows a Research → Plan → Code → Review pipeline. What distinguishes it is the research phase: before any code is written, a research agent examines best practices, common patterns, and potential pitfalls for the task at hand. This front-loaded research gets passed to the coding agent as context. DeerFlow also maintains sub-agent memory across tasks within a session, so later tasks can reference decisions made in earlier ones. DeerFlow ran natively using the embedded Python client (<code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">DeerFlowClient</code>) with GitHub Models API (gpt-4o) as the LLM backend, after installing Python 3.12 via pyenv on the ARM64 test machine.</p>
<p><strong>Squad</strong> (<a href="https://github.com/bradygaster/squad" style="color:#1a8917;text-decoration:none;">github.com/bradygaster/squad</a>, v0.8.25) is a fundamentally different kind of tool. It's a team management layer for GitHub Copilot that creates persistent specialist roles — frontend developer, backend developer, tester, team lead — stored in <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">.squad/</code> directories within your project. Each role has custom instructions that Copilot reads when working on tasks assigned to that specialist. Squad is designed for ongoing GitHub Issues and PR workflows: you assign an issue to the “backend” specialist, and Copilot picks up that role's context and constraints.</p>
<p>This matters for interpreting the results. We installed Squad v0.8.25, ran <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">squad init</code> to generate the team structure, but discovered that <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">--agent squad</code> doesn't register as a Copilot agent in the way you'd expect. The Squad experiments in this benchmark were effectively single-agent Copilot sessions where Copilot had access to Squad's <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">.squad/</code> context files (role definitions, project conventions). No multi-agent orchestration occurred. Squad is built for persistent team coordination across many PRs over time, not single-prompt code generation from a spec. Comparing it head-to-head with Superpowers or DeerFlow as an “orchestrator” is somewhat apples-to-oranges.</p>
<h3 style="font-size:22px;font-weight:700;margin-top:36px;margin-bottom:12px;color:#0d0d0d;">Spec Toolkits</h3>
<p><strong>GSD (Get Shit Done)</strong> (<a href="https://github.com/gsd-build/get-shit-done" style="color:#1a8917;text-decoration:none;">github.com/gsd-build/get-shit-done</a>, 35K+ stars) takes a milestone/phase approach to project management for AI agents. Its philosophy is shipping fast with accountability. You run <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">new-project</code> to scaffold milestones from a spec, then <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">execute-phase</code> to have an agent work through each phase, then <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">verify-phase</code> to check the output. Each phase gets a fresh context window, which prevents context pollution across tasks. GSD enforces atomic git commits per phase, creating a clear audit trail. The fresh-context-per-phase pattern may explain why GSD variants consistently handled edge cases like terminal size checks — each phase is independently verified against the spec.</p>
<p><strong>Spec Kit</strong> (<a href="https://github.com/github/spec-kit" style="color:#1a8917;text-decoration:none;">github.com/github/spec-kit</a>, 72.7K stars) is GitHub's approach to spec-driven development. It generates structured requirement documents: feature scenarios (given/when/then), acceptance criteria, and architectural specifications. These documents are designed to be consumed by AI agents as context. Spec Kit emphasizes separation of concerns — the spec describes <em>what</em>, not <em>how</em> — and produces documents that read like detailed product requirements. In practice, this led agents to build more elaborate class hierarchies and typed interfaces, mirroring the structured nature of the specs themselves.</p>
<p><strong>OpenSpec</strong> (<a href="https://github.com/Fission-AI/openspec" style="color:#1a8917;text-decoration:none;">github.com/Fission-AI/openspec</a>) is the most lightweight of the three. Its workflow is <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">plan → apply → archive</code>: generate a plan.md with task breakdowns, apply the plan to produce code, then archive the completed plan. Minimal ceremony, minimal scaffolding. The task breakdowns in plan.md files tend to be flat lists rather than hierarchical milestones, which gives agents more freedom in how they organize the code. This freedom showed in the results — OpenSpec variants had the most diverse architectures across the three orchestrators.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">Methodology</h2>
<p><strong>What was controlled:</strong></p>
<ul style="padding-left:24px;">
<li>Identical spec (<a href="https://github.com/nimanch/multi-agent-benchmark/blob/main/SNAKE_SPEC.md" style="color:#1a8917;text-decoration:none;">SNAKE_SPEC.md</a>) — 16 acceptance criteria covering movement, collision, scoring, display, and restart</li>
<li>Same language: Python 3.x with curses (stdlib only)</li>
<li>Same machine: NVIDIA Jetson ARM64, 8GB RAM, Ubuntu</li>
<li>Single-file implementations, zero manual interventions on any of the nine</li>
</ul>
<p><strong>What was NOT controlled — and this matters:</strong></p>
<ul style="padding-left:24px;">
<li><strong>Superpowers</strong> ran natively as a fully multi-agent system. Subagents were dispatched, parallel work occurred, two-stage review happened automatically.</li>
<li><strong>DeerFlow 2.0</strong> ran natively using the embedded Python client (<code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">DeerFlowClient.chat()</code>) with GitHub Models API (gpt-4o) as the LLM backend. Python 3.12 was installed via pyenv on the ARM64 Jetson. Each experiment was a single <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">client.chat()</code> call with a methodology-specific prompt referencing the spec. DeerFlow orchestrates sub-agents, memory, and tools internally.</li>
<li><strong>Squad</strong> was installed as the actual CLI (v0.8.25). We ran <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">squad init</code> to generate the <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">.squad/</code> team structure with specialist roles and project conventions. Copilot then generated code with Squad's context files available in the working directory. Squad is designed for persistent team roles across GitHub Issues and PR workflows — its value proposition is coordination over many tasks over time, not single-prompt generation from a spec.</li>
</ul>
<p>All three tools were installed and used as designed. The key difference is what “multi-agent” means for each: Superpowers dispatches parallel subagents, DeerFlow pipelines through specialized stages, and Squad provides persistent team context that shapes how Copilot approaches tasks.</p>
<p><strong>Evaluation:</strong> An LLM judge (not me) scored each implementation 1–5 on five dimensions: Spec Compliance, Correctness, Code Quality, Completeness, and Robustness. I reviewed the scores against the code and agreed with most of them, though the evaluation has its own limitations (discussed below).</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">The Scores</h2>
<table style="width:100%;border-collapse:collapse;margin:24px 0;font-size:15px;line-height:1.5;">
<thead>
<tr style="background:#1e1e2e;color:#cdd6f4;">
<th style="padding:10px 12px;text-align:left;font-weight:600;">Implementation</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Spec</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Correct</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Quality</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Complete</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Robust</th>
<th style="padding:10px 8px;text-align:center;font-weight:700;">Total</th>
</tr>
</thead>
<tbody>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-gsd</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">21</td>
</tr>
<tr style="background:#ffffff;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-openspec</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">21</td>
</tr>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-gsd</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">20</td>
</tr>
<tr style="background:#ffffff;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-speckit</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">19</td>
</tr>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-gsd</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">18</td>
</tr>
<tr style="background:#ffffff;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-speckit</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">5</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">18</td>
</tr>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-openspec</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">2</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">17</td>
</tr>
<tr style="background:#ffffff;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-openspec</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">2</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">17</td>
</tr>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-speckit</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">3</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">4</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">2</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;font-weight:700;">17</td>
</tr>
</tbody>
</table>
<p>GSD swept the top three positions. Every orchestrator paired with GSD outscored or matched the same orchestrator paired with either alternative toolkit. squad-gsd and squad-openspec tied for first at 21/25; superpowers-gsd came in third at 20.</p>
<p>squad-openspec remains the notable outlier — the only non-GSD implementation to crack the top two, scoring 21/25 with an interactive terminal-size resize loop and smart self-collision handling.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">Spec Compliance Matrix</h2>
<p>Not all “pass” results are equal. The LLM judge checked 16 specific requirements:</p>
<div style="overflow-x:auto;margin:24px 0;">
<table style="width:100%;border-collapse:collapse;font-size:13px;line-height:1.4;white-space:nowrap;">
<thead>
<tr style="background:#1e1e2e;color:#cdd6f4;">
<th style="padding:8px 10px;text-align:left;font-weight:600;">Requirement</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SP-GSD</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SP-SK</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SP-OS</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">DF-GSD</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">DF-SK</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">DF-OS</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SQ-GSD</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SQ-SK</th>
<th style="padding:8px 6px;text-align:center;font-weight:600;">SQ-OS</th>
</tr>
</thead>
<tbody>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Continuous movement</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Arrow key controls</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">No reverse direction</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Random food spawn</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Food as <code>*</code></td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Head=O, Body=█</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Score +10, displayed</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;font-weight:600;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Snake grows on eat</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#e67700;">⚠</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#e67700;">⚠</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Wall collision = death</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Self collision = death</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Border around play area</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Game over screen</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Final score shown</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">q=quit, r=restart</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">100ms tick</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
<tr style="background:#ffffff;font-weight:600;"><td style="padding:6px 10px;border-bottom:1px solid #e9ecef;">Min terminal 20×10</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#e67700;">⚠</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#d63384;">✗</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#d63384;">✗</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#d63384;">✗</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;color:#d63384;">✗</td><td style="padding:6px;text-align:center;border-bottom:1px solid #e9ecef;">✓</td></tr>
</tbody>
</table>
</div>
<p>Two things jump out:</p>
<ol style="padding-left:24px;">
<li><strong>Only 4 of 9 implementations checked minimum terminal size</strong> — three GSD variants plus squad-openspec. This is a clear spec requirement that 5 implementations simply ignored.</li>
<li><strong>Two implementations (superpowers-speckit, squad-speckit) have a grow-timing bug</strong> where the snake doesn't visually grow until one frame after eating. More on this below.</li>
</ol>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">The Grow-Timing Bug</h2>
<p>The spec says “the snake grows by one segment each time it eats food.” Two implementations (superpowers-speckit and squad-speckit) get this subtly wrong due to an architectural choice.</p>
<p>Here's the pattern that causes the bug — from superpowers-speckit:</p>
<pre style="background:#1e1e2e;color:#cdd6f4;padding:20px;border-radius:8px;overflow-x:auto;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:14px;line-height:1.6;margin:24px 0;"><span style="color:#6c7086;"># Game loop (simplified):</span>
snake.<span style="color:#89b4fa;">move</span>() <span style="color:#6c7086;"># advance first</span>
<span style="color:#cba6f7;">if</span> snake.head == food.position:
snake.<span style="color:#89b4fa;">grow</span>() <span style="color:#6c7086;"># sets grow_pending = True</span>
food.<span style="color:#89b4fa;">respawn</span>(...)
<span style="color:#6c7086;"># Inside Snake.move():</span>
<span style="color:#cba6f7;">def</span> <span style="color:#89b4fa;">move</span>(<span style="color:#fab387;">self</span>):
new_head = (<span style="color:#fab387;">self</span>.head[<span style="color:#fab387;">0</span>] + dy, <span style="color:#fab387;">self</span>.head[<span style="color:#fab387;">1</span>] + dx)
<span style="color:#fab387;">self</span>.body.<span style="color:#89b4fa;">insert</span>(<span style="color:#fab387;">0</span>, new_head)
<span style="color:#cba6f7;">if not</span> <span style="color:#fab387;">self</span>.grow_pending: <span style="color:#6c7086;"># checked BEFORE grow() is called this tick</span>
<span style="color:#fab387;">self</span>.body.<span style="color:#89b4fa;">pop</span>()
<span style="color:#cba6f7;">else</span>:
<span style="color:#fab387;">self</span>.grow_pending = <span style="color:#fab387;">False</span></pre>
<p>The problem: <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">move()</code> executes before the food check. So when the snake eats food, <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">grow_pending</code> is still <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">False</code> during that tick's <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">move()</code>. The tail gets popped. <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">grow()</code> then sets <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">grow_pending = True</code>, which takes effect on the <em>next</em> tick. The snake still grows by exactly one segment — but there's a one-frame visual glitch where the food disappears before the snake visually extends.</p>
<p>Compare with the pattern used by 7 of 9 implementations (here from deerflow-gsd):</p>
<pre style="background:#1e1e2e;color:#cdd6f4;padding:20px;border-radius:8px;overflow-x:auto;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:14px;line-height:1.6;margin:24px 0;"><span style="color:#cba6f7;">def</span> <span style="color:#89b4fa;">move_snake</span>(snake, direction, food):
new_head = (head_y + dy, head_x + dx)
new_snake = [new_head] + snake[:]
ate = new_head == food
<span style="color:#cba6f7;">if not</span> ate:
new_snake.<span style="color:#89b4fa;">pop</span>() <span style="color:#6c7086;"># grow decision is immediate</span>
<span style="color:#cba6f7;">return</span> new_snake, ate</pre>
<p>Insert head, then conditionally pop tail. No deferred flag, no timing issue. The simpler pattern is the correct one.</p>
<p>This is the most interesting result from the evaluation: the implementation with the <em>best</em> architecture score (superpowers-speckit: Code Quality 5/5, five clean OOP classes) had a functional bug. The procedural implementations with no classes got the behavior right.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">The Dead EventBus (DeerFlow + OpenSpec)</h2>
<p>The deerflow-openspec implementation built an event-driven architecture with an <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">EventBus</code> class:</p>
<pre style="background:#1e1e2e;color:#cdd6f4;padding:20px;border-radius:8px;overflow-x:auto;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:14px;line-height:1.6;margin:24px 0;"><span style="color:#cba6f7;">class</span> <span style="color:#f9e2af;">EventBus</span>:
<span style="color:#cba6f7;">def</span> <span style="color:#89b4fa;">on</span>(<span style="color:#fab387;">self</span>, event, handler):
<span style="color:#fab387;">self</span>._handlers.<span style="color:#89b4fa;">setdefault</span>(event, []).<span style="color:#89b4fa;">append</span>(handler)
<span style="color:#cba6f7;">def</span> <span style="color:#89b4fa;">emit</span>(<span style="color:#fab387;">self</span>, event, **kwargs):
<span style="color:#cba6f7;">for</span> h <span style="color:#cba6f7;">in</span> <span style="color:#fab387;">self</span>._handlers.<span style="color:#89b4fa;">get</span>(event, []):
<span style="color:#89b4fa;">h</span>(**kwargs)</pre>
<p>Events are emitted throughout the game — <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">EVT_MOVE</code>, <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">EVT_EAT</code>, <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">EVT_DIE</code>. But <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">bus.on()</code> is never called anywhere. Zero subscribers. The events fire into the void.</p>
<p>The evaluator called this “architecture astronautics,” which is fair. The game works fine — the EventBus just doesn't do anything. It's the code equivalent of installing a fire alarm system and never connecting the sirens.</p>
<p>One redeeming quality: deerflow-openspec is the only implementation that uses <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">collections.deque</code> for the snake body, giving O(1) head insertion vs O(n) for list <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">insert(0, ...)</code>. For a 20×10 terminal it doesn't matter, but it's the technically correct data structure.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">Lines of Code</h2>
<table style="width:100%;border-collapse:collapse;margin:24px 0;font-size:15px;line-height:1.5;">
<thead>
<tr style="background:#1e1e2e;color:#cdd6f4;">
<th style="padding:10px 12px;text-align:left;font-weight:600;">Implementation</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">LOC</th>
<th style="padding:10px 12px;text-align:left;font-weight:600;">Architecture</th>
</tr>
</thead>
<tbody>
<tr style="background:#f8f9fa;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-gsd</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">91</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Bare procedural</td></tr>
<tr style="background:#ffffff;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-speckit</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">110</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Procedural (despite Spec Kit)</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">deerflow-openspec</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">131</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">OOP (SnakeGame class)</td></tr>
<tr style="background:#ffffff;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-gsd</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">132</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Clean procedural</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-gsd</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">178</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Procedural with constants</td></tr>
<tr style="background:#ffffff;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-openspec</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">179</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Functional + dataclass</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-speckit</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">180</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Protocol pattern + Position class</td></tr>
<tr style="background:#ffffff;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">squad-openspec</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">203</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">Dict-state, pure functions</td></tr>
<tr style="background:#f8f9fa;"><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">superpowers-speckit</td><td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;">213</td><td style="padding:8px 12px;border-bottom:1px solid #e9ecef;">OOP (5 classes)</td></tr>
</tbody>
</table>
<p>The native DeerFlow outputs are the shortest — 91 to 131 lines. squad-gsd's 132 lines scored 21/25 (tied first). superpowers-speckit's 213 lines scored 19/25 (with a bug). More code correlated with more architectural ambition, which correlated with more opportunities for bugs — except when the extra code was defensive rather than structural.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">What the Spec Toolkit Controls vs. What the Orchestrator Controls</h2>
<p>The spec toolkit shaped <em>architecture</em>:</p>
<ul style="padding-left:24px;">
<li><strong>GSD</strong> → procedural, pragmatic, “just make it work”</li>
<li><strong>Spec Kit</strong> → OOP, typed, separated concerns, more elaborate class hierarchies</li>
<li><strong>OpenSpec</strong> → functional, state-as-data, testable pure functions</li>
</ul>
<p>The orchestrator shaped <em>completeness and robustness</em>:</p>
<ul style="padding-left:24px;">
<li>All three GSD variants checked terminal size. Of the non-GSD variants, only squad-openspec did.</li>
<li>Superpowers consistently produced more polished rendering (manual Unicode box-drawing borders)</li>
<li>Native DeerFlow produced notably compact code: both GSD and Spec Kit yielded procedural implementations, while OpenSpec prompted DeerFlow to use a class-based approach.</li>
</ul>
<p>The interaction between them is what mattered. GSD's milestone-based approach apparently forced all three orchestrators to handle edge cases (terminal size) that the other toolkits didn't prompt for. The spec explicitly says “Minimum terminal size: 20×10” — GSD treated this as a requirement; Spec Kit and OpenSpec (with the exception of squad-openspec) apparently didn't surface it as a first-class acceptance criterion.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">The Demos</h2>
<p>All nine games running:</p>
<table style="width:100%;border-collapse:collapse;margin:24px 0;">
<thead>
<tr style="background:#1e1e2e;color:#cdd6f4;">
<th style="padding:10px 12px;text-align:left;font-weight:600;width:120px;"></th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">GSD</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">Spec Kit</th>
<th style="padding:10px 8px;text-align:center;font-weight:600;">OpenSpec</th>
</tr>
</thead>
<tbody>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;font-weight:600;">Superpowers</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/superpowers-gsd.gif" alt="superpowers-gsd" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/superpowers-speckit.gif" alt="superpowers-speckit" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/superpowers-openspec.gif" alt="superpowers-openspec" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
</tr>
<tr style="background:#ffffff;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;font-weight:600;">DeerFlow</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/deerflow-gsd.gif" alt="deerflow-gsd" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/deerflow-speckit.gif" alt="deerflow-speckit" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/deerflow-openspec.gif" alt="deerflow-openspec" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
</tr>
<tr style="background:#f8f9fa;">
<td style="padding:8px 12px;border-bottom:1px solid #e9ecef;font-weight:600;">Squad</td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/squad-gsd.gif" alt="squad-gsd" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/squad-speckit.gif" alt="squad-speckit" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
<td style="padding:8px;text-align:center;border-bottom:1px solid #e9ecef;"><img src="https://raw.githubusercontent.com/nimanch/multi-agent-benchmark/main/gifs/squad-openspec.gif" alt="squad-openspec" style="width:100%;max-width:200px;border-radius:6px;border:1px solid #e9ecef;"></td>
</tr>
</tbody>
</table>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">Limitations</h2>
<p>This evaluation has real weaknesses:</p>
<p><strong>Snake is too easy.</strong> All nine implementations work. The interesting differences are in edge cases (terminal size, grow timing) and code style — not in whether the agents could solve the problem. A harder task (multi-file, networking, database) would produce more differentiation.</p>
<p><strong>All three orchestrators ran with their actual tools.</strong> Superpowers dispatched subagents natively. DeerFlow ran through its embedded client. Squad provided team context via <code style="background:#f0f0f0;padding:2px 6px;border-radius:3px;font-family:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace;font-size:15px;">.squad/</code> directories. The tools are designed for different workflows — Superpowers for parallel subagent dispatch, DeerFlow for staged pipelines, Squad for persistent team coordination — so the comparison is inherently asymmetric.</p>
<p><strong>N=1 per combination.</strong> I ran each combination once. LLM outputs are stochastic. Running each combination five times and averaging would be more rigorous. The scores could shift by a few points on re-runs.</p>
<p><strong>LLM-as-judge.</strong> The evaluation was done by an LLM, not by running a test suite. The grow-timing bug was caught by code inspection, not automated testing. A proper evaluation would include functional tests that exercise each acceptance criterion programmatically.</p>
<p><strong>Scoring granularity.</strong> 1–5 integer scores across five dimensions give a 5–25 range. The actual spread was 17–21, a 4-point range across 9 implementations. The top-3 vs bottom-6 distinction is clearer than individual rankings.</p>
<!-- ================================================================ -->
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">Conclusions</h2>
<p>Three data-driven takeaways:</p>
<ol style="padding-left:24px;">
<li style="margin-bottom:16px;"><strong>GSD's spec compliance advantage was decisive.</strong> All three GSD variants scored 5/5 on spec compliance and checked terminal size. The toolkit's milestone-based approach — fresh context per phase, explicit verification steps — appears to enforce spec requirements more reliably than scenario-based (Spec Kit) or task-based (OpenSpec) approaches. squad-gsd and squad-openspec tied at 21/25; superpowers-gsd scored 20. At least for this spec, GSD's “verify each phase” pattern caught requirements that other toolkits missed.</li>
<li style="margin-bottom:16px;"><strong>Architectural ambition correlated negatively with correctness.</strong> The two implementations with the most classes (superpowers-speckit: 5 classes, squad-speckit: Position + Protocol + entities) both had the grow-timing bug (Correctness: 3/5 each). The top scorers (21, 21, 20) were all procedural or simple functional code. squad-openspec (21/25) is instructive: its 203 lines are longer than average, but the extra code is defensive (try/except, resize loop) rather than structural (classes, protocols).</li>
<li style="margin-bottom:16px;"><strong>DeerFlow favored simplicity over architecture.</strong> DeerFlow's pipeline produced the shortest implementations (91–131 LOC), opting for direct procedural code over elaborate class hierarchies. This is interesting because DeerFlow's documented architecture (research → plan → code → review) suggests it would produce more considered designs. In practice, the research and review stages may have encouraged <em>simplicity</em> — identifying that a Snake game doesn't need an EventBus or five classes. Whether this holds for more complex tasks is an open question.</li>
</ol>
<p>Whether this generalizes beyond Snake — beyond a well-understood, small, single-file problem — is an open question.</p>
<hr style="border:none;border-top:1px solid #e6e6e6;margin:40px 0 24px;">
<p style="color:#757575;font-size:16px;"><em>Nishant Manchanda builds things in Seattle.</em></p>
<p style="color:#757575;font-size:16px;"><em>All code and evaluation data: <a href="https://github.com/nimanch/multi-agent-benchmark" style="color:#1a8917;text-decoration:none;">github.com/nimanch/multi-agent-benchmark</a></em></p>
<h2 style="font-size:28px;font-weight:700;margin-top:48px;margin-bottom:16px;color:#0d0d0d;">References</h2>
<p><strong>Orchestrators:</strong></p>
<ul style="padding-left:24px;">
<li><a href="https://github.com/obra/superpowers" style="color:#1a8917;text-decoration:none;">Superpowers</a> — Subagent-driven development framework with two-stage review (94K+ stars)</li>
<li><a href="https://github.com/bytedance/deer-flow" style="color:#1a8917;text-decoration:none;">DeerFlow 2.0</a> — ByteDance's Research → Plan → Code → Review pipeline</li>
<li><a href="https://github.com/bradygaster/squad" style="color:#1a8917;text-decoration:none;">Squad</a> — Team management layer for GitHub Copilot with persistent specialist roles (v0.8.25)</li>
</ul>
<p><strong>Spec Toolkits:</strong></p>
<ul style="padding-left:24px;">
<li><a href="https://github.com/gsd-build/get-shit-done" style="color:#1a8917;text-decoration:none;">GSD (Get Shit Done)</a> — Milestone/phase project management with fresh context windows (35K+ stars)</li>
<li><a href="https://github.com/github/spec-kit" style="color:#1a8917;text-decoration:none;">Spec Kit</a> — GitHub's spec-driven development with feature scenarios and acceptance criteria (72.7K stars)</li>
<li><a href="https://github.com/Fission-AI/openspec" style="color:#1a8917;text-decoration:none;">OpenSpec</a> — Lightweight plan → apply → archive workflow</li>
</ul>
<p><strong>Evaluation:</strong></p>
<ul style="padding-left:24px;">
<li><a href="https://github.com/nimanch/multi-agent-benchmark/blob/main/SNAKE_SPEC.md" style="color:#1a8917;text-decoration:none;">SNAKE_SPEC.md</a> — Shared spec used across all 9 experiments</li>
<li><a href="https://github.com/nimanch/multi-agent-benchmark/blob/main/EVALUATION.md" style="color:#1a8917;text-decoration:none;">EVALUATION.md</a> — Full LLM-as-Judge scoring with per-implementation analysis</li>
<li><a href="https://github.com/nimanch/multi-agent-benchmark/blob/main/RESULTS.md" style="color:#1a8917;text-decoration:none;">RESULTS.md</a> — Raw benchmark results</li>
</ul>
</div>
</body>
</html>