Demo repository for Perstack. Runs the same task across different providers and compares results.
Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.
-
Definition:
perstack.toml -
Providers:
- Anthropic
- OpenAI
- Fireworks
-
Models:
- Anthropic: Opus 4.6, Sonnet 4.6
- OpenAI: GPT-5.4, GPT-5 mini
- Google: Gemini 3.1 Pro, Gemini 3 Flash
- Fireworks:
- Kimi K2.5
- MiniMax M2.5
bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify
Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.
| Expert | Model Tier | Role |
|---|---|---|
| bash-gaming (coordinator) | high | Coordinates the entire task and delegates to the appropriate experts. |
| @bash-gaming/plan | middle | Expands requirements, defines dual-mode API contract, npm package structure. |
| @bash-gaming/build | high | Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging. |
| @bash-gaming/verify | middle | Validates npx installability, AI mode deterministic output, TUI playthrough tests. |
Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.
Replace --provider, --model, and result directory for each run.
docker run --pull always --rm -it \
--env-file .env \
-v ./<result-dir>:/workspace \
-v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
perstack/perstack start bash-gaming \
--config /definitions/perstack.toml \
--provider <provider> \
--model <model> \
"Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."The game must be playable via npx . in the result directory — an Ink-based TUI launches and the dungeon crawler is interactive.
| Provider | Models Used | Works | Directory | Steps | Duration | Input Tokens | Cached | Output Tokens | Cost |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic | Opus 4.6, Sonnet 4.6 | ✅ | anthropic/ |
173 | 51m 18s | 13.8M | 13.3M (96.07%) | 213.9K | $15.24 |
| OpenAI | GPT-5.4, GPT-5 mini | ✅ | openai/ |
118 | 12m 24s | 2.4M | 2.2M (89.56%) | 42.6K | ~$1.80 |
| Gemini 3.1 Pro, Gemini 3 Flash | ✅ | google/ |
163 | 16m 31s | 2.9M | 2.1M (72.69%) | 46.2K | ~$1.76 | |
| Fireworks | Kimi K2.5 | ✅ | fireworks-kimi-k2p5/ |
324 | 1h 46m | 20.6M | 19.0M (92.13%) | 189.1K | ~$3.43 |
| Fireworks | MiniMax M2.5 | ❌ | fireworks-minimax-m2p5/ |
59 | 5m 49s | 1.0M | 844.4K (83.31%) | 39.7K | ~$0.13 |
Building production-grade agentic apps requires models that perform well, but at an operationally viable cost. Model performance itself has multiple axes:
- Trustworthiness — Does the model avoid self-destructive behavior (e.g., corrupting its own workspace, breaking the environment)? Can a user delegate a task and walk away? This is a knockout factor — if a model cannot be trusted, nothing else matters.
- Outcome quality — Does the model exceed expectations, or merely satisfy them? Does it produce deliverables that go beyond the literal request?
- Instruction adherence — Does the model follow the defined expert topology and constraints? Note: blind compliance is not the same as good adherence — over-literal following at the expense of the task goal is a performance regression.
- Cost efficiency — Is the result achievable at a price point that makes the agentic app commercially viable?
This evaluation assessed each provider across these axes.
1. Kimi K2.5 (Fireworks) — Best overall.
| Axis | Assessment |
|---|---|
| Trustworthiness | High. No self-destructive behavior. Two follow-ups were needed, but for environment-specific issues — the game was functional within the harness. |
| Outcome quality | Excellent. Consumed high input tokens and iterated extensively, delivering a polished game comparable to Anthropic's output. |
| Instruction adherence | Full. Respected the plan→build→verify pipeline and met quality gates defined in the expert topology. |
| Cost efficiency | Strong. ~$3.43 — achieved near-Anthropic quality at ~1/4 the cost. |
2. Anthropic (Opus 4.6 + Sonnet 4.6) — Highest quality, but cost is prohibitive.
| Axis | Assessment |
|---|---|
| Trustworthiness | High. Stable, no environment issues. Completed in a single query. |
| Outcome quality | Excellent. Rich Ink TUI with polish beyond the minimum requirements. However, a significant portion of output tokens went to excessive documentation rather than code — a misallocation of effort. |
| Instruction adherence | Full. Correctly routed delegates via defaultModelTier and followed the plan→build→verify pipeline. |
| Cost efficiency | Poor. $15.24 is prohibitive for third-party agentic apps. The Opus + Sonnet combination works architecturally but the price makes it unviable at scale. |
3. Google (Gemini 3.1 Pro + Gemini 3 Flash) — Efficient and compliant, but shallow.
| Axis | Assessment |
|---|---|
| Trustworthiness | Adequate. No destructive behavior, but the output has noticeable bugs. |
| Outcome quality | Minimal. Functional but lacks depth and polish. Does not aim beyond satisfying the literal requirements. |
| Instruction adherence | Full. Correctly followed plan→build→verify, routing delegates to Flash via defaultModelTier. |
| Cost efficiency | Excellent. ~$1.76 — cheapest successful run. |
4. OpenAI (GPT-5.4 + GPT-5 mini) — Fast and cheap, but skipped verification.
| Axis | Assessment |
|---|---|
| Trustworthiness | Adequate. No destructive behavior, stable execution. |
| Outcome quality | Minimal. Functional but plain. Prioritizes task completion over exceeding expectations. |
| Instruction adherence | Partial. Skipped @bash-gaming/verify entirely, calling @bash-gaming/build three times instead. The defined pipeline was not followed. |
| Cost efficiency | Excellent. ~$1.80 — fastest completion at 12m 24s. |
5. MiniMax M2.5 (Fireworks) — Failed.
| Axis | Assessment |
|---|---|
| Trustworthiness | Low. Ignored expert instructions entirely. |
| Outcome quality | N/A. Produced a browser-based HTML file instead of a CLI game. |
| Instruction adherence | None. Did not follow the expert topology or use the plan/build/verify pipeline. |
| Cost efficiency | N/A. ~$0.13 spent, but no usable output. |
cd bash-gaming/opus-4-6-anthropic && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | Opus 4.6 | 24.7K | — | 24.7K | 2.1K |
| @bash-gaming/plan | Sonnet 4.6 | 7.1M | 6.9M | 200K | 142.6K |
| bash-gaming (coordinator) | Opus 4.6 | 8.4K | — | 8.4K | 11.8K |
| @bash-gaming/build | Opus 4.6 | 4.6M | 4.5M | 100K | 36.1K |
| bash-gaming (coordinator) | Opus 4.6 | 90.0K | 65.8K | 24.2K | 1.0K |
| @bash-gaming/verify | Sonnet 4.6 | 1.6M | 1.5M | 100K | 16.8K |
| bash-gaming (coordinator) | Opus 4.6 | 370.8K | 351.8K | 19.0K | 3.5K |
By model:
| Model | Cost |
|---|---|
| Opus 4.6 | $9.22 |
| Sonnet 4.6 | $6.02 |
| Total | $15.24 |
View execution history:
cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38accd bash-gaming/gpt-5-4-openai && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | GPT-5.4 | 5.8K | 1.2K | 4.6K | 541 |
| @bash-gaming/plan | GPT-5 mini | 3.9K | 1.0K | 2.9K | 4.6K |
| bash-gaming (coordinator) | GPT-5.4 | 5.3K | — | 5.3K | 209 |
| @bash-gaming/build | GPT-5.4 | 1.6M | 1.5M | 100K | 22.5K |
| bash-gaming (coordinator) | GPT-5.4 | 85.9K | 46.6K | 39.3K | 1.2K |
| @bash-gaming/build | GPT-5.4 | 503.1K | 427.5K | 75.6K | 9.3K |
| bash-gaming (coordinator) | GPT-5.4 | 59.0K | 54.3K | 4.7K | 375 |
| @bash-gaming/build | GPT-5.4 | 99.8K | 82.4K | 17.4K | 3.0K |
| bash-gaming (coordinator) | GPT-5.4 | 99.3K | 94.8K | 4.5K | 892 |
By model:
| Model | Pricing (input / cached / output per 1M) | Cost |
|---|---|---|
| GPT-5.4 | $2.50 / $0.25 / $15.00 | ~$1.75 |
| GPT-5 mini | $0.25 / $0.025 / $2.00 | < $0.01 |
| Total | ~$1.80 |
View execution history:
cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertfcd bash-gaming/google/wizardry-crawler && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | Gemini 3.1 Pro | 1.1K | — | 1.1K | 354 |
| @bash-gaming/plan | Gemini 3 Flash | 394.9K | 229.7K | 165.2K | 12.5K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 12.2K | — | 12.2K | 531 |
| @bash-gaming/build | Gemini 3.1 Pro | 492.2K | 379.8K | 112.4K | 12.0K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 3.9K | — | 3.9K | 48 |
| @bash-gaming/verify | Gemini 3 Flash | 289.0K | 60.0K | 229.0K | 4.4K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 18.4K | — | 18.4K | 347 |
| @bash-gaming/build | Gemini 3.1 Pro | 1.6M | 1.4M | 200K | 15.1K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 46.1K | — | 46.1K | 952 |
By model:
| Model | Pricing (input / cached / output per 1M) | Cost |
|---|---|---|
| Gemini 3.1 Pro Preview | $2.00 / $0.20 / $12.00 | ~$1.50 |
| Gemini 3 Flash Preview | $0.50 / $0.05 / $3.00 | ~$0.26 |
| Total | ~$1.76 |
View execution history:
cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbucd bash-gaming/kimi-k2p5-fireworks && npm install && npx .Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).
Cost breakdown by run
| Run | Input | Cached | Uncached | Output |
|---|---|---|---|---|
| bash-gaming (coordinator) | 952 | — | 952 | 504 |
| @bash-gaming/plan | 955.8K | 794.1K | 161.7K | 69.5K |
| bash-gaming (coordinator) | 2.5K | 512 | 2.0K | 432 |
| @bash-gaming/build | 12.2M | 11.7M | 500K | 83.1K |
| bash-gaming (coordinator) | 3.9K | 2.0K | 1.9K | 307 |
| @bash-gaming/verify | 1.8M | 1.6M | 200K | 8.6K |
| bash-gaming (coordinator) | 56.9K | 26.1K | 30.8K | 1.3K |
| (follow-up 1) | ||||
| bash-gaming (coordinator) | 57.9K | 26.6K | 31.3K | 566 |
| @bash-gaming/build | 882.5K | 710.7K | 171.8K | 6.9K |
| bash-gaming (coordinator) | 1.7M | 1.5M | 200K | 3.6K |
| (follow-up 2) | ||||
| bash-gaming (coordinator) | 306.1K | 205.3K | 100.8K | 700 |
| @bash-gaming/build | 2.2M | 2.1M | 100K | 13.0K |
| bash-gaming (coordinator) | 416.1K | 394.2K | 21.9K | 695 |
All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43
View execution history:
cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.
Cost breakdown by run
| Run | Input | Cached | Uncached | Output |
|---|---|---|---|---|
| bash-gaming (coordinator) | 14.6K | 10.1K | 4.5K | 1.5K |
| @bash-gaming/plan | 172.9K | 111.2K | 61.7K | 29.3K |
| bash-gaming (coordinator) | 3.4K | 2.3K | 1.1K | 346 |
| @bash-gaming/verify | 755.9K | 663.1K | 92.8K | 5.5K |
| bash-gaming (coordinator) | 66.7K | 57.6K | 9.1K | 2.9K |
All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13
View execution history:
cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw





