Perstack: Demo Catalog

Demo repository for Perstack. Runs the same task across different providers and compares results.

bash-gaming

Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.

Definition: perstack.toml
Providers:
- Anthropic
- OpenAI
- Google
- Fireworks
Models:
- Anthropic: Opus 4.6, Sonnet 4.6
- OpenAI: GPT-5.4, GPT-5 mini
- Google: Gemini 3.1 Pro, Gemini 3 Flash
- Fireworks:
  - Kimi K2.5
  - MiniMax M2.5

Expert Topology

bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify

Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.

Expert	Model Tier	Role
bash-gaming (coordinator)	high	Coordinates the entire task and delegates to the appropriate experts.
@bash-gaming/plan	middle	Expands requirements, defines dual-mode API contract, npm package structure.
@bash-gaming/build	high	Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging.
@bash-gaming/verify	middle	Validates npx installability, AI mode deterministic output, TUI playthrough tests.

Task (Query)

Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.

Execution Command

Replace --provider, --model, and result directory for each run.

docker run --pull always --rm -it \
  --env-file .env \
  -v ./<result-dir>:/workspace \
  -v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
  perstack/perstack start bash-gaming \
    --config /definitions/perstack.toml \
    --provider <provider> \
    --model <model> \
    "Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."

Completion Criteria

The game must be playable via npx . in the result directory — an Ink-based TUI launches and the dungeon crawler is interactive.

Results

Provider	Models Used	Works	Directory	Steps	Duration	Input Tokens	Cached	Output Tokens	Cost
Anthropic	Opus 4.6, Sonnet 4.6	✅	`anthropic/`	173	51m 18s	13.8M	13.3M (96.07%)	213.9K	$15.24
OpenAI	GPT-5.4, GPT-5 mini	✅	`openai/`	118	12m 24s	2.4M	2.2M (89.56%)	42.6K	~$1.80
Google	Gemini 3.1 Pro, Gemini 3 Flash	✅	`google/`	163	16m 31s	2.9M	2.1M (72.69%)	46.2K	~$1.76
Fireworks	Kimi K2.5	✅	`fireworks-kimi-k2p5/`	324	1h 46m	20.6M	19.0M (92.13%)	189.1K	~$3.43
Fireworks	MiniMax M2.5	❌	`fireworks-minimax-m2p5/`	59	5m 49s	1.0M	844.4K (83.31%)	39.7K	~$0.13

Thoughts

Building production-grade agentic apps requires models that perform well, but at an operationally viable cost. Model performance itself has multiple axes:

Trustworthiness — Does the model avoid self-destructive behavior (e.g., corrupting its own workspace, breaking the environment)? Can a user delegate a task and walk away? This is a knockout factor — if a model cannot be trusted, nothing else matters.
Outcome quality — Does the model exceed expectations, or merely satisfy them? Does it produce deliverables that go beyond the literal request?
Instruction adherence — Does the model follow the defined expert topology and constraints? Note: blind compliance is not the same as good adherence — over-literal following at the expense of the task goal is a performance regression.
Cost efficiency — Is the result achievable at a price point that makes the agentic app commercially viable?

This evaluation assessed each provider across these axes.

1. Kimi K2.5 (Fireworks) — Best overall.

Axis	Assessment
Trustworthiness	High. No self-destructive behavior. Two follow-ups were needed, but for environment-specific issues — the game was functional within the harness.
Outcome quality	Excellent. Consumed high input tokens and iterated extensively, delivering a polished game comparable to Anthropic's output.
Instruction adherence	Full. Respected the plan→build→verify pipeline and met quality gates defined in the expert topology.
Cost efficiency	Strong. ~$3.43 — achieved near-Anthropic quality at ~1/4 the cost.

2. Anthropic (Opus 4.6 + Sonnet 4.6) — Highest quality, but cost is prohibitive.

Axis	Assessment
Trustworthiness	High. Stable, no environment issues. Completed in a single query.
Outcome quality	Excellent. Rich Ink TUI with polish beyond the minimum requirements. However, a significant portion of output tokens went to excessive documentation rather than code — a misallocation of effort.
Instruction adherence	Full. Correctly routed delegates via `defaultModelTier` and followed the plan→build→verify pipeline.
Cost efficiency	Poor. $15.24 is prohibitive for third-party agentic apps. The Opus + Sonnet combination works architecturally but the price makes it unviable at scale.

3. Google (Gemini 3.1 Pro + Gemini 3 Flash) — Efficient and compliant, but shallow.

Axis	Assessment
Trustworthiness	Adequate. No destructive behavior, but the output has noticeable bugs.
Outcome quality	Minimal. Functional but lacks depth and polish. Does not aim beyond satisfying the literal requirements.
Instruction adherence	Full. Correctly followed plan→build→verify, routing delegates to Flash via `defaultModelTier`.
Cost efficiency	Excellent. ~$1.76 — cheapest successful run.

4. OpenAI (GPT-5.4 + GPT-5 mini) — Fast and cheap, but skipped verification.

Axis	Assessment
Trustworthiness	Adequate. No destructive behavior, stable execution.
Outcome quality	Minimal. Functional but plain. Prioritizes task completion over exceeding expectations.
Instruction adherence	Partial. Skipped @bash-gaming/verify entirely, calling @bash-gaming/build three times instead. The defined pipeline was not followed.
Cost efficiency	Excellent. ~$1.80 — fastest completion at 12m 24s.

5. MiniMax M2.5 (Fireworks) — Failed.

Axis	Assessment
Trustworthiness	Low. Ignored expert instructions entirely.
Outcome quality	N/A. Produced a browser-based HTML file instead of a CLI game.
Instruction adherence	None. Did not follow the expert topology or use the plan/build/verify pipeline.
Cost efficiency	N/A. ~$0.13 spent, but no usable output.

Anthropic — Opus 4.6 + Sonnet 4.6

cd bash-gaming/opus-4-6-anthropic && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	Opus 4.6	24.7K	—	24.7K	2.1K
@bash-gaming/plan	Sonnet 4.6	7.1M	6.9M	200K	142.6K
bash-gaming (coordinator)	Opus 4.6	8.4K	—	8.4K	11.8K
@bash-gaming/build	Opus 4.6	4.6M	4.5M	100K	36.1K
bash-gaming (coordinator)	Opus 4.6	90.0K	65.8K	24.2K	1.0K
@bash-gaming/verify	Sonnet 4.6	1.6M	1.5M	100K	16.8K
bash-gaming (coordinator)	Opus 4.6	370.8K	351.8K	19.0K	3.5K

By model:

Model	Cost
Opus 4.6	$9.22
Sonnet 4.6	$6.02
Total	$15.24

View execution history:

cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38ac

OpenAI — GPT-5.4 + GPT-5 mini

cd bash-gaming/gpt-5-4-openai && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	GPT-5.4	5.8K	1.2K	4.6K	541
@bash-gaming/plan	GPT-5 mini	3.9K	1.0K	2.9K	4.6K
bash-gaming (coordinator)	GPT-5.4	5.3K	—	5.3K	209
@bash-gaming/build	GPT-5.4	1.6M	1.5M	100K	22.5K
bash-gaming (coordinator)	GPT-5.4	85.9K	46.6K	39.3K	1.2K
@bash-gaming/build	GPT-5.4	503.1K	427.5K	75.6K	9.3K
bash-gaming (coordinator)	GPT-5.4	59.0K	54.3K	4.7K	375
@bash-gaming/build	GPT-5.4	99.8K	82.4K	17.4K	3.0K
bash-gaming (coordinator)	GPT-5.4	99.3K	94.8K	4.5K	892

By model:

Model	Pricing (input / cached / output per 1M)	Cost
GPT-5.4	$2.50 / $0.25 / $15.00	~$1.75
GPT-5 mini	$0.25 / $0.025 / $2.00	< $0.01
Total		~$1.80

View execution history:

cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertf

Google — Gemini 3.1 Pro + Gemini 3 Flash

cd bash-gaming/google/wizardry-crawler && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	Gemini 3.1 Pro	1.1K	—	1.1K	354
@bash-gaming/plan	Gemini 3 Flash	394.9K	229.7K	165.2K	12.5K
bash-gaming (coordinator)	Gemini 3.1 Pro	12.2K	—	12.2K	531
@bash-gaming/build	Gemini 3.1 Pro	492.2K	379.8K	112.4K	12.0K
bash-gaming (coordinator)	Gemini 3.1 Pro	3.9K	—	3.9K	48
@bash-gaming/verify	Gemini 3 Flash	289.0K	60.0K	229.0K	4.4K
bash-gaming (coordinator)	Gemini 3.1 Pro	18.4K	—	18.4K	347
@bash-gaming/build	Gemini 3.1 Pro	1.6M	1.4M	200K	15.1K
bash-gaming (coordinator)	Gemini 3.1 Pro	46.1K	—	46.1K	952

By model:

Model	Pricing (input / cached / output per 1M)	Cost
Gemini 3.1 Pro Preview	$2.00 / $0.20 / $12.00	~$1.50
Gemini 3 Flash Preview	$0.50 / $0.05 / $3.00	~$0.26
Total		~$1.76

View execution history:

cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbu

Fireworks — Kimi K2.5

cd bash-gaming/kimi-k2p5-fireworks && npm install && npx .

Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).

Cost breakdown by run

Run	Input	Cached	Uncached	Output
bash-gaming (coordinator)	952	—	952	504
@bash-gaming/plan	955.8K	794.1K	161.7K	69.5K
bash-gaming (coordinator)	2.5K	512	2.0K	432
@bash-gaming/build	12.2M	11.7M	500K	83.1K
bash-gaming (coordinator)	3.9K	2.0K	1.9K	307
@bash-gaming/verify	1.8M	1.6M	200K	8.6K
bash-gaming (coordinator)	56.9K	26.1K	30.8K	1.3K
(follow-up 1)
bash-gaming (coordinator)	57.9K	26.6K	31.3K	566
@bash-gaming/build	882.5K	710.7K	171.8K	6.9K
bash-gaming (coordinator)	1.7M	1.5M	200K	3.6K
(follow-up 2)
bash-gaming (coordinator)	306.1K	205.3K	100.8K	700
@bash-gaming/build	2.2M	2.1M	100K	13.0K
bash-gaming (coordinator)	416.1K	394.2K	21.9K	695

All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43

View execution history:

cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18

Fireworks — MiniMax M2.5

Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.

Cost breakdown by run

Run	Input	Cached	Uncached	Output
bash-gaming (coordinator)	14.6K	10.1K	4.5K	1.5K
@bash-gaming/plan	172.9K	111.2K	61.7K	29.3K
bash-gaming (coordinator)	3.4K	2.3K	1.1K	346
@bash-gaming/verify	755.9K	663.1K	92.8K	5.5K
bash-gaming (coordinator)	66.7K	57.6K	9.1K	2.9K

All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13

View execution history:

cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bash-gaming		bash-gaming
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perstack: Demo Catalog

bash-gaming

Expert Topology

Task (Query)

Execution Command

Completion Criteria

Results

Thoughts

Anthropic — Opus 4.6 + Sonnet 4.6

OpenAI — GPT-5.4 + GPT-5 mini

Google — Gemini 3.1 Pro + Gemini 3 Flash

Fireworks — Kimi K2.5

Fireworks — MiniMax M2.5

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Perstack: Demo Catalog

bash-gaming

Expert Topology

Task (Query)

Execution Command

Completion Criteria

Results

Thoughts

Anthropic — Opus 4.6 + Sonnet 4.6

OpenAI — GPT-5.4 + GPT-5 mini

Google — Gemini 3.1 Pro + Gemini 3 Flash

Fireworks — Kimi K2.5

Fireworks — MiniMax M2.5

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages