Skip to content

perstack-ai/demo-catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Perstack: Demo Catalog

Demo repository for Perstack. Runs the same task across different providers and compares results.


bash-gaming

Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.

  • Definition: perstack.toml

  • Providers:

    • Anthropic
    • OpenAI
    • Google
    • Fireworks
  • Models:

    • Anthropic: Opus 4.6, Sonnet 4.6
    • OpenAI: GPT-5.4, GPT-5 mini
    • Google: Gemini 3.1 Pro, Gemini 3 Flash
    • Fireworks:
      • Kimi K2.5
      • MiniMax M2.5

Expert Topology

bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify

Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.

Expert Model Tier Role
bash-gaming (coordinator) high Coordinates the entire task and delegates to the appropriate experts.
@bash-gaming/plan middle Expands requirements, defines dual-mode API contract, npm package structure.
@bash-gaming/build high Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging.
@bash-gaming/verify middle Validates npx installability, AI mode deterministic output, TUI playthrough tests.

Task (Query)

Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.

Execution Command

Replace --provider, --model, and result directory for each run.

docker run --pull always --rm -it \
  --env-file .env \
  -v ./<result-dir>:/workspace \
  -v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
  perstack/perstack start bash-gaming \
    --config /definitions/perstack.toml \
    --provider <provider> \
    --model <model> \
    "Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."

Completion Criteria

The game must be playable via npx . in the result directory — an Ink-based TUI launches and the dungeon crawler is interactive.

Results

Provider Models Used Works Directory Steps Duration Input Tokens Cached Output Tokens Cost
Anthropic Opus 4.6, Sonnet 4.6 anthropic/ 173 51m 18s 13.8M 13.3M (96.07%) 213.9K $15.24
OpenAI GPT-5.4, GPT-5 mini openai/ 118 12m 24s 2.4M 2.2M (89.56%) 42.6K ~$1.80
Google Gemini 3.1 Pro, Gemini 3 Flash google/ 163 16m 31s 2.9M 2.1M (72.69%) 46.2K ~$1.76
Fireworks Kimi K2.5 fireworks-kimi-k2p5/ 324 1h 46m 20.6M 19.0M (92.13%) 189.1K ~$3.43
Fireworks MiniMax M2.5 fireworks-minimax-m2p5/ 59 5m 49s 1.0M 844.4K (83.31%) 39.7K ~$0.13

Thoughts

Building production-grade agentic apps requires models that perform well, but at an operationally viable cost. Model performance itself has multiple axes:

  • Trustworthiness — Does the model avoid self-destructive behavior (e.g., corrupting its own workspace, breaking the environment)? Can a user delegate a task and walk away? This is a knockout factor — if a model cannot be trusted, nothing else matters.
  • Outcome quality — Does the model exceed expectations, or merely satisfy them? Does it produce deliverables that go beyond the literal request?
  • Instruction adherence — Does the model follow the defined expert topology and constraints? Note: blind compliance is not the same as good adherence — over-literal following at the expense of the task goal is a performance regression.
  • Cost efficiency — Is the result achievable at a price point that makes the agentic app commercially viable?

This evaluation assessed each provider across these axes.

1. Kimi K2.5 (Fireworks) — Best overall.

Axis Assessment
Trustworthiness High. No self-destructive behavior. Two follow-ups were needed, but for environment-specific issues — the game was functional within the harness.
Outcome quality Excellent. Consumed high input tokens and iterated extensively, delivering a polished game comparable to Anthropic's output.
Instruction adherence Full. Respected the plan→build→verify pipeline and met quality gates defined in the expert topology.
Cost efficiency Strong. ~$3.43 — achieved near-Anthropic quality at ~1/4 the cost.

2. Anthropic (Opus 4.6 + Sonnet 4.6) — Highest quality, but cost is prohibitive.

Axis Assessment
Trustworthiness High. Stable, no environment issues. Completed in a single query.
Outcome quality Excellent. Rich Ink TUI with polish beyond the minimum requirements. However, a significant portion of output tokens went to excessive documentation rather than code — a misallocation of effort.
Instruction adherence Full. Correctly routed delegates via defaultModelTier and followed the plan→build→verify pipeline.
Cost efficiency Poor. $15.24 is prohibitive for third-party agentic apps. The Opus + Sonnet combination works architecturally but the price makes it unviable at scale.

3. Google (Gemini 3.1 Pro + Gemini 3 Flash) — Efficient and compliant, but shallow.

Axis Assessment
Trustworthiness Adequate. No destructive behavior, but the output has noticeable bugs.
Outcome quality Minimal. Functional but lacks depth and polish. Does not aim beyond satisfying the literal requirements.
Instruction adherence Full. Correctly followed plan→build→verify, routing delegates to Flash via defaultModelTier.
Cost efficiency Excellent. ~$1.76 — cheapest successful run.

4. OpenAI (GPT-5.4 + GPT-5 mini) — Fast and cheap, but skipped verification.

Axis Assessment
Trustworthiness Adequate. No destructive behavior, stable execution.
Outcome quality Minimal. Functional but plain. Prioritizes task completion over exceeding expectations.
Instruction adherence Partial. Skipped @bash-gaming/verify entirely, calling @bash-gaming/build three times instead. The defined pipeline was not followed.
Cost efficiency Excellent. ~$1.80 — fastest completion at 12m 24s.

5. MiniMax M2.5 (Fireworks) — Failed.

Axis Assessment
Trustworthiness Low. Ignored expert instructions entirely.
Outcome quality N/A. Produced a browser-based HTML file instead of a CLI game.
Instruction adherence None. Did not follow the expert topology or use the plan/build/verify pipeline.
Cost efficiency N/A. ~$0.13 spent, but no usable output.

Anthropic — Opus 4.6 + Sonnet 4.6

cd bash-gaming/opus-4-6-anthropic && npm install && npx .

Anthropic title

Anthropic gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) Opus 4.6 24.7K 24.7K 2.1K
@bash-gaming/plan Sonnet 4.6 7.1M 6.9M 200K 142.6K
bash-gaming (coordinator) Opus 4.6 8.4K 8.4K 11.8K
@bash-gaming/build Opus 4.6 4.6M 4.5M 100K 36.1K
bash-gaming (coordinator) Opus 4.6 90.0K 65.8K 24.2K 1.0K
@bash-gaming/verify Sonnet 4.6 1.6M 1.5M 100K 16.8K
bash-gaming (coordinator) Opus 4.6 370.8K 351.8K 19.0K 3.5K

By model:

Model Cost
Opus 4.6 $9.22
Sonnet 4.6 $6.02
Total $15.24

View execution history:

cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38ac

OpenAI — GPT-5.4 + GPT-5 mini

cd bash-gaming/gpt-5-4-openai && npm install && npx .

OpenAI gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) GPT-5.4 5.8K 1.2K 4.6K 541
@bash-gaming/plan GPT-5 mini 3.9K 1.0K 2.9K 4.6K
bash-gaming (coordinator) GPT-5.4 5.3K 5.3K 209
@bash-gaming/build GPT-5.4 1.6M 1.5M 100K 22.5K
bash-gaming (coordinator) GPT-5.4 85.9K 46.6K 39.3K 1.2K
@bash-gaming/build GPT-5.4 503.1K 427.5K 75.6K 9.3K
bash-gaming (coordinator) GPT-5.4 59.0K 54.3K 4.7K 375
@bash-gaming/build GPT-5.4 99.8K 82.4K 17.4K 3.0K
bash-gaming (coordinator) GPT-5.4 99.3K 94.8K 4.5K 892

By model:

Model Pricing (input / cached / output per 1M) Cost
GPT-5.4 $2.50 / $0.25 / $15.00 ~$1.75
GPT-5 mini $0.25 / $0.025 / $2.00 < $0.01
Total ~$1.80

View execution history:

cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertf

Google — Gemini 3.1 Pro + Gemini 3 Flash

cd bash-gaming/google/wizardry-crawler && npm install && npx .

Google gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) Gemini 3.1 Pro 1.1K 1.1K 354
@bash-gaming/plan Gemini 3 Flash 394.9K 229.7K 165.2K 12.5K
bash-gaming (coordinator) Gemini 3.1 Pro 12.2K 12.2K 531
@bash-gaming/build Gemini 3.1 Pro 492.2K 379.8K 112.4K 12.0K
bash-gaming (coordinator) Gemini 3.1 Pro 3.9K 3.9K 48
@bash-gaming/verify Gemini 3 Flash 289.0K 60.0K 229.0K 4.4K
bash-gaming (coordinator) Gemini 3.1 Pro 18.4K 18.4K 347
@bash-gaming/build Gemini 3.1 Pro 1.6M 1.4M 200K 15.1K
bash-gaming (coordinator) Gemini 3.1 Pro 46.1K 46.1K 952

By model:

Model Pricing (input / cached / output per 1M) Cost
Gemini 3.1 Pro Preview $2.00 / $0.20 / $12.00 ~$1.50
Gemini 3 Flash Preview $0.50 / $0.05 / $3.00 ~$0.26
Total ~$1.76

View execution history:

cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbu

Fireworks — Kimi K2.5

cd bash-gaming/kimi-k2p5-fireworks && npm install && npx .

Kimi K2.5 title

Kimi K2.5 gameplay

Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).

Cost breakdown by run
Run Input Cached Uncached Output
bash-gaming (coordinator) 952 952 504
@bash-gaming/plan 955.8K 794.1K 161.7K 69.5K
bash-gaming (coordinator) 2.5K 512 2.0K 432
@bash-gaming/build 12.2M 11.7M 500K 83.1K
bash-gaming (coordinator) 3.9K 2.0K 1.9K 307
@bash-gaming/verify 1.8M 1.6M 200K 8.6K
bash-gaming (coordinator) 56.9K 26.1K 30.8K 1.3K
(follow-up 1)
bash-gaming (coordinator) 57.9K 26.6K 31.3K 566
@bash-gaming/build 882.5K 710.7K 171.8K 6.9K
bash-gaming (coordinator) 1.7M 1.5M 200K 3.6K
(follow-up 2)
bash-gaming (coordinator) 306.1K 205.3K 100.8K 700
@bash-gaming/build 2.2M 2.1M 100K 13.0K
bash-gaming (coordinator) 416.1K 394.2K 21.9K 695

All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43

View execution history:

cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18

Fireworks — MiniMax M2.5

MiniMax M2.5 screenshot

Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.

Cost breakdown by run
Run Input Cached Uncached Output
bash-gaming (coordinator) 14.6K 10.1K 4.5K 1.5K
@bash-gaming/plan 172.9K 111.2K 61.7K 29.3K
bash-gaming (coordinator) 3.4K 2.3K 1.1K 346
@bash-gaming/verify 755.9K 663.1K 92.8K 5.5K
bash-gaming (coordinator) 66.7K 57.6K 9.1K 2.9K

All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13

View execution history:

cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw

About

Demo repository for Perstack. Runs the same task across different providers and compares results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors