Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a95908f
harnesses: add package skeleton with base, memory, tools
yourconscience May 11, 2026
1c63c3c
harnesses: implement 8 concrete harness classes
yourconscience May 11, 2026
68cb27e
harnesses: HarnessConfig schema + test_factory
yourconscience May 11, 2026
1bb3121
harnesses: implement 8 concrete harness classes, retire agents/
yourconscience May 11, 2026
78babe8
harnesses: HarnessConfig schema + factory
yourconscience May 11, 2026
eeac9f3
harnesses: wire CLI, benchmark, YAML configs
yourconscience May 11, 2026
664452d
tests: migrate to harness API
yourconscience May 11, 2026
1ac851e
docs: reframe as agent harness benchmark
yourconscience May 11, 2026
7cc2a21
docs: add experiment audit
yourconscience May 11, 2026
27328d1
fix: double memory update, compaction guard, test config migration
yourconscience May 11, 2026
5c2aa5c
fix: preserve benchmark result compatibility
yourconscience May 11, 2026
930edda
fix: address PR review feedback
yourconscience May 11, 2026
9ed2679
fix: P2 harness model attribution and harness_id includes system_temp…
yourconscience May 11, 2026
96704c9
fix: hide legacy AgentConfig public export
yourconscience May 11, 2026
cce82dc
docs: consolidate harness documentation
yourconscience May 11, 2026
2276639
fix: address harness review feedback
yourconscience May 11, 2026
42a89b1
fix: preserve legacy agent compatibility
yourconscience May 12, 2026
d9a0f22
fix: validate special harness models
yourconscience May 12, 2026
0a3f0b0
style: format special harness validation
yourconscience May 12, 2026
ff73599
fix: preserve legacy memory routing
yourconscience May 12, 2026
6ae2265
remove legacy agent compatibility
yourconscience May 12, 2026
131ecad
simplify harness template surface
yourconscience May 13, 2026
4f39834
rename non-llm agents to players
yourconscience May 13, 2026
1fe3930
fix harness leaderboard memory mode
yourconscience May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ uv run llm-quest benchmark --config configs/benchmarks/memory_full_transcript.ya
uv run llm-quest benchmark-report --benchmark-id <id> --output report.md

# Analyze a single run
uv run llm-quest analyze-run --run-summary results/<agent>/<quest>/run_<id>/run_summary.json
uv run llm-quest analyze-run --run-summary results/<harness>/<quest>/run_<id>/run_summary.json

# Play as human in terminal
uv run llm-quest play --quest quests/Boat.qm
Expand Down Expand Up @@ -107,7 +107,8 @@ Provider-specific keys in `.env`:

## Project Structure

- `llm_quest_benchmark/agents/` - Agent implementations (LLM, planner, tool-augmented)
- `llm_quest_benchmark/harnesses/` - LLM harness implementations for prompt, memory, tools, and planning experiments
- `llm_quest_benchmark/players/` - Non-LLM player primitives (`human`, `random_choice`)
- `llm_quest_benchmark/prompt_templates/` - Jinja2 prompt templates for the public context-scaffold taxonomy
- `llm_quest_benchmark/executors/` - CLI, benchmark orchestration, TS bridge
- `configs/benchmarks/` - YAML benchmark configurations
Expand Down
20 changes: 8 additions & 12 deletions configs/benchmarks/balanced_gpt5mini_all_modes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,49 +21,45 @@ quests:
agents:
# 1. Minimal prompt
- model: gpt-5-mini
template: stub
harness: minimal
temperature: 0.4
runs: 3
# 2. Short-context reasoning
- model: gpt-5-mini
template: reasoning
harness: reasoning_recent
temperature: 0.4
runs: 3
# 3. Full-history reasoning
- model: gpt-5-mini
template: reasoning
harness: reasoning_full
temperature: 0.4
runs: 3
memory_mode: full_transcript
# 4. Compact memory / memo
- model: gpt-5-mini
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
# 5. Prompt hints
- model: gpt-5-mini
template: light_hints
harness: hinted_compact
temperature: 0.4
runs: 3
# 6. Tools + compact memory
- model: gpt-5-mini
template: tool_augmented
harness: tool_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
# 7. Tools + hints + compact memory
- model: gpt-5-mini
template: tool_augmented_hints
harness: tool_hinted
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
# 8. Planner loop
- model: gpt-5-mini
template: planner
harness: planner
temperature: 0.4
runs: 3
debug: false
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp3_no_loop_breaker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: reasoning
harness: reasoning_full
temperature: 0.4
runs: 2
memory_mode: full_transcript
debug: false
quest_timeout: 600
max_workers: 2
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp3_stateful_compact.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp4_compaction_no_memo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: reasoning
harness: compaction_no_memo
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp4_memo_cot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: memo_cot
harness: memo_cot
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp4_memo_extended.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: memo_extended
harness: memo_extended
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp4_memo_structured.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: memo_structured
harness: memo_structured
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp5_stateful_compact_variance.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 5
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp6_prompt_hints.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: stateful_compact_hints
harness: hinted_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp6_tools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: tool_augmented
harness: tool_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp6_tools_hints.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: tool_augmented_hints
harness: tool_hinted
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp6_unified_tools_screen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,9 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:google/gemini-3-flash-preview"
template: tool_augmented
harness: tool_compact
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp7_deepseek.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ quests:
- quests/sr_2_1_2121_eng/Robots_eng.qm
agents:
- model: "openrouter:deepseek/deepseek-chat-v3-0324"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp7_haiku.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ quests:
- quests/sr_2_1_2121_eng/Robots_eng.qm
agents:
- model: "anthropic:claude-3-5-haiku-latest"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp7_llama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ quests:
- quests/sr_2_1_2121_eng/Robots_eng.qm
agents:
- model: "openrouter:meta-llama/llama-4-scout"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp7_mistral.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ quests:
- quests/sr_2_1_2121_eng/Robots_eng.qm
agents:
- model: "openrouter:mistralai/mistral-small-2603"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
3 changes: 1 addition & 2 deletions configs/benchmarks/exp7_qwen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ quests:
- quests/sr_2_1_2121_eng/Robots_eng.qm
agents:
- model: "openrouter:qwen/qwen3-30b-a3b"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
11 changes: 4 additions & 7 deletions configs/benchmarks/exp7b_model_upgrades.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,22 +20,19 @@ quests:
- quests/sr_2_1_2121_eng/Sortirovka1_eng.qm
agents:
- model: "openrouter:deepseek/deepseek-v4-flash"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
- model: "openrouter:qwen/qwen3.6-flash"
template: stateful_compact
harness: memo_compact
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
- model: "claude:claude-haiku-4-5-20251001"
template: stateful_compact
- model: "anthropic:claude-haiku-4-5-20251001"
harness: memo_compact
temperature: 0.4
runs: 2
memory_mode: compaction
compaction_interval: 50
debug: false
quest_timeout: 600
Expand Down
18 changes: 6 additions & 12 deletions configs/benchmarks/memory_compaction.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,45 +18,39 @@ quests:
agents:
# Gemini 3 Flash - compaction interval 10
- model: "openrouter:google/gemini-3-flash-preview"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 10
# Gemini 3 Flash - compaction interval 20
- model: "openrouter:google/gemini-3-flash-preview"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 20
# GPT-5.4 Mini - compaction interval 10
- model: "openrouter:openai/gpt-5.4-mini"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 10
# GPT-5.4 Mini - compaction interval 20
- model: "openrouter:openai/gpt-5.4-mini"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 20
# DeepSeek V3.2 - compaction interval 10
- model: "openrouter:deepseek/deepseek-v3.2"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 10
# DeepSeek V3.2 - compaction interval 20
- model: "openrouter:deepseek/deepseek-v3.2"
template: reasoning
harness: memo_compact
temperature: 0.4
runs: 3
memory_mode: compaction
compaction_interval: 20
debug: false
quest_timeout: 600
Expand Down
9 changes: 3 additions & 6 deletions configs/benchmarks/memory_full_transcript.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,19 @@ quests:
agents:
# Gemini 3 Flash - full transcript
- model: "openrouter:google/gemini-3-flash-preview"
template: reasoning
harness: reasoning_full
temperature: 0.4
runs: 3
memory_mode: full_transcript
# GPT-5.4 Mini - full transcript
- model: "openrouter:openai/gpt-5.4-mini"
template: reasoning
harness: reasoning_full
temperature: 0.4
runs: 3
memory_mode: full_transcript
# DeepSeek V3.2 - full transcript
- model: "openrouter:deepseek/deepseek-v3.2"
template: reasoning
harness: reasoning_full
temperature: 0.4
runs: 3
memory_mode: full_transcript
debug: false
quest_timeout: 600
max_workers: 2
Expand Down
Loading
Loading