-
Notifications
You must be signed in to change notification settings - Fork 0
Agents must beat unmanaged baseline #6
Copy link
Copy link
Open
Description
The Problem
First paired benchmark against a live RimWorld colony shows agents are not helping:
Agent: 0.801 ± 0.03
Baseline: 0.830 ± 0.00
Delta: -0.029 (p = 0.37)
The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.
Why Agents Are Losing
1. High action failure rate
set_growing_zone→ RIMAPI 500 every time (fork bug, tracked separately)place_blueprint→ agent doesn't include x,z coordinatestoggle_power→ agent sends building_id=0 (no valid IDs in state)haul_resource→ RIMAPI rejects the job assignment
Agents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.
2. Agents disrupt productive colonist behavior
- RimWorld's built-in AI already assigns colonists to work, eat, sleep, haul
- Our agents override work priorities, draft colonists away from tasks, reassign researchers
- If the override is wrong or the action fails, the colonist is worse off than if we'd done nothing
3. No understanding of what's already working
- Agents see a snapshot of colony state but don't know what colonists are currently doing
- They propose "set_work_priority growing=1" but the colonist is already growing
- The action succeeds but adds no value — and may disrupt the colonist's current task queue
4. 10-second tick interval means minimal game progression
- Colony runs for 10 seconds between deliberation cycles
- Not enough time for actions to have measurable impact before the next override
What Needs to Change
Fix action reliability first
- Fix
set_growing_zoneRIMAPI fork bug - Teach agents to include coordinates for blueprints
- Expose building IDs in filtered state for
toggle_power - Get execution rate from 43% to 90%+
Make agents aware of current colonist activity
- Add
current_activityorcurrent_jobto colonist state (if RIMAPI exposes it) - Agents should propose NO_ACTION when colonists are already doing the right thing
- Penalize unnecessary overrides in the scoring
Increase tick interval for meaningful progression
- Test with 30-60 second tick intervals so colony state actually changes between ticks
- Fewer but higher-quality interventions > many disruptive ones
Add "do no harm" principle to agent prompts
- System prompt: "Only propose actions that improve on the colony's current trajectory. If colonists are already productive, propose NO_ACTION."
- Weight NO_ACTION higher in the conflict resolver when no crisis exists
Success Criteria
The benchmark answer should be:
Agent: 0.85 ± 0.05
Baseline: 0.75 ± 0.03
Delta: +0.10** (p < 0.05)
Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.
How to Reproduce
# Requires: RimWorld running, RIMAPI mod, LM Studio with Nemotron Nano 4B
# Save a Crashlanded colony as "rle_crashlanded_v1"
python scripts/run_scenario.py crashlanded_survival \
--provider openai --model nvidia/nemotron-3-nano-4b \
--base-url http://localhost:1234/v1 \
--no-think --ticks 10
python scripts/run_scenario.py crashlanded_survival --no-agent --ticks 10Compare the two final scores. Agent must be higher.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels