Skip to content

Agents must beat unmanaged baseline #6

@jkbennitt

Description

@jkbennitt

The Problem

First paired benchmark against a live RimWorld colony shows agents are not helping:

Agent:    0.801 ± 0.03
Baseline: 0.830 ± 0.00
Delta:    -0.029 (p = 0.37)

The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.

Why Agents Are Losing

1. High action failure rate

  • set_growing_zone → RIMAPI 500 every time (fork bug, tracked separately)
  • place_blueprint → agent doesn't include x,z coordinates
  • toggle_power → agent sends building_id=0 (no valid IDs in state)
  • haul_resource → RIMAPI rejects the job assignment

Agents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.

2. Agents disrupt productive colonist behavior

  • RimWorld's built-in AI already assigns colonists to work, eat, sleep, haul
  • Our agents override work priorities, draft colonists away from tasks, reassign researchers
  • If the override is wrong or the action fails, the colonist is worse off than if we'd done nothing

3. No understanding of what's already working

  • Agents see a snapshot of colony state but don't know what colonists are currently doing
  • They propose "set_work_priority growing=1" but the colonist is already growing
  • The action succeeds but adds no value — and may disrupt the colonist's current task queue

4. 10-second tick interval means minimal game progression

  • Colony runs for 10 seconds between deliberation cycles
  • Not enough time for actions to have measurable impact before the next override

What Needs to Change

Fix action reliability first

  • Fix set_growing_zone RIMAPI fork bug
  • Teach agents to include coordinates for blueprints
  • Expose building IDs in filtered state for toggle_power
  • Get execution rate from 43% to 90%+

Make agents aware of current colonist activity

  • Add current_activity or current_job to colonist state (if RIMAPI exposes it)
  • Agents should propose NO_ACTION when colonists are already doing the right thing
  • Penalize unnecessary overrides in the scoring

Increase tick interval for meaningful progression

  • Test with 30-60 second tick intervals so colony state actually changes between ticks
  • Fewer but higher-quality interventions > many disruptive ones

Add "do no harm" principle to agent prompts

  • System prompt: "Only propose actions that improve on the colony's current trajectory. If colonists are already productive, propose NO_ACTION."
  • Weight NO_ACTION higher in the conflict resolver when no crisis exists

Success Criteria

The benchmark answer should be:

Agent:    0.85 ± 0.05
Baseline: 0.75 ± 0.03
Delta:    +0.10** (p < 0.05)

Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.

How to Reproduce

# Requires: RimWorld running, RIMAPI mod, LM Studio with Nemotron Nano 4B
# Save a Crashlanded colony as "rle_crashlanded_v1"

python scripts/run_scenario.py crashlanded_survival \
  --provider openai --model nvidia/nemotron-3-nano-4b \
  --base-url http://localhost:1234/v1 \
  --no-think --ticks 10

python scripts/run_scenario.py crashlanded_survival --no-agent --ticks 10

Compare the two final scores. Agent must be higher.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions