Multi-Agent RL Environment For LLM Reward Verification

This repository implements a multi-agent evaluation and improvement loop for LLMs on GSM8K-style math tasks. It combines objective correctness, peer judgment, trust weighting, attention weighting, revision rounds, and optional contextual-bandit RL updates.

Core Reward Design

The environment verifies rewards in strict order:

ground_truth_reward -> raw_peer_salvageability_scores -> score_normalization -> trust_weighting -> attention_weighting -> combined_peer_recoverability -> recoverability_components -> final_recoverability_reward -> sanity_checks

The final scalar reward is recoverability-aware:

R = alpha*R_final + beta*sum(delta_u) + gamma*mean(b_t) - delta*mean(f_t) + eta*peer_salv + zeta*branch_bonus

where delta_u is the change in heuristic recoverability per reasoning step, b_t is a belief / branch-preservation score, f_t matches stored failure motifs, and peer_salv is the attention–trust weighted aggregate of peer salvageability scores (not a simple blend of GT and peer accuracy).

Repository Layout

environment/ Multi-agent environment and reward pipeline.
agents/ Agent base classes and implementations: heuristic agents, Ollama agents, self-refine agents, and ICL agents.
data/ GSM8K loading and answer extraction.
experiment/ Experiment runner, resumable manifests, and aggregate summaries.
analysis/ Statistics, tables, and learning-curve plots.
configs/ YAML configs for smoke tests and ablations.
notebook.ipynb Local notebook for train/test runs and reward-verification inspection.
notebook_colab.ipynb Colab-first notebook with smaller defaults and setup logic.

Installation

python -m venv .venv
source .venv/bin/activate
pip install -e .

For development tools:

pip install -e .[dev]

Quick Start

Fast smoke test

python scripts/run_smoke_ablation.py

Single CLI run

python run_experiment.py \
  --backend heuristic \
  --dataset gsm8k \
  --split test \
  --limit 5 \
  --output-dir outputs/quick-run

Local notebook

Open notebook.ipynb and run the cells from top to bottom.

Colab notebook

Open notebook_colab.ipynb in Colab. It defaults to the heuristic backend because standard Colab does not include a local Ollama server.

Important Run Outputs

Each run writes:

results.jsonl Full per-example responses, rewards, ranking, and metadata.
summary.json Aggregate leaderboard metrics and verification_metrics.
learning_curve.jsonl Per-episode learning-curve rows.
run_manifest.json Resolved run configuration plus environment settings.

Reward Verification Output

Each episode stores a detailed report under:

metadata.reward_verification

Useful fields include:

stage_order
ground_truth_reward
raw_peer_salvageability_scores
score_normalization
trust_weighting
attention_weighting
combined_peer_recoverability
recoverability_components
final_recoverability_reward
sanity_checks

Current Improvement Areas

The codebase is in solid shape for research iteration, but the main areas to keep improving are:

model-backend ergonomics for non-Ollama environments
richer aggregate reporting over reward-verification health
clearer top-level documentation and setup guidance
notebook UX when outputs do not exist yet

This pass addresses the documentation gap, adds aggregate verification metrics to summaries, exposes the missing attention temperature config in experiment entrypoints, and hardens both notebooks against missing output files.

Tests

pytest -q tests/test_environment.py

More Detail

PROJECT_SUMMARY.md
docs/PAPER_EXPERIMENTS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent RL Environment For LLM Reward Verification

Core Reward Design

Repository Layout

Installation

Quick Start

Fast smoke test

Single CLI run

Local notebook

Colab notebook

Important Run Outputs

Reward Verification Output

Current Improvement Areas

Tests

More Detail

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agents		agents
analysis		analysis
configs		configs
data		data
docs		docs
environment		environment
experiment		experiment
multi_agent_rl_environment.egg-info		multi_agent_rl_environment.egg-info
output2		output2
outputs		outputs
outputs1		outputs1
scripts		scripts
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
RUN_SUMMARY_NOTEBOOK_PILOT.md		RUN_SUMMARY_NOTEBOOK_PILOT.md
architecture.png		architecture.png
notebook.ipynb		notebook.ipynb
notebook_colab.ipynb		notebook_colab.ipynb
pyproject.toml		pyproject.toml
run_demo.py		run_demo.py
run_experiment.py		run_experiment.py

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent RL Environment For LLM Reward Verification

Core Reward Design

Repository Layout

Installation

Quick Start

Fast smoke test

Single CLI run

Local notebook

Colab notebook

Important Run Outputs

Reward Verification Output

Current Improvement Areas

Tests

More Detail

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages