This repository implements a multi-agent evaluation and improvement loop for LLMs on GSM8K-style math tasks. It combines objective correctness, peer judgment, trust weighting, attention weighting, revision rounds, and optional contextual-bandit RL updates.
The environment verifies rewards in strict order:
ground_truth_reward -> raw_peer_salvageability_scores -> score_normalization -> trust_weighting -> attention_weighting -> combined_peer_recoverability -> recoverability_components -> final_recoverability_reward -> sanity_checks
The final scalar reward is recoverability-aware:
R = alpha*R_final + beta*sum(delta_u) + gamma*mean(b_t) - delta*mean(f_t) + eta*peer_salv + zeta*branch_bonus
where delta_u is the change in heuristic recoverability per reasoning step, b_t is a belief / branch-preservation score, f_t matches stored failure motifs, and peer_salv is the attention–trust weighted aggregate of peer salvageability scores (not a simple blend of GT and peer accuracy).
-
environment/Multi-agent environment and reward pipeline. -
agents/Agent base classes and implementations: heuristic agents, Ollama agents, self-refine agents, and ICL agents. -
data/GSM8K loading and answer extraction. -
experiment/Experiment runner, resumable manifests, and aggregate summaries. -
analysis/Statistics, tables, and learning-curve plots. -
configs/YAML configs for smoke tests and ablations. -
notebook.ipynbLocal notebook for train/test runs and reward-verification inspection. -
notebook_colab.ipynbColab-first notebook with smaller defaults and setup logic.
python -m venv .venv
source .venv/bin/activate
pip install -e .For development tools:
pip install -e .[dev]python scripts/run_smoke_ablation.pypython run_experiment.py \
--backend heuristic \
--dataset gsm8k \
--split test \
--limit 5 \
--output-dir outputs/quick-runOpen notebook.ipynb and run the cells from top to bottom.
Open notebook_colab.ipynb in Colab.
It defaults to the heuristic backend because standard Colab does not include a local Ollama server.
Each run writes:
-
results.jsonlFull per-example responses, rewards, ranking, and metadata. -
summary.jsonAggregate leaderboard metrics andverification_metrics. -
learning_curve.jsonlPer-episode learning-curve rows. -
run_manifest.jsonResolved run configuration plus environment settings.
Each episode stores a detailed report under:
metadata.reward_verification
Useful fields include:
stage_orderground_truth_rewardraw_peer_salvageability_scoresscore_normalizationtrust_weightingattention_weightingcombined_peer_recoverabilityrecoverability_componentsfinal_recoverability_rewardsanity_checks
The codebase is in solid shape for research iteration, but the main areas to keep improving are:
- model-backend ergonomics for non-Ollama environments
- richer aggregate reporting over reward-verification health
- clearer top-level documentation and setup guidance
- notebook UX when outputs do not exist yet
This pass addresses the documentation gap, adds aggregate verification metrics to summaries, exposes the missing attention temperature config in experiment entrypoints, and hardens both notebooks against missing output files.
pytest -q tests/test_environment.pyPROJECT_SUMMARY.mddocs/PAPER_EXPERIMENTS.md