Self-improving LLMs through Group Relative Policy Optimization (GRPO) and multi-language execution sandboxes.
This project implements an autonomous training loop that evolves Qwen2.5-Coder (or any compatible model) into a superior coding agent. It leverages GRPO—the same reinforcement learning logic behind DeepSeek-R1—to optimize model performance using real-world execution feedback across six programming languages, requiring zero human labels.
- GRPO Training: Efficient reinforcement learning without the overhead of a separate critic or value network.
- Autonomous Problem Generation: Uses any OpenAI-compatible API (Ollama, vLLM, OpenRouter) to generate unique coding challenges on the fly.
- Multi-Language Sandbox: Integrated execution for Python, Go, Node.js, C#, C++, and Rust with strict timeouts and dependency management.
- Hardware Efficient: Optimized for consumer hardware; fits 7B models in 4-bit (NF4) on a single 24GB GPU (RTX 3090/4090).
- Windows & Linux Ready: Cross-platform support for all language runtimes and execution environments.
- Synthesize: A "Teacher" LLM generates a structured coding problem with unit tests.
- Rollout: The "Student" model generates multiple completion candidates (the "Group").
- Execute: Every candidate is compiled and run against the unit tests in a secure sandbox.
- Reward: Candidates are scored based on formatting, compilation success, and pass rate.
- Optimize: GRPO computes advantages within the group to update the student policy.
Ensure you have the necessary language runtimes (Go, Node, .NET, G++, Rust) installed.
# On Linux
bash system_deps.sh
# On Windows
# Ensure 'go', 'node', 'dotnet', 'g++', and 'cargo' are in your PATH.git clone https://github.com/Akicou/rl-coding-agent
cd rl-coding-agent
pip install -r requirements.txt
cp .env.example .env # Configure your OAI_BASE_URL and API keys# Run a smoke test to verify model loading and generation
python scripts/smoke_test.py
# Start the infinite training loop
python scripts/train.pyKey parameters for tuning the training loop:
| Category | Parameter | Default | Description |
|---|---|---|---|
| Model | model_name |
Qwen/Qwen2.5-Coder-7B-Instruct |
Target policy & reference model |
| Generation | group_size |
4 |
Completions per rollout group |
max_new_tokens |
65536 |
Max generation length | |
| RL | kl_coef |
0.04 |
Regularization vs. reference policy |
clip_eps |
0.2 |
PPO-style clipping epsilon | |
| Reward | w_pass |
1.0 |
Weight for test pass rate |
w_compile |
0.3 |
Weight for compilation success | |
| Loop | batch_size |
2 |
Problems per micro-batch |
grad_accum |
4 |
Gradients accumulated per step |
| Language | Engine | Sandbox Detail |
|---|---|---|
| Python | 3.11+ | Auto-installs missing packages via pip |
| Go | 1.22+ | Isolated go.mod environment |
| Node.js | 20+ | Dynamic package.json with npm support |
| C# | .NET 8 | Ephemeral .csproj with NuGet resolution |
| C++ | G++ 17 | Direct compilation and execution |
| Rust | Stable | Full Cargo project isolation |
We welcome technical contributions. To add a new language runtime:
- Subclass
LanguageExecutorinrl_agent/languages/. - Implement
extract_deps()andexecute(). - Register it in the
LANGUAGE_REGISTRY.
MIT © 2026 Akicou