A reinforcement learning environment for training LLM-driven agents to autonomously debug, localize, and repair code through sandboxed execution and feedback-driven learning.
Built for scalable agent evaluation on real-world software debugging tasks.
CodeFix-Env is a Gymnasium-compatible RL environment that simulates real-world Python code debugging workflows. It enables researchers and engineers to:
| Goal | Description |
|---|---|
| Train | Large language models (LLMs) on code repair tasks. |
| Evaluate | AI agents on real debugging scenarios. |
| Benchmark | Code generation and reasoning capabilities. |
| Build | Automated debugging systems. |
Compatible with: Qwen, Llama, CodeLlama, DeepSeek, and other leading LLMs.
Training Large Language Models (LLMs) for code generation and autonomous repair requires more than static data. Effective learning requires:
- Authentic Scenarios: Moving beyond synthetic examples to real-world debugging.
- Tight Feedback Loops: Providing immediate test results and error logs.
- Measurable Progress: Quantifiable metrics based on test pass rates.
- Secure Execution: Sandboxed environments to prevent unsafe code execution.
CodeFix-Env is a specialized environment designed for RL agents and LLMs to master code repair through:
- Real-World Tasks: 21 curated Python bugs ranging from logic errors to complex edge cases.
- Test-Driven Feedback: Agents receive granular execution data after every action.
- Multi-Factor Rewards: Reward signals based on test progress, code quality, and task completion.
- Isolated Execution: Built-in sandboxing with timeout protection for safe evaluations.
- Structured Interaction: Clean API for structured actions that LLMs can reliably generate.
graph TD
A[Agent receives buggy code] --> B[Analyze code & test failures]
B --> C[Generate action: e.g., edit_line]
C --> D[Environment executes action]
D --> E[Environment provides feedback]
subgraph Feedback Loop
E --> F[Test results & pass/fail counts]
E --> G[Code diff & change analysis]
E --> H[Shaped reward signals]
end
F & G & H --> I{Solution found?}
I -- No --> B
I -- Yes --> J[Task Complete]
** Task Library**
- 21 carefully designed debugging tasks
- 3 difficulty levels: Easy (8), Medium (8), Hard (5)
- Each task includes buggy code, solution, test cases, and hints
- Covers real Python patterns: loops, recursion, algorithms, decorators
** LLM Integration**
- Works with HuggingFace Transformers ecosystem
- Compatible with training frameworks: SFT, DPO, GRPO, PPO
- Pre-built support for Qwen, Llama, CodeLlama, DeepSeek models
- Structured JSON action interface for LLM control
** Performance**
- Multiprocessing-based sandbox with timeout enforcement
- Per-test execution isolation
- 197 unit tests with 83% code coverage
- Automated CI/CD pipeline (Python 3.10, 3.11, 3.12)
** Developer Friendly**
- Gymnasium standard interface
- Async client for high-performance training
- Sync wrapper for notebooks
- FastAPI HTTP server for remote access
- Complete type hints and documentation
git clone https://github.com/dhakarshailendra829/codefix-env.git
cd codefix-envpython3.11 -m venv .venv
source .venv/bin/activate (On mac)
.venv/Scripts/activate (On Windows)pip install -e . pip install -e ".[dev]"pip install -e ".[llm]"pip install -e ".[dev,llm]"python -c "from codefix_env import CodeFixEnvironment; print(' Installation successful!')"python << 'EOF'
from codefix_env import CodeFixEnvironment, CodeFixAction, ActionType
# Create environment
env = CodeFixEnvironment()
# Start episode with random easy task
obs = env.reset(difficulty="easy")
print(f"Task: {obs.task_id}")
print(f"Current Code:\n{obs.current_code}")
# Run tests to see failures
action = CodeFixAction(action_type=ActionType.RUN_TESTS)
result = env.step(action)
print(f"\nTest Results: {result.observation.tests_passed}/{result.observation.tests_total} passing")
print(f"Output:\n{result.observation.test_output}")
print(f"\nReward: {result.reward:.3f}")
print(f"Episode Done: {result.done}")
EOFTo run the full suite:
cd codefix-env
PYTHONPATH=src pytest tests/ -v --timeout=60To run specific Test Suite
PYTHONPATH=src pytest tests/test_tasks.py::TestSolutionCorrectness -vTo check code standards:
ruff check src/ tests/ examples/
black --check src/ tests/ examples/
mypy src/codefix_env/ --ignore-missing-importscd codefix-env
uvicorn src.codefix_env.server.app:app --reload --port 8000A complete cycle from task initialization to final submission.
from codefix_env import CodeFixEnvironment, CodeFixAction, ActionType
env = CodeFixEnvironment()
obs = env.reset(task_id="easy-001-missing-return")
# 1. Run tests to identify failures
result = env.step(CodeFixAction(action_type=ActionType.RUN_TESTS))
print(f"Tests: {result.observation.tests_passed}/{result.observation.tests_total} passing")
# 2. Apply a targeted fix
result = env.step(CodeFixAction(
action_type=ActionType.EDIT_LINE,
line_number=3,
new_content=" return result"
))
# 3. Final Verification & Submission
result = env.step(CodeFixAction(action_type=ActionType.SUBMIT_FIX))
print(f"Status: {'Solved' if result.observation.all_tests_pass else 'Failed'}")
print(f"Final Reward: {result.reward:.3f}")Programmatically interact with the task database.
from codefix_env import list_tasks, Difficulty, task_count
# Get environment stats
print(f"Total tasks by difficulty: {task_count()}")
# Bulk process specific difficulty levels
easy_tasks = list_tasks(difficulty=Difficulty.EASY)
for task in easy_tasks[:5]:
print(f"Loading {task.id}: {task.title}")Deploy the environment as a service to scale LLM evaluations across multiple agents.
# Start the API server
uvicorn src.codefix_env.server.app:app --port 8000Export environment states into .jsonl formats for fine-tuning models like Llama or Qwen.
import json
from codefix_env import CodeFixClient
async def collect_trajectories():
async with CodeFixClient("http://localhost:8000") as client:
obs = await client.reset(difficulty="medium")
dataset_entry = {
"instruction": f"Fix the following code: {obs.current_code}",
"context": obs.test_output,
"metadata": {"task_id": obs.task_id}
}
# Append to training_data.jsonl...Connect an LLM directly to the environment loop.
from transformers import pipeline
from codefix_env import CodeFixEnvironment
# Load a coding model
fixer = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-7B")
env = CodeFixEnvironment()
obs = env.reset(difficulty="easy")
# Generate fix based on environment observation
prompt = f"Fix this Python code based on these errors: {obs.test_output}\nCode:\n{obs.current_code}"
suggestion = fixer(prompt)| Class / Method | Description |
|---|---|
CodeFixEnvironment |
Local execution engine for debugging tasks. |
CodeFixClient |
Async client for interacting with a remote server. |
env.step(action) |
Executes an ActionType (EDIT, RUN_TESTS, SUBMIT). |
Difficulty |
Enum: EASY, MEDIUM, HARD. |
BugCategory |
Categories: SYNTAX, LOGIC, DATA_STRUCTURE, ALGORITHM. |
Standard commands for maintaining the environment.
# Run full test suite with coverage
pytest tests/ -v --cov=src/codefix_env
# Static Analysis
ruff check src/ # Linting
black --check src/ # Formatting
mypy src/ # Type Checking| Model | Description |
|---|---|
| π’ Qwen 1.5B, 7B, 14B | Alibaba's instruction models |
| π’ Llama 2-7B-Chat | Meta's foundation model |
| π’ CodeLlama 7B, 13B, 34B | Specialized for code |
| π’ DeepSeek Coder 1.3B, 6.7B | Chinese LLM for code |
| π’ Mistral 7B | Open-source efficient model |
| Framework | Method |
|---|---|
| π’ SFT | Supervised fine-tuning |
| π’ DPO | Direct preference optimization |
| π’ GRPO | Group relative policy optimization |
| π’ PPO | Proximal policy optimization |
| Tool | Purpose |
|---|---|
| π’ HuggingFace | Model hub & training |
| π’ vLLM | Fast LLM serving |
| π’ Ollama | Local model deployment |
| π’ Docker | Containerization |
| π’ GitHub Actions | CI/CD |