Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c3a8f64
initial qed-math OpenEnv scaffold with models and client
rycerzes Mar 14, 2026
016aa35
implement MCP server tools and QEDMath environment
rycerzes Mar 14, 2026
7d725b5
rubric implementation
rycerzes Mar 15, 2026
e01a3f0
data pipeline integration
rycerzes Mar 16, 2026
efa2544
grading prompts
rycerzes Mar 16, 2026
bc95317
API & Client
rycerzes Mar 16, 2026
8d91ff8
fix deps
rycerzes Mar 16, 2026
53b8476
proof submission with output length tokens and reasoning stripping
rycerzes Mar 16, 2026
31ba418
fix eval prompt
rycerzes Mar 17, 2026
8caae19
original_problem field for grading in QED-Nano RC stream
rycerzes Mar 17, 2026
74be02a
add verifier metrics to grading results
rycerzes Mar 17, 2026
5305d42
docs
rycerzes Mar 17, 2026
477024f
tests for QED Math Env
rycerzes Mar 26, 2026
244b5c0
add QED Math inference example
rycerzes Mar 26, 2026
bdb5678
improve async step handling
rycerzes Mar 26, 2026
d6b0c15
increase timeout for LLM judge calls and adjust WebSocket ping settings
rycerzes Mar 26, 2026
f54cc91
add feedback output for incorrect answers in QED Math inference
rycerzes Mar 26, 2026
aeebd21
refactor get_problem method to include metadata and adjust reference …
rycerzes Mar 28, 2026
a41919f
refactor math expression parsing and simplify async calls in tests
rycerzes Mar 29, 2026
f69b83c
impl process-based math verification service
rycerzes Mar 29, 2026
83811d4
gold answer caching and request admission control
rycerzes Mar 29, 2026
8d99084
worker pool restart and health probe reporting
rycerzes Mar 29, 2026
e13f22a
preserve custom observation fields during parsing
rycerzes Mar 29, 2026
60c171a
metrics tracking for MathVerifierService and QEDMathEnvironment
rycerzes Mar 29, 2026
8f86fc8
update env docs
rycerzes Mar 29, 2026
3f822b3
Merge branch 'main' into feat/qed-math-env
rycerzes Mar 29, 2026
389980d
refactor: remove output_length_tokens param from submit_proof and rel…
rycerzes Mar 30, 2026
e35b91e
Merge branch 'main' into feat/qed-math-env
rycerzes Apr 4, 2026
2939ab5
chore(lint): usort + ruff format
rycerzes May 11, 2026
9ffb7ef
Merge branch 'main' into feat/qed-math-env
rycerzes May 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions envs/qed_math_env/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.venv
.git
.gitignore
.env
__pycache__/
*.pyc
*.pyo
*.pyd
*.pyw
*.pyz
*.pywz
*.pyzw
*.pyzwz


328 changes: 328 additions & 0 deletions envs/qed_math_env/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
---
title: QED Math Environment
emoji: 🧮
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- mathematics
- proof-evaluation
- llm-grading
---

# QED Math Environment

Mathematical proof generation and evaluation environment for OpenEnv, ported from [QED-Nano](https://github.com/meta-pytorch/QED-Nano). Agents receive math problems, submit proofs, and receive LLM-based rubric grading on a 0–7 scale with normalized rewards.

## Features

- **LLM-based rubric grading** (0–7 scale) via any OpenAI-compatible endpoint
- **Process-based answer verification service** (`math_verify` in worker processes)
- **Backpressure + retries + worker restart** for robust concurrent rollout operation
- **Gold-answer cache** keyed by `problem_id` and verifier normalization settings
- **Flexible dataset loading**: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
- **Reward shaping**: discount factor, length penalty, and optional score thresholding
- **Reasoning stripping**: configurable delimiters (e.g. `<think>...</think>`) removed before grading
- **Multi-step problems**: configurable max attempts with per-attempt feedback
- **Verifier metrics**: rollout/staging counters and health signals surfaced in observation metadata, ready for TrackIO / WandB
- **MCP tool interface**: `get_problem`, `submit_proof`, `get_grading_guidelines`

## Quick Start

### Async (default)

```python
import asyncio
from qed_math_env import QEDMathEnv

async def main():
async with QEDMathEnv(base_url="http://localhost:8000") as env:
# Reset to load a problem
result = await env.reset()
obs = result.observation
print(f"Problem: {obs.problem[:100]}...")

# Submit a proof
submission = await env.submit_proof(proof="By induction on n...")
print(f"Score: {submission.score}/7, Reward: {submission.reward:.2f}")

asyncio.run(main())
```

### Sync

```python
from qed_math_env import QEDMathEnv

with QEDMathEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset()
submission = env.call_tool("submit_proof", proof="By induction on n...")
```

### MCP tool-calling

```python
async with QEDMathEnv(base_url="http://localhost:8000") as env:
await env.reset()

# Discover tools
tools = await env.list_tools()
print([t.name for t in tools])
# ['get_problem', 'submit_proof', 'get_grading_guidelines']

# Call tools by name
problem = await env.call_tool("get_problem")
guidelines = await env.call_tool("get_grading_guidelines")
result = await env.call_tool("submit_proof", proof="...")
```

## Building & Running

```bash
# Build Docker image (from project root)
docker build -t qed-math-env:latest -f envs/qed_math_env/server/Dockerfile .

# Run the server
docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY qed-math-env:latest

# Or run locally with uvicorn
PYTHONPATH=src:envs uvicorn qed_math_env.server.app:app --port 8000

# Or install and run via uv
cd envs/qed_math_env
uv sync
uv run server
```

## Project Structure

```
qed_math_env/
├── __init__.py # Module exports (QEDMathEnv, models)
├── models.py # ProblemObservation, ProofSubmissionObservation
├── client.py # QEDMathEnv client (MCPToolClient subclass)
├── openenv.yaml # OpenEnv manifest with metrics declarations
├── pyproject.toml # Dependencies
├── uv.lock # Locked dependencies
├── README.md
├── prompts/
│ └── evaluator_prompts/
│ └── v2.md # Evaluator prompt template (QED-Nano v2)
└── server/
├── __init__.py
├── app.py # FastAPI server (create_app factory)
├── qed_math_environment.py # QEDMathEnvironment (MCPEnvironment)
├── math_verify_service.py # Process-pool verifier service + health/metrics
├── mcp_server.py # MCP tool registration
├── rubric.py # MathProofRubric + GradingResult
└── Dockerfile
```

## Configuration

The environment is configured via `QEDMathConfig`:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `dataset_path` | `None` | Dataset source: local path, Hub ID, or list of specs. `None` uses bootstrap problems. |
| `grader_model` | `"gemini-3-pro"` | Model identifier for the LLM grader (any OpenAI-compatible endpoint) |
| `prompt_name` | `"v2"` | Evaluator prompt template name (loads from `prompts/evaluator_prompts/`) |
| `custom_reward_threshold` | `False` | When `True`, collapses partial-credit scores 1–5 → 1 |
| `max_attempts` | `1` | Max proof attempts per problem (>1 for multi-step) |
| `discount_factor` | `1.0` | Exponential discount: `reward *= discount_factor ** output_length_tokens` |
| `buffer_tokens` | `0` | Length penalty zone width. `0` disables the penalty. |
| `max_tokens` | `0` | Max token limit for length penalty computation |
| `reasoning_delimiters` | `None` | Delimiter strings to strip reasoning (e.g. `["</think>"]`) |
| `verifier_workers` | `max(2, min(8, cpu_count//2))` | Number of process workers used for answer-mode verification |
| `verifier_queue_size` | `verifier_workers * 32` | Max in-flight verifier requests before backpressure |
| `verifier_request_timeout_seconds` | `5.0` | Per-request client-side timeout when awaiting worker response |
| `verifier_max_retries` | `1` | Retry budget for transient verifier infra failures |
| `verifier_strict` | `True` | Strict `math_verify` equivalence mode |
| `verifier_numeric_precision` | `5` | Numeric precision setting used in verifier request contract |
| `verifier_float_rounding` | `10` | Float rounding setting used in verifier request contract |

Environment variables:
- `OPENAI_API_KEY` — API key for the grader LLM
- `OPENAI_BASE_URL` — Base URL override (for non-OpenAI providers)

## Dataset Format

### Local JSONL/JSON

```json
{
"problem": "Prove that the sum of two even integers is even.",
"solution": "Let a=2m and b=2n. Then a+b=2(m+n), which is even.",
"rubrics": [
{"title": "Definitions", "points": 2, "desc": "Correctly defines even integers."},
{"title": "Algebra", "points": 3, "desc": "Valid algebraic manipulation."},
{"title": "Conclusion", "points": 2, "desc": "Correctly concludes evenness."}
],
"dataset": "FineProofs-RL",
"problem_id": "fp_001"
}
```

### Hugging Face Hub

```python
QEDMathConfig(dataset_path="meta-math/MetaMathQA")
# or with config
QEDMathConfig(dataset_path={"hub_id": "meta-math/MetaMathQA", "split": "train", "config": "default"})
```

### Field Aliases

The environment normalizes many dataset formats automatically:

| Canonical Field | Accepted Aliases |
|----------------|------------------|
| `problem` | `task`, `Problem` |
| `reference_solution` | `solution`, `answer`, `Solution` |
| `grading_guidelines` | `rubrics`, `schema`, `schema_0`, `Grading guidelines`, `details` |
| `problem_id` | `id` |
| `original_problem` | Used for RC-stream problems where the actor prompt differs from grading prompt |

## Observation Space

### ProblemObservation (from `reset` / `get_problem`)

| Field | Type | Description |
|-------|------|-------------|
| `problem` | `str` | Math problem statement |
| `reference_solution` | `str` | Ground-truth solution |
| `grading_guidelines` | `str` | Rubric / marking scheme |
| `problem_id` | `str` | Unique identifier |
| `problem_type` | `str` | `"proof"`, `"answer"`, or `"multi_step"` |
| `dataset_source` | `str` | Source dataset name |
| `metadata` | `dict` | Additional context (e.g. `original_problem`) |

### ProofSubmissionObservation (from `submit_proof`)

| Field | Type | Description |
|-------|------|-------------|
| `proof` | `str` | Submitted proof text |
| `score` | `int` | Raw grade (0–7) |
| `feedback` | `str` | Full grader response |
| `reward` | `float` | Shaped reward in [0, 1] |
| `done` | `bool` | Whether the episode is over |
| `is_correct` | `bool` | Whether score >= success threshold (default 6) |
| `attempt_number` | `int` | Current attempt count |
| `attempts_remaining` | `int` | Remaining attempts |
| `problem_type` | `str` | Problem type |
| `metadata` | `dict` | Contains `verifier_metrics`, `base_reward`, `shaped_reward` |

## MCP Tools

| Tool | Description | Parameters |
|------|-------------|------------|
| `get_problem` | Return current problem statement and metadata | — |
| `submit_proof` | Submit a proof for LLM-based rubric grading | `proof` (str, required) |
| `get_grading_guidelines` | Return the rubric/marking scheme | — |

> **Note:** `output_length_tokens` is **not** an agent-supplied parameter. Token counts are
> injected by the training harness via the HTTP step request body (see [Reward Shaping](#reward-shaping))
> to preserve reward integrity — the agent cannot influence its own discount factor.

## Reward Shaping

The reward pipeline follows QED-Nano conventions:

1. **LLM grading**: Score 0–7 via evaluator prompt with `<score>N</score>` parsing
2. **Optional thresholding**: Collapses 1–5 → 1 (when `custom_reward_threshold=True`)
3. **Normalization**: `reward = score / 7.0`
4. **Discount factor**: `reward *= discount_factor ** output_length_tokens`
5. **Length penalty**: Linear penalty when output approaches `max_tokens`

For answer-mode problems (`evaluation_mode: "answer"`), grading is routed through the process-based verifier service: `\boxed{}` answers are extracted and verified against cached gold answers, with timeout/retry/backpressure handling for concurrent rollouts.

### Harness-injected token count

Steps 4 and 5 require the full generation length (including any reasoning trace that is stripped before grading). This value cannot come from the agent — it is supplied by the training harness as an out-of-band field in the HTTP step request body, mirroring the [`StateUsageTracker`](https://github.com/PrimeIntellect-ai/verifiers/blob/main/verifiers/utils/usage_utils.py) pattern from PrimeIntellect/verifiers:

```python
# Training harness (pseudocode)
completion_tokens = llm_call.usage.completion_tokens # from inference API

step_response = await openenv_client.step(
action=CallToolAction(tool_name="submit_proof", arguments={"proof": proof_text}),
output_length_tokens=completion_tokens, # injected here, not via MCP tool
)
```

When `output_length_tokens` is absent (local testing, eval without a training loop) shaping is skipped entirely — no estimation is attempted, consistent with verifiers' behaviour of returning `None` from `StateUsageTracker.snapshot()` when no usage was recorded.

## Verifier Metrics

Every `submit_proof` call emits verifier metrics in `metadata["verifier_metrics"]`, compatible with TrackIO and WandB:

| Metric | Description |
|--------|-------------|
| `verifier/rollouts/success` | 1 if grading succeeded |
| `verifier/rollouts/failure` | 1 if grading failed |
| `verifier/failures/timeout` | Count of timeout errors |
| `verifier/failures/rate_limit` | Count of rate-limit errors |
| `verifier/failures/no_input` | 1 if proof was empty |
| `verifier/failures/no_score_tag` | 1 if LLM response had no `<score>` tag |
| `verifier/failures/all_attempts_failed` | 1 if all retries exhausted |
| `verifier/failures/num_retries` | Number of retries used |
| `verifier/runtime/latency_per_request` | Grading wall-clock time (seconds) |
| `verifier/requests/count` | Total verifier requests processed by the service |
| `verifier/requests/latency_ms` | Service-level average request latency |
| `verifier/requests/timeout_count` | Service-level timeout counter |
| `verifier/requests/error_count` | Service-level internal error counter |
| `verifier/queue/depth` | Current in-flight verifier queue depth |
| `verifier/cache/hit_rate` | Gold-answer cache hit rate |
| `verifier/workers/restart_count` | Worker-pool restart count |
| `verifier/workers/worker_restarted` | 1 if current request required worker restart |
| `verifier/workers/heartbeat_lag_ms` | Time since last verifier activity |
| `verifier/runtime/input_tokens` | Estimated input tokens |
| `verifier/runtime/output_tokens` | Estimated output tokens |
| `reward/base` | Pre-shaping reward |
| `reward/shaped` | Post-shaping reward |
| `reward/score_raw` | Raw integer score (0–7) |
| `reward/overlong_penalty` | Length penalty applied |
| `episode/attempt_number` | Current attempt |
| `episode/is_correct` | 1 if correct |
| `episode/problem_type` | proof / answer / multi_step |
| `episode/dataset_source` | Source dataset name |

### TrackIO Integration

```python
import trackio

run = trackio.init(project="qed-math-training")

# After each submit_proof call:
verifier_metrics = result["metadata"]["verifier_metrics"]
numeric = {k: v for k, v in verifier_metrics.items() if isinstance(v, (int, float))}
run.log(numeric, step=global_step)
```

Or with TRL's GRPOTrainer:

```python
from trl import GRPOConfig

config = GRPOConfig(
report_to="trackio",
trackio_space_id="your-org/qed-math-grpo",
# ...
)
```

## Deployment

```bash
# Optional: run rollout/staging verifier validation first
PYTHONPATH=src:envs uv run python scripts/qed_math_verifier_staging_validation.py \
--workers 4 --queue-size 128 --concurrency 64 --requests 2000 \
--max-timeout-rate 0.05 --max-error-rate 0.02

openenv push
```
29 changes: 29 additions & 0 deletions envs/qed_math_env/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

"""QED Math Environment."""

from .client import QEDMathEnv
from .models import (
GetGradingGuidelines,
GetProblem,
ProblemObservation,
ProofSubmissionObservation,
QEDMathAction,
QEDMathObservation,
SubmitProof,
)

__all__ = [
"QEDMathAction",
"QEDMathObservation",
"QEDMathEnv",
"SubmitProof",
"GetProblem",
"GetGradingGuidelines",
"ProblemObservation",
"ProofSubmissionObservation",
]
Loading