Skip to content

Commit fa7f8d6

Browse files
committed
Codemode
1 parent 2e49922 commit fa7f8d6

46 files changed

Lines changed: 2557 additions & 73 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,31 @@ All notable changes to this project are documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [0.1.5] - 2026-02-15
8+
## [0.1.2] - 2026-02-20
9+
10+
### Added
11+
- Harness strategy selector with `tool_call` (default) and opt-in `codemode`.
12+
- CodeMode execution flow in harness: MCP tool discovery (`search_tools`), typed tool surface prompt, single-program generation, guardrail validation, and MCP chain execution (`call_tool_chain`).
13+
- Benchmark support for harness strategy comparison with CodeMode telemetry fields (`harness_strategy`, `codemode_chain_calls`, `codemode_search_calls`, `codemode_discovery_calls`, `codemode_guardrail_blocked`).
14+
- New top-level CodeMode docs section with dedicated pages for quickstart, architecture, guardrails, and evaluation.
15+
- Release documentation set for CodeMode:
16+
- quickstart and operator workflow
17+
- integration architecture and runtime controls
18+
- provider/bridge separation model (Cloudflare-based, UTCP, custom)
19+
- CodeMode sandbox responsibility and deployment matrix
20+
- guardrail policy and safety runbook
21+
- benchmark evaluation and promotion-gate criteria
22+
23+
### Changed
24+
- `/harness run` supports `strategy=tool_call|codemode` and `mcp_server=<name>`.
25+
- `/rlm bench` in `mode=harness` supports `strategy=tool_call|codemode`.
26+
- Harness and benchmark command handling now auto-enables MCP when `strategy=codemode` is selected.
27+
28+
### Security
29+
- Added explicit CodeMode guardrail policy documentation with blocked API classes and runtime limit defaults.
30+
- Codemode path remains opt-in; default harness behavior remains strict baseline `strategy=tool_call`.
31+
32+
## [0.1.1] - 2026-02-15
933

1034
Initial public release of **RLM Code**.
1135

@@ -31,3 +55,4 @@ Initial public release of **RLM Code**.
3155
- Unsafe local `exec` usage preserved only as an explicit, opt-in path for advanced development scenarios.
3256

3357
[0.1.5]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.5
58+
[0.1.2]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.2

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,22 @@ RLM Code implements the [Recursive Language Models](https://arxiv.org/abs/2502.0
2525

2626
RLM Code wraps this algorithm in an interactive terminal UI with built-in benchmarks, trajectory replay, and observability.
2727

28+
## Release v0.1.2
29+
30+
This release adds the new CodeMode path as an opt-in harness strategy.
31+
32+
- New harness strategy: `strategy=codemode` (default remains `strategy=tool_call`)
33+
- MCP bridge flow for CodeMode: `search_tools` -> typed tool surface -> `call_tool_chain`
34+
- Guardrails before execution: blocked API classes plus timeout/size/tool-call caps
35+
- Benchmark telemetry for side-by-side comparison: `tool_call` vs `codemode`
36+
- Dedicated docs section for CodeMode: quickstart, architecture, guardrails, evaluation
37+
38+
Example:
39+
40+
```text
41+
/harness run "implement feature and add tests" steps=8 mcp=on strategy=codemode mcp_server=codemode
42+
```
43+
2844
## Documentation
2945

3046
<p align="center">
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# CodeMode Evaluation & Promotion Gates
2+
3+
Use this page to evaluate `strategy=codemode` against baseline `strategy=tool_call` before wider rollout.
4+
5+
---
6+
7+
## Objective
8+
9+
Compare harness strategies on identical workloads:
10+
11+
- baseline: `strategy=tool_call`
12+
- candidate: `strategy=codemode`
13+
14+
Promotion decision should be metric-driven and safety-aware.
15+
16+
---
17+
18+
## Evaluation Protocol
19+
20+
1. Select the same preset and same case limit for both runs.
21+
2. Keep model, MCP server, and environment fixed.
22+
3. Run baseline first, candidate second.
23+
4. Compare with gate thresholds.
24+
25+
Recommended starting preset for MCP-heavy behavior:
26+
27+
- `dynamic_web_filtering`
28+
29+
---
30+
31+
## Commands
32+
33+
### Baseline
34+
35+
```bash
36+
/rlm bench preset=dynamic_web_filtering mode=harness strategy=tool_call mcp=on mcp_server=codemode
37+
```
38+
39+
### Candidate
40+
41+
```bash
42+
/rlm bench preset=dynamic_web_filtering mode=harness strategy=codemode mcp=on mcp_server=codemode
43+
```
44+
45+
### Compare
46+
47+
```bash
48+
/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
49+
```
50+
51+
### CI-style gate
52+
53+
```bash
54+
/rlm bench validate candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50 fail_on_completion_regression=on --json
55+
```
56+
57+
---
58+
59+
## Metrics to Watch
60+
61+
Core benchmark metrics:
62+
63+
- `avg_reward`
64+
- `completion_rate`
65+
- `avg_steps`
66+
- usage totals (`total_calls`, `prompt_tokens`, `completion_tokens`)
67+
68+
CodeMode-specific diagnostics (per case):
69+
70+
- `harness_strategy`
71+
- `codemode_chain_calls`
72+
- `codemode_search_calls`
73+
- `codemode_discovery_calls`
74+
- `codemode_guardrail_blocked`
75+
- `mcp_tool_calls`
76+
77+
---
78+
79+
## Suggested Promotion Criteria
80+
81+
Use these as default release criteria unless your team has stricter requirements.
82+
83+
| Gate | Recommended threshold |
84+
|---|---|
85+
| Reward delta | `>= 0.00` |
86+
| Completion delta | `>= 0.00` |
87+
| Steps increase | `<= 0.50` |
88+
| Completion regressions | `0` (enforce `fail_on_completion_regression=on`) |
89+
| Safety | No unexplained policy failures in case logs |
90+
91+
If candidate fails any gate, keep default on `tool_call` and continue CodeMode as opt-in only.
92+
93+
---
94+
95+
## Reading Summary Files
96+
97+
Benchmark summaries are stored under `.rlm_code/rlm/benchmarks/*.json`.
98+
99+
For harness runs, summary-level fields include:
100+
101+
- `mode`
102+
- `mcp_enabled`
103+
- `mcp_server`
104+
- `harness_strategy`
105+
106+
Case payloads include the CodeMode telemetry listed above.
107+
108+
---
109+
110+
## Release Decision Template
111+
112+
Use this lightweight checklist for launch approval:
113+
114+
- Baseline benchmark ID:
115+
- Candidate benchmark ID:
116+
- Reward delta:
117+
- Completion delta:
118+
- Steps increase:
119+
- Completion regressions:
120+
- Guardrail blocked count:
121+
- Decision: `promote` or `hold`
122+
- Owner + date:
123+
124+
---
125+
126+
## Related Pages
127+
128+
- [Benchmarks & Leaderboard](index.md)
129+
- [CodeMode Integration](../integrations/codemode.md)
130+
- [CodeMode Guardrails](../security/codemode-guardrails.md)

docs/benchmarks/index.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: Benchmarks & Leaderboard
44

55
# Benchmarks & Leaderboard
66

7-
RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides **10 preset benchmark suites** covering 33+ test cases, a **multi-metric leaderboard** for ranking and comparison, and **session replay** with full time-travel debugging.
7+
RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides **11 preset benchmark suites** covering 33+ test cases, a **multi-metric leaderboard** for ranking and comparison, and **session replay** with full time-travel debugging.
88

99
---
1010

@@ -28,15 +28,15 @@ flowchart TD
2828

2929
### Preset Benchmarks
3030

31-
10 built-in benchmark suites that cover the full spectrum of RLM capabilities:
31+
11 built-in benchmark suites that cover the full spectrum of RLM capabilities:
3232

3333
| Category | Presets | Total Cases | Focus |
3434
|---|---|---|---|
3535
| DSPy | `dspy_quick`, `dspy_extended` | 8 | DSPy coding loop: signatures, modules, tests |
3636
| Generic | `generic_smoke` | 2 | Basic Python execution and error recovery |
3737
| Pure RLM | `pure_rlm_smoke`, `pure_rlm_context` | 7 | Paper-compliant mode, context-as-variable |
3838
| Advanced | `deep_recursion`, `paradigm_comparison` | 6 | Depth > 1 recursion, cross-paradigm comparison |
39-
| Paper-Compatible | `oolong_style`, `browsecomp_style`, `token_efficiency` | 10 | OOLONG, BrowseComp-Plus, token efficiency |
39+
| Paper-Compatible | `oolong_style`, `browsecomp_style`, `token_efficiency`, `dynamic_web_filtering` | 13 | OOLONG, BrowseComp-Plus, token efficiency, dynamic web filtering |
4040

4141
See [Preset Benchmarks](presets.md) for full details on every suite and case.
4242

@@ -129,6 +129,7 @@ See [Preset Benchmarks](presets.md) for all supported pack formats including Goo
129129

130130
| Page | Description |
131131
|---|---|
132-
| [Preset Benchmarks](presets.md) | All 10 presets in detail, custom pack loading, YAML format |
132+
| [Preset Benchmarks](presets.md) | All 11 presets in detail, custom pack loading, YAML format |
133133
| [Leaderboard](leaderboard.md) | Ranking, filtering, statistics, trend analysis, export |
134+
| [CodeMode Evaluation & Promotion Gates](codemode-evaluation.md) | Side-by-side harness strategy methodology, telemetry, and release gates |
134135
| [Session Replay](session-replay.md) | Recording, replaying, time-travel debugging, session comparison |

docs/benchmarks/presets.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: Preset Benchmarks
44

55
# Preset Benchmarks
66

7-
RLM Code ships with 10 preset benchmark suites containing 33+ test cases. These cover DSPy coding loops, generic execution, Pure RLM paper-compliant mode, deep recursion, paradigm comparison, and paper-compatible evaluation tasks.
7+
RLM Code ships with 11 preset benchmark suites containing 36+ test cases. These cover DSPy coding loops, generic execution, Pure RLM paper-compliant mode, deep recursion, paradigm comparison, and paper-compatible evaluation tasks.
88

99
**Module**: `rlm_code.rlm.benchmarks`
1010

@@ -38,7 +38,7 @@ class RLMBenchmarkCase:
3838

3939
---
4040

41-
## All 10 Presets
41+
## All 11 Presets
4242

4343
### 1. `dspy_quick` -- Fast DSPy Smoke Test (3 cases)
4444

@@ -206,6 +206,23 @@ rlm-code bench preset=token_efficiency
206206

207207
---
208208

209+
### 11. `dynamic_web_filtering` -- Dynamic Web Filtering (3 cases)
210+
211+
Benchmarks designed for retrieval workflows where search results must be
212+
filtered by domain scope, keyword constraints, and search-budget discipline.
213+
214+
| Case ID | Description | Max Steps | Timeout |
215+
|---|---|---|---|
216+
| `dynamic_filter_domain_scope` | Domain-scoped retrieval with strict source constraints | 6 | 90s |
217+
| `dynamic_filter_claim_verification` | Claim verification with include/exclude term filters | 7 | 120s |
218+
| `dynamic_filter_budgeted_search` | Budgeted search with early stopping | 6 | 90s |
219+
220+
```bash
221+
rlm-code bench preset=dynamic_web_filtering
222+
```
223+
224+
---
225+
209226
## Running Benchmarks
210227

211228
### CLI

docs/codemode/architecture.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# CodeMode Architecture
2+
3+
This page describes the implemented CodeMode path in `HarnessRunner`.
4+
5+
---
6+
7+
## Strategy behavior
8+
9+
| Strategy | Planner | Execution |
10+
|---|---|---|
11+
| `tool_call` | iterative tool/final loop | multi-step tool calls |
12+
| `codemode` | single generated JS/TS program | single guarded MCP chain execution |
13+
14+
---
15+
16+
## End-to-end flow
17+
18+
```mermaid
19+
sequenceDiagram
20+
participant User as User
21+
participant HR as HarnessRunner
22+
participant MCP as MCP Server
23+
participant LLM as Model
24+
25+
User->>HR: /harness run ... strategy=codemode
26+
HR->>MCP: search_tools(task_description, limit=10)
27+
MCP-->>HR: discovered tool interfaces
28+
HR->>LLM: typed surface + task prompt
29+
LLM-->>HR: {"code": "..."}
30+
HR->>HR: guardrail checks
31+
HR->>MCP: call_tool_chain(code, timeout, max_output_size)
32+
MCP-->>HR: result payload
33+
HR-->>User: final response
34+
```
35+
36+
---
37+
38+
## Contract required by RLM
39+
40+
CodeMode requires MCP server tools:
41+
42+
- `search_tools`
43+
- `call_tool_chain`
44+
45+
Strict MCP allowlist in harness also includes:
46+
47+
- `list_tools`
48+
- `tools_info`
49+
- `get_required_keys_for_tool`
50+
51+
---
52+
53+
## Provider separation
54+
55+
RLM is provider-agnostic at this layer.
56+
57+
- Cloudflare code mode stacks are implementation choices.
58+
- UTCP bridges are implementation choices.
59+
- Custom bridges are implementation choices.
60+
61+
RLM only requires MCP tools with compatible names and input schemas.
62+
63+
---
64+
65+
## Sandbox responsibility
66+
67+
For `strategy=codemode`, generated JS/TS is executed by the MCP bridge behind
68+
`call_tool_chain`, not by the RLM Python execution sandbox.
69+
70+
That means:
71+
72+
- RLM harness guardrails run before execution.
73+
- Runtime isolation is primarily enforced by the bridge deployment environment.
74+
- `/sandbox` settings in RLM control RLM Python execution paths, but do not
75+
automatically sandbox external bridge execution.
76+
77+
---
78+
79+
## CodeMode sandbox matrix
80+
81+
| Bridge deployment pattern | Where CodeMode program executes | Isolation boundary | Recommended usage |
82+
|---|---|---|---|
83+
| UTCP bridge as local `npx` process | Local host process | Host OS process boundary and bridge controls | Local development and debugging |
84+
| Cloudflare-based bridge deployment | Cloudflare runtime isolate where bridge is deployed | Cloudflare runtime controls plus bridge policy | Hosted production-style deployments |
85+
| Custom bridge in Docker or VM | Container or VM running bridge server | Container or VM isolation plus bridge policy | Self-hosted staging and production |
86+
| Custom bridge as plain local process | Local host process | Minimal isolation unless hardening is added | Avoid for production |
87+
88+
Notes:
89+
90+
- Strongest safety posture comes from hardened bridge runtime plus RLM guardrails.
91+
- For production, prefer explicit `mcp_server=<name>` to avoid ambiguous server resolution.

docs/codemode/evaluation.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# CodeMode Evaluation
2+
3+
Evaluate `strategy=codemode` against `strategy=tool_call` on identical harness workloads.
4+
5+
---
6+
7+
## Suggested workflow
8+
9+
```bash
10+
/rlm bench preset=dynamic_web_filtering mode=harness strategy=tool_call mcp=on mcp_server=codemode
11+
/rlm bench preset=dynamic_web_filtering mode=harness strategy=codemode mcp=on mcp_server=codemode
12+
/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
13+
```
14+
15+
---
16+
17+
## Metrics to review
18+
19+
- `avg_reward`
20+
- `completion_rate`
21+
- `avg_steps`
22+
- usage totals
23+
- `codemode_chain_calls`
24+
- `codemode_search_calls`
25+
- `codemode_discovery_calls`
26+
- `codemode_guardrail_blocked`
27+
28+
---
29+
30+
## Promotion guidance
31+
32+
Keep CodeMode opt-in unless it meets your gate thresholds with no policy regressions.
33+
34+
For full gating details, see:
35+
36+
- [CodeMode Evaluation and Promotion Gates](../benchmarks/codemode-evaluation.md)

0 commit comments

Comments
 (0)