SuperagenticAI
diff --git a/‎CHANGELOG.md‎
Lines changed: 26 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 26 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 16 additions & 0 deletions b/‎README.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/benchmarks/codemode-evaluation.md‎
Lines changed: 130 additions & 0 deletions b/‎docs/benchmarks/codemode-evaluation.md‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎docs/benchmarks/index.md‎
Lines changed: 5 additions & 4 deletions b/‎docs/benchmarks/index.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/benchmarks/presets.md‎
Lines changed: 19 additions & 2 deletions b/‎docs/benchmarks/presets.md‎
Lines changed: 19 additions & 2 deletions
diff --git a/‎docs/codemode/architecture.md‎
Lines changed: 91 additions & 0 deletions b/‎docs/codemode/architecture.md‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎docs/codemode/evaluation.md‎
Lines changed: 36 additions & 0 deletions b/‎docs/codemode/evaluation.md‎
Lines changed: 36 additions & 0 deletions
@@ -5,7 +5,31 @@ All notable changes to this project are documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [0.1.5] - 2026-02-15
+## [0.1.2] - 2026-02-20
+
+### Added
+- Harness strategy selector with `tool_call` (default) and opt-in `codemode`.
+- CodeMode execution flow in harness: MCP tool discovery (`search_tools`), typed tool surface prompt, single-program generation, guardrail validation, and MCP chain execution (`call_tool_chain`).
+- Benchmark support for harness strategy comparison with CodeMode telemetry fields (`harness_strategy`, `codemode_chain_calls`, `codemode_search_calls`, `codemode_discovery_calls`, `codemode_guardrail_blocked`).
+- New top-level CodeMode docs section with dedicated pages for quickstart, architecture, guardrails, and evaluation.
+- Release documentation set for CodeMode:
+  - quickstart and operator workflow
+  - integration architecture and runtime controls
+  - provider/bridge separation model (Cloudflare-based, UTCP, custom)
+  - CodeMode sandbox responsibility and deployment matrix
+  - guardrail policy and safety runbook
+  - benchmark evaluation and promotion-gate criteria
+
+### Changed
+- `/harness run` supports `strategy=tool_call|codemode` and `mcp_server=<name>`.
+- `/rlm bench` in `mode=harness` supports `strategy=tool_call|codemode`.
+- Harness and benchmark command handling now auto-enables MCP when `strategy=codemode` is selected.
+
+### Security
+- Added explicit CodeMode guardrail policy documentation with blocked API classes and runtime limit defaults.
+- Codemode path remains opt-in; default harness behavior remains strict baseline `strategy=tool_call`.
+
+## [0.1.1] - 2026-02-15
 
 Initial public release of **RLM Code**.
 
@@ -31,3 +55,4 @@ Initial public release of **RLM Code**.
 - Unsafe local `exec` usage preserved only as an explicit, opt-in path for advanced development scenarios.
 
 [0.1.5]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.5
+[0.1.2]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.2
@@ -25,6 +25,22 @@ RLM Code implements the [Recursive Language Models](https://arxiv.org/abs/2502.0
 
 RLM Code wraps this algorithm in an interactive terminal UI with built-in benchmarks, trajectory replay, and observability.
 
+## Release v0.1.2
+
+This release adds the new CodeMode path as an opt-in harness strategy.
+
+- New harness strategy: `strategy=codemode` (default remains `strategy=tool_call`)
+- MCP bridge flow for CodeMode: `search_tools` -> typed tool surface -> `call_tool_chain`
+- Guardrails before execution: blocked API classes plus timeout/size/tool-call caps
+- Benchmark telemetry for side-by-side comparison: `tool_call` vs `codemode`
+- Dedicated docs section for CodeMode: quickstart, architecture, guardrails, evaluation
+
+Example:
+
+```text
+/harness run "implement feature and add tests" steps=8 mcp=on strategy=codemode mcp_server=codemode
+```
+
 ## Documentation
 
 <p align="center">
 
@@ -0,0 +1,130 @@
+# CodeMode Evaluation & Promotion Gates
+
+Use this page to evaluate `strategy=codemode` against baseline `strategy=tool_call` before wider rollout.
+
+---
+
+## Objective
+
+Compare harness strategies on identical workloads:
+
+- baseline: `strategy=tool_call`
+- candidate: `strategy=codemode`
+
+Promotion decision should be metric-driven and safety-aware.
+
+---
+
+## Evaluation Protocol
+
+1. Select the same preset and same case limit for both runs.
+2. Keep model, MCP server, and environment fixed.
+3. Run baseline first, candidate second.
+4. Compare with gate thresholds.
+
+Recommended starting preset for MCP-heavy behavior:
+
+- `dynamic_web_filtering`
+
+---
+
+## Commands
+
+### Baseline
+
+```bash
+/rlm bench preset=dynamic_web_filtering mode=harness strategy=tool_call mcp=on mcp_server=codemode
+```
+
+### Candidate
+
+```bash
+/rlm bench preset=dynamic_web_filtering mode=harness strategy=codemode mcp=on mcp_server=codemode
+```
+
+### Compare
+
+```bash
+/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
+```
+
+### CI-style gate
+
+```bash
+/rlm bench validate candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50 fail_on_completion_regression=on --json
+```
+
+---
+
+## Metrics to Watch
+
+Core benchmark metrics:
+
+- `avg_reward`
+- `completion_rate`
+- `avg_steps`
+- usage totals (`total_calls`, `prompt_tokens`, `completion_tokens`)
+
+CodeMode-specific diagnostics (per case):
+
+- `harness_strategy`
+- `codemode_chain_calls`
+- `codemode_search_calls`
+- `codemode_discovery_calls`
+- `codemode_guardrail_blocked`
+- `mcp_tool_calls`
+
+---
+
+## Suggested Promotion Criteria
+
+Use these as default release criteria unless your team has stricter requirements.
+
+| Gate | Recommended threshold |
+|---|---|
+| Reward delta | `>= 0.00` |
+| Completion delta | `>= 0.00` |
+| Steps increase | `<= 0.50` |
+| Completion regressions | `0` (enforce `fail_on_completion_regression=on`) |
+| Safety | No unexplained policy failures in case logs |
+
+If candidate fails any gate, keep default on `tool_call` and continue CodeMode as opt-in only.
+
+---
+
+## Reading Summary Files
+
+Benchmark summaries are stored under `.rlm_code/rlm/benchmarks/*.json`.
+
+For harness runs, summary-level fields include:
+
+- `mode`
+- `mcp_enabled`
+- `mcp_server`
+- `harness_strategy`
+
+Case payloads include the CodeMode telemetry listed above.
+
+---
+
+## Release Decision Template
+
+Use this lightweight checklist for launch approval:
+
+- Baseline benchmark ID:
+- Candidate benchmark ID:
+- Reward delta:
+- Completion delta:
+- Steps increase:
+- Completion regressions:
+- Guardrail blocked count:
+- Decision: `promote` or `hold`
+- Owner + date:
+
+---
+
+## Related Pages
+
+- [Benchmarks & Leaderboard](index.md)
+- [CodeMode Integration](../integrations/codemode.md)
+- [CodeMode Guardrails](../security/codemode-guardrails.md)
@@ -4,7 +4,7 @@ title: Benchmarks & Leaderboard
 
 # Benchmarks & Leaderboard
 
-RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides **10 preset benchmark suites** covering 33+ test cases, a **multi-metric leaderboard** for ranking and comparison, and **session replay** with full time-travel debugging.
+RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides **11 preset benchmark suites** covering 33+ test cases, a **multi-metric leaderboard** for ranking and comparison, and **session replay** with full time-travel debugging.
 
 ---
 
@@ -28,15 +28,15 @@ flowchart TD
 
 ### Preset Benchmarks
 
-10 built-in benchmark suites that cover the full spectrum of RLM capabilities:
+11 built-in benchmark suites that cover the full spectrum of RLM capabilities:
 
 | Category | Presets | Total Cases | Focus |
 |---|---|---|---|
 | DSPy | `dspy_quick`, `dspy_extended` | 8 | DSPy coding loop: signatures, modules, tests |
 | Generic | `generic_smoke` | 2 | Basic Python execution and error recovery |
 | Pure RLM | `pure_rlm_smoke`, `pure_rlm_context` | 7 | Paper-compliant mode, context-as-variable |
 | Advanced | `deep_recursion`, `paradigm_comparison` | 6 | Depth > 1 recursion, cross-paradigm comparison |
-| Paper-Compatible | `oolong_style`, `browsecomp_style`, `token_efficiency` | 10 | OOLONG, BrowseComp-Plus, token efficiency |
+| Paper-Compatible | `oolong_style`, `browsecomp_style`, `token_efficiency`, `dynamic_web_filtering` | 13 | OOLONG, BrowseComp-Plus, token efficiency, dynamic web filtering |
 
 See [Preset Benchmarks](presets.md) for full details on every suite and case.
 
@@ -129,6 +129,7 @@ See [Preset Benchmarks](presets.md) for all supported pack formats including Goo
 
 | Page | Description |
 |---|---|
-| [Preset Benchmarks](presets.md) | All 10 presets in detail, custom pack loading, YAML format |
+| [Preset Benchmarks](presets.md) | All 11 presets in detail, custom pack loading, YAML format |
 | [Leaderboard](leaderboard.md) | Ranking, filtering, statistics, trend analysis, export |
+| [CodeMode Evaluation & Promotion Gates](codemode-evaluation.md) | Side-by-side harness strategy methodology, telemetry, and release gates |
 | [Session Replay](session-replay.md) | Recording, replaying, time-travel debugging, session comparison |
@@ -4,7 +4,7 @@ title: Preset Benchmarks
 
 # Preset Benchmarks
 
-RLM Code ships with 10 preset benchmark suites containing 33+ test cases. These cover DSPy coding loops, generic execution, Pure RLM paper-compliant mode, deep recursion, paradigm comparison, and paper-compatible evaluation tasks.
+RLM Code ships with 11 preset benchmark suites containing 36+ test cases. These cover DSPy coding loops, generic execution, Pure RLM paper-compliant mode, deep recursion, paradigm comparison, and paper-compatible evaluation tasks.
 
 **Module**: `rlm_code.rlm.benchmarks`
 
@@ -38,7 +38,7 @@ class RLMBenchmarkCase:
 
 ---
 
-## All 10 Presets
+## All 11 Presets
 
 ### 1. `dspy_quick` -- Fast DSPy Smoke Test (3 cases)
 
@@ -206,6 +206,23 @@ rlm-code bench preset=token_efficiency
 
 ---
 
+### 11. `dynamic_web_filtering` -- Dynamic Web Filtering (3 cases)
+
+Benchmarks designed for retrieval workflows where search results must be
+filtered by domain scope, keyword constraints, and search-budget discipline.
+
+| Case ID | Description | Max Steps | Timeout |
+|---|---|---|---|
+| `dynamic_filter_domain_scope` | Domain-scoped retrieval with strict source constraints | 6 | 90s |
+| `dynamic_filter_claim_verification` | Claim verification with include/exclude term filters | 7 | 120s |
+| `dynamic_filter_budgeted_search` | Budgeted search with early stopping | 6 | 90s |
+
+```bash
+rlm-code bench preset=dynamic_web_filtering
+```
+
+---
+
 ## Running Benchmarks
 
 ### CLI
 
@@ -0,0 +1,91 @@
+# CodeMode Architecture
+
+This page describes the implemented CodeMode path in `HarnessRunner`.
+
+---
+
+## Strategy behavior
+
+| Strategy | Planner | Execution |
+|---|---|---|
+| `tool_call` | iterative tool/final loop | multi-step tool calls |
+| `codemode` | single generated JS/TS program | single guarded MCP chain execution |
+
+---
+
+## End-to-end flow
+
+```mermaid
+sequenceDiagram
+    participant User as User
+    participant HR as HarnessRunner
+    participant MCP as MCP Server
+    participant LLM as Model
+
+    User->>HR: /harness run ... strategy=codemode
+    HR->>MCP: search_tools(task_description, limit=10)
+    MCP-->>HR: discovered tool interfaces
+    HR->>LLM: typed surface + task prompt
+    LLM-->>HR: {"code": "..."}
+    HR->>HR: guardrail checks
+    HR->>MCP: call_tool_chain(code, timeout, max_output_size)
+    MCP-->>HR: result payload
+    HR-->>User: final response
+```
+
+---
+
+## Contract required by RLM
+
+CodeMode requires MCP server tools:
+
+- `search_tools`
+- `call_tool_chain`
+
+Strict MCP allowlist in harness also includes:
+
+- `list_tools`
+- `tools_info`
+- `get_required_keys_for_tool`
+
+---
+
+## Provider separation
+
+RLM is provider-agnostic at this layer.
+
+- Cloudflare code mode stacks are implementation choices.
+- UTCP bridges are implementation choices.
+- Custom bridges are implementation choices.
+
+RLM only requires MCP tools with compatible names and input schemas.
+
+---
+
+## Sandbox responsibility
+
+For `strategy=codemode`, generated JS/TS is executed by the MCP bridge behind
+`call_tool_chain`, not by the RLM Python execution sandbox.
+
+That means:
+
+- RLM harness guardrails run before execution.
+- Runtime isolation is primarily enforced by the bridge deployment environment.
+- `/sandbox` settings in RLM control RLM Python execution paths, but do not
+  automatically sandbox external bridge execution.
+
+---
+
+## CodeMode sandbox matrix
+
+| Bridge deployment pattern | Where CodeMode program executes | Isolation boundary | Recommended usage |
+|---|---|---|---|
+| UTCP bridge as local `npx` process | Local host process | Host OS process boundary and bridge controls | Local development and debugging |
+| Cloudflare-based bridge deployment | Cloudflare runtime isolate where bridge is deployed | Cloudflare runtime controls plus bridge policy | Hosted production-style deployments |
+| Custom bridge in Docker or VM | Container or VM running bridge server | Container or VM isolation plus bridge policy | Self-hosted staging and production |
+| Custom bridge as plain local process | Local host process | Minimal isolation unless hardening is added | Avoid for production |
+
+Notes:
+
+- Strongest safety posture comes from hardened bridge runtime plus RLM guardrails.
+- For production, prefer explicit `mcp_server=<name>` to avoid ambiguous server resolution.
@@ -0,0 +1,36 @@
+# CodeMode Evaluation
+
+Evaluate `strategy=codemode` against `strategy=tool_call` on identical harness workloads.
+
+---
+
+## Suggested workflow
+
+```bash
+/rlm bench preset=dynamic_web_filtering mode=harness strategy=tool_call mcp=on mcp_server=codemode
+/rlm bench preset=dynamic_web_filtering mode=harness strategy=codemode mcp=on mcp_server=codemode
+/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
+```
+
+---
+
+## Metrics to review
+
+- `avg_reward`
+- `completion_rate`
+- `avg_steps`
+- usage totals
+- `codemode_chain_calls`
+- `codemode_search_calls`
+- `codemode_discovery_calls`
+- `codemode_guardrail_blocked`
+
+---
+
+## Promotion guidance
+
+Keep CodeMode opt-in unless it meets your gate thresholds with no policy regressions.
+
+For full gating details, see:
+
+- [CodeMode Evaluation and Promotion Gates](../benchmarks/codemode-evaluation.md)