From 49827a55d0d84b3b4874a07a0f02a81ca64eff20 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sat, 4 Apr 2026 14:43:49 +0000
Subject: [PATCH 1/3] Initial plan


From 8b68464408842f6d3eedbe99cb5300f56b43bc8b Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sat, 4 Apr 2026 14:45:29 +0000
Subject: [PATCH 2/3] docs: add docs/riteway-ai.md with complete Riteway AI
 documentation including temperature/model config

Agent-Logs-Url: https://github.com/paralleldrive/riteway/sessions/ec479bc6-b9af-46a2-be4e-74350664a1d8

Co-authored-by: ericelliott <364727+ericelliott@users.noreply.github.com>
---
 docs/riteway-ai.md | 179 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)
 create mode 100644 docs/riteway-ai.md

diff --git a/docs/riteway-ai.md b/docs/riteway-ai.md
new file mode 100644
index 00000000..af9edfd8
--- /dev/null
+++ b/docs/riteway-ai.md
@@ -0,0 +1,179 @@
+# `riteway ai` — AI Prompt Evaluations
+
+The `riteway ai` CLI runs your AI agent prompt evaluations against a configurable pass-rate threshold. Write a `.sudo` test file, run it through any supported AI agent, and get a TAP-formatted report with per-assertion pass rates across multiple runs.
+
+---
+
+## Authentication
+
+All agents use OAuth authentication — no API keys needed. Authenticate once before running evals:
+
+| Agent | Command | Docs |
+|-------|---------|------|
+| Claude | `claude setup-token` | [Claude Code docs](https://docs.anthropic.com/en/docs/claude-code) |
+| Cursor | `agent login` | [Cursor docs](https://docs.cursor.com/context/rules-for-ai) |
+| OpenCode | See docs | [opencode.ai/docs/cli](https://opencode.ai/docs/cli/) |
+
+---
+
+## Writing a test file
+
+AI evals are written in `.sudo` files using [SudoLang](https://github.com/paralleldrive/sudolang) syntax:
+
+```
+# my-feature-test.sudo
+
+import 'path/to/spec.mdc'
+
+userPrompt = """
+Implement the sum function as described.
+"""
+
+- Given the spec, should name the function sum
+- Given the spec, should accept two parameters named a and b
+- Given the spec, should return the correct sum of the two parameters
+```
+
+Each `- Given ..., should ...` line becomes an independently judged assertion. The agent is asked to respond to the `userPrompt` (with any imported spec as context), and a judge agent scores each assertion across all runs.
+
+---
+
+## Running an eval
+
+```shell
+riteway ai path/to/my-feature-test.sudo
+```
+
+By default this runs **4 passes**, requires **75% pass rate**, uses the **claude** agent, runs up to **4 tests concurrently**, and allows **300 seconds** per agent call.
+
+```shell
+# Specify runs, threshold, and agent
+riteway ai path/to/test.sudo --runs 10 --threshold 80 --agent opencode
+
+# Use a Cursor agent with color output
+riteway ai path/to/test.sudo --agent cursor --color
+
+# Use a custom agent config file (mutually exclusive with --agent)
+riteway ai path/to/test.sudo --agent-config ./my-agent.json
+```
+
+---
+
+## Options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--runs N` | `4` | Number of passes per assertion |
+| `--threshold P` | `75` | Required pass percentage (0–100) |
+| `--timeout MS` | `300000` | Per-agent-call timeout in milliseconds |
+| `--agent NAME` | `claude` | Agent: `claude`, `opencode`, `cursor`, or a custom name from `riteway.agent-config.json` |
+| `--agent-config FILE` | — | Path to a flat single-agent JSON config `{"command","args","outputFormat"}` — mutually exclusive with `--agent` |
+| `--concurrency N` | `4` | Max concurrent test executions |
+| `--color` | off | Enable ANSI color output |
+| `--save-responses` | off | Save raw agent responses and judge details to a companion `.responses.md` file |
+
+Results are written as a TAP markdown file under `ai-evals/` in the project root.
+
+---
+
+## Saving raw responses for debugging
+
+When `--save-responses` is passed, a companion `.responses.md` file is written alongside the `.tap.md` output. It contains the raw result agent response and per-run judge details (passed, actual, expected, score) for every assertion — useful for debugging failures without adding console noise.
+
+```shell
+riteway ai path/to/test.sudo --save-responses
+```
+
+Each test file produces its own uniquely-named pair of files (e.g. `2026-03-17-test-abc12.tap.md` and `2026-03-17-test-abc12.responses.md`), so multiple test files never conflict.
+
+### Capturing responses as CI artifacts
+
+In GitHub Actions, use `--save-responses` and upload the `ai-evals/` directory as an artifact:
+
+```yaml
+- name: Run AI prompt evaluations
+  run: npx riteway ai path/to/test.sudo --save-responses
+
+- name: Upload AI eval responses
+  if: always()
+  uses: actions/upload-artifact@v4
+  with:
+    name: ai-eval-responses
+    path: ai-evals/*.responses.md
+    retention-days: 14
+```
+
+The `if: always()` ensures responses are uploaded even when assertions fail, so you can inspect exactly what the agent produced.
+
+### Partial results on timeout
+
+If some runs complete before another times out, the completed runs' responses are still written to the responses file. The timed-out run's partial agent output is also captured, followed by a `[RITEWAY TIMEOUT]` marker showing when and where the timeout occurred. This lets you debug why a run took too long and potentially optimize the prompt to run faster.
+
+---
+
+## Custom agent configuration
+
+`riteway ai init` writes all built-in agent configs to `riteway.agent-config.json` in your project root, so you can add custom agents or tweak existing flags:
+
+```shell
+riteway ai init           # create riteway.agent-config.json
+riteway ai init --force   # overwrite existing file
+```
+
+The generated file is a keyed registry. Add a custom agent entry and use it with `--agent`:
+
+```json
+{
+  "claude":   { "command": "claude",   "args": ["-p", "--output-format", "json", "--no-session-persistence"], "outputFormat": "json"  },
+  "opencode": { "command": "opencode", "args": ["run", "--format", "json"],                                   "outputFormat": "ndjson" },
+  "cursor":   { "command": "agent",    "args": ["--print", "--output-format", "json"],                        "outputFormat": "json"  },
+  "my-agent": { "command": "my-tool",  "args": ["--json"],                                                    "outputFormat": "json"  }
+}
+```
+
+```shell
+riteway ai path/to/test.sudo --agent my-agent
+```
+
+Once `riteway.agent-config.json` exists, any agent key defined in it supersedes the library's built-in defaults for that agent.
+
+---
+
+## Configuring Model and Temperature (OpenCode)
+
+Most agent harnesses don't expose temperature settings. If you need temperature control for Riteway AI evals, use OpenCode with a custom agent config.
+
+### 1. Create the agent file
+
+Create `.opencode/agents/riteway.md` in your project root:
+
+```markdown
+---
+description: Riteway AI eval agent
+mode: primary
+model: anthropic/claude-sonnet-4-6
+temperature: 0.2
+---
+```
+
+### 2. Configure the Riteway agent
+
+In your Riteway config (`riteway.agent-config.json`), reference the agent by filename (without extension):
+
+```json
+{
+  "opencode": {
+    "command": "opencode",
+    "args": ["run", "--agent", "riteway", "--format", "json"],
+    "outputFormat": "ndjson"
+  }
+}
+```
+
+Then run your evals using the `opencode` agent:
+
+```shell
+riteway ai path/to/test.sudo --agent opencode
+```
+
+This gives you precise control over both the model and sampling temperature used during evaluations, which is useful for experiments that require deterministic or creative outputs.

From 43ef6b6878f09a30dd6bbcee6319b5a83c995cb5 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sat, 4 Apr 2026 14:57:52 +0000
Subject: [PATCH 3/3] =?UTF-8?q?docs:=20clarify=20temperature=20semantics?=
 =?UTF-8?q?=20=E2=80=94=20no=20LLM=20guarantees=20determinism?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Agent-Logs-Url: https://github.com/paralleldrive/riteway/sessions/10d75dcd-7c04-4829-9b44-2e419a4135fd

Co-authored-by: ericelliott <364727+ericelliott@users.noreply.github.com>
---
 docs/riteway-ai.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/riteway-ai.md b/docs/riteway-ai.md
index af9edfd8..6a7e4f05 100644
--- a/docs/riteway-ai.md
+++ b/docs/riteway-ai.md
@@ -176,4 +176,4 @@ Then run your evals using the `opencode` agent:
 riteway ai path/to/test.sudo --agent opencode
 ```
 
-This gives you precise control over both the model and sampling temperature used during evaluations, which is useful for experiments that require deterministic or creative outputs.
+This gives you precise control over both the model and sampling temperature used during evaluations. Temperature shapes the next-token probability distribution — lower values skew toward more probable tokens (less creative, more repetitive), and higher values spread probability more evenly (more creative, less predictable). Note that no current LLM guarantees deterministic outputs even at temperature 0; results will still vary across runs.