From 8d262a6ded9f088aee68c7ae59e7c492c90f7591 Mon Sep 17 00:00:00 2001
From: Mayank Jha <mayank.jha@uipath.com>
Date: Mon, 4 May 2026 15:15:04 -0700
Subject: [PATCH 1/5] feat: add low-code agent evaluation docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The uipath-agents skill has comprehensive evaluation docs for coded
agents (5 files under coded/lifecycle/evaluations/) but none for
low-code agents, despite full CLI support in `uip agent eval`.

Adds 4 reference files under lowcode/evaluation/:
- evaluate.md — entry point, prerequisites, file structure, differences from coded
- evaluators.md — 4 evaluator types, add/list/remove, JSON format, custom prompts
- evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format
- running-evaluations.md — run start/status/results/list/compare, workflow example

Updates SKILL.md task navigation and lowcode.md capability registry
to reference the new evaluation docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 skills/uipath-agents/SKILL.md                 |   1 +
 .../references/lowcode/evaluation/evaluate.md |  71 ++++++++
 .../lowcode/evaluation/evaluation-sets.md     | 140 +++++++++++++++
 .../lowcode/evaluation/evaluators.md          | 102 +++++++++++
 .../lowcode/evaluation/running-evaluations.md | 163 ++++++++++++++++++
 .../references/lowcode/lowcode.md             |   2 +
 6 files changed, 479 insertions(+)
 create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluate.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluators.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md

diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md
index 170893ba9..53fbc6b88 100644
--- a/skills/uipath-agents/SKILL.md
+++ b/skills/uipath-agents/SKILL.md
@@ -46,6 +46,7 @@ Determine the agent mode before proceeding:
 | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` |
 | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists |
 | Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` |
+| Run low-code evaluations | Low-code | [lowcode/evaluation/evaluate.md](references/lowcode/evaluation/evaluate.md) | `lowcode/evaluation/evaluators.md`, `lowcode/evaluation/evaluation-sets.md`, `lowcode/evaluation/running-evaluations.md` |
 | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` |
 
 ## Resources
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md
new file mode 100644
index 000000000..c60e450cd
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md
@@ -0,0 +1,71 @@
+# Evaluate Low-Code Agents
+
+Design and run evaluations against low-code agents using the `uip agent eval` CLI.
+
+## Quick Reference
+
+```bash
+# Add a test case
+uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json
+
+# Run evals and wait for results
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+
+# Check results (failures only, with justifications)
+uip agent eval run results <run_id> --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json
+```
+
+## Prerequisites
+
+- Agent project initialized (`uip agent init`)
+- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud)
+- `SolutionStorage.json` exists in the agent project (created by `uip agent push`)
+
+Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection.
+
+## Reference Navigation
+
+- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts
+- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options
+- [Running Evaluations](running-evaluations.md) — start, status, results, compare
+
+Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases.
+
+## File Structure
+
+After `uip agent init`, the eval-related project structure is:
+
+```
+my-agent/
+  agent.json
+  SolutionStorage.json              # Created after `uip agent push`
+  evals/
+    evaluators/
+      evaluator-default.json              # Semantic similarity evaluator
+      evaluator-default-trajectory.json   # Trajectory evaluator
+    eval-sets/
+      evaluation-set-default.json         # Default eval set (references both evaluators)
+```
+
+Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories.
+
+## Key Differences from Coded Agent Evals
+
+| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) |
+|--------|-------------------------------|------------------------------|
+| Execution | Local Python process | Cloud-based via Agent Runtime |
+| Auth required | Only for `--report` | Always (cloud execution) |
+| Prerequisite | `entry-points.json` | `uip agent push` (SolutionStorage.json) |
+| Mocking | `@mockable()` decorator + declarative | Simulation instructions only |
+| CLI prefix | `uip codedagent eval` | `uip agent eval` |
+
+## Troubleshooting
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `SolutionStorage.json not found` | Agent not pushed to Studio Web | Run `uip agent push --output json` |
+| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
+| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
+| `401 Unauthorized` | Auth expired | Run `uip login --output json` |
+| Eval run timeout | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web |
+| `same-as-agent` model error | Evaluator model can't be resolved | Set an explicit model in the evaluator config instead of `"same-as-agent"` |
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
new file mode 100644
index 000000000..490faf2a2
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
@@ -0,0 +1,140 @@
+# Evaluation Sets and Test Cases
+
+Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set.
+
+## Managing Eval Sets
+
+### Add an eval set
+
+```bash
+uip agent eval set add <name> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--evaluators <ids>` | No | Comma-separated evaluator IDs | All existing evaluators |
+| `--path <path>` | No | Agent project directory | `.` |
+
+When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project.
+
+### List eval sets
+
+```bash
+uip agent eval set list --path <agent_dir> --output json
+```
+
+### Remove an eval set
+
+```bash
+uip agent eval set remove <id_or_name> --path <agent_dir> --output json
+```
+
+## Managing Test Cases
+
+Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations.
+
+### Add a test case
+
+```bash
+uip agent eval add <name> \
+  --set "<eval_set_name>" \
+  --inputs '{"input":"hello"}' \
+  --expected '{"content":"greeting response"}' \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--inputs <json>` | Yes | Input values as JSON | — |
+| `--expected <json>` | No | Expected output as JSON | `{}` |
+| `--expected-agent-behavior <text>` | No | Description of expected behavior (used by trajectory evaluator) | `""` |
+| `--simulation-instructions <text>` | No | Instructions for simulating agent behavior | `""` |
+| `--simulate-input` | No | Enable input simulation | `false` |
+| `--simulate-tools` | No | Enable tool simulation | `false` |
+| `--input-generation-instructions <text>` | No | Instructions for generating synthetic inputs | `""` |
+| `--path <path>` | No | Agent project directory | `.` |
+
+### List test cases
+
+```bash
+uip agent eval list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+### Remove a test case
+
+```bash
+uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+## Test Case Design
+
+### Matching evaluator to test case fields
+
+| Evaluator Type | Key Test Case Fields |
+|---------------|---------------------|
+| Semantic Similarity | `--inputs`, `--expected` |
+| Trajectory | `--inputs`, `--expected-agent-behavior` |
+| Context Precision | `--inputs`, `--expected` |
+| Faithfulness | `--inputs`, `--expected` |
+
+For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
+
+```bash
+uip agent eval add tool-usage-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"What is the weather in NYC?"}' \
+  --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \
+  --path ./my-agent --output json
+```
+
+### Simulation options
+
+- `--simulate-input` — the runtime generates synthetic input variations based on the provided input
+- `--simulate-tools` — tool calls are simulated rather than executed against real services
+- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters")
+- `--simulation-instructions` — guides the overall simulation behavior
+
+These are useful for expanding test coverage without writing every input by hand.
+
+## Eval Set JSON Format
+
+```json
+{
+  "fileName": "evaluation-set-default.json",
+  "id": "<uuid>",
+  "name": "Default Evaluation Set",
+  "batchSize": 10,
+  "evaluatorRefs": ["<evaluator-uuid-1>", "<evaluator-uuid-2>"],
+  "evaluations": [
+    {
+      "id": "<uuid>",
+      "name": "happy-path",
+      "inputs": {"input": "hello"},
+      "expectedOutput": {"content": "greeting"},
+      "expectedAgentBehavior": "",
+      "simulationInstructions": "",
+      "simulateInput": false,
+      "simulateTools": false,
+      "inputGenerationInstructions": "",
+      "evalSetId": "<eval-set-uuid>",
+      "source": "manual",
+      "createdAt": "...",
+      "updatedAt": "..."
+    }
+  ],
+  "modelSettings": [],
+  "agentMemoryEnabled": false,
+  "agentMemorySettings": [],
+  "lineByLineEvaluation": false,
+  "createdAt": "...",
+  "updatedAt": "..."
+}
+```
+
+The `source` field indicates how the test case was created: `"manual"` (CLI), `"debugRun"` (from a debug session), `"runtimeRun"` (from a live run), `"simulatedRun"`, or `"autopilotUserInitiated"`.
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md
new file mode 100644
index 000000000..bfb284d6d
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md
@@ -0,0 +1,102 @@
+# Evaluators
+
+Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`.
+
+## Evaluator Types
+
+| Type | CLI Flag | What It Scores |
+|------|----------|----------------|
+| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output |
+| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior |
+| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query |
+| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context |
+
+## Managing Evaluators
+
+### Add an evaluator
+
+```bash
+uip agent eval evaluator add <name> --type <type> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — |
+| `--description <desc>` | No | Human-readable description | Auto-generated from type |
+| `--prompt <prompt>` | No | Custom LLM evaluation prompt | Built-in default per type |
+| `--target-key <key>` | No | Specific output key to evaluate | `*` (all keys) |
+| `--path <path>` | No | Agent project directory | `.` |
+
+**Example:**
+```bash
+uip agent eval evaluator add content-quality \
+  --type semantic-similarity \
+  --path ./my-agent \
+  --output json
+```
+
+### List evaluators
+
+```bash
+uip agent eval evaluator list --path <agent_dir> --output json
+```
+
+### Remove an evaluator
+
+```bash
+uip agent eval evaluator remove <id_or_name> --path <agent_dir> --output json
+```
+
+Removing an evaluator automatically removes its references from all eval sets that reference it.
+
+## Default Evaluators
+
+`uip agent init` creates two default evaluators:
+
+### Semantic Similarity (`evaluator-default.json`)
+
+Compares expected vs actual output for semantic equivalence. Uses template variables `{{ExpectedOutput}}` and `{{ActualOutput}}`. Scores 0–100.
+
+### Trajectory (`evaluator-default-trajectory.json`)
+
+Evaluates the agent's reasoning path against expected behavior. Uses template variables `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. Scores 0–100.
+
+Both default evaluators use `"same-as-agent"` as the model, which resolves to the agent's configured model at runtime.
+
+## Evaluator JSON Format
+
+```json
+{
+  "fileName": "evaluator-content-quality.json",
+  "id": "<uuid>",
+  "name": "content-quality",
+  "description": "Evaluates semantic similarity of output",
+  "category": 1,
+  "type": 5,
+  "prompt": "Compare {{ExpectedOutput}} with {{ActualOutput}}...",
+  "model": "same-as-agent",
+  "targetOutputKey": "*",
+  "createdAt": "2025-01-01T00:00:00.000Z",
+  "updatedAt": "2025-01-01T00:00:00.000Z"
+}
+```
+
+**Type and category mapping:**
+
+| CLI Type | `type` (numeric) | `category` |
+|----------|-------------------|------------|
+| `semantic-similarity` | 5 | 1 (output-based) |
+| `trajectory` | 7 | 3 (trajectory-based) |
+| `context-precision` | 8 | 1 (output-based) |
+| `faithfulness` | 9 | 1 (output-based) |
+
+## Custom Prompts
+
+When `--prompt` is omitted, the CLI uses a built-in default prompt for each type. To customize, pass a prompt string using the appropriate template variables:
+
+- **Semantic Similarity**: `{{ExpectedOutput}}`, `{{ActualOutput}}`
+- **Trajectory**: `{{AgentRunHistory}}`, `{{ExpectedBehavior}}`
+- **Context Precision**: `{{UserQuery}}`, `{{RetrievedContext}}`
+- **Faithfulness**: `{{AgentOutput}}`, `{{Context}}`
diff --git a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md
new file mode 100644
index 000000000..0ffc4ebe8
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md
@@ -0,0 +1,163 @@
+# Running Evaluations
+
+Execute evaluations against the Agent Runtime, check status, view results, and compare runs.
+
+All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition.
+
+## Start an Eval Run
+
+```bash
+uip agent eval run start --set "<eval_set_name>" --path <agent_dir> --wait --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--path <path>` | No | Agent project directory | `.` |
+| `--wait` | No | Poll until completion and show results | `false` |
+| `--timeout <seconds>` | No | Polling timeout (with `--wait`) | 600 (10 min) |
+| `--solution-id <id>` | No | Override solution ID | Auto-resolved from `SolutionStorage.json` |
+
+Without `--wait`, the command returns immediately with an `EvalSetRunId`:
+
+```json
+{
+  "Code": "AgentEvalRunStarted",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "EvalSetName": "Default Evaluation Set",
+    "TestCases": 5,
+    "Evaluators": 2
+  }
+}
+```
+
+With `--wait`, the CLI polls every 5 seconds until completion, then outputs both a summary and per-test-case results.
+
+## Check Run Status
+
+```bash
+uip agent eval run status <eval_set_run_id> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunStatus",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "Status": "completed",
+    "Score": 0.86,
+    "Duration": "42.5s",
+    "EvaluatorScores": "semantic: 0.9, trajectory: 0.82"
+  }
+}
+```
+
+Terminal states: `completed` or `failed`.
+
+## View Results
+
+```bash
+uip agent eval run results <eval_set_run_id> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Description |
+|------|-------------|
+| `--only-failed` | Show only failed or errored test cases |
+| `--verbose` | Include evaluator justifications in output |
+| `--export-format <json\|csv>` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) |
+
+**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`).
+
+### Failure detection
+
+A test case is considered **failed** if any of these are true:
+- Status is `failed`
+- Has an error message
+- Any evaluator score type is `error`
+- Any exact-match evaluator returned `false`
+
+## List Past Runs
+
+```bash
+uip agent eval run list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`.
+
+## Compare Runs
+
+Compare two eval runs side by side to see score changes:
+
+```bash
+uip agent eval run compare <run_id_a> \
+  --compare-to <run_id_b> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunComparison",
+  "Data": {
+    "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" },
+    "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" },
+    "ScoreDelta": 0.06,
+    "TestCases": [
+      {
+        "TestCase": "happy-path",
+        "ScoreA": 1.0,
+        "ScoreB": 0.9,
+        "Delta": "+0.1",
+        "StatusA": "completed",
+        "StatusB": "completed"
+      }
+    ]
+  }
+}
+```
+
+Use `compare` after prompt changes to verify improvements without regressions.
+
+## Workflow Example
+
+```bash
+# 1. Push agent to Studio Web (if not already done)
+uip agent push --path ./my-agent --output json
+
+# 2. Add test cases
+uip agent eval add greeting-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"hi there"}' \
+  --expected '{"content":"Hello! How can I help you?"}' \
+  --expected-agent-behavior "Agent should respond with a friendly greeting" \
+  --path ./my-agent --output json
+
+# 3. Run and wait
+uip agent eval run start \
+  --set "Default Evaluation Set" \
+  --path ./my-agent \
+  --wait --output json
+
+# 4. Review failures
+uip agent eval run results <run_id> \
+  --set "Default Evaluation Set" \
+  --only-failed --verbose \
+  --path ./my-agent --output json
+
+# 5. Make changes, push, re-run, compare
+uip agent push --path ./my-agent --output json
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+uip agent eval run compare <new_run_id> --compare-to <old_run_id> \
+  --set "Default Evaluation Set" --path ./my-agent --output json
+```
diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md
index 6f16e3bd5..a60995399 100644
--- a/skills/uipath-agents/references/lowcode/lowcode.md
+++ b/skills/uipath-agents/references/lowcode/lowcode.md
@@ -46,6 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad
 | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) |
 | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) |
 | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) |
+| Running evaluations, adding test cases, managing evaluators | [evaluation/evaluate.md](evaluation/evaluate.md) |
 
 ### Capability Registry
 
@@ -68,6 +69,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad
 | Add an Action Center escalation (HITL) | [capabilities/escalation/escalation.md](capabilities/escalation/escalation.md) | |
 | Add guardrails (PII, harmful content, custom rules) | [capabilities/guardrails/guardrails.md](capabilities/guardrails/guardrails.md) | |
 | Embed an agent inline in a flow | [capabilities/inline-in-flow/inline-in-flow.md](capabilities/inline-in-flow/inline-in-flow.md) | |
+| Evaluate agent (add test cases, run evals, view results) | [evaluation/evaluate.md](evaluation/evaluate.md) | `evaluation/evaluators.md`, `evaluation/evaluation-sets.md`, `evaluation/running-evaluations.md` |
 | Set up Orchestrator resources | Tell the user to use the `uipath-platform` skill | |
 | Wire agent into a flow | Tell the user to use the `uipath-maestro-flow` skill | |
 

From 29f59568f780d3788ae5ff2d6181f724bdc5e0be Mon Sep 17 00:00:00 2001
From: Mayank Jha <mayank.jha@uipath.com>
Date: Mon, 4 May 2026 18:29:50 -0700
Subject: [PATCH 2/5] fix(uipath-agents): correct low-code eval docs against
 CLI/SDK source
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address PR #552 review comment from @Chibionos: drop SolutionStorage.json
mentions throughout the eval refs (it is going away). Reword troubleshooting,
prerequisites, file-structure tree, and the --solution-id default to describe
the user-facing concept ("agent pushed to Studio Web") instead.

Folds in additional corrections found while verifying the PR against the
uip CLI (Code/cli), uipath-python SDK, and Agents service repo:

- Rename evaluation/ → evaluations/ to match coded sibling convention.
- Move eval row from Capability Registry to "Read on demand" in lowcode.md
  (eval is lifecycle, not a capability).
- Fix evaluator filename example: actual pattern is evaluator-<uuid8>.json,
  not <name>.json. The user-supplied <name> goes into the JSON name field.
- Restore --wait polling cadence (5s) and --timeout default (600s) — both
  hardcoded in eval-run.ts. Removed earlier when unverified.
- Add complete output Code enum (AgentEvalRunStarted/Completed/Results/
  Status/Exported/List/Comparison).
- Expand failure detection with the numeric forms isFailedRun() actually
  checks (status "3", score.type "2"), plus the SDK status enum.
- Document the worker-side LLM model fail-fast (activities.py) and the
  same-as-agent resolver error (EvaluatorFactory) — these are runtime,
  not validate-time, errors.
- Correct context-precision/faithfulness data flow: both are trace-driven
  (RETRIEVER spans), not test-case-driven; faithfulness reads expectedOutput
  as the candidate text, not the agent's actual output.
- Add "Why fewer evaluators than coded?" section explaining the legacy vs
  new SDK engine split, plus the 2 runtime-supported types not exposed by
  the CLI (Equals=1, JsonSimilarity=6) with copy-pasteable JSON.
- Document validate's category↔type matrix (cat 0→{1,6}, cat 1→{5,8,9},
  cat 3→{7}) and required fields per schema-validation-service.ts.
- Add Anti-patterns section to all four eval reference files per
  skill-structure.md convention.
- Workflow example: insert validate step between add and push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 skills/uipath-agents/SKILL.md                 |   2 +-
 .../references/lowcode/evaluation/evaluate.md |  71 -----
 .../lowcode/evaluation/evaluation-sets.md     | 140 ----------
 .../lowcode/evaluation/evaluators.md          | 102 --------
 .../lowcode/evaluation/running-evaluations.md | 163 ------------
 .../lowcode/evaluations/evaluate.md           |  90 +++++++
 .../lowcode/evaluations/evaluation-sets.md    | 163 ++++++++++++
 .../lowcode/evaluations/evaluators.md         | 242 ++++++++++++++++++
 .../evaluations/running-evaluations.md        | 207 +++++++++++++++
 .../references/lowcode/lowcode.md             |   3 +-
 10 files changed, 704 insertions(+), 479 deletions(-)
 delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluate.md
 delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
 delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluators.md
 delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluate.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluators.md
 create mode 100644 skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md

diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md
index 53fbc6b88..734437628 100644
--- a/skills/uipath-agents/SKILL.md
+++ b/skills/uipath-agents/SKILL.md
@@ -46,7 +46,7 @@ Determine the agent mode before proceeding:
 | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` |
 | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists |
 | Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` |
-| Run low-code evaluations | Low-code | [lowcode/evaluation/evaluate.md](references/lowcode/evaluation/evaluate.md) | `lowcode/evaluation/evaluators.md`, `lowcode/evaluation/evaluation-sets.md`, `lowcode/evaluation/running-evaluations.md` |
+| Run low-code evaluations | Low-code | [lowcode/evaluations/evaluate.md](references/lowcode/evaluations/evaluate.md) | `lowcode/evaluations/evaluators.md`, `lowcode/evaluations/evaluation-sets.md`, `lowcode/evaluations/running-evaluations.md` |
 | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` |
 
 ## Resources
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md
deleted file mode 100644
index c60e450cd..000000000
--- a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md
+++ /dev/null
@@ -1,71 +0,0 @@
-# Evaluate Low-Code Agents
-
-Design and run evaluations against low-code agents using the `uip agent eval` CLI.
-
-## Quick Reference
-
-```bash
-# Add a test case
-uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json
-
-# Run evals and wait for results
-uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
-
-# Check results (failures only, with justifications)
-uip agent eval run results <run_id> --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json
-```
-
-## Prerequisites
-
-- Agent project initialized (`uip agent init`)
-- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud)
-- `SolutionStorage.json` exists in the agent project (created by `uip agent push`)
-
-Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection.
-
-## Reference Navigation
-
-- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts
-- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options
-- [Running Evaluations](running-evaluations.md) — start, status, results, compare
-
-Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases.
-
-## File Structure
-
-After `uip agent init`, the eval-related project structure is:
-
-```
-my-agent/
-  agent.json
-  SolutionStorage.json              # Created after `uip agent push`
-  evals/
-    evaluators/
-      evaluator-default.json              # Semantic similarity evaluator
-      evaluator-default-trajectory.json   # Trajectory evaluator
-    eval-sets/
-      evaluation-set-default.json         # Default eval set (references both evaluators)
-```
-
-Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories.
-
-## Key Differences from Coded Agent Evals
-
-| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) |
-|--------|-------------------------------|------------------------------|
-| Execution | Local Python process | Cloud-based via Agent Runtime |
-| Auth required | Only for `--report` | Always (cloud execution) |
-| Prerequisite | `entry-points.json` | `uip agent push` (SolutionStorage.json) |
-| Mocking | `@mockable()` decorator + declarative | Simulation instructions only |
-| CLI prefix | `uip codedagent eval` | `uip agent eval` |
-
-## Troubleshooting
-
-| Error | Cause | Fix |
-|-------|-------|-----|
-| `SolutionStorage.json not found` | Agent not pushed to Studio Web | Run `uip agent push --output json` |
-| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
-| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
-| `401 Unauthorized` | Auth expired | Run `uip login --output json` |
-| Eval run timeout | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web |
-| `same-as-agent` model error | Evaluator model can't be resolved | Set an explicit model in the evaluator config instead of `"same-as-agent"` |
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
deleted file mode 100644
index 490faf2a2..000000000
--- a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md
+++ /dev/null
@@ -1,140 +0,0 @@
-# Evaluation Sets and Test Cases
-
-Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set.
-
-## Managing Eval Sets
-
-### Add an eval set
-
-```bash
-uip agent eval set add <name> --path <agent_dir> --output json
-```
-
-**Options:**
-
-| Flag | Required | Description | Default |
-|------|----------|-------------|---------|
-| `--evaluators <ids>` | No | Comma-separated evaluator IDs | All existing evaluators |
-| `--path <path>` | No | Agent project directory | `.` |
-
-When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project.
-
-### List eval sets
-
-```bash
-uip agent eval set list --path <agent_dir> --output json
-```
-
-### Remove an eval set
-
-```bash
-uip agent eval set remove <id_or_name> --path <agent_dir> --output json
-```
-
-## Managing Test Cases
-
-Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations.
-
-### Add a test case
-
-```bash
-uip agent eval add <name> \
-  --set "<eval_set_name>" \
-  --inputs '{"input":"hello"}' \
-  --expected '{"content":"greeting response"}' \
-  --path <agent_dir> \
-  --output json
-```
-
-**Options:**
-
-| Flag | Required | Description | Default |
-|------|----------|-------------|---------|
-| `--set <name>` | Yes | Eval set name or ID | — |
-| `--inputs <json>` | Yes | Input values as JSON | — |
-| `--expected <json>` | No | Expected output as JSON | `{}` |
-| `--expected-agent-behavior <text>` | No | Description of expected behavior (used by trajectory evaluator) | `""` |
-| `--simulation-instructions <text>` | No | Instructions for simulating agent behavior | `""` |
-| `--simulate-input` | No | Enable input simulation | `false` |
-| `--simulate-tools` | No | Enable tool simulation | `false` |
-| `--input-generation-instructions <text>` | No | Instructions for generating synthetic inputs | `""` |
-| `--path <path>` | No | Agent project directory | `.` |
-
-### List test cases
-
-```bash
-uip agent eval list --set "<eval_set_name>" --path <agent_dir> --output json
-```
-
-### Remove a test case
-
-```bash
-uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --output json
-```
-
-## Test Case Design
-
-### Matching evaluator to test case fields
-
-| Evaluator Type | Key Test Case Fields |
-|---------------|---------------------|
-| Semantic Similarity | `--inputs`, `--expected` |
-| Trajectory | `--inputs`, `--expected-agent-behavior` |
-| Context Precision | `--inputs`, `--expected` |
-| Faithfulness | `--inputs`, `--expected` |
-
-For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
-
-```bash
-uip agent eval add tool-usage-test \
-  --set "Default Evaluation Set" \
-  --inputs '{"input":"What is the weather in NYC?"}' \
-  --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \
-  --path ./my-agent --output json
-```
-
-### Simulation options
-
-- `--simulate-input` — the runtime generates synthetic input variations based on the provided input
-- `--simulate-tools` — tool calls are simulated rather than executed against real services
-- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters")
-- `--simulation-instructions` — guides the overall simulation behavior
-
-These are useful for expanding test coverage without writing every input by hand.
-
-## Eval Set JSON Format
-
-```json
-{
-  "fileName": "evaluation-set-default.json",
-  "id": "<uuid>",
-  "name": "Default Evaluation Set",
-  "batchSize": 10,
-  "evaluatorRefs": ["<evaluator-uuid-1>", "<evaluator-uuid-2>"],
-  "evaluations": [
-    {
-      "id": "<uuid>",
-      "name": "happy-path",
-      "inputs": {"input": "hello"},
-      "expectedOutput": {"content": "greeting"},
-      "expectedAgentBehavior": "",
-      "simulationInstructions": "",
-      "simulateInput": false,
-      "simulateTools": false,
-      "inputGenerationInstructions": "",
-      "evalSetId": "<eval-set-uuid>",
-      "source": "manual",
-      "createdAt": "...",
-      "updatedAt": "..."
-    }
-  ],
-  "modelSettings": [],
-  "agentMemoryEnabled": false,
-  "agentMemorySettings": [],
-  "lineByLineEvaluation": false,
-  "createdAt": "...",
-  "updatedAt": "..."
-}
-```
-
-The `source` field indicates how the test case was created: `"manual"` (CLI), `"debugRun"` (from a debug session), `"runtimeRun"` (from a live run), `"simulatedRun"`, or `"autopilotUserInitiated"`.
diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md
deleted file mode 100644
index bfb284d6d..000000000
--- a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# Evaluators
-
-Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`.
-
-## Evaluator Types
-
-| Type | CLI Flag | What It Scores |
-|------|----------|----------------|
-| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output |
-| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior |
-| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query |
-| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context |
-
-## Managing Evaluators
-
-### Add an evaluator
-
-```bash
-uip agent eval evaluator add <name> --type <type> --path <agent_dir> --output json
-```
-
-**Options:**
-
-| Flag | Required | Description | Default |
-|------|----------|-------------|---------|
-| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — |
-| `--description <desc>` | No | Human-readable description | Auto-generated from type |
-| `--prompt <prompt>` | No | Custom LLM evaluation prompt | Built-in default per type |
-| `--target-key <key>` | No | Specific output key to evaluate | `*` (all keys) |
-| `--path <path>` | No | Agent project directory | `.` |
-
-**Example:**
-```bash
-uip agent eval evaluator add content-quality \
-  --type semantic-similarity \
-  --path ./my-agent \
-  --output json
-```
-
-### List evaluators
-
-```bash
-uip agent eval evaluator list --path <agent_dir> --output json
-```
-
-### Remove an evaluator
-
-```bash
-uip agent eval evaluator remove <id_or_name> --path <agent_dir> --output json
-```
-
-Removing an evaluator automatically removes its references from all eval sets that reference it.
-
-## Default Evaluators
-
-`uip agent init` creates two default evaluators:
-
-### Semantic Similarity (`evaluator-default.json`)
-
-Compares expected vs actual output for semantic equivalence. Uses template variables `{{ExpectedOutput}}` and `{{ActualOutput}}`. Scores 0–100.
-
-### Trajectory (`evaluator-default-trajectory.json`)
-
-Evaluates the agent's reasoning path against expected behavior. Uses template variables `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. Scores 0–100.
-
-Both default evaluators use `"same-as-agent"` as the model, which resolves to the agent's configured model at runtime.
-
-## Evaluator JSON Format
-
-```json
-{
-  "fileName": "evaluator-content-quality.json",
-  "id": "<uuid>",
-  "name": "content-quality",
-  "description": "Evaluates semantic similarity of output",
-  "category": 1,
-  "type": 5,
-  "prompt": "Compare {{ExpectedOutput}} with {{ActualOutput}}...",
-  "model": "same-as-agent",
-  "targetOutputKey": "*",
-  "createdAt": "2025-01-01T00:00:00.000Z",
-  "updatedAt": "2025-01-01T00:00:00.000Z"
-}
-```
-
-**Type and category mapping:**
-
-| CLI Type | `type` (numeric) | `category` |
-|----------|-------------------|------------|
-| `semantic-similarity` | 5 | 1 (output-based) |
-| `trajectory` | 7 | 3 (trajectory-based) |
-| `context-precision` | 8 | 1 (output-based) |
-| `faithfulness` | 9 | 1 (output-based) |
-
-## Custom Prompts
-
-When `--prompt` is omitted, the CLI uses a built-in default prompt for each type. To customize, pass a prompt string using the appropriate template variables:
-
-- **Semantic Similarity**: `{{ExpectedOutput}}`, `{{ActualOutput}}`
-- **Trajectory**: `{{AgentRunHistory}}`, `{{ExpectedBehavior}}`
-- **Context Precision**: `{{UserQuery}}`, `{{RetrievedContext}}`
-- **Faithfulness**: `{{AgentOutput}}`, `{{Context}}`
diff --git a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md
deleted file mode 100644
index 0ffc4ebe8..000000000
--- a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md
+++ /dev/null
@@ -1,163 +0,0 @@
-# Running Evaluations
-
-Execute evaluations against the Agent Runtime, check status, view results, and compare runs.
-
-All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition.
-
-## Start an Eval Run
-
-```bash
-uip agent eval run start --set "<eval_set_name>" --path <agent_dir> --wait --output json
-```
-
-**Options:**
-
-| Flag | Required | Description | Default |
-|------|----------|-------------|---------|
-| `--set <name>` | Yes | Eval set name or ID | — |
-| `--path <path>` | No | Agent project directory | `.` |
-| `--wait` | No | Poll until completion and show results | `false` |
-| `--timeout <seconds>` | No | Polling timeout (with `--wait`) | 600 (10 min) |
-| `--solution-id <id>` | No | Override solution ID | Auto-resolved from `SolutionStorage.json` |
-
-Without `--wait`, the command returns immediately with an `EvalSetRunId`:
-
-```json
-{
-  "Code": "AgentEvalRunStarted",
-  "Data": {
-    "EvalSetRunId": "a1b2c3d4-...",
-    "EvalSetName": "Default Evaluation Set",
-    "TestCases": 5,
-    "Evaluators": 2
-  }
-}
-```
-
-With `--wait`, the CLI polls every 5 seconds until completion, then outputs both a summary and per-test-case results.
-
-## Check Run Status
-
-```bash
-uip agent eval run status <eval_set_run_id> --set "<eval_set_name>" --path <agent_dir> --output json
-```
-
-**Output:**
-```json
-{
-  "Code": "AgentEvalRunStatus",
-  "Data": {
-    "EvalSetRunId": "a1b2c3d4-...",
-    "Status": "completed",
-    "Score": 0.86,
-    "Duration": "42.5s",
-    "EvaluatorScores": "semantic: 0.9, trajectory: 0.82"
-  }
-}
-```
-
-Terminal states: `completed` or `failed`.
-
-## View Results
-
-```bash
-uip agent eval run results <eval_set_run_id> \
-  --set "<eval_set_name>" \
-  --path <agent_dir> \
-  --output json
-```
-
-**Options:**
-
-| Flag | Description |
-|------|-------------|
-| `--only-failed` | Show only failed or errored test cases |
-| `--verbose` | Include evaluator justifications in output |
-| `--export-format <json\|csv>` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) |
-
-**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`).
-
-### Failure detection
-
-A test case is considered **failed** if any of these are true:
-- Status is `failed`
-- Has an error message
-- Any evaluator score type is `error`
-- Any exact-match evaluator returned `false`
-
-## List Past Runs
-
-```bash
-uip agent eval run list --set "<eval_set_name>" --path <agent_dir> --output json
-```
-
-**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`.
-
-## Compare Runs
-
-Compare two eval runs side by side to see score changes:
-
-```bash
-uip agent eval run compare <run_id_a> \
-  --compare-to <run_id_b> \
-  --set "<eval_set_name>" \
-  --path <agent_dir> \
-  --output json
-```
-
-**Output:**
-```json
-{
-  "Code": "AgentEvalRunComparison",
-  "Data": {
-    "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" },
-    "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" },
-    "ScoreDelta": 0.06,
-    "TestCases": [
-      {
-        "TestCase": "happy-path",
-        "ScoreA": 1.0,
-        "ScoreB": 0.9,
-        "Delta": "+0.1",
-        "StatusA": "completed",
-        "StatusB": "completed"
-      }
-    ]
-  }
-}
-```
-
-Use `compare` after prompt changes to verify improvements without regressions.
-
-## Workflow Example
-
-```bash
-# 1. Push agent to Studio Web (if not already done)
-uip agent push --path ./my-agent --output json
-
-# 2. Add test cases
-uip agent eval add greeting-test \
-  --set "Default Evaluation Set" \
-  --inputs '{"input":"hi there"}' \
-  --expected '{"content":"Hello! How can I help you?"}' \
-  --expected-agent-behavior "Agent should respond with a friendly greeting" \
-  --path ./my-agent --output json
-
-# 3. Run and wait
-uip agent eval run start \
-  --set "Default Evaluation Set" \
-  --path ./my-agent \
-  --wait --output json
-
-# 4. Review failures
-uip agent eval run results <run_id> \
-  --set "Default Evaluation Set" \
-  --only-failed --verbose \
-  --path ./my-agent --output json
-
-# 5. Make changes, push, re-run, compare
-uip agent push --path ./my-agent --output json
-uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
-uip agent eval run compare <new_run_id> --compare-to <old_run_id> \
-  --set "Default Evaluation Set" --path ./my-agent --output json
-```
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
new file mode 100644
index 000000000..632af61fc
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
@@ -0,0 +1,90 @@
+# Evaluate Low-Code Agents
+
+Design and run evaluations against low-code agents using the `uip agent eval` CLI.
+
+## Quick Reference
+
+```bash
+# Add a test case
+uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json
+
+# Run evals and wait for results
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+
+# Check results (failures only, with justifications)
+uip agent eval run results <run_id> --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json
+```
+
+## Prerequisites
+
+- Agent project initialized (`uip agent init <path>`)
+- `entry-points.json` present (defines `input`/`output` schema that test case `--inputs`/`--expected` must conform to)
+- `uip agent validate --output json` passes (validate also checks evals and evaluators)
+- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud)
+
+Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. Only `uip agent eval run *` commands require cloud connectivity.
+
+## Reference Navigation
+
+- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts
+- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options
+- [Running Evaluations](running-evaluations.md) — start, status, results, compare
+
+Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases.
+
+## File Structure
+
+After `uip agent init`, the project structure is:
+
+```
+my-agent/
+  agent.json
+  entry-points.json                       # Input/output schema — test case --inputs / --expected must match
+  project.uiproj
+  flow-layout.json
+  evals/
+    evaluators/
+      evaluator-default.json              # name: "Default Evaluator" (semantic-similarity)
+      evaluator-default-trajectory.json   # name: "Default Trajectory Evaluator"
+    eval-sets/
+      evaluation-set-default.json         # name: "Default Evaluation Set" (references both evaluators)
+```
+
+Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories.
+
+CLI-added evaluators are written as `evaluator-<uuid8>.json` (first 8 hex chars of the evaluator UUID). The `<name>` argument populates the `name` field inside the JSON, NOT the filename. Reference evaluators in eval sets by `id` (UUID), not filename.
+
+## Key Differences from Coded Agent Evals
+
+| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) |
+|--------|-------------------------------|------------------------------|
+| Execution | Local Python process | Cloud-based via Agent Runtime |
+| Auth required | Only for `--report` | Always (cloud execution) |
+| Prerequisite | `entry-points.json` | `uip agent push` |
+| Mocking | `@mockable()` decorator + declarative | Simulation instructions only |
+| CLI prefix | `uip codedagent eval` | `uip agent eval` |
+
+## Troubleshooting
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id <id>` explicitly to `uip agent eval run start` |
+| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
+| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
+| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` |
+| `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model |
+| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model |
+| `401 Unauthorized` | Auth expired | Run `uip login --output json` |
+| Eval run timeout (with `--wait`) | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web. Note: this only stops the local CLI from blocking; the run continues server-side — query with `uip agent eval run status <run_id>` |
+| Validate fails with eval errors | Eval set references an evaluator that no longer exists, OR evaluator JSON missing required field, OR `category`/`type` mismatch (see [evaluators.md](evaluators.md) § What `uip agent validate` Checks) | Re-run `uip agent eval evaluator list` and reconcile `evaluatorRefs`; fix per the validate error message |
+
+The two model-resolution errors above are **runtime checks in the cloud eval worker**, not validate-time checks — `uip agent validate` will not catch them. They surface only after `uip agent eval run start`. To pre-empt them, inspect each evaluator's `model` field locally before pushing.
+
+## Anti-patterns
+
+- **Don't run `uip agent eval run start` before `uip agent push`.** The Agent Runtime executes against the pushed agent. Local edits to `agent.json` after the last push will not be reflected in the run.
+- **Don't skip `uip agent validate` before push.** Validate checks `evals/` and `evaluators/`; broken eval JSON will not block push but will surface as runtime errors.
+- **Don't hand-edit `id` or `evaluatorRefs` UUIDs.** Eval sets reference evaluators by UUID. Renaming an evaluator file or copy-pasting a UUID across evaluators silently breaks resolution.
+- **Don't expect filenames to match `<name>`.** CLI-generated evaluator files use `evaluator-<uuid8>.json`, not `<name>.json`. Look up evaluators by the `name` field inside the JSON, not by filename.
+- **Don't pass `--type` in PascalCase.** The CLI rejects `SemanticSimilarity`. Only kebab-case is accepted.
+- **Don't reference evaluators across projects.** Each agent project has its own `evals/evaluators/` directory; UUIDs are not portable.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
new file mode 100644
index 000000000..60a751f96
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
@@ -0,0 +1,163 @@
+# Evaluation Sets and Test Cases
+
+Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set.
+
+## Managing Eval Sets
+
+### Add an eval set
+
+```bash
+uip agent eval set add <name> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--evaluators <ids>` | No | Comma-separated evaluator IDs | All existing evaluators |
+| `--path <path>` | No | Agent project directory | `.` |
+
+When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project.
+
+### List eval sets
+
+```bash
+uip agent eval set list --path <agent_dir> --output json
+```
+
+### Remove an eval set
+
+```bash
+uip agent eval set remove <id_or_name> --path <agent_dir> --output json
+```
+
+## Managing Test Cases
+
+Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations.
+
+### Add a test case
+
+```bash
+uip agent eval add <name> \
+  --set "<eval_set_name>" \
+  --inputs '{"input":"hello"}' \
+  --expected '{"content":"greeting response"}' \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--inputs <json>` | Yes | Input values as JSON | — |
+| `--expected <json>` | No | Expected output as JSON | `{}` |
+| `--expected-agent-behavior <text>` | No | Description of expected behavior (used by trajectory evaluator) | `""` |
+| `--simulation-instructions <text>` | No | Instructions for simulating agent behavior | `""` |
+| `--simulate-input` | No | Enable input simulation | `false` |
+| `--simulate-tools` | No | Enable tool simulation | `false` |
+| `--input-generation-instructions <text>` | No | Instructions for generating synthetic inputs | `""` |
+| `--path <path>` | No | Agent project directory | `.` |
+
+### List test cases
+
+```bash
+uip agent eval list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+### Remove a test case
+
+```bash
+uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+## Test Case Design
+
+### Aligning `--inputs` with `entry-points.json`
+
+`--inputs` JSON keys must match the `input` schema in `entry-points.json`. Mismatched keys do not block `eval add` (the CLI stores the JSON verbatim) but will fail at run time when the Agent Runtime invokes the agent. Run `uip agent validate --output json` after adding test cases to surface schema drift.
+
+### Matching evaluator to test case fields
+
+The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace:
+
+| Evaluator Type | From test case | From agent run trace |
+|----------------|---------------|----------------------|
+| `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
+| `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
+| `context-precision` | (none directly used) | RETRIEVER spans `input.value` → `{{UserQuery}}`, `output.value.documents` → `{{RetrievedContext}}` |
+| `faithfulness` | `expectedOutput` → `{{AgentOutput}}` (note: it is the *expected* output that is treated as the candidate text to fact-check, not the agent's actual output) | Trace span outputs (RETRIEVER + tool calls) → `{{Context}}` |
+
+`context-precision` and `faithfulness` are **trace-driven evaluators**. They extract `{{UserQuery}}`, `{{RetrievedContext}}`, and `{{Context}}` by walking `openinference.span.kind == "RETRIEVER"` (and other tool spans) on the agent's run trace. Their behavior:
+
+- **The agent must perform retrieval** (Context Grounding / index / DataFabric / a tool that emits an OpenInference RETRIEVER span). Without retrieval spans, the placeholders resolve to empty and scores collapse.
+- **`--inputs` and `--expected` are not consumed in the obvious way**: `context-precision` ignores test-case `inputs` (it reads the query from the trace); `faithfulness` reads the *expected* output (not the agent's actual output) as the candidate text.
+- **CLI-default placeholders may differ from SDK-internal placeholders.** The CLI writes prompts with `{{UserQuery}}` and `{{RetrievedContext}}` for context-precision, but the SDK's legacy evaluator hardcodes `{{Query}}` and `{{Chunks}}` internally. Inspect the resulting evaluator JSON; if you customize the prompt, match the placeholders the runtime actually substitutes (test with a small run before relying on results).
+
+If the agent has no retrieval step, remove `context-precision` and `faithfulness` from the eval set rather than letting them silently score everything as 0.
+
+For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
+
+```bash
+uip agent eval add tool-usage-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"What is the weather in NYC?"}' \
+  --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \
+  --path ./my-agent --output json
+```
+
+### Simulation options
+
+- `--simulate-input` — runtime generates synthetic input variations based on the provided input
+- `--simulate-tools` — tool calls are simulated rather than executed against real services
+- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters")
+- `--simulation-instructions` — guides overall simulation behavior
+
+Use these to expand test coverage without writing every input by hand.
+
+## Eval Set JSON Format
+
+```json
+{
+  "fileName": "evaluation-set-default.json",
+  "id": "<uuid>",
+  "name": "Default Evaluation Set",
+  "batchSize": 10,
+  "evaluatorRefs": ["<evaluator-uuid-1>", "<evaluator-uuid-2>"],
+  "evaluations": [
+    {
+      "id": "<uuid>",
+      "name": "happy-path",
+      "inputs": {"input": "hello"},
+      "expectedOutput": {"content": "greeting"},
+      "expectedAgentBehavior": "",
+      "simulationInstructions": "",
+      "simulateInput": false,
+      "simulateTools": false,
+      "inputGenerationInstructions": "",
+      "evalSetId": "<eval-set-uuid>",
+      "source": "manual",
+      "createdAt": "...",
+      "updatedAt": "..."
+    }
+  ],
+  "modelSettings": [],
+  "agentMemoryEnabled": false,
+  "agentMemorySettings": [],
+  "lineByLineEvaluation": false,
+  "createdAt": "...",
+  "updatedAt": "..."
+}
+```
+
+The `source` field indicates how the test case was created. CLI-added test cases are always `"manual"` (verified). Other observed values from Studio Web include `"debugRun"`, `"runtimeRun"`, `"simulatedRun"`, and `"autopilotUserInitiated"` — treat the `source` field as an enum but do not set it manually; the CLI and Studio Web own this value.
+
+## Anti-patterns
+
+- **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`.
+- **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push.
+- **Don't add `context-precision` or `faithfulness` evaluators to an eval set whose agent has no RETRIEVER span.** Both extract their placeholders from agent trace spans, not from `inputs`/`expectedOutput`. No retrieval → scores collapse to 0.
+- **Don't expect `faithfulness` to read the agent's actual output.** It reads `expectedOutput` (the criteria field) as the candidate text. To fact-check actual agent output, use `semantic-similarity` against an expected ground truth instead.
+- **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons.
+- **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
new file mode 100644
index 000000000..89d8c0d61
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
@@ -0,0 +1,242 @@
+# Evaluators
+
+Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`.
+
+## Why fewer evaluators than coded?
+
+The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 4 because the two surfaces use **different engines** in the SDK:
+
+- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class.
+- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only 4 implementation classes ship: `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator`.
+
+Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent — the cloud eval worker will not load them.
+
+The CLI also narrows the runtime surface further: of the 6 legacy `type` values the runtime accepts, the `--type` flag exposes only 4. See § Runtime-supported types not exposed by the CLI below.
+
+## Evaluator Types (CLI-exposed)
+
+| Type | CLI Flag | What It Scores |
+|------|----------|----------------|
+| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output |
+| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior |
+| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query |
+| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context |
+
+## Runtime-supported types not exposed by the CLI
+
+The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/<filename>.json`:
+
+### `Equals` (type 1, category 0 — Deterministic)
+
+Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`.
+
+```json
+{
+  "fileName": "evaluator-equals.json",
+  "id": "<generate-uuid>",
+  "name": "exact-match",
+  "description": "Exact-match evaluator",
+  "category": 0,
+  "type": 1,
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+No `prompt`/`model` required (Deterministic category bypasses the LLM checks).
+
+### `JsonSimilarity` (type 6, category 0 — Deterministic)
+
+Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`.
+
+```json
+{
+  "fileName": "evaluator-json-sim.json",
+  "id": "<generate-uuid>",
+  "name": "json-similarity",
+  "description": "JSON similarity evaluator",
+  "category": 0,
+  "type": 6,
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+After hand-writing, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps.
+
+## Coded-only evaluators (NOT available on low-code)
+
+The following coded `evaluatorTypeId` strings have no legacy class — agents working on a low-code agent should not attempt to use them. Switch to a coded agent (`version: "1.0"` eval sets) if you need any of these:
+
+`uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, `uipath-llm-judge-trajectory-simulation`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-tool-call-order`, `uipath-tool-call-args`, `uipath-tool-call-count`, `uipath-tool-call-output`.
+
+## Managing Evaluators
+
+### Add an evaluator
+
+```bash
+uip agent eval evaluator add <name> --type <type> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — |
+| `--description <desc>` | No | Human-readable description | Auto-generated from type |
+| `--prompt <prompt>` | No | Custom LLM evaluation prompt | Built-in default per type |
+| `--target-key <key>` | No | Specific output key to evaluate | `*` (all keys) |
+| `--path <path>` | No | Agent project directory | `.` |
+
+**Example:**
+```bash
+uip agent eval evaluator add content-quality \
+  --type semantic-similarity \
+  --path ./my-agent \
+  --output json
+```
+
+### List evaluators
+
+```bash
+uip agent eval evaluator list --path <agent_dir> --output json
+```
+
+### Remove an evaluator
+
+```bash
+uip agent eval evaluator remove <id_or_name> --path <agent_dir> --output json
+```
+
+Removing an evaluator automatically removes its references from all eval sets that reference it.
+
+## Default Evaluators
+
+`uip agent init` creates two default evaluators:
+
+### Semantic Similarity (`evaluator-default.json`, `name: "Default Evaluator"`)
+
+Compares expected vs actual output for semantic equivalence. Default prompt asks the LLM for a 0–100 score and substitutes `{{ExpectedOutput}}` and `{{ActualOutput}}`.
+
+### Trajectory (`evaluator-default-trajectory.json`, `name: "Default Trajectory Evaluator"`)
+
+Evaluates the agent's reasoning path against expected behavior. Default prompt asks the LLM for a 0–100 score and substitutes `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`.
+
+Both default evaluators ship with `"model": "same-as-agent"` — this is supported and resolves to the agent's configured model at runtime. Override with an explicit model only if you need to score with a different model than the agent uses.
+
+The runtime DTO normalizes all evaluator scores to a 0–100 scale regardless of what the prompt asks for, but mixed-scale prompts in the same eval set produce confusing intermediate values — pick one scale per eval set.
+
+## Filename vs Name
+
+CLI-added evaluators are saved as `evaluator-<uuid8>.json` (first 8 hex chars of the evaluator UUID). The `<name>` argument populates the `name` field inside the JSON; it does NOT shape the filename.
+
+```bash
+uip agent eval evaluator add content-quality --type semantic-similarity --path ./my-agent
+# Creates: evals/evaluators/evaluator-b47e26ca.json
+# JSON has: "name": "content-quality"
+```
+
+The two `evaluator-default*.json` files are written by `uip agent init`, not by `evaluator add`. Eval sets reference evaluators by `id` (UUID), not by filename or name.
+
+## Evaluator JSON Format
+
+```json
+{
+  "fileName": "evaluator-b47e26ca.json",
+  "id": "b47e26ca-7a13-4c83-9ee4-039d6415fb63",
+  "name": "content-quality",
+  "description": "Semantic Similarity",
+  "category": 1,
+  "type": 5,
+  "prompt": "As an expert evaluator, ... {{ExpectedOutput}} ... {{ActualOutput}} ...",
+  "model": "same-as-agent",
+  "targetOutputKey": "*",
+  "createdAt": "2026-05-04T00:00:00.000Z",
+  "updatedAt": "2026-05-04T00:00:00.000Z"
+}
+```
+
+**Type and category mapping:**
+
+| CLI Type | `type` (numeric) | `category` |
+|----------|-------------------|------------|
+| `semantic-similarity` | 5 | 1 (output-based) |
+| `trajectory` | 7 | 3 (trajectory-based) |
+| `context-precision` | 8 | 1 (output-based) |
+| `faithfulness` | 9 | 1 (output-based) |
+
+## Default Prompts and Template Variables
+
+The prompt and score scale the CLI writes when you run `evaluator add` differs from what `uip agent init` writes for the two default evaluators:
+
+| Type | `evaluator add` default | `uip agent init` default |
+|------|-------------------------|--------------------------|
+| `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders |
+| `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` |
+| `context-precision` | Asks 0–1; uses `{{UserQuery}}`, `{{RetrievedContext}}` | Not created by `init` |
+| `faithfulness` | Asks 0–1; uses `{{AgentOutput}}`, `{{Context}}` | Not created by `init` |
+
+Two notable inconsistencies:
+
+1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix.
+2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want.
+
+For `context-precision` and `faithfulness`, the SDK's legacy evaluator may use its own internal placeholders (`{{Query}}`, `{{Chunks}}`) that differ from what the CLI writes. Inspect the resulting evaluator JSON and run a small test before relying on customized prompts. See [evaluation-sets.md](evaluation-sets.md) § Matching evaluator to test case fields for the data flow.
+
+## Custom Prompts
+
+Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text.
+
+```bash
+uip agent eval evaluator add strict-match \
+  --type semantic-similarity \
+  --prompt 'Score 0-100 how closely {{ActualOutput}} matches {{ExpectedOutput}}. Return JSON {"score": N, "reason": "..."}.' \
+  --path ./my-agent --output json
+```
+
+## What `uip agent validate` Checks
+
+Validate runs schema migration, which enforces the following on every file in `evals/evaluators/`:
+
+**Required fields:** `fileName`, `id`, `name`, `description`, `category`, `type`, `targetOutputKey`, `createdAt`, `updatedAt`. Missing field → `Required field "<field>" is missing`.
+
+**Category ↔ type compatibility:**
+
+| Category | Name | Allowed `type` | Additional requirements |
+|----------|------|----------------|-------------------------|
+| `0` | Deterministic | `1`, `6` | — |
+| `1` | LlmAsAJudge | `5`, `8`, `9` | `prompt` and `model` required |
+| `3` | Trajectory | `7` | `prompt` and `model` required |
+
+Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually.
+
+Eval sets are validated against a Zod schema. The CLI surfaces the offending file path, JSON path, and message — fix and re-run validate.
+
+## Runtime Errors (Eval Worker)
+
+These errors surface only after `uip agent eval run start` — `uip agent validate` does NOT catch them. They come from the cloud eval worker (`python-eval-worker/workflows/eval/activities.py`) and the SDK's `EvaluatorFactory`.
+
+| Error string | Trigger | Fix |
+|--------------|---------|-----|
+| `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig. Specify a valid model name (e.g. 'claude-haiku-4-5-20251001').` | Evaluator JSON has empty/missing `model` (and is not `same-as-agent`). The worker fail-fasts before calling the LLM gateway. | Set `model` in the evaluator JSON to a model available in your tenant, or set `"model": "same-as-agent"` and ensure `agent.json` has a model. |
+| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"same-as-agent"` but `agent.json` has no resolvable model. | Set `model` in `agent.json`, or override the evaluator with an explicit model. |
+
+**Pre-empt locally:** before push, run
+
+```bash
+uip agent eval evaluator list --path ./my-agent --output json --output-filter '[?model==`""` || model==null]'
+```
+
+to find any LLM evaluator without an explicit model. (Switch to `--output-filter '[?model==`"same-as-agent"`]'` if you want to flag those that depend on `agent.json`.)
+
+## Anti-patterns
+
+- **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`).
+- **Don't pass `--type` in PascalCase.** Only `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` are accepted.
+- **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling.
+- **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically.
+- **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution.
+- **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code.
+- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → types {5, 8, 9}, cat 3 → type {7}. Anything else fails schema migration.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md
new file mode 100644
index 000000000..8713186b1
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md
@@ -0,0 +1,207 @@
+# Running Evaluations
+
+Execute evaluations against the Agent Runtime, check status, view results, and compare runs.
+
+All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition.
+
+## Start an Eval Run
+
+```bash
+uip agent eval run start --set "<eval_set_name>" --path <agent_dir> --wait --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--path <path>` | No | Agent project directory | `.` |
+| `--wait` | No | Block until the run completes, then print results | `false` |
+| `--timeout <seconds>` | No | Maximum time to block when `--wait` is set | `600` (10 min) |
+| `--solution-id <id>` | No | Override solution ID for this run | Auto-resolved from the pushed-agent state |
+
+Without `--wait`, the command returns immediately with `Code: AgentEvalRunStarted`:
+
+```json
+{
+  "Code": "AgentEvalRunStarted",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "EvalSetName": "Default Evaluation Set",
+    "TestCases": 5,
+    "Evaluators": 2
+  }
+}
+```
+
+With `--wait`, the CLI polls every 5 seconds (hardcoded interval) until the run reaches a terminal state (`completed` or `failed`) or `--timeout` elapses, then emits `AgentEvalRunCompleted` plus per-test `AgentEvalRunResults`. If `--timeout` elapses first, the run continues server-side; query progress with `eval run status <run_id>`.
+
+### Output codes
+
+| Subcommand | `Code` |
+|------------|--------|
+| `run start` (no `--wait`) | `AgentEvalRunStarted` |
+| `run start --wait` (summary) | `AgentEvalRunCompleted` |
+| `run start --wait` (per-case detail) | `AgentEvalRunResults` |
+| `run status` | `AgentEvalRunStatus` |
+| `run results` | `AgentEvalRunResults` |
+| `run results --export-format` | `AgentEvalRunExported` |
+| `run list` | `AgentEvalRunList` |
+| `run compare` | `AgentEvalRunComparison` |
+
+## Check Run Status
+
+```bash
+uip agent eval run status <eval_set_run_id> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunStatus",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "Status": "completed",
+    "Score": 0.86,
+    "Duration": "42.5s",
+    "EvaluatorScores": "semantic: 0.9, trajectory: 0.82"
+  }
+}
+```
+
+Terminal states: `completed` or `failed`.
+
+## View Results
+
+```bash
+uip agent eval run results <eval_set_run_id> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Description |
+|------|-------------|
+| `--only-failed` | Show only failed or errored test cases |
+| `--verbose` | Include evaluator justifications in output |
+| `--export-format <json\|csv>` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) |
+
+**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`).
+
+### Filtering results with `--output-filter`
+
+`--output-filter` takes a JMESPath expression and applies it to the JSON payload before printing. Useful for triage:
+
+```bash
+# Print only test cases with a specific name
+uip agent eval run results <run_id> --set "Default Evaluation Set" --path ./my-agent \
+  --output json --output-filter 'Data.Results[?TestCase==`greeting-test`]'
+
+# Print only the score field for each test case
+uip agent eval run results <run_id> --set "Default Evaluation Set" --path ./my-agent \
+  --output json --output-filter 'Data.Results[*].{name: TestCase, score: Score}'
+```
+
+### Failure detection
+
+`--only-failed` filters to test cases where any of these are true (`isFailedRun()` in the CLI):
+
+1. `status === "failed"` (or numeric `"3"`)
+2. `errorMessage` is non-null
+3. `result.score.type === "error"` (or numeric `"2"`)
+4. Any `assertionRuns[*].result.score.type === "error"` (or numeric `"2"`)
+5. Any `assertionRuns[*].result.score.value === false` (exact-match evaluators that returned a false boolean)
+
+Status enum values from the SDK: `0 = pending`, `1 = running`, `2 = completed`, `3 = failed`. The CLI normalizes string and numeric forms.
+
+## List Past Runs
+
+```bash
+uip agent eval run list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`.
+
+## Compare Runs
+
+Compare two eval runs side by side to see score changes:
+
+```bash
+uip agent eval run compare <run_id_a> \
+  --compare-to <run_id_b> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunComparison",
+  "Data": {
+    "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" },
+    "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" },
+    "ScoreDelta": 0.06,
+    "TestCases": [
+      {
+        "TestCase": "happy-path",
+        "ScoreA": 1.0,
+        "ScoreB": 0.9,
+        "Delta": "+0.1",
+        "StatusA": "completed",
+        "StatusB": "completed"
+      }
+    ]
+  }
+}
+```
+
+Use `compare` after prompt changes to verify improvements without regressions.
+
+## Workflow Example
+
+```bash
+# 1. Add test cases
+uip agent eval add greeting-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"hi there"}' \
+  --expected '{"content":"Hello! How can I help you?"}' \
+  --expected-agent-behavior "Agent should respond with a friendly greeting" \
+  --path ./my-agent --output json
+
+# 2. Validate (catches schema drift, missing evaluator refs, broken eval JSON)
+uip agent validate --path ./my-agent --output json
+
+# 3. Push agent to Studio Web (required before running evals)
+uip agent push --path ./my-agent --output json
+
+# 4. Run and wait
+uip agent eval run start \
+  --set "Default Evaluation Set" \
+  --path ./my-agent \
+  --wait --timeout 600 --output json
+
+# 5. Review failures
+uip agent eval run results <run_id> \
+  --set "Default Evaluation Set" \
+  --only-failed --verbose \
+  --path ./my-agent --output json
+
+# 6. Make changes, validate, push, re-run, compare
+uip agent validate --path ./my-agent --output json
+uip agent push --path ./my-agent --output json
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+uip agent eval run compare <new_run_id> --compare-to <old_run_id> \
+  --set "Default Evaluation Set" --path ./my-agent --output json
+```
+
+## Anti-patterns
+
+- **Don't run `eval run start` without `uip agent push` first.** The Agent Runtime executes against the pushed agent, not local files. Local edits made after the last push will not affect results.
+- **Don't assume `--timeout` cancels the server-side run.** It only stops the local CLI from blocking. The run continues and can be inspected with `run status`.
+- **Don't skip `uip agent validate` between edits and push.** Validate catches eval-set / evaluator drift that push will accept silently and the runtime will reject.
+- **Don't compare runs from different eval sets.** `compare` aligns by test case `name` within the eval set; cross-set deltas are meaningless.
+- **Don't rely on `Score` alone — inspect `EvaluatorScores`.** A 0.86 aggregate can mask a faithful-but-wrong agent (high semantic, low trajectory). Use `--verbose` to read justifications when scores look surprising.
+- **Don't mix score scales across evaluators in the same eval set.** Defaults written by `uip agent init` use 0–100 prompts; defaults written by `evaluator add` use 0–1 prompts. The runtime DTO normalizes to 0–100, but mixed-scale prompts produce confusing per-evaluator scores. Decide on one scale per eval set and edit prompts to match.
diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md
index a60995399..d11133c7d 100644
--- a/skills/uipath-agents/references/lowcode/lowcode.md
+++ b/skills/uipath-agents/references/lowcode/lowcode.md
@@ -46,7 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad
 | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) |
 | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) |
 | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) |
-| Running evaluations, adding test cases, managing evaluators | [evaluation/evaluate.md](evaluation/evaluate.md) |
+| Running evaluations, adding test cases, managing evaluators | [evaluations/evaluate.md](evaluations/evaluate.md) |
 
 ### Capability Registry
 
@@ -69,7 +69,6 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad
 | Add an Action Center escalation (HITL) | [capabilities/escalation/escalation.md](capabilities/escalation/escalation.md) | |
 | Add guardrails (PII, harmful content, custom rules) | [capabilities/guardrails/guardrails.md](capabilities/guardrails/guardrails.md) | |
 | Embed an agent inline in a flow | [capabilities/inline-in-flow/inline-in-flow.md](capabilities/inline-in-flow/inline-in-flow.md) | |
-| Evaluate agent (add test cases, run evals, view results) | [evaluation/evaluate.md](evaluation/evaluate.md) | `evaluation/evaluators.md`, `evaluation/evaluation-sets.md`, `evaluation/running-evaluations.md` |
 | Set up Orchestrator resources | Tell the user to use the `uipath-platform` skill | |
 | Wire agent into a flow | Tell the user to use the `uipath-maestro-flow` skill | |
 

From 79268a548396931159e4e40fc40c9044ede14b02 Mon Sep 17 00:00:00 2001
From: Mayank Jha <mayank.jha@uipath.com>
Date: Mon, 4 May 2026 18:33:57 -0700
Subject: [PATCH 3/5] fix(uipath-agents): drop context-precision and
 faithfulness from low-code eval docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Remove context-precision and faithfulness from the low-code evaluator
surface entirely. Updates:

- evaluators.md: drop both rows from the CLI-exposed table, the --type
  description, the type/category mapping, and the default-prompts table.
  Narrow the validate matrix's cat 1 to type {5} only. Update the "Why
  fewer" intro to reflect 2 supported CLI types.
- evaluation-sets.md: remove the trace-driven data-flow rows for both
  evaluators, the explanatory callout about RETRIEVER spans, and the
  related anti-patterns. Test-case design now covers only ss + trajectory.
- evaluate.md: narrow the "Unknown evaluator type" troubleshooting hint.

Coded eval refs are unchanged — those use uipath-llm-judge-* IDs, not
the legacy CLI names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../lowcode/evaluations/evaluate.md           |  2 +-
 .../lowcode/evaluations/evaluation-sets.md    | 12 ----------
 .../lowcode/evaluations/evaluators.md         | 24 +++++++------------
 3 files changed, 9 insertions(+), 29 deletions(-)

diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
index 632af61fc..e77c339c0 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
@@ -71,7 +71,7 @@ CLI-added evaluators are written as `evaluator-<uuid8>.json` (first 8 hex chars
 | Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id <id>` explicitly to `uip agent eval run start` |
 | `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
 | `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
-| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` |
+| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory` |
 | `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model |
 | `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model |
 | `401 Unauthorized` | Auth expired | Run `uip login --output json` |
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
index 60a751f96..528bc0868 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
@@ -86,16 +86,6 @@ The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on
 |----------------|---------------|----------------------|
 | `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
 | `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
-| `context-precision` | (none directly used) | RETRIEVER spans `input.value` → `{{UserQuery}}`, `output.value.documents` → `{{RetrievedContext}}` |
-| `faithfulness` | `expectedOutput` → `{{AgentOutput}}` (note: it is the *expected* output that is treated as the candidate text to fact-check, not the agent's actual output) | Trace span outputs (RETRIEVER + tool calls) → `{{Context}}` |
-
-`context-precision` and `faithfulness` are **trace-driven evaluators**. They extract `{{UserQuery}}`, `{{RetrievedContext}}`, and `{{Context}}` by walking `openinference.span.kind == "RETRIEVER"` (and other tool spans) on the agent's run trace. Their behavior:
-
-- **The agent must perform retrieval** (Context Grounding / index / DataFabric / a tool that emits an OpenInference RETRIEVER span). Without retrieval spans, the placeholders resolve to empty and scores collapse.
-- **`--inputs` and `--expected` are not consumed in the obvious way**: `context-precision` ignores test-case `inputs` (it reads the query from the trace); `faithfulness` reads the *expected* output (not the agent's actual output) as the candidate text.
-- **CLI-default placeholders may differ from SDK-internal placeholders.** The CLI writes prompts with `{{UserQuery}}` and `{{RetrievedContext}}` for context-precision, but the SDK's legacy evaluator hardcodes `{{Query}}` and `{{Chunks}}` internally. Inspect the resulting evaluator JSON; if you customize the prompt, match the placeholders the runtime actually substitutes (test with a small run before relying on results).
-
-If the agent has no retrieval step, remove `context-precision` and `faithfulness` from the eval set rather than letting them silently score everything as 0.
 
 For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
 
@@ -157,7 +147,5 @@ The `source` field indicates how the test case was created. CLI-added test cases
 
 - **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`.
 - **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push.
-- **Don't add `context-precision` or `faithfulness` evaluators to an eval set whose agent has no RETRIEVER span.** Both extract their placeholders from agent trace spans, not from `inputs`/`expectedOutput`. No retrieval → scores collapse to 0.
-- **Don't expect `faithfulness` to read the agent's actual output.** It reads `expectedOutput` (the criteria field) as the candidate text. To fact-check actual agent output, use `semantic-similarity` against an expected ground truth instead.
 - **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons.
 - **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
index 89d8c0d61..2c35f0180 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
@@ -4,14 +4,14 @@ Evaluators define how agent output is scored. Each evaluator is a JSON file in `
 
 ## Why fewer evaluators than coded?
 
-The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 4 because the two surfaces use **different engines** in the SDK:
+The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 2 supported types because the two surfaces use **different engines** in the SDK:
 
 - **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class.
-- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only 4 implementation classes ship: `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator`.
+- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only `LegacyLlmAsAJudgeEvaluator` (semantic-similarity), `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, and `LegacyJsonSimilarityEvaluator` are supported for low-code agents.
 
-Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent — the cloud eval worker will not load them.
+Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent.
 
-The CLI also narrows the runtime surface further: of the 6 legacy `type` values the runtime accepts, the `--type` flag exposes only 4. See § Runtime-supported types not exposed by the CLI below.
+The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equals` and `JsonSimilarity` are accepted by the runtime but require hand-written JSON — see § Runtime-supported types not exposed by the CLI below.
 
 ## Evaluator Types (CLI-exposed)
 
@@ -19,8 +19,6 @@ The CLI also narrows the runtime surface further: of the 6 legacy `type` values
 |------|----------|----------------|
 | Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output |
 | Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior |
-| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query |
-| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context |
 
 ## Runtime-supported types not exposed by the CLI
 
@@ -84,7 +82,7 @@ uip agent eval evaluator add <name> --type <type> --path <agent_dir> --output js
 
 | Flag | Required | Description | Default |
 |------|----------|-------------|---------|
-| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — |
+| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory` | — |
 | `--description <desc>` | No | Human-readable description | Auto-generated from type |
 | `--prompt <prompt>` | No | Custom LLM evaluation prompt | Built-in default per type |
 | `--target-key <key>` | No | Specific output key to evaluate | `*` (all keys) |
@@ -164,8 +162,6 @@ The two `evaluator-default*.json` files are written by `uip agent init`, not by
 |----------|-------------------|------------|
 | `semantic-similarity` | 5 | 1 (output-based) |
 | `trajectory` | 7 | 3 (trajectory-based) |
-| `context-precision` | 8 | 1 (output-based) |
-| `faithfulness` | 9 | 1 (output-based) |
 
 ## Default Prompts and Template Variables
 
@@ -175,16 +171,12 @@ The prompt and score scale the CLI writes when you run `evaluator add` differs f
 |------|-------------------------|--------------------------|
 | `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders |
 | `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` |
-| `context-precision` | Asks 0–1; uses `{{UserQuery}}`, `{{RetrievedContext}}` | Not created by `init` |
-| `faithfulness` | Asks 0–1; uses `{{AgentOutput}}`, `{{Context}}` | Not created by `init` |
 
 Two notable inconsistencies:
 
 1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix.
 2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want.
 
-For `context-precision` and `faithfulness`, the SDK's legacy evaluator may use its own internal placeholders (`{{Query}}`, `{{Chunks}}`) that differ from what the CLI writes. Inspect the resulting evaluator JSON and run a small test before relying on customized prompts. See [evaluation-sets.md](evaluation-sets.md) § Matching evaluator to test case fields for the data flow.
-
 ## Custom Prompts
 
 Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text.
@@ -207,7 +199,7 @@ Validate runs schema migration, which enforces the following on every file in `e
 | Category | Name | Allowed `type` | Additional requirements |
 |----------|------|----------------|-------------------------|
 | `0` | Deterministic | `1`, `6` | — |
-| `1` | LlmAsAJudge | `5`, `8`, `9` | `prompt` and `model` required |
+| `1` | LlmAsAJudge | `5` | `prompt` and `model` required |
 | `3` | Trajectory | `7` | `prompt` and `model` required |
 
 Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually.
@@ -234,9 +226,9 @@ to find any LLM evaluator without an explicit model. (Switch to `--output-filter
 ## Anti-patterns
 
 - **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`).
-- **Don't pass `--type` in PascalCase.** Only `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` are accepted.
+- **Don't pass `--type` in PascalCase.** Only `semantic-similarity` and `trajectory` are accepted.
 - **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling.
 - **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically.
 - **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution.
 - **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code.
-- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → types {5, 8, 9}, cat 3 → type {7}. Anything else fails schema migration.
+- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → type {5}, cat 3 → type {7}. Anything else fails schema migration.

From 6cb0dac4092f53f31f4ff7c10a30e83ed63069b4 Mon Sep 17 00:00:00 2001
From: Mayank Jha <mayank.jha@uipath.com>
Date: Mon, 4 May 2026 18:44:54 -0700
Subject: [PATCH 4/5] docs(uipath-agents): add realistic legacy evaluator JSON
 examples

Replace the synthetic skeletons in "Runtime-supported types not exposed
by the CLI" with the canonical shapes used in real low-code agent
projects:

- Equals (type 1) and JsonSimilarity (type 6) keep their
  Deterministic-category shape (no prompt/model needed) but now use
  realistic descriptions and filenames.
- Add explicit LlmAsAJudge (type 5) and Trajectory (type 7) JSON shapes
  for hand-written use, including the full prompt strings, an explicit
  model pin, and the descriptions used in production examples.
- Soften the filename rule: CLI-generated evaluators use
  evaluator-<uuid8>.json, but hand-written files can use any descriptive
  name. The runtime keys off id / evaluatorRefs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../lowcode/evaluations/evaluators.md         | 54 ++++++++++++++++---
 1 file changed, 47 insertions(+), 7 deletions(-)

diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
index 2c35f0180..81dd6bdf1 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
@@ -24,16 +24,18 @@ The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equal
 
 The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/<filename>.json`:
 
+For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-<uuid8>.json` pattern only applies to evaluators created via `uip agent eval evaluator add`.
+
 ### `Equals` (type 1, category 0 — Deterministic)
 
 Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`.
 
 ```json
 {
-  "fileName": "evaluator-equals.json",
+  "fileName": "legacy-equality.json",
   "id": "<generate-uuid>",
-  "name": "exact-match",
-  "description": "Exact-match evaluator",
+  "name": "Equality Evaluator",
+  "description": "An evaluator that judges the agent based on expected output.",
   "category": 0,
   "type": 1,
   "targetOutputKey": "*",
@@ -50,10 +52,10 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`
 
 ```json
 {
-  "fileName": "evaluator-json-sim.json",
+  "fileName": "legacy-json-similarity.json",
   "id": "<generate-uuid>",
-  "name": "json-similarity",
-  "description": "JSON similarity evaluator",
+  "name": "JSON Similarity Evaluator",
+  "description": "An evaluator that compares JSON structures with tolerance for numeric and string differences.",
   "category": 0,
   "type": 6,
   "targetOutputKey": "*",
@@ -62,7 +64,45 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`
 }
 ```
 
-After hand-writing, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps.
+### `LlmAsAJudge` semantic-similarity (type 5, category 1) — explicit shape
+
+If you want to hand-write a semantic-similarity evaluator instead of using `evaluator add` (e.g. to pin a specific model and prompt), the full shape is:
+
+```json
+{
+  "fileName": "legacy-llm-as-a-judge.json",
+  "id": "<generate-uuid>",
+  "name": "LLM As A Judge Evaluator",
+  "description": "An evaluator that uses an LLM to judge the similarity of the actual output to the expected output",
+  "category": 1,
+  "type": 5,
+  "prompt": "As an expert evaluator, analyze the semantic similarity of these outputs to determine a score from 0-100.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n",
+  "targetOutputKey": "*",
+  "model": "gpt-4.1-2025-04-14",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+### `Trajectory` (type 7, category 3) — explicit shape
+
+```json
+{
+  "fileName": "legacy-trajectory.json",
+  "id": "<generate-uuid>",
+  "name": "Trajectory Evaluator",
+  "description": "An evaluator that analyzes the execution trajectory and decision sequence taken by the agent.",
+  "category": 3,
+  "type": 7,
+  "prompt": "Evaluate the agent's execution trajectory based on the expected behavior.\n\nExpected Agent Behavior: {{ExpectedAgentBehavior}}\nAgent Run History: {{AgentRunHistory}}\n\nProvide a score from 0-100 based on how well the agent followed the expected trajectory.",
+  "model": "gpt-4.1-2025-04-14",
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+After hand-writing any evaluator, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps.
 
 ## Coded-only evaluators (NOT available on low-code)
 

From 5fb59e7a1da09af53b9c8f5e4fa3a7228d868d26 Mon Sep 17 00:00:00 2001
From: Mayank Jha <mayank.jha@uipath.com>
Date: Tue, 5 May 2026 18:09:05 -0700
Subject: [PATCH 5/5] docs(uipath-agents): reframe low-code evaluators as 4
 first-class types
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Studio Web UI exposes 4 evaluator types (Semantic Similarity, Trajectory,
Exact match, JSON similarity). Verified by counting evaluator JSON files
across multiple production examples — only types 1, 5, 6, 7 appear; nothing
else does.

Previous framing called Exact match and JSON similarity "runtime-supported
types not exposed by the CLI", which understated their status. Both are
real first-class options; the only narrowing surface is the CLI's --type
flag (which covers 2 of 4).

evaluators.md changes:
- New "Supported Evaluator Types" section with a 4-row table mapping UI
  label, type/category, --type flag (where applicable), what it scores,
  and whether it is LLM-based.
- New subsection "How to add each type" calling out the three creation
  paths (UI, CLI, hand-write JSON).
- Renamed the "Why fewer than coded?" section into a subsection of the
  Supported Types group; updated wording to reflect 4 supported types.
- Renamed "Runtime-supported types not exposed by the CLI" to "JSON
  Shapes" and reordered the four shapes to match the table order
  (Exact match, JSON similarity, LLM-as-a-judge, Trajectory).

evaluation-sets.md changes:
- Added Exact match and JSON similarity rows to the field-mapping table
  so all 4 supported types are covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../lowcode/evaluations/evaluation-sets.md    | 10 ++--
 .../lowcode/evaluations/evaluators.md         | 46 +++++++++++--------
 2 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
index 528bc0868..acda99f05 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
@@ -82,10 +82,12 @@ uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --
 
 The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace:
 
-| Evaluator Type | From test case | From agent run trace |
-|----------------|---------------|----------------------|
-| `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
-| `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
+| Evaluator Type | From test case | From agent run |
+|----------------|---------------|----------------|
+| Semantic Similarity (type 5) | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
+| Trajectory (type 7) | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
+| Exact match (type 1) | `expectedOutput` (compared verbatim, no placeholders) | Agent output (compared verbatim) |
+| JSON similarity (type 6) | `expectedOutput` (tree-compared, no placeholders) | Agent output (tree-compared) |
 
 For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
 
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
index 81dd6bdf1..c3f4adfae 100644
--- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
@@ -2,33 +2,39 @@
 
 Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`.
 
-## Why fewer evaluators than coded?
+## Supported Evaluator Types
 
-The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 2 supported types because the two surfaces use **different engines** in the SDK:
+Low-code agents support exactly four evaluator types. All four are first-class options in the Studio Web "Add evaluator" dialog. Two also have CLI-flag shortcuts; the other two are created via the UI or by hand-writing JSON in `evals/evaluators/`.
 
-- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class.
-- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only `LegacyLlmAsAJudgeEvaluator` (semantic-similarity), `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, and `LegacyJsonSimilarityEvaluator` are supported for low-code agents.
+| UI label | `type` | `category` | `--type` flag | What it scores | LLM-based |
+|----------|--------|-----------|---------------|----------------|-----------|
+| LLM-as-a-judge: Semantic Similarity | 5 | 1 (LlmAsAJudge) | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | Yes |
+| Trajectory | 7 | 3 (Trajectory) | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | Yes |
+| Exact match | 1 | 0 (Deterministic) | — | Whether the output precisely matches the expected output without variations in wording or formatting | No |
+| JSON similarity | 6 | 0 (Deterministic) | — | Whether two JSON structures or values are "close enough" or share similar structure/contents | No |
+
+How to add each type:
 
-Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent.
+- **Studio Web UI** — Evaluators tab → **Create New** → Add evaluator dialog → pick any of the four. UI is the canonical surface and supports all four with no special steps.
+- **CLI** — `uip agent eval evaluator add <name> --type <flag>` for `semantic-similarity` or `trajectory`. The CLI does not have a `--type` value for Exact match or JSON similarity; create those in the UI or hand-write the JSON.
+- **Hand-write JSON** — drop a file in `evals/evaluators/` matching the schema below; run `uip agent validate --output json`; reference the new `id` from your eval set's `evaluatorRefs`. Useful when you want to pin a specific model and prompt for the LLM-based types, or when you're scaffolding eval files programmatically.
 
-The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equals` and `JsonSimilarity` are accepted by the runtime but require hand-written JSON — see § Runtime-supported types not exposed by the CLI below.
+### Why fewer than coded?
 
-## Evaluator Types (CLI-exposed)
+The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code supports only these four because the two surfaces use **different engines** in the SDK:
 
-| Type | CLI Flag | What It Scores |
-|------|----------|----------------|
-| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output |
-| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior |
+- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class.
+- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). The four legacy classes shipped — `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator` — are exactly what the UI exposes.
 
-## Runtime-supported types not exposed by the CLI
+Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) have no legacy counterpart and cannot be used on a low-code agent.
 
-The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/<filename>.json`:
+## JSON Shapes
 
 For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-<uuid8>.json` pattern only applies to evaluators created via `uip agent eval evaluator add`.
 
-### `Equals` (type 1, category 0 — Deterministic)
+### Exact match (`type` 1, `category` 0 — Deterministic)
 
-Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`.
+No LLM. Equivalent of coded `uipath-exact-match`.
 
 ```json
 {
@@ -46,9 +52,9 @@ Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`.
 
 No `prompt`/`model` required (Deterministic category bypasses the LLM checks).
 
-### `JsonSimilarity` (type 6, category 0 — Deterministic)
+### JSON similarity (`type` 6, `category` 0 — Deterministic)
 
-Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`.
+Tree-based JSON comparison. No LLM. Equivalent of coded `uipath-json-similarity`.
 
 ```json
 {
@@ -64,9 +70,9 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`
 }
 ```
 
-### `LlmAsAJudge` semantic-similarity (type 5, category 1) — explicit shape
+### LLM-as-a-judge: Semantic Similarity (`type` 5, `category` 1 — LlmAsAJudge)
 
-If you want to hand-write a semantic-similarity evaluator instead of using `evaluator add` (e.g. to pin a specific model and prompt), the full shape is:
+The CLI's `evaluator add --type semantic-similarity` writes a shorter prompt; hand-write the file when you want to pin a specific model and the longer 0–100 prompt:
 
 ```json
 {
@@ -84,7 +90,7 @@ If you want to hand-write a semantic-similarity evaluator instead of using `eval
 }
 ```
 
-### `Trajectory` (type 7, category 3) — explicit shape
+### Trajectory (`type` 7, `category` 3 — Trajectory)
 
 ```json
 {