diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md
index 7dc4747c3..391fe7346 100644
--- a/skills/uipath-agents/SKILL.md
+++ b/skills/uipath-agents/SKILL.md
@@ -47,7 +47,8 @@ Determine the agent mode before proceeding:
 | Add an Action Center escalation (HITL) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/escalation/escalation.md` |
 | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` |
 | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists |
-| Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/process.md` |
+| Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` |
+| Run low-code evaluations | Low-code | [lowcode/evaluations/evaluate.md](references/lowcode/evaluations/evaluate.md) | `lowcode/evaluations/evaluators.md`, `lowcode/evaluations/evaluation-sets.md`, `lowcode/evaluations/running-evaluations.md` |
 | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` |
 
 ## Resources
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
new file mode 100644
index 000000000..e77c339c0
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md
@@ -0,0 +1,90 @@
+# Evaluate Low-Code Agents
+
+Design and run evaluations against low-code agents using the `uip agent eval` CLI.
+
+## Quick Reference
+
+```bash
+# Add a test case
+uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json
+
+# Run evals and wait for results
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+
+# Check results (failures only, with justifications)
+uip agent eval run results <run_id> --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json
+```
+
+## Prerequisites
+
+- Agent project initialized (`uip agent init <path>`)
+- `entry-points.json` present (defines `input`/`output` schema that test case `--inputs`/`--expected` must conform to)
+- `uip agent validate --output json` passes (validate also checks evals and evaluators)
+- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud)
+
+Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. Only `uip agent eval run *` commands require cloud connectivity.
+
+## Reference Navigation
+
+- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts
+- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options
+- [Running Evaluations](running-evaluations.md) — start, status, results, compare
+
+Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases.
+
+## File Structure
+
+After `uip agent init`, the project structure is:
+
+```
+my-agent/
+  agent.json
+  entry-points.json                       # Input/output schema — test case --inputs / --expected must match
+  project.uiproj
+  flow-layout.json
+  evals/
+    evaluators/
+      evaluator-default.json              # name: "Default Evaluator" (semantic-similarity)
+      evaluator-default-trajectory.json   # name: "Default Trajectory Evaluator"
+    eval-sets/
+      evaluation-set-default.json         # name: "Default Evaluation Set" (references both evaluators)
+```
+
+Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories.
+
+CLI-added evaluators are written as `evaluator-<uuid8>.json` (first 8 hex chars of the evaluator UUID). The `<name>` argument populates the `name` field inside the JSON, NOT the filename. Reference evaluators in eval sets by `id` (UUID), not filename.
+
+## Key Differences from Coded Agent Evals
+
+| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) |
+|--------|-------------------------------|------------------------------|
+| Execution | Local Python process | Cloud-based via Agent Runtime |
+| Auth required | Only for `--report` | Always (cloud execution) |
+| Prerequisite | `entry-points.json` | `uip agent push` |
+| Mocking | `@mockable()` decorator + declarative | Simulation instructions only |
+| CLI prefix | `uip codedagent eval` | `uip agent eval` |
+
+## Troubleshooting
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id <id>` explicitly to `uip agent eval run start` |
+| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
+| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
+| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory` |
+| `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model |
+| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model |
+| `401 Unauthorized` | Auth expired | Run `uip login --output json` |
+| Eval run timeout (with `--wait`) | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web. Note: this only stops the local CLI from blocking; the run continues server-side — query with `uip agent eval run status <run_id>` |
+| Validate fails with eval errors | Eval set references an evaluator that no longer exists, OR evaluator JSON missing required field, OR `category`/`type` mismatch (see [evaluators.md](evaluators.md) § What `uip agent validate` Checks) | Re-run `uip agent eval evaluator list` and reconcile `evaluatorRefs`; fix per the validate error message |
+
+The two model-resolution errors above are **runtime checks in the cloud eval worker**, not validate-time checks — `uip agent validate` will not catch them. They surface only after `uip agent eval run start`. To pre-empt them, inspect each evaluator's `model` field locally before pushing.
+
+## Anti-patterns
+
+- **Don't run `uip agent eval run start` before `uip agent push`.** The Agent Runtime executes against the pushed agent. Local edits to `agent.json` after the last push will not be reflected in the run.
+- **Don't skip `uip agent validate` before push.** Validate checks `evals/` and `evaluators/`; broken eval JSON will not block push but will surface as runtime errors.
+- **Don't hand-edit `id` or `evaluatorRefs` UUIDs.** Eval sets reference evaluators by UUID. Renaming an evaluator file or copy-pasting a UUID across evaluators silently breaks resolution.
+- **Don't expect filenames to match `<name>`.** CLI-generated evaluator files use `evaluator-<uuid8>.json`, not `<name>.json`. Look up evaluators by the `name` field inside the JSON, not by filename.
+- **Don't pass `--type` in PascalCase.** The CLI rejects `SemanticSimilarity`. Only kebab-case is accepted.
+- **Don't reference evaluators across projects.** Each agent project has its own `evals/evaluators/` directory; UUIDs are not portable.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
new file mode 100644
index 000000000..acda99f05
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md
@@ -0,0 +1,153 @@
+# Evaluation Sets and Test Cases
+
+Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set.
+
+## Managing Eval Sets
+
+### Add an eval set
+
+```bash
+uip agent eval set add <name> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--evaluators <ids>` | No | Comma-separated evaluator IDs | All existing evaluators |
+| `--path <path>` | No | Agent project directory | `.` |
+
+When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project.
+
+### List eval sets
+
+```bash
+uip agent eval set list --path <agent_dir> --output json
+```
+
+### Remove an eval set
+
+```bash
+uip agent eval set remove <id_or_name> --path <agent_dir> --output json
+```
+
+## Managing Test Cases
+
+Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations.
+
+### Add a test case
+
+```bash
+uip agent eval add <name> \
+  --set "<eval_set_name>" \
+  --inputs '{"input":"hello"}' \
+  --expected '{"content":"greeting response"}' \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--inputs <json>` | Yes | Input values as JSON | — |
+| `--expected <json>` | No | Expected output as JSON | `{}` |
+| `--expected-agent-behavior <text>` | No | Description of expected behavior (used by trajectory evaluator) | `""` |
+| `--simulation-instructions <text>` | No | Instructions for simulating agent behavior | `""` |
+| `--simulate-input` | No | Enable input simulation | `false` |
+| `--simulate-tools` | No | Enable tool simulation | `false` |
+| `--input-generation-instructions <text>` | No | Instructions for generating synthetic inputs | `""` |
+| `--path <path>` | No | Agent project directory | `.` |
+
+### List test cases
+
+```bash
+uip agent eval list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+### Remove a test case
+
+```bash
+uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+## Test Case Design
+
+### Aligning `--inputs` with `entry-points.json`
+
+`--inputs` JSON keys must match the `input` schema in `entry-points.json`. Mismatched keys do not block `eval add` (the CLI stores the JSON verbatim) but will fail at run time when the Agent Runtime invokes the agent. Run `uip agent validate --output json` after adding test cases to surface schema drift.
+
+### Matching evaluator to test case fields
+
+The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace:
+
+| Evaluator Type | From test case | From agent run |
+|----------------|---------------|----------------|
+| Semantic Similarity (type 5) | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
+| Trajectory (type 7) | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
+| Exact match (type 1) | `expectedOutput` (compared verbatim, no placeholders) | Agent output (compared verbatim) |
+| JSON similarity (type 6) | `expectedOutput` (tree-compared, no placeholders) | Agent output (tree-compared) |
+
+For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:
+
+```bash
+uip agent eval add tool-usage-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"What is the weather in NYC?"}' \
+  --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \
+  --path ./my-agent --output json
+```
+
+### Simulation options
+
+- `--simulate-input` — runtime generates synthetic input variations based on the provided input
+- `--simulate-tools` — tool calls are simulated rather than executed against real services
+- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters")
+- `--simulation-instructions` — guides overall simulation behavior
+
+Use these to expand test coverage without writing every input by hand.
+
+## Eval Set JSON Format
+
+```json
+{
+  "fileName": "evaluation-set-default.json",
+  "id": "<uuid>",
+  "name": "Default Evaluation Set",
+  "batchSize": 10,
+  "evaluatorRefs": ["<evaluator-uuid-1>", "<evaluator-uuid-2>"],
+  "evaluations": [
+    {
+      "id": "<uuid>",
+      "name": "happy-path",
+      "inputs": {"input": "hello"},
+      "expectedOutput": {"content": "greeting"},
+      "expectedAgentBehavior": "",
+      "simulationInstructions": "",
+      "simulateInput": false,
+      "simulateTools": false,
+      "inputGenerationInstructions": "",
+      "evalSetId": "<eval-set-uuid>",
+      "source": "manual",
+      "createdAt": "...",
+      "updatedAt": "..."
+    }
+  ],
+  "modelSettings": [],
+  "agentMemoryEnabled": false,
+  "agentMemorySettings": [],
+  "lineByLineEvaluation": false,
+  "createdAt": "...",
+  "updatedAt": "..."
+}
+```
+
+The `source` field indicates how the test case was created. CLI-added test cases are always `"manual"` (verified). Other observed values from Studio Web include `"debugRun"`, `"runtimeRun"`, `"simulatedRun"`, and `"autopilotUserInitiated"` — treat the `source` field as an enum but do not set it manually; the CLI and Studio Web own this value.
+
+## Anti-patterns
+
+- **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`.
+- **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push.
+- **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons.
+- **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
new file mode 100644
index 000000000..c3f4adfae
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md
@@ -0,0 +1,280 @@
+# Evaluators
+
+Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`.
+
+## Supported Evaluator Types
+
+Low-code agents support exactly four evaluator types. All four are first-class options in the Studio Web "Add evaluator" dialog. Two also have CLI-flag shortcuts; the other two are created via the UI or by hand-writing JSON in `evals/evaluators/`.
+
+| UI label | `type` | `category` | `--type` flag | What it scores | LLM-based |
+|----------|--------|-----------|---------------|----------------|-----------|
+| LLM-as-a-judge: Semantic Similarity | 5 | 1 (LlmAsAJudge) | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | Yes |
+| Trajectory | 7 | 3 (Trajectory) | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | Yes |
+| Exact match | 1 | 0 (Deterministic) | — | Whether the output precisely matches the expected output without variations in wording or formatting | No |
+| JSON similarity | 6 | 0 (Deterministic) | — | Whether two JSON structures or values are "close enough" or share similar structure/contents | No |
+
+How to add each type:
+
+- **Studio Web UI** — Evaluators tab → **Create New** → Add evaluator dialog → pick any of the four. UI is the canonical surface and supports all four with no special steps.
+- **CLI** — `uip agent eval evaluator add <name> --type <flag>` for `semantic-similarity` or `trajectory`. The CLI does not have a `--type` value for Exact match or JSON similarity; create those in the UI or hand-write the JSON.
+- **Hand-write JSON** — drop a file in `evals/evaluators/` matching the schema below; run `uip agent validate --output json`; reference the new `id` from your eval set's `evaluatorRefs`. Useful when you want to pin a specific model and prompt for the LLM-based types, or when you're scaffolding eval files programmatically.
+
+### Why fewer than coded?
+
+The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code supports only these four because the two surfaces use **different engines** in the SDK:
+
+- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class.
+- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). The four legacy classes shipped — `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator` — are exactly what the UI exposes.
+
+Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) have no legacy counterpart and cannot be used on a low-code agent.
+
+## JSON Shapes
+
+For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-<uuid8>.json` pattern only applies to evaluators created via `uip agent eval evaluator add`.
+
+### Exact match (`type` 1, `category` 0 — Deterministic)
+
+No LLM. Equivalent of coded `uipath-exact-match`.
+
+```json
+{
+  "fileName": "legacy-equality.json",
+  "id": "<generate-uuid>",
+  "name": "Equality Evaluator",
+  "description": "An evaluator that judges the agent based on expected output.",
+  "category": 0,
+  "type": 1,
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+No `prompt`/`model` required (Deterministic category bypasses the LLM checks).
+
+### JSON similarity (`type` 6, `category` 0 — Deterministic)
+
+Tree-based JSON comparison. No LLM. Equivalent of coded `uipath-json-similarity`.
+
+```json
+{
+  "fileName": "legacy-json-similarity.json",
+  "id": "<generate-uuid>",
+  "name": "JSON Similarity Evaluator",
+  "description": "An evaluator that compares JSON structures with tolerance for numeric and string differences.",
+  "category": 0,
+  "type": 6,
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+### LLM-as-a-judge: Semantic Similarity (`type` 5, `category` 1 — LlmAsAJudge)
+
+The CLI's `evaluator add --type semantic-similarity` writes a shorter prompt; hand-write the file when you want to pin a specific model and the longer 0–100 prompt:
+
+```json
+{
+  "fileName": "legacy-llm-as-a-judge.json",
+  "id": "<generate-uuid>",
+  "name": "LLM As A Judge Evaluator",
+  "description": "An evaluator that uses an LLM to judge the similarity of the actual output to the expected output",
+  "category": 1,
+  "type": 5,
+  "prompt": "As an expert evaluator, analyze the semantic similarity of these outputs to determine a score from 0-100.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n",
+  "targetOutputKey": "*",
+  "model": "gpt-4.1-2025-04-14",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+### Trajectory (`type` 7, `category` 3 — Trajectory)
+
+```json
+{
+  "fileName": "legacy-trajectory.json",
+  "id": "<generate-uuid>",
+  "name": "Trajectory Evaluator",
+  "description": "An evaluator that analyzes the execution trajectory and decision sequence taken by the agent.",
+  "category": 3,
+  "type": 7,
+  "prompt": "Evaluate the agent's execution trajectory based on the expected behavior.\n\nExpected Agent Behavior: {{ExpectedAgentBehavior}}\nAgent Run History: {{AgentRunHistory}}\n\nProvide a score from 0-100 based on how well the agent followed the expected trajectory.",
+  "model": "gpt-4.1-2025-04-14",
+  "targetOutputKey": "*",
+  "createdAt": "<iso-timestamp>",
+  "updatedAt": "<iso-timestamp>"
+}
+```
+
+After hand-writing any evaluator, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps.
+
+## Coded-only evaluators (NOT available on low-code)
+
+The following coded `evaluatorTypeId` strings have no legacy class — agents working on a low-code agent should not attempt to use them. Switch to a coded agent (`version: "1.0"` eval sets) if you need any of these:
+
+`uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, `uipath-llm-judge-trajectory-simulation`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-tool-call-order`, `uipath-tool-call-args`, `uipath-tool-call-count`, `uipath-tool-call-output`.
+
+## Managing Evaluators
+
+### Add an evaluator
+
+```bash
+uip agent eval evaluator add <name> --type <type> --path <agent_dir> --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--type <type>` | Yes | One of: `semantic-similarity`, `trajectory` | — |
+| `--description <desc>` | No | Human-readable description | Auto-generated from type |
+| `--prompt <prompt>` | No | Custom LLM evaluation prompt | Built-in default per type |
+| `--target-key <key>` | No | Specific output key to evaluate | `*` (all keys) |
+| `--path <path>` | No | Agent project directory | `.` |
+
+**Example:**
+```bash
+uip agent eval evaluator add content-quality \
+  --type semantic-similarity \
+  --path ./my-agent \
+  --output json
+```
+
+### List evaluators
+
+```bash
+uip agent eval evaluator list --path <agent_dir> --output json
+```
+
+### Remove an evaluator
+
+```bash
+uip agent eval evaluator remove <id_or_name> --path <agent_dir> --output json
+```
+
+Removing an evaluator automatically removes its references from all eval sets that reference it.
+
+## Default Evaluators
+
+`uip agent init` creates two default evaluators:
+
+### Semantic Similarity (`evaluator-default.json`, `name: "Default Evaluator"`)
+
+Compares expected vs actual output for semantic equivalence. Default prompt asks the LLM for a 0–100 score and substitutes `{{ExpectedOutput}}` and `{{ActualOutput}}`.
+
+### Trajectory (`evaluator-default-trajectory.json`, `name: "Default Trajectory Evaluator"`)
+
+Evaluates the agent's reasoning path against expected behavior. Default prompt asks the LLM for a 0–100 score and substitutes `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`.
+
+Both default evaluators ship with `"model": "same-as-agent"` — this is supported and resolves to the agent's configured model at runtime. Override with an explicit model only if you need to score with a different model than the agent uses.
+
+The runtime DTO normalizes all evaluator scores to a 0–100 scale regardless of what the prompt asks for, but mixed-scale prompts in the same eval set produce confusing intermediate values — pick one scale per eval set.
+
+## Filename vs Name
+
+CLI-added evaluators are saved as `evaluator-<uuid8>.json` (first 8 hex chars of the evaluator UUID). The `<name>` argument populates the `name` field inside the JSON; it does NOT shape the filename.
+
+```bash
+uip agent eval evaluator add content-quality --type semantic-similarity --path ./my-agent
+# Creates: evals/evaluators/evaluator-b47e26ca.json
+# JSON has: "name": "content-quality"
+```
+
+The two `evaluator-default*.json` files are written by `uip agent init`, not by `evaluator add`. Eval sets reference evaluators by `id` (UUID), not by filename or name.
+
+## Evaluator JSON Format
+
+```json
+{
+  "fileName": "evaluator-b47e26ca.json",
+  "id": "b47e26ca-7a13-4c83-9ee4-039d6415fb63",
+  "name": "content-quality",
+  "description": "Semantic Similarity",
+  "category": 1,
+  "type": 5,
+  "prompt": "As an expert evaluator, ... {{ExpectedOutput}} ... {{ActualOutput}} ...",
+  "model": "same-as-agent",
+  "targetOutputKey": "*",
+  "createdAt": "2026-05-04T00:00:00.000Z",
+  "updatedAt": "2026-05-04T00:00:00.000Z"
+}
+```
+
+**Type and category mapping:**
+
+| CLI Type | `type` (numeric) | `category` |
+|----------|-------------------|------------|
+| `semantic-similarity` | 5 | 1 (output-based) |
+| `trajectory` | 7 | 3 (trajectory-based) |
+
+## Default Prompts and Template Variables
+
+The prompt and score scale the CLI writes when you run `evaluator add` differs from what `uip agent init` writes for the two default evaluators:
+
+| Type | `evaluator add` default | `uip agent init` default |
+|------|-------------------------|--------------------------|
+| `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders |
+| `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` |
+
+Two notable inconsistencies:
+
+1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix.
+2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want.
+
+## Custom Prompts
+
+Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text.
+
+```bash
+uip agent eval evaluator add strict-match \
+  --type semantic-similarity \
+  --prompt 'Score 0-100 how closely {{ActualOutput}} matches {{ExpectedOutput}}. Return JSON {"score": N, "reason": "..."}.' \
+  --path ./my-agent --output json
+```
+
+## What `uip agent validate` Checks
+
+Validate runs schema migration, which enforces the following on every file in `evals/evaluators/`:
+
+**Required fields:** `fileName`, `id`, `name`, `description`, `category`, `type`, `targetOutputKey`, `createdAt`, `updatedAt`. Missing field → `Required field "<field>" is missing`.
+
+**Category ↔ type compatibility:**
+
+| Category | Name | Allowed `type` | Additional requirements |
+|----------|------|----------------|-------------------------|
+| `0` | Deterministic | `1`, `6` | — |
+| `1` | LlmAsAJudge | `5` | `prompt` and `model` required |
+| `3` | Trajectory | `7` | `prompt` and `model` required |
+
+Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually.
+
+Eval sets are validated against a Zod schema. The CLI surfaces the offending file path, JSON path, and message — fix and re-run validate.
+
+## Runtime Errors (Eval Worker)
+
+These errors surface only after `uip agent eval run start` — `uip agent validate` does NOT catch them. They come from the cloud eval worker (`python-eval-worker/workflows/eval/activities.py`) and the SDK's `EvaluatorFactory`.
+
+| Error string | Trigger | Fix |
+|--------------|---------|-----|
+| `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig. Specify a valid model name (e.g. 'claude-haiku-4-5-20251001').` | Evaluator JSON has empty/missing `model` (and is not `same-as-agent`). The worker fail-fasts before calling the LLM gateway. | Set `model` in the evaluator JSON to a model available in your tenant, or set `"model": "same-as-agent"` and ensure `agent.json` has a model. |
+| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"same-as-agent"` but `agent.json` has no resolvable model. | Set `model` in `agent.json`, or override the evaluator with an explicit model. |
+
+**Pre-empt locally:** before push, run
+
+```bash
+uip agent eval evaluator list --path ./my-agent --output json --output-filter '[?model==`""` || model==null]'
+```
+
+to find any LLM evaluator without an explicit model. (Switch to `--output-filter '[?model==`"same-as-agent"`]'` if you want to flag those that depend on `agent.json`.)
+
+## Anti-patterns
+
+- **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`).
+- **Don't pass `--type` in PascalCase.** Only `semantic-similarity` and `trajectory` are accepted.
+- **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling.
+- **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically.
+- **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution.
+- **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code.
+- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → type {5}, cat 3 → type {7}. Anything else fails schema migration.
diff --git a/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md
new file mode 100644
index 000000000..8713186b1
--- /dev/null
+++ b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md
@@ -0,0 +1,207 @@
+# Running Evaluations
+
+Execute evaluations against the Agent Runtime, check status, view results, and compare runs.
+
+All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition.
+
+## Start an Eval Run
+
+```bash
+uip agent eval run start --set "<eval_set_name>" --path <agent_dir> --wait --output json
+```
+
+**Options:**
+
+| Flag | Required | Description | Default |
+|------|----------|-------------|---------|
+| `--set <name>` | Yes | Eval set name or ID | — |
+| `--path <path>` | No | Agent project directory | `.` |
+| `--wait` | No | Block until the run completes, then print results | `false` |
+| `--timeout <seconds>` | No | Maximum time to block when `--wait` is set | `600` (10 min) |
+| `--solution-id <id>` | No | Override solution ID for this run | Auto-resolved from the pushed-agent state |
+
+Without `--wait`, the command returns immediately with `Code: AgentEvalRunStarted`:
+
+```json
+{
+  "Code": "AgentEvalRunStarted",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "EvalSetName": "Default Evaluation Set",
+    "TestCases": 5,
+    "Evaluators": 2
+  }
+}
+```
+
+With `--wait`, the CLI polls every 5 seconds (hardcoded interval) until the run reaches a terminal state (`completed` or `failed`) or `--timeout` elapses, then emits `AgentEvalRunCompleted` plus per-test `AgentEvalRunResults`. If `--timeout` elapses first, the run continues server-side; query progress with `eval run status <run_id>`.
+
+### Output codes
+
+| Subcommand | `Code` |
+|------------|--------|
+| `run start` (no `--wait`) | `AgentEvalRunStarted` |
+| `run start --wait` (summary) | `AgentEvalRunCompleted` |
+| `run start --wait` (per-case detail) | `AgentEvalRunResults` |
+| `run status` | `AgentEvalRunStatus` |
+| `run results` | `AgentEvalRunResults` |
+| `run results --export-format` | `AgentEvalRunExported` |
+| `run list` | `AgentEvalRunList` |
+| `run compare` | `AgentEvalRunComparison` |
+
+## Check Run Status
+
+```bash
+uip agent eval run status <eval_set_run_id> --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunStatus",
+  "Data": {
+    "EvalSetRunId": "a1b2c3d4-...",
+    "Status": "completed",
+    "Score": 0.86,
+    "Duration": "42.5s",
+    "EvaluatorScores": "semantic: 0.9, trajectory: 0.82"
+  }
+}
+```
+
+Terminal states: `completed` or `failed`.
+
+## View Results
+
+```bash
+uip agent eval run results <eval_set_run_id> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Options:**
+
+| Flag | Description |
+|------|-------------|
+| `--only-failed` | Show only failed or errored test cases |
+| `--verbose` | Include evaluator justifications in output |
+| `--export-format <json\|csv>` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) |
+
+**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`).
+
+### Filtering results with `--output-filter`
+
+`--output-filter` takes a JMESPath expression and applies it to the JSON payload before printing. Useful for triage:
+
+```bash
+# Print only test cases with a specific name
+uip agent eval run results <run_id> --set "Default Evaluation Set" --path ./my-agent \
+  --output json --output-filter 'Data.Results[?TestCase==`greeting-test`]'
+
+# Print only the score field for each test case
+uip agent eval run results <run_id> --set "Default Evaluation Set" --path ./my-agent \
+  --output json --output-filter 'Data.Results[*].{name: TestCase, score: Score}'
+```
+
+### Failure detection
+
+`--only-failed` filters to test cases where any of these are true (`isFailedRun()` in the CLI):
+
+1. `status === "failed"` (or numeric `"3"`)
+2. `errorMessage` is non-null
+3. `result.score.type === "error"` (or numeric `"2"`)
+4. Any `assertionRuns[*].result.score.type === "error"` (or numeric `"2"`)
+5. Any `assertionRuns[*].result.score.value === false` (exact-match evaluators that returned a false boolean)
+
+Status enum values from the SDK: `0 = pending`, `1 = running`, `2 = completed`, `3 = failed`. The CLI normalizes string and numeric forms.
+
+## List Past Runs
+
+```bash
+uip agent eval run list --set "<eval_set_name>" --path <agent_dir> --output json
+```
+
+**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`.
+
+## Compare Runs
+
+Compare two eval runs side by side to see score changes:
+
+```bash
+uip agent eval run compare <run_id_a> \
+  --compare-to <run_id_b> \
+  --set "<eval_set_name>" \
+  --path <agent_dir> \
+  --output json
+```
+
+**Output:**
+```json
+{
+  "Code": "AgentEvalRunComparison",
+  "Data": {
+    "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" },
+    "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" },
+    "ScoreDelta": 0.06,
+    "TestCases": [
+      {
+        "TestCase": "happy-path",
+        "ScoreA": 1.0,
+        "ScoreB": 0.9,
+        "Delta": "+0.1",
+        "StatusA": "completed",
+        "StatusB": "completed"
+      }
+    ]
+  }
+}
+```
+
+Use `compare` after prompt changes to verify improvements without regressions.
+
+## Workflow Example
+
+```bash
+# 1. Add test cases
+uip agent eval add greeting-test \
+  --set "Default Evaluation Set" \
+  --inputs '{"input":"hi there"}' \
+  --expected '{"content":"Hello! How can I help you?"}' \
+  --expected-agent-behavior "Agent should respond with a friendly greeting" \
+  --path ./my-agent --output json
+
+# 2. Validate (catches schema drift, missing evaluator refs, broken eval JSON)
+uip agent validate --path ./my-agent --output json
+
+# 3. Push agent to Studio Web (required before running evals)
+uip agent push --path ./my-agent --output json
+
+# 4. Run and wait
+uip agent eval run start \
+  --set "Default Evaluation Set" \
+  --path ./my-agent \
+  --wait --timeout 600 --output json
+
+# 5. Review failures
+uip agent eval run results <run_id> \
+  --set "Default Evaluation Set" \
+  --only-failed --verbose \
+  --path ./my-agent --output json
+
+# 6. Make changes, validate, push, re-run, compare
+uip agent validate --path ./my-agent --output json
+uip agent push --path ./my-agent --output json
+uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json
+uip agent eval run compare <new_run_id> --compare-to <old_run_id> \
+  --set "Default Evaluation Set" --path ./my-agent --output json
+```
+
+## Anti-patterns
+
+- **Don't run `eval run start` without `uip agent push` first.** The Agent Runtime executes against the pushed agent, not local files. Local edits made after the last push will not affect results.
+- **Don't assume `--timeout` cancels the server-side run.** It only stops the local CLI from blocking. The run continues and can be inspected with `run status`.
+- **Don't skip `uip agent validate` between edits and push.** Validate catches eval-set / evaluator drift that push will accept silently and the runtime will reject.
+- **Don't compare runs from different eval sets.** `compare` aligns by test case `name` within the eval set; cross-set deltas are meaningless.
+- **Don't rely on `Score` alone — inspect `EvaluatorScores`.** A 0.86 aggregate can mask a faithful-but-wrong agent (high semantic, low trajectory). Use `--verbose` to read justifications when scores look surprising.
+- **Don't mix score scales across evaluators in the same eval set.** Defaults written by `uip agent init` use 0–100 prompts; defaults written by `evaluator add` use 0–1 prompts. The runtime DTO normalizes to 0–100, but mixed-scale prompts produce confusing per-evaluator scores. Decide on one scale per eval set and edit prompts to match.
diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md
index 94b65521d..4858efd71 100644
--- a/skills/uipath-agents/references/lowcode/lowcode.md
+++ b/skills/uipath-agents/references/lowcode/lowcode.md
@@ -46,6 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad
 | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) |
 | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) |
 | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) |
+| Running evaluations, adding test cases, managing evaluators | [evaluations/evaluate.md](evaluations/evaluate.md) |
 
 ### Capability Registry