From 8d262a6ded9f088aee68c7ae59e7c492c90f7591 Mon Sep 17 00:00:00 2001 From: Mayank Jha Date: Mon, 4 May 2026 15:15:04 -0700 Subject: [PATCH 1/5] feat: add low-code agent evaluation docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The uipath-agents skill has comprehensive evaluation docs for coded agents (5 files under coded/lifecycle/evaluations/) but none for low-code agents, despite full CLI support in `uip agent eval`. Adds 4 reference files under lowcode/evaluation/: - evaluate.md — entry point, prerequisites, file structure, differences from coded - evaluators.md — 4 evaluator types, add/list/remove, JSON format, custom prompts - evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format - running-evaluations.md — run start/status/results/list/compare, workflow example Updates SKILL.md task navigation and lowcode.md capability registry to reference the new evaluation docs. Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/uipath-agents/SKILL.md | 1 + .../references/lowcode/evaluation/evaluate.md | 71 ++++++++ .../lowcode/evaluation/evaluation-sets.md | 140 +++++++++++++++ .../lowcode/evaluation/evaluators.md | 102 +++++++++++ .../lowcode/evaluation/running-evaluations.md | 163 ++++++++++++++++++ .../references/lowcode/lowcode.md | 2 + 6 files changed, 479 insertions(+) create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluate.md create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md create mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluators.md create mode 100644 skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md index 170893ba9..53fbc6b88 100644 --- a/skills/uipath-agents/SKILL.md +++ b/skills/uipath-agents/SKILL.md @@ -46,6 +46,7 @@ Determine the agent mode before proceeding: | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` | | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists | | Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` | +| Run low-code evaluations | Low-code | [lowcode/evaluation/evaluate.md](references/lowcode/evaluation/evaluate.md) | `lowcode/evaluation/evaluators.md`, `lowcode/evaluation/evaluation-sets.md`, `lowcode/evaluation/running-evaluations.md` | | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` | ## Resources diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md new file mode 100644 index 000000000..c60e450cd --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md @@ -0,0 +1,71 @@ +# Evaluate Low-Code Agents + +Design and run evaluations against low-code agents using the `uip agent eval` CLI. + +## Quick Reference + +```bash +# Add a test case +uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json + +# Run evals and wait for results +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json + +# Check results (failures only, with justifications) +uip agent eval run results --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json +``` + +## Prerequisites + +- Agent project initialized (`uip agent init`) +- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud) +- `SolutionStorage.json` exists in the agent project (created by `uip agent push`) + +Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. + +## Reference Navigation + +- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts +- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options +- [Running Evaluations](running-evaluations.md) — start, status, results, compare + +Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases. + +## File Structure + +After `uip agent init`, the eval-related project structure is: + +``` +my-agent/ + agent.json + SolutionStorage.json # Created after `uip agent push` + evals/ + evaluators/ + evaluator-default.json # Semantic similarity evaluator + evaluator-default-trajectory.json # Trajectory evaluator + eval-sets/ + evaluation-set-default.json # Default eval set (references both evaluators) +``` + +Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories. + +## Key Differences from Coded Agent Evals + +| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) | +|--------|-------------------------------|------------------------------| +| Execution | Local Python process | Cloud-based via Agent Runtime | +| Auth required | Only for `--report` | Always (cloud execution) | +| Prerequisite | `entry-points.json` | `uip agent push` (SolutionStorage.json) | +| Mocking | `@mockable()` decorator + declarative | Simulation instructions only | +| CLI prefix | `uip codedagent eval` | `uip agent eval` | + +## Troubleshooting + +| Error | Cause | Fix | +|-------|-------|-----| +| `SolutionStorage.json not found` | Agent not pushed to Studio Web | Run `uip agent push --output json` | +| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` | +| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases | +| `401 Unauthorized` | Auth expired | Run `uip login --output json` | +| Eval run timeout | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web | +| `same-as-agent` model error | Evaluator model can't be resolved | Set an explicit model in the evaluator config instead of `"same-as-agent"` | diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md new file mode 100644 index 000000000..490faf2a2 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md @@ -0,0 +1,140 @@ +# Evaluation Sets and Test Cases + +Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set. + +## Managing Eval Sets + +### Add an eval set + +```bash +uip agent eval set add --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--evaluators ` | No | Comma-separated evaluator IDs | All existing evaluators | +| `--path ` | No | Agent project directory | `.` | + +When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project. + +### List eval sets + +```bash +uip agent eval set list --path --output json +``` + +### Remove an eval set + +```bash +uip agent eval set remove --path --output json +``` + +## Managing Test Cases + +Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations. + +### Add a test case + +```bash +uip agent eval add \ + --set "" \ + --inputs '{"input":"hello"}' \ + --expected '{"content":"greeting response"}' \ + --path \ + --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--inputs ` | Yes | Input values as JSON | — | +| `--expected ` | No | Expected output as JSON | `{}` | +| `--expected-agent-behavior ` | No | Description of expected behavior (used by trajectory evaluator) | `""` | +| `--simulation-instructions ` | No | Instructions for simulating agent behavior | `""` | +| `--simulate-input` | No | Enable input simulation | `false` | +| `--simulate-tools` | No | Enable tool simulation | `false` | +| `--input-generation-instructions ` | No | Instructions for generating synthetic inputs | `""` | +| `--path ` | No | Agent project directory | `.` | + +### List test cases + +```bash +uip agent eval list --set "" --path --output json +``` + +### Remove a test case + +```bash +uip agent eval remove --set "" --path --output json +``` + +## Test Case Design + +### Matching evaluator to test case fields + +| Evaluator Type | Key Test Case Fields | +|---------------|---------------------| +| Semantic Similarity | `--inputs`, `--expected` | +| Trajectory | `--inputs`, `--expected-agent-behavior` | +| Context Precision | `--inputs`, `--expected` | +| Faithfulness | `--inputs`, `--expected` | + +For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: + +```bash +uip agent eval add tool-usage-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"What is the weather in NYC?"}' \ + --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \ + --path ./my-agent --output json +``` + +### Simulation options + +- `--simulate-input` — the runtime generates synthetic input variations based on the provided input +- `--simulate-tools` — tool calls are simulated rather than executed against real services +- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters") +- `--simulation-instructions` — guides the overall simulation behavior + +These are useful for expanding test coverage without writing every input by hand. + +## Eval Set JSON Format + +```json +{ + "fileName": "evaluation-set-default.json", + "id": "", + "name": "Default Evaluation Set", + "batchSize": 10, + "evaluatorRefs": ["", ""], + "evaluations": [ + { + "id": "", + "name": "happy-path", + "inputs": {"input": "hello"}, + "expectedOutput": {"content": "greeting"}, + "expectedAgentBehavior": "", + "simulationInstructions": "", + "simulateInput": false, + "simulateTools": false, + "inputGenerationInstructions": "", + "evalSetId": "", + "source": "manual", + "createdAt": "...", + "updatedAt": "..." + } + ], + "modelSettings": [], + "agentMemoryEnabled": false, + "agentMemorySettings": [], + "lineByLineEvaluation": false, + "createdAt": "...", + "updatedAt": "..." +} +``` + +The `source` field indicates how the test case was created: `"manual"` (CLI), `"debugRun"` (from a debug session), `"runtimeRun"` (from a live run), `"simulatedRun"`, or `"autopilotUserInitiated"`. diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md new file mode 100644 index 000000000..bfb284d6d --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md @@ -0,0 +1,102 @@ +# Evaluators + +Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`. + +## Evaluator Types + +| Type | CLI Flag | What It Scores | +|------|----------|----------------| +| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | +| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | +| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query | +| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context | + +## Managing Evaluators + +### Add an evaluator + +```bash +uip agent eval evaluator add --type --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--type ` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — | +| `--description ` | No | Human-readable description | Auto-generated from type | +| `--prompt ` | No | Custom LLM evaluation prompt | Built-in default per type | +| `--target-key ` | No | Specific output key to evaluate | `*` (all keys) | +| `--path ` | No | Agent project directory | `.` | + +**Example:** +```bash +uip agent eval evaluator add content-quality \ + --type semantic-similarity \ + --path ./my-agent \ + --output json +``` + +### List evaluators + +```bash +uip agent eval evaluator list --path --output json +``` + +### Remove an evaluator + +```bash +uip agent eval evaluator remove --path --output json +``` + +Removing an evaluator automatically removes its references from all eval sets that reference it. + +## Default Evaluators + +`uip agent init` creates two default evaluators: + +### Semantic Similarity (`evaluator-default.json`) + +Compares expected vs actual output for semantic equivalence. Uses template variables `{{ExpectedOutput}}` and `{{ActualOutput}}`. Scores 0–100. + +### Trajectory (`evaluator-default-trajectory.json`) + +Evaluates the agent's reasoning path against expected behavior. Uses template variables `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. Scores 0–100. + +Both default evaluators use `"same-as-agent"` as the model, which resolves to the agent's configured model at runtime. + +## Evaluator JSON Format + +```json +{ + "fileName": "evaluator-content-quality.json", + "id": "", + "name": "content-quality", + "description": "Evaluates semantic similarity of output", + "category": 1, + "type": 5, + "prompt": "Compare {{ExpectedOutput}} with {{ActualOutput}}...", + "model": "same-as-agent", + "targetOutputKey": "*", + "createdAt": "2025-01-01T00:00:00.000Z", + "updatedAt": "2025-01-01T00:00:00.000Z" +} +``` + +**Type and category mapping:** + +| CLI Type | `type` (numeric) | `category` | +|----------|-------------------|------------| +| `semantic-similarity` | 5 | 1 (output-based) | +| `trajectory` | 7 | 3 (trajectory-based) | +| `context-precision` | 8 | 1 (output-based) | +| `faithfulness` | 9 | 1 (output-based) | + +## Custom Prompts + +When `--prompt` is omitted, the CLI uses a built-in default prompt for each type. To customize, pass a prompt string using the appropriate template variables: + +- **Semantic Similarity**: `{{ExpectedOutput}}`, `{{ActualOutput}}` +- **Trajectory**: `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` +- **Context Precision**: `{{UserQuery}}`, `{{RetrievedContext}}` +- **Faithfulness**: `{{AgentOutput}}`, `{{Context}}` diff --git a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md new file mode 100644 index 000000000..0ffc4ebe8 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md @@ -0,0 +1,163 @@ +# Running Evaluations + +Execute evaluations against the Agent Runtime, check status, view results, and compare runs. + +All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition. + +## Start an Eval Run + +```bash +uip agent eval run start --set "" --path --wait --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--path ` | No | Agent project directory | `.` | +| `--wait` | No | Poll until completion and show results | `false` | +| `--timeout ` | No | Polling timeout (with `--wait`) | 600 (10 min) | +| `--solution-id ` | No | Override solution ID | Auto-resolved from `SolutionStorage.json` | + +Without `--wait`, the command returns immediately with an `EvalSetRunId`: + +```json +{ + "Code": "AgentEvalRunStarted", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "EvalSetName": "Default Evaluation Set", + "TestCases": 5, + "Evaluators": 2 + } +} +``` + +With `--wait`, the CLI polls every 5 seconds until completion, then outputs both a summary and per-test-case results. + +## Check Run Status + +```bash +uip agent eval run status --set "" --path --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunStatus", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "Status": "completed", + "Score": 0.86, + "Duration": "42.5s", + "EvaluatorScores": "semantic: 0.9, trajectory: 0.82" + } +} +``` + +Terminal states: `completed` or `failed`. + +## View Results + +```bash +uip agent eval run results \ + --set "" \ + --path \ + --output json +``` + +**Options:** + +| Flag | Description | +|------|-------------| +| `--only-failed` | Show only failed or errored test cases | +| `--verbose` | Include evaluator justifications in output | +| `--export-format ` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) | + +**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`). + +### Failure detection + +A test case is considered **failed** if any of these are true: +- Status is `failed` +- Has an error message +- Any evaluator score type is `error` +- Any exact-match evaluator returned `false` + +## List Past Runs + +```bash +uip agent eval run list --set "" --path --output json +``` + +**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`. + +## Compare Runs + +Compare two eval runs side by side to see score changes: + +```bash +uip agent eval run compare \ + --compare-to \ + --set "" \ + --path \ + --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunComparison", + "Data": { + "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" }, + "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" }, + "ScoreDelta": 0.06, + "TestCases": [ + { + "TestCase": "happy-path", + "ScoreA": 1.0, + "ScoreB": 0.9, + "Delta": "+0.1", + "StatusA": "completed", + "StatusB": "completed" + } + ] + } +} +``` + +Use `compare` after prompt changes to verify improvements without regressions. + +## Workflow Example + +```bash +# 1. Push agent to Studio Web (if not already done) +uip agent push --path ./my-agent --output json + +# 2. Add test cases +uip agent eval add greeting-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"hi there"}' \ + --expected '{"content":"Hello! How can I help you?"}' \ + --expected-agent-behavior "Agent should respond with a friendly greeting" \ + --path ./my-agent --output json + +# 3. Run and wait +uip agent eval run start \ + --set "Default Evaluation Set" \ + --path ./my-agent \ + --wait --output json + +# 4. Review failures +uip agent eval run results \ + --set "Default Evaluation Set" \ + --only-failed --verbose \ + --path ./my-agent --output json + +# 5. Make changes, push, re-run, compare +uip agent push --path ./my-agent --output json +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json +uip agent eval run compare --compare-to \ + --set "Default Evaluation Set" --path ./my-agent --output json +``` diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md index 6f16e3bd5..a60995399 100644 --- a/skills/uipath-agents/references/lowcode/lowcode.md +++ b/skills/uipath-agents/references/lowcode/lowcode.md @@ -46,6 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) | | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) | | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) | +| Running evaluations, adding test cases, managing evaluators | [evaluation/evaluate.md](evaluation/evaluate.md) | ### Capability Registry @@ -68,6 +69,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad | Add an Action Center escalation (HITL) | [capabilities/escalation/escalation.md](capabilities/escalation/escalation.md) | | | Add guardrails (PII, harmful content, custom rules) | [capabilities/guardrails/guardrails.md](capabilities/guardrails/guardrails.md) | | | Embed an agent inline in a flow | [capabilities/inline-in-flow/inline-in-flow.md](capabilities/inline-in-flow/inline-in-flow.md) | | +| Evaluate agent (add test cases, run evals, view results) | [evaluation/evaluate.md](evaluation/evaluate.md) | `evaluation/evaluators.md`, `evaluation/evaluation-sets.md`, `evaluation/running-evaluations.md` | | Set up Orchestrator resources | Tell the user to use the `uipath-platform` skill | | | Wire agent into a flow | Tell the user to use the `uipath-maestro-flow` skill | | From 29f59568f780d3788ae5ff2d6181f724bdc5e0be Mon Sep 17 00:00:00 2001 From: Mayank Jha Date: Mon, 4 May 2026 18:29:50 -0700 Subject: [PATCH 2/5] fix(uipath-agents): correct low-code eval docs against CLI/SDK source MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address PR #552 review comment from @Chibionos: drop SolutionStorage.json mentions throughout the eval refs (it is going away). Reword troubleshooting, prerequisites, file-structure tree, and the --solution-id default to describe the user-facing concept ("agent pushed to Studio Web") instead. Folds in additional corrections found while verifying the PR against the uip CLI (Code/cli), uipath-python SDK, and Agents service repo: - Rename evaluation/ → evaluations/ to match coded sibling convention. - Move eval row from Capability Registry to "Read on demand" in lowcode.md (eval is lifecycle, not a capability). - Fix evaluator filename example: actual pattern is evaluator-.json, not .json. The user-supplied goes into the JSON name field. - Restore --wait polling cadence (5s) and --timeout default (600s) — both hardcoded in eval-run.ts. Removed earlier when unverified. - Add complete output Code enum (AgentEvalRunStarted/Completed/Results/ Status/Exported/List/Comparison). - Expand failure detection with the numeric forms isFailedRun() actually checks (status "3", score.type "2"), plus the SDK status enum. - Document the worker-side LLM model fail-fast (activities.py) and the same-as-agent resolver error (EvaluatorFactory) — these are runtime, not validate-time, errors. - Correct context-precision/faithfulness data flow: both are trace-driven (RETRIEVER spans), not test-case-driven; faithfulness reads expectedOutput as the candidate text, not the agent's actual output. - Add "Why fewer evaluators than coded?" section explaining the legacy vs new SDK engine split, plus the 2 runtime-supported types not exposed by the CLI (Equals=1, JsonSimilarity=6) with copy-pasteable JSON. - Document validate's category↔type matrix (cat 0→{1,6}, cat 1→{5,8,9}, cat 3→{7}) and required fields per schema-validation-service.ts. - Add Anti-patterns section to all four eval reference files per skill-structure.md convention. - Workflow example: insert validate step between add and push. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/uipath-agents/SKILL.md | 2 +- .../references/lowcode/evaluation/evaluate.md | 71 ----- .../lowcode/evaluation/evaluation-sets.md | 140 ---------- .../lowcode/evaluation/evaluators.md | 102 -------- .../lowcode/evaluation/running-evaluations.md | 163 ------------ .../lowcode/evaluations/evaluate.md | 90 +++++++ .../lowcode/evaluations/evaluation-sets.md | 163 ++++++++++++ .../lowcode/evaluations/evaluators.md | 242 ++++++++++++++++++ .../evaluations/running-evaluations.md | 207 +++++++++++++++ .../references/lowcode/lowcode.md | 3 +- 10 files changed, 704 insertions(+), 479 deletions(-) delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluate.md delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/evaluators.md delete mode 100644 skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluate.md create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md create mode 100644 skills/uipath-agents/references/lowcode/evaluations/evaluators.md create mode 100644 skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md index 53fbc6b88..734437628 100644 --- a/skills/uipath-agents/SKILL.md +++ b/skills/uipath-agents/SKILL.md @@ -46,7 +46,7 @@ Determine the agent mode before proceeding: | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` | | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists | | Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` | -| Run low-code evaluations | Low-code | [lowcode/evaluation/evaluate.md](references/lowcode/evaluation/evaluate.md) | `lowcode/evaluation/evaluators.md`, `lowcode/evaluation/evaluation-sets.md`, `lowcode/evaluation/running-evaluations.md` | +| Run low-code evaluations | Low-code | [lowcode/evaluations/evaluate.md](references/lowcode/evaluations/evaluate.md) | `lowcode/evaluations/evaluators.md`, `lowcode/evaluations/evaluation-sets.md`, `lowcode/evaluations/running-evaluations.md` | | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` | ## Resources diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md b/skills/uipath-agents/references/lowcode/evaluation/evaluate.md deleted file mode 100644 index c60e450cd..000000000 --- a/skills/uipath-agents/references/lowcode/evaluation/evaluate.md +++ /dev/null @@ -1,71 +0,0 @@ -# Evaluate Low-Code Agents - -Design and run evaluations against low-code agents using the `uip agent eval` CLI. - -## Quick Reference - -```bash -# Add a test case -uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json - -# Run evals and wait for results -uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json - -# Check results (failures only, with justifications) -uip agent eval run results --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json -``` - -## Prerequisites - -- Agent project initialized (`uip agent init`) -- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud) -- `SolutionStorage.json` exists in the agent project (created by `uip agent push`) - -Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. - -## Reference Navigation - -- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts -- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options -- [Running Evaluations](running-evaluations.md) — start, status, results, compare - -Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases. - -## File Structure - -After `uip agent init`, the eval-related project structure is: - -``` -my-agent/ - agent.json - SolutionStorage.json # Created after `uip agent push` - evals/ - evaluators/ - evaluator-default.json # Semantic similarity evaluator - evaluator-default-trajectory.json # Trajectory evaluator - eval-sets/ - evaluation-set-default.json # Default eval set (references both evaluators) -``` - -Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories. - -## Key Differences from Coded Agent Evals - -| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) | -|--------|-------------------------------|------------------------------| -| Execution | Local Python process | Cloud-based via Agent Runtime | -| Auth required | Only for `--report` | Always (cloud execution) | -| Prerequisite | `entry-points.json` | `uip agent push` (SolutionStorage.json) | -| Mocking | `@mockable()` decorator + declarative | Simulation instructions only | -| CLI prefix | `uip codedagent eval` | `uip agent eval` | - -## Troubleshooting - -| Error | Cause | Fix | -|-------|-------|-----| -| `SolutionStorage.json not found` | Agent not pushed to Studio Web | Run `uip agent push --output json` | -| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` | -| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases | -| `401 Unauthorized` | Auth expired | Run `uip login --output json` | -| Eval run timeout | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web | -| `same-as-agent` model error | Evaluator model can't be resolved | Set an explicit model in the evaluator config instead of `"same-as-agent"` | diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md deleted file mode 100644 index 490faf2a2..000000000 --- a/skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md +++ /dev/null @@ -1,140 +0,0 @@ -# Evaluation Sets and Test Cases - -Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set. - -## Managing Eval Sets - -### Add an eval set - -```bash -uip agent eval set add --path --output json -``` - -**Options:** - -| Flag | Required | Description | Default | -|------|----------|-------------|---------| -| `--evaluators ` | No | Comma-separated evaluator IDs | All existing evaluators | -| `--path ` | No | Agent project directory | `.` | - -When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project. - -### List eval sets - -```bash -uip agent eval set list --path --output json -``` - -### Remove an eval set - -```bash -uip agent eval set remove --path --output json -``` - -## Managing Test Cases - -Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations. - -### Add a test case - -```bash -uip agent eval add \ - --set "" \ - --inputs '{"input":"hello"}' \ - --expected '{"content":"greeting response"}' \ - --path \ - --output json -``` - -**Options:** - -| Flag | Required | Description | Default | -|------|----------|-------------|---------| -| `--set ` | Yes | Eval set name or ID | — | -| `--inputs ` | Yes | Input values as JSON | — | -| `--expected ` | No | Expected output as JSON | `{}` | -| `--expected-agent-behavior ` | No | Description of expected behavior (used by trajectory evaluator) | `""` | -| `--simulation-instructions ` | No | Instructions for simulating agent behavior | `""` | -| `--simulate-input` | No | Enable input simulation | `false` | -| `--simulate-tools` | No | Enable tool simulation | `false` | -| `--input-generation-instructions ` | No | Instructions for generating synthetic inputs | `""` | -| `--path ` | No | Agent project directory | `.` | - -### List test cases - -```bash -uip agent eval list --set "" --path --output json -``` - -### Remove a test case - -```bash -uip agent eval remove --set "" --path --output json -``` - -## Test Case Design - -### Matching evaluator to test case fields - -| Evaluator Type | Key Test Case Fields | -|---------------|---------------------| -| Semantic Similarity | `--inputs`, `--expected` | -| Trajectory | `--inputs`, `--expected-agent-behavior` | -| Context Precision | `--inputs`, `--expected` | -| Faithfulness | `--inputs`, `--expected` | - -For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: - -```bash -uip agent eval add tool-usage-test \ - --set "Default Evaluation Set" \ - --inputs '{"input":"What is the weather in NYC?"}' \ - --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \ - --path ./my-agent --output json -``` - -### Simulation options - -- `--simulate-input` — the runtime generates synthetic input variations based on the provided input -- `--simulate-tools` — tool calls are simulated rather than executed against real services -- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters") -- `--simulation-instructions` — guides the overall simulation behavior - -These are useful for expanding test coverage without writing every input by hand. - -## Eval Set JSON Format - -```json -{ - "fileName": "evaluation-set-default.json", - "id": "", - "name": "Default Evaluation Set", - "batchSize": 10, - "evaluatorRefs": ["", ""], - "evaluations": [ - { - "id": "", - "name": "happy-path", - "inputs": {"input": "hello"}, - "expectedOutput": {"content": "greeting"}, - "expectedAgentBehavior": "", - "simulationInstructions": "", - "simulateInput": false, - "simulateTools": false, - "inputGenerationInstructions": "", - "evalSetId": "", - "source": "manual", - "createdAt": "...", - "updatedAt": "..." - } - ], - "modelSettings": [], - "agentMemoryEnabled": false, - "agentMemorySettings": [], - "lineByLineEvaluation": false, - "createdAt": "...", - "updatedAt": "..." -} -``` - -The `source` field indicates how the test case was created: `"manual"` (CLI), `"debugRun"` (from a debug session), `"runtimeRun"` (from a live run), `"simulatedRun"`, or `"autopilotUserInitiated"`. diff --git a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md b/skills/uipath-agents/references/lowcode/evaluation/evaluators.md deleted file mode 100644 index bfb284d6d..000000000 --- a/skills/uipath-agents/references/lowcode/evaluation/evaluators.md +++ /dev/null @@ -1,102 +0,0 @@ -# Evaluators - -Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`. - -## Evaluator Types - -| Type | CLI Flag | What It Scores | -|------|----------|----------------| -| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | -| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | -| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query | -| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context | - -## Managing Evaluators - -### Add an evaluator - -```bash -uip agent eval evaluator add --type --path --output json -``` - -**Options:** - -| Flag | Required | Description | Default | -|------|----------|-------------|---------| -| `--type ` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — | -| `--description ` | No | Human-readable description | Auto-generated from type | -| `--prompt ` | No | Custom LLM evaluation prompt | Built-in default per type | -| `--target-key ` | No | Specific output key to evaluate | `*` (all keys) | -| `--path ` | No | Agent project directory | `.` | - -**Example:** -```bash -uip agent eval evaluator add content-quality \ - --type semantic-similarity \ - --path ./my-agent \ - --output json -``` - -### List evaluators - -```bash -uip agent eval evaluator list --path --output json -``` - -### Remove an evaluator - -```bash -uip agent eval evaluator remove --path --output json -``` - -Removing an evaluator automatically removes its references from all eval sets that reference it. - -## Default Evaluators - -`uip agent init` creates two default evaluators: - -### Semantic Similarity (`evaluator-default.json`) - -Compares expected vs actual output for semantic equivalence. Uses template variables `{{ExpectedOutput}}` and `{{ActualOutput}}`. Scores 0–100. - -### Trajectory (`evaluator-default-trajectory.json`) - -Evaluates the agent's reasoning path against expected behavior. Uses template variables `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. Scores 0–100. - -Both default evaluators use `"same-as-agent"` as the model, which resolves to the agent's configured model at runtime. - -## Evaluator JSON Format - -```json -{ - "fileName": "evaluator-content-quality.json", - "id": "", - "name": "content-quality", - "description": "Evaluates semantic similarity of output", - "category": 1, - "type": 5, - "prompt": "Compare {{ExpectedOutput}} with {{ActualOutput}}...", - "model": "same-as-agent", - "targetOutputKey": "*", - "createdAt": "2025-01-01T00:00:00.000Z", - "updatedAt": "2025-01-01T00:00:00.000Z" -} -``` - -**Type and category mapping:** - -| CLI Type | `type` (numeric) | `category` | -|----------|-------------------|------------| -| `semantic-similarity` | 5 | 1 (output-based) | -| `trajectory` | 7 | 3 (trajectory-based) | -| `context-precision` | 8 | 1 (output-based) | -| `faithfulness` | 9 | 1 (output-based) | - -## Custom Prompts - -When `--prompt` is omitted, the CLI uses a built-in default prompt for each type. To customize, pass a prompt string using the appropriate template variables: - -- **Semantic Similarity**: `{{ExpectedOutput}}`, `{{ActualOutput}}` -- **Trajectory**: `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` -- **Context Precision**: `{{UserQuery}}`, `{{RetrievedContext}}` -- **Faithfulness**: `{{AgentOutput}}`, `{{Context}}` diff --git a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md deleted file mode 100644 index 0ffc4ebe8..000000000 --- a/skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md +++ /dev/null @@ -1,163 +0,0 @@ -# Running Evaluations - -Execute evaluations against the Agent Runtime, check status, view results, and compare runs. - -All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition. - -## Start an Eval Run - -```bash -uip agent eval run start --set "" --path --wait --output json -``` - -**Options:** - -| Flag | Required | Description | Default | -|------|----------|-------------|---------| -| `--set ` | Yes | Eval set name or ID | — | -| `--path ` | No | Agent project directory | `.` | -| `--wait` | No | Poll until completion and show results | `false` | -| `--timeout ` | No | Polling timeout (with `--wait`) | 600 (10 min) | -| `--solution-id ` | No | Override solution ID | Auto-resolved from `SolutionStorage.json` | - -Without `--wait`, the command returns immediately with an `EvalSetRunId`: - -```json -{ - "Code": "AgentEvalRunStarted", - "Data": { - "EvalSetRunId": "a1b2c3d4-...", - "EvalSetName": "Default Evaluation Set", - "TestCases": 5, - "Evaluators": 2 - } -} -``` - -With `--wait`, the CLI polls every 5 seconds until completion, then outputs both a summary and per-test-case results. - -## Check Run Status - -```bash -uip agent eval run status --set "" --path --output json -``` - -**Output:** -```json -{ - "Code": "AgentEvalRunStatus", - "Data": { - "EvalSetRunId": "a1b2c3d4-...", - "Status": "completed", - "Score": 0.86, - "Duration": "42.5s", - "EvaluatorScores": "semantic: 0.9, trajectory: 0.82" - } -} -``` - -Terminal states: `completed` or `failed`. - -## View Results - -```bash -uip agent eval run results \ - --set "" \ - --path \ - --output json -``` - -**Options:** - -| Flag | Description | -|------|-------------| -| `--only-failed` | Show only failed or errored test cases | -| `--verbose` | Include evaluator justifications in output | -| `--export-format ` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) | - -**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`). - -### Failure detection - -A test case is considered **failed** if any of these are true: -- Status is `failed` -- Has an error message -- Any evaluator score type is `error` -- Any exact-match evaluator returned `false` - -## List Past Runs - -```bash -uip agent eval run list --set "" --path --output json -``` - -**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`. - -## Compare Runs - -Compare two eval runs side by side to see score changes: - -```bash -uip agent eval run compare \ - --compare-to \ - --set "" \ - --path \ - --output json -``` - -**Output:** -```json -{ - "Code": "AgentEvalRunComparison", - "Data": { - "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" }, - "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" }, - "ScoreDelta": 0.06, - "TestCases": [ - { - "TestCase": "happy-path", - "ScoreA": 1.0, - "ScoreB": 0.9, - "Delta": "+0.1", - "StatusA": "completed", - "StatusB": "completed" - } - ] - } -} -``` - -Use `compare` after prompt changes to verify improvements without regressions. - -## Workflow Example - -```bash -# 1. Push agent to Studio Web (if not already done) -uip agent push --path ./my-agent --output json - -# 2. Add test cases -uip agent eval add greeting-test \ - --set "Default Evaluation Set" \ - --inputs '{"input":"hi there"}' \ - --expected '{"content":"Hello! How can I help you?"}' \ - --expected-agent-behavior "Agent should respond with a friendly greeting" \ - --path ./my-agent --output json - -# 3. Run and wait -uip agent eval run start \ - --set "Default Evaluation Set" \ - --path ./my-agent \ - --wait --output json - -# 4. Review failures -uip agent eval run results \ - --set "Default Evaluation Set" \ - --only-failed --verbose \ - --path ./my-agent --output json - -# 5. Make changes, push, re-run, compare -uip agent push --path ./my-agent --output json -uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json -uip agent eval run compare --compare-to \ - --set "Default Evaluation Set" --path ./my-agent --output json -``` diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md new file mode 100644 index 000000000..632af61fc --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md @@ -0,0 +1,90 @@ +# Evaluate Low-Code Agents + +Design and run evaluations against low-code agents using the `uip agent eval` CLI. + +## Quick Reference + +```bash +# Add a test case +uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json + +# Run evals and wait for results +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json + +# Check results (failures only, with justifications) +uip agent eval run results --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json +``` + +## Prerequisites + +- Agent project initialized (`uip agent init `) +- `entry-points.json` present (defines `input`/`output` schema that test case `--inputs`/`--expected` must conform to) +- `uip agent validate --output json` passes (validate also checks evals and evaluators) +- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud) + +Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. Only `uip agent eval run *` commands require cloud connectivity. + +## Reference Navigation + +- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts +- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options +- [Running Evaluations](running-evaluations.md) — start, status, results, compare + +Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases. + +## File Structure + +After `uip agent init`, the project structure is: + +``` +my-agent/ + agent.json + entry-points.json # Input/output schema — test case --inputs / --expected must match + project.uiproj + flow-layout.json + evals/ + evaluators/ + evaluator-default.json # name: "Default Evaluator" (semantic-similarity) + evaluator-default-trajectory.json # name: "Default Trajectory Evaluator" + eval-sets/ + evaluation-set-default.json # name: "Default Evaluation Set" (references both evaluators) +``` + +Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories. + +CLI-added evaluators are written as `evaluator-.json` (first 8 hex chars of the evaluator UUID). The `` argument populates the `name` field inside the JSON, NOT the filename. Reference evaluators in eval sets by `id` (UUID), not filename. + +## Key Differences from Coded Agent Evals + +| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) | +|--------|-------------------------------|------------------------------| +| Execution | Local Python process | Cloud-based via Agent Runtime | +| Auth required | Only for `--report` | Always (cloud execution) | +| Prerequisite | `entry-points.json` | `uip agent push` | +| Mocking | `@mockable()` decorator + declarative | Simulation instructions only | +| CLI prefix | `uip codedagent eval` | `uip agent eval` | + +## Troubleshooting + +| Error | Cause | Fix | +|-------|-------|-----| +| Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id ` explicitly to `uip agent eval run start` | +| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` | +| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases | +| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | +| `Evaluator '' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model | +| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model | +| `401 Unauthorized` | Auth expired | Run `uip login --output json` | +| Eval run timeout (with `--wait`) | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web. Note: this only stops the local CLI from blocking; the run continues server-side — query with `uip agent eval run status ` | +| Validate fails with eval errors | Eval set references an evaluator that no longer exists, OR evaluator JSON missing required field, OR `category`/`type` mismatch (see [evaluators.md](evaluators.md) § What `uip agent validate` Checks) | Re-run `uip agent eval evaluator list` and reconcile `evaluatorRefs`; fix per the validate error message | + +The two model-resolution errors above are **runtime checks in the cloud eval worker**, not validate-time checks — `uip agent validate` will not catch them. They surface only after `uip agent eval run start`. To pre-empt them, inspect each evaluator's `model` field locally before pushing. + +## Anti-patterns + +- **Don't run `uip agent eval run start` before `uip agent push`.** The Agent Runtime executes against the pushed agent. Local edits to `agent.json` after the last push will not be reflected in the run. +- **Don't skip `uip agent validate` before push.** Validate checks `evals/` and `evaluators/`; broken eval JSON will not block push but will surface as runtime errors. +- **Don't hand-edit `id` or `evaluatorRefs` UUIDs.** Eval sets reference evaluators by UUID. Renaming an evaluator file or copy-pasting a UUID across evaluators silently breaks resolution. +- **Don't expect filenames to match ``.** CLI-generated evaluator files use `evaluator-.json`, not `.json`. Look up evaluators by the `name` field inside the JSON, not by filename. +- **Don't pass `--type` in PascalCase.** The CLI rejects `SemanticSimilarity`. Only kebab-case is accepted. +- **Don't reference evaluators across projects.** Each agent project has its own `evals/evaluators/` directory; UUIDs are not portable. diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md new file mode 100644 index 000000000..60a751f96 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md @@ -0,0 +1,163 @@ +# Evaluation Sets and Test Cases + +Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set. + +## Managing Eval Sets + +### Add an eval set + +```bash +uip agent eval set add --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--evaluators ` | No | Comma-separated evaluator IDs | All existing evaluators | +| `--path ` | No | Agent project directory | `.` | + +When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project. + +### List eval sets + +```bash +uip agent eval set list --path --output json +``` + +### Remove an eval set + +```bash +uip agent eval set remove --path --output json +``` + +## Managing Test Cases + +Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations. + +### Add a test case + +```bash +uip agent eval add \ + --set "" \ + --inputs '{"input":"hello"}' \ + --expected '{"content":"greeting response"}' \ + --path \ + --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--inputs ` | Yes | Input values as JSON | — | +| `--expected ` | No | Expected output as JSON | `{}` | +| `--expected-agent-behavior ` | No | Description of expected behavior (used by trajectory evaluator) | `""` | +| `--simulation-instructions ` | No | Instructions for simulating agent behavior | `""` | +| `--simulate-input` | No | Enable input simulation | `false` | +| `--simulate-tools` | No | Enable tool simulation | `false` | +| `--input-generation-instructions ` | No | Instructions for generating synthetic inputs | `""` | +| `--path ` | No | Agent project directory | `.` | + +### List test cases + +```bash +uip agent eval list --set "" --path --output json +``` + +### Remove a test case + +```bash +uip agent eval remove --set "" --path --output json +``` + +## Test Case Design + +### Aligning `--inputs` with `entry-points.json` + +`--inputs` JSON keys must match the `input` schema in `entry-points.json`. Mismatched keys do not block `eval add` (the CLI stores the JSON verbatim) but will fail at run time when the Agent Runtime invokes the agent. Run `uip agent validate --output json` after adding test cases to surface schema drift. + +### Matching evaluator to test case fields + +The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace: + +| Evaluator Type | From test case | From agent run trace | +|----------------|---------------|----------------------| +| `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` | +| `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` | +| `context-precision` | (none directly used) | RETRIEVER spans `input.value` → `{{UserQuery}}`, `output.value.documents` → `{{RetrievedContext}}` | +| `faithfulness` | `expectedOutput` → `{{AgentOutput}}` (note: it is the *expected* output that is treated as the candidate text to fact-check, not the agent's actual output) | Trace span outputs (RETRIEVER + tool calls) → `{{Context}}` | + +`context-precision` and `faithfulness` are **trace-driven evaluators**. They extract `{{UserQuery}}`, `{{RetrievedContext}}`, and `{{Context}}` by walking `openinference.span.kind == "RETRIEVER"` (and other tool spans) on the agent's run trace. Their behavior: + +- **The agent must perform retrieval** (Context Grounding / index / DataFabric / a tool that emits an OpenInference RETRIEVER span). Without retrieval spans, the placeholders resolve to empty and scores collapse. +- **`--inputs` and `--expected` are not consumed in the obvious way**: `context-precision` ignores test-case `inputs` (it reads the query from the trace); `faithfulness` reads the *expected* output (not the agent's actual output) as the candidate text. +- **CLI-default placeholders may differ from SDK-internal placeholders.** The CLI writes prompts with `{{UserQuery}}` and `{{RetrievedContext}}` for context-precision, but the SDK's legacy evaluator hardcodes `{{Query}}` and `{{Chunks}}` internally. Inspect the resulting evaluator JSON; if you customize the prompt, match the placeholders the runtime actually substitutes (test with a small run before relying on results). + +If the agent has no retrieval step, remove `context-precision` and `faithfulness` from the eval set rather than letting them silently score everything as 0. + +For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: + +```bash +uip agent eval add tool-usage-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"What is the weather in NYC?"}' \ + --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \ + --path ./my-agent --output json +``` + +### Simulation options + +- `--simulate-input` — runtime generates synthetic input variations based on the provided input +- `--simulate-tools` — tool calls are simulated rather than executed against real services +- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters") +- `--simulation-instructions` — guides overall simulation behavior + +Use these to expand test coverage without writing every input by hand. + +## Eval Set JSON Format + +```json +{ + "fileName": "evaluation-set-default.json", + "id": "", + "name": "Default Evaluation Set", + "batchSize": 10, + "evaluatorRefs": ["", ""], + "evaluations": [ + { + "id": "", + "name": "happy-path", + "inputs": {"input": "hello"}, + "expectedOutput": {"content": "greeting"}, + "expectedAgentBehavior": "", + "simulationInstructions": "", + "simulateInput": false, + "simulateTools": false, + "inputGenerationInstructions": "", + "evalSetId": "", + "source": "manual", + "createdAt": "...", + "updatedAt": "..." + } + ], + "modelSettings": [], + "agentMemoryEnabled": false, + "agentMemorySettings": [], + "lineByLineEvaluation": false, + "createdAt": "...", + "updatedAt": "..." +} +``` + +The `source` field indicates how the test case was created. CLI-added test cases are always `"manual"` (verified). Other observed values from Studio Web include `"debugRun"`, `"runtimeRun"`, `"simulatedRun"`, and `"autopilotUserInitiated"` — treat the `source` field as an enum but do not set it manually; the CLI and Studio Web own this value. + +## Anti-patterns + +- **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`. +- **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push. +- **Don't add `context-precision` or `faithfulness` evaluators to an eval set whose agent has no RETRIEVER span.** Both extract their placeholders from agent trace spans, not from `inputs`/`expectedOutput`. No retrieval → scores collapse to 0. +- **Don't expect `faithfulness` to read the agent's actual output.** It reads `expectedOutput` (the criteria field) as the candidate text. To fact-check actual agent output, use `semantic-similarity` against an expected ground truth instead. +- **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons. +- **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync. diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md new file mode 100644 index 000000000..89d8c0d61 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md @@ -0,0 +1,242 @@ +# Evaluators + +Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`. + +## Why fewer evaluators than coded? + +The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 4 because the two surfaces use **different engines** in the SDK: + +- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class. +- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only 4 implementation classes ship: `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator`. + +Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent — the cloud eval worker will not load them. + +The CLI also narrows the runtime surface further: of the 6 legacy `type` values the runtime accepts, the `--type` flag exposes only 4. See § Runtime-supported types not exposed by the CLI below. + +## Evaluator Types (CLI-exposed) + +| Type | CLI Flag | What It Scores | +|------|----------|----------------| +| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | +| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | +| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query | +| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context | + +## Runtime-supported types not exposed by the CLI + +The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/.json`: + +### `Equals` (type 1, category 0 — Deterministic) + +Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`. + +```json +{ + "fileName": "evaluator-equals.json", + "id": "", + "name": "exact-match", + "description": "Exact-match evaluator", + "category": 0, + "type": 1, + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +No `prompt`/`model` required (Deterministic category bypasses the LLM checks). + +### `JsonSimilarity` (type 6, category 0 — Deterministic) + +Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`. + +```json +{ + "fileName": "evaluator-json-sim.json", + "id": "", + "name": "json-similarity", + "description": "JSON similarity evaluator", + "category": 0, + "type": 6, + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +After hand-writing, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps. + +## Coded-only evaluators (NOT available on low-code) + +The following coded `evaluatorTypeId` strings have no legacy class — agents working on a low-code agent should not attempt to use them. Switch to a coded agent (`version: "1.0"` eval sets) if you need any of these: + +`uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, `uipath-llm-judge-trajectory-simulation`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-tool-call-order`, `uipath-tool-call-args`, `uipath-tool-call-count`, `uipath-tool-call-output`. + +## Managing Evaluators + +### Add an evaluator + +```bash +uip agent eval evaluator add --type --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--type ` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — | +| `--description ` | No | Human-readable description | Auto-generated from type | +| `--prompt ` | No | Custom LLM evaluation prompt | Built-in default per type | +| `--target-key ` | No | Specific output key to evaluate | `*` (all keys) | +| `--path ` | No | Agent project directory | `.` | + +**Example:** +```bash +uip agent eval evaluator add content-quality \ + --type semantic-similarity \ + --path ./my-agent \ + --output json +``` + +### List evaluators + +```bash +uip agent eval evaluator list --path --output json +``` + +### Remove an evaluator + +```bash +uip agent eval evaluator remove --path --output json +``` + +Removing an evaluator automatically removes its references from all eval sets that reference it. + +## Default Evaluators + +`uip agent init` creates two default evaluators: + +### Semantic Similarity (`evaluator-default.json`, `name: "Default Evaluator"`) + +Compares expected vs actual output for semantic equivalence. Default prompt asks the LLM for a 0–100 score and substitutes `{{ExpectedOutput}}` and `{{ActualOutput}}`. + +### Trajectory (`evaluator-default-trajectory.json`, `name: "Default Trajectory Evaluator"`) + +Evaluates the agent's reasoning path against expected behavior. Default prompt asks the LLM for a 0–100 score and substitutes `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. + +Both default evaluators ship with `"model": "same-as-agent"` — this is supported and resolves to the agent's configured model at runtime. Override with an explicit model only if you need to score with a different model than the agent uses. + +The runtime DTO normalizes all evaluator scores to a 0–100 scale regardless of what the prompt asks for, but mixed-scale prompts in the same eval set produce confusing intermediate values — pick one scale per eval set. + +## Filename vs Name + +CLI-added evaluators are saved as `evaluator-.json` (first 8 hex chars of the evaluator UUID). The `` argument populates the `name` field inside the JSON; it does NOT shape the filename. + +```bash +uip agent eval evaluator add content-quality --type semantic-similarity --path ./my-agent +# Creates: evals/evaluators/evaluator-b47e26ca.json +# JSON has: "name": "content-quality" +``` + +The two `evaluator-default*.json` files are written by `uip agent init`, not by `evaluator add`. Eval sets reference evaluators by `id` (UUID), not by filename or name. + +## Evaluator JSON Format + +```json +{ + "fileName": "evaluator-b47e26ca.json", + "id": "b47e26ca-7a13-4c83-9ee4-039d6415fb63", + "name": "content-quality", + "description": "Semantic Similarity", + "category": 1, + "type": 5, + "prompt": "As an expert evaluator, ... {{ExpectedOutput}} ... {{ActualOutput}} ...", + "model": "same-as-agent", + "targetOutputKey": "*", + "createdAt": "2026-05-04T00:00:00.000Z", + "updatedAt": "2026-05-04T00:00:00.000Z" +} +``` + +**Type and category mapping:** + +| CLI Type | `type` (numeric) | `category` | +|----------|-------------------|------------| +| `semantic-similarity` | 5 | 1 (output-based) | +| `trajectory` | 7 | 3 (trajectory-based) | +| `context-precision` | 8 | 1 (output-based) | +| `faithfulness` | 9 | 1 (output-based) | + +## Default Prompts and Template Variables + +The prompt and score scale the CLI writes when you run `evaluator add` differs from what `uip agent init` writes for the two default evaluators: + +| Type | `evaluator add` default | `uip agent init` default | +|------|-------------------------|--------------------------| +| `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders | +| `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` | +| `context-precision` | Asks 0–1; uses `{{UserQuery}}`, `{{RetrievedContext}}` | Not created by `init` | +| `faithfulness` | Asks 0–1; uses `{{AgentOutput}}`, `{{Context}}` | Not created by `init` | + +Two notable inconsistencies: + +1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix. +2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want. + +For `context-precision` and `faithfulness`, the SDK's legacy evaluator may use its own internal placeholders (`{{Query}}`, `{{Chunks}}`) that differ from what the CLI writes. Inspect the resulting evaluator JSON and run a small test before relying on customized prompts. See [evaluation-sets.md](evaluation-sets.md) § Matching evaluator to test case fields for the data flow. + +## Custom Prompts + +Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text. + +```bash +uip agent eval evaluator add strict-match \ + --type semantic-similarity \ + --prompt 'Score 0-100 how closely {{ActualOutput}} matches {{ExpectedOutput}}. Return JSON {"score": N, "reason": "..."}.' \ + --path ./my-agent --output json +``` + +## What `uip agent validate` Checks + +Validate runs schema migration, which enforces the following on every file in `evals/evaluators/`: + +**Required fields:** `fileName`, `id`, `name`, `description`, `category`, `type`, `targetOutputKey`, `createdAt`, `updatedAt`. Missing field → `Required field "" is missing`. + +**Category ↔ type compatibility:** + +| Category | Name | Allowed `type` | Additional requirements | +|----------|------|----------------|-------------------------| +| `0` | Deterministic | `1`, `6` | — | +| `1` | LlmAsAJudge | `5`, `8`, `9` | `prompt` and `model` required | +| `3` | Trajectory | `7` | `prompt` and `model` required | + +Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually. + +Eval sets are validated against a Zod schema. The CLI surfaces the offending file path, JSON path, and message — fix and re-run validate. + +## Runtime Errors (Eval Worker) + +These errors surface only after `uip agent eval run start` — `uip agent validate` does NOT catch them. They come from the cloud eval worker (`python-eval-worker/workflows/eval/activities.py`) and the SDK's `EvaluatorFactory`. + +| Error string | Trigger | Fix | +|--------------|---------|-----| +| `Evaluator '' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig. Specify a valid model name (e.g. 'claude-haiku-4-5-20251001').` | Evaluator JSON has empty/missing `model` (and is not `same-as-agent`). The worker fail-fasts before calling the LLM gateway. | Set `model` in the evaluator JSON to a model available in your tenant, or set `"model": "same-as-agent"` and ensure `agent.json` has a model. | +| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"same-as-agent"` but `agent.json` has no resolvable model. | Set `model` in `agent.json`, or override the evaluator with an explicit model. | + +**Pre-empt locally:** before push, run + +```bash +uip agent eval evaluator list --path ./my-agent --output json --output-filter '[?model==`""` || model==null]' +``` + +to find any LLM evaluator without an explicit model. (Switch to `--output-filter '[?model==`"same-as-agent"`]'` if you want to flag those that depend on `agent.json`.) + +## Anti-patterns + +- **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`). +- **Don't pass `--type` in PascalCase.** Only `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` are accepted. +- **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling. +- **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically. +- **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution. +- **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code. +- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → types {5, 8, 9}, cat 3 → type {7}. Anything else fails schema migration. diff --git a/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md new file mode 100644 index 000000000..8713186b1 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md @@ -0,0 +1,207 @@ +# Running Evaluations + +Execute evaluations against the Agent Runtime, check status, view results, and compare runs. + +All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition. + +## Start an Eval Run + +```bash +uip agent eval run start --set "" --path --wait --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--path ` | No | Agent project directory | `.` | +| `--wait` | No | Block until the run completes, then print results | `false` | +| `--timeout ` | No | Maximum time to block when `--wait` is set | `600` (10 min) | +| `--solution-id ` | No | Override solution ID for this run | Auto-resolved from the pushed-agent state | + +Without `--wait`, the command returns immediately with `Code: AgentEvalRunStarted`: + +```json +{ + "Code": "AgentEvalRunStarted", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "EvalSetName": "Default Evaluation Set", + "TestCases": 5, + "Evaluators": 2 + } +} +``` + +With `--wait`, the CLI polls every 5 seconds (hardcoded interval) until the run reaches a terminal state (`completed` or `failed`) or `--timeout` elapses, then emits `AgentEvalRunCompleted` plus per-test `AgentEvalRunResults`. If `--timeout` elapses first, the run continues server-side; query progress with `eval run status `. + +### Output codes + +| Subcommand | `Code` | +|------------|--------| +| `run start` (no `--wait`) | `AgentEvalRunStarted` | +| `run start --wait` (summary) | `AgentEvalRunCompleted` | +| `run start --wait` (per-case detail) | `AgentEvalRunResults` | +| `run status` | `AgentEvalRunStatus` | +| `run results` | `AgentEvalRunResults` | +| `run results --export-format` | `AgentEvalRunExported` | +| `run list` | `AgentEvalRunList` | +| `run compare` | `AgentEvalRunComparison` | + +## Check Run Status + +```bash +uip agent eval run status --set "" --path --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunStatus", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "Status": "completed", + "Score": 0.86, + "Duration": "42.5s", + "EvaluatorScores": "semantic: 0.9, trajectory: 0.82" + } +} +``` + +Terminal states: `completed` or `failed`. + +## View Results + +```bash +uip agent eval run results \ + --set "" \ + --path \ + --output json +``` + +**Options:** + +| Flag | Description | +|------|-------------| +| `--only-failed` | Show only failed or errored test cases | +| `--verbose` | Include evaluator justifications in output | +| `--export-format ` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) | + +**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`). + +### Filtering results with `--output-filter` + +`--output-filter` takes a JMESPath expression and applies it to the JSON payload before printing. Useful for triage: + +```bash +# Print only test cases with a specific name +uip agent eval run results --set "Default Evaluation Set" --path ./my-agent \ + --output json --output-filter 'Data.Results[?TestCase==`greeting-test`]' + +# Print only the score field for each test case +uip agent eval run results --set "Default Evaluation Set" --path ./my-agent \ + --output json --output-filter 'Data.Results[*].{name: TestCase, score: Score}' +``` + +### Failure detection + +`--only-failed` filters to test cases where any of these are true (`isFailedRun()` in the CLI): + +1. `status === "failed"` (or numeric `"3"`) +2. `errorMessage` is non-null +3. `result.score.type === "error"` (or numeric `"2"`) +4. Any `assertionRuns[*].result.score.type === "error"` (or numeric `"2"`) +5. Any `assertionRuns[*].result.score.value === false` (exact-match evaluators that returned a false boolean) + +Status enum values from the SDK: `0 = pending`, `1 = running`, `2 = completed`, `3 = failed`. The CLI normalizes string and numeric forms. + +## List Past Runs + +```bash +uip agent eval run list --set "" --path --output json +``` + +**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`. + +## Compare Runs + +Compare two eval runs side by side to see score changes: + +```bash +uip agent eval run compare \ + --compare-to \ + --set "" \ + --path \ + --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunComparison", + "Data": { + "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" }, + "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" }, + "ScoreDelta": 0.06, + "TestCases": [ + { + "TestCase": "happy-path", + "ScoreA": 1.0, + "ScoreB": 0.9, + "Delta": "+0.1", + "StatusA": "completed", + "StatusB": "completed" + } + ] + } +} +``` + +Use `compare` after prompt changes to verify improvements without regressions. + +## Workflow Example + +```bash +# 1. Add test cases +uip agent eval add greeting-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"hi there"}' \ + --expected '{"content":"Hello! How can I help you?"}' \ + --expected-agent-behavior "Agent should respond with a friendly greeting" \ + --path ./my-agent --output json + +# 2. Validate (catches schema drift, missing evaluator refs, broken eval JSON) +uip agent validate --path ./my-agent --output json + +# 3. Push agent to Studio Web (required before running evals) +uip agent push --path ./my-agent --output json + +# 4. Run and wait +uip agent eval run start \ + --set "Default Evaluation Set" \ + --path ./my-agent \ + --wait --timeout 600 --output json + +# 5. Review failures +uip agent eval run results \ + --set "Default Evaluation Set" \ + --only-failed --verbose \ + --path ./my-agent --output json + +# 6. Make changes, validate, push, re-run, compare +uip agent validate --path ./my-agent --output json +uip agent push --path ./my-agent --output json +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json +uip agent eval run compare --compare-to \ + --set "Default Evaluation Set" --path ./my-agent --output json +``` + +## Anti-patterns + +- **Don't run `eval run start` without `uip agent push` first.** The Agent Runtime executes against the pushed agent, not local files. Local edits made after the last push will not affect results. +- **Don't assume `--timeout` cancels the server-side run.** It only stops the local CLI from blocking. The run continues and can be inspected with `run status`. +- **Don't skip `uip agent validate` between edits and push.** Validate catches eval-set / evaluator drift that push will accept silently and the runtime will reject. +- **Don't compare runs from different eval sets.** `compare` aligns by test case `name` within the eval set; cross-set deltas are meaningless. +- **Don't rely on `Score` alone — inspect `EvaluatorScores`.** A 0.86 aggregate can mask a faithful-but-wrong agent (high semantic, low trajectory). Use `--verbose` to read justifications when scores look surprising. +- **Don't mix score scales across evaluators in the same eval set.** Defaults written by `uip agent init` use 0–100 prompts; defaults written by `evaluator add` use 0–1 prompts. The runtime DTO normalizes to 0–100, but mixed-scale prompts produce confusing per-evaluator scores. Decide on one scale per eval set and edit prompts to match. diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md index a60995399..d11133c7d 100644 --- a/skills/uipath-agents/references/lowcode/lowcode.md +++ b/skills/uipath-agents/references/lowcode/lowcode.md @@ -46,7 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) | | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) | | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) | -| Running evaluations, adding test cases, managing evaluators | [evaluation/evaluate.md](evaluation/evaluate.md) | +| Running evaluations, adding test cases, managing evaluators | [evaluations/evaluate.md](evaluations/evaluate.md) | ### Capability Registry @@ -69,7 +69,6 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad | Add an Action Center escalation (HITL) | [capabilities/escalation/escalation.md](capabilities/escalation/escalation.md) | | | Add guardrails (PII, harmful content, custom rules) | [capabilities/guardrails/guardrails.md](capabilities/guardrails/guardrails.md) | | | Embed an agent inline in a flow | [capabilities/inline-in-flow/inline-in-flow.md](capabilities/inline-in-flow/inline-in-flow.md) | | -| Evaluate agent (add test cases, run evals, view results) | [evaluation/evaluate.md](evaluation/evaluate.md) | `evaluation/evaluators.md`, `evaluation/evaluation-sets.md`, `evaluation/running-evaluations.md` | | Set up Orchestrator resources | Tell the user to use the `uipath-platform` skill | | | Wire agent into a flow | Tell the user to use the `uipath-maestro-flow` skill | | From 79268a548396931159e4e40fc40c9044ede14b02 Mon Sep 17 00:00:00 2001 From: Mayank Jha Date: Mon, 4 May 2026 18:33:57 -0700 Subject: [PATCH 3/5] fix(uipath-agents): drop context-precision and faithfulness from low-code eval docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove context-precision and faithfulness from the low-code evaluator surface entirely. Updates: - evaluators.md: drop both rows from the CLI-exposed table, the --type description, the type/category mapping, and the default-prompts table. Narrow the validate matrix's cat 1 to type {5} only. Update the "Why fewer" intro to reflect 2 supported CLI types. - evaluation-sets.md: remove the trace-driven data-flow rows for both evaluators, the explanatory callout about RETRIEVER spans, and the related anti-patterns. Test-case design now covers only ss + trajectory. - evaluate.md: narrow the "Unknown evaluator type" troubleshooting hint. Coded eval refs are unchanged — those use uipath-llm-judge-* IDs, not the legacy CLI names. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../lowcode/evaluations/evaluate.md | 2 +- .../lowcode/evaluations/evaluation-sets.md | 12 ---------- .../lowcode/evaluations/evaluators.md | 24 +++++++------------ 3 files changed, 9 insertions(+), 29 deletions(-) diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md index 632af61fc..e77c339c0 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md @@ -71,7 +71,7 @@ CLI-added evaluators are written as `evaluator-.json` (first 8 hex chars | Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id ` explicitly to `uip agent eval run start` | | `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` | | `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases | -| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | +| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory` | | `Evaluator '' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model | | `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model | | `401 Unauthorized` | Auth expired | Run `uip login --output json` | diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md index 60a751f96..528bc0868 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md @@ -86,16 +86,6 @@ The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on |----------------|---------------|----------------------| | `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` | | `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` | -| `context-precision` | (none directly used) | RETRIEVER spans `input.value` → `{{UserQuery}}`, `output.value.documents` → `{{RetrievedContext}}` | -| `faithfulness` | `expectedOutput` → `{{AgentOutput}}` (note: it is the *expected* output that is treated as the candidate text to fact-check, not the agent's actual output) | Trace span outputs (RETRIEVER + tool calls) → `{{Context}}` | - -`context-precision` and `faithfulness` are **trace-driven evaluators**. They extract `{{UserQuery}}`, `{{RetrievedContext}}`, and `{{Context}}` by walking `openinference.span.kind == "RETRIEVER"` (and other tool spans) on the agent's run trace. Their behavior: - -- **The agent must perform retrieval** (Context Grounding / index / DataFabric / a tool that emits an OpenInference RETRIEVER span). Without retrieval spans, the placeholders resolve to empty and scores collapse. -- **`--inputs` and `--expected` are not consumed in the obvious way**: `context-precision` ignores test-case `inputs` (it reads the query from the trace); `faithfulness` reads the *expected* output (not the agent's actual output) as the candidate text. -- **CLI-default placeholders may differ from SDK-internal placeholders.** The CLI writes prompts with `{{UserQuery}}` and `{{RetrievedContext}}` for context-precision, but the SDK's legacy evaluator hardcodes `{{Query}}` and `{{Chunks}}` internally. Inspect the resulting evaluator JSON; if you customize the prompt, match the placeholders the runtime actually substitutes (test with a small run before relying on results). - -If the agent has no retrieval step, remove `context-precision` and `faithfulness` from the eval set rather than letting them silently score everything as 0. For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: @@ -157,7 +147,5 @@ The `source` field indicates how the test case was created. CLI-added test cases - **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`. - **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push. -- **Don't add `context-precision` or `faithfulness` evaluators to an eval set whose agent has no RETRIEVER span.** Both extract their placeholders from agent trace spans, not from `inputs`/`expectedOutput`. No retrieval → scores collapse to 0. -- **Don't expect `faithfulness` to read the agent's actual output.** It reads `expectedOutput` (the criteria field) as the candidate text. To fact-check actual agent output, use `semantic-similarity` against an expected ground truth instead. - **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons. - **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync. diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md index 89d8c0d61..2c35f0180 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md @@ -4,14 +4,14 @@ Evaluators define how agent output is scored. Each evaluator is a JSON file in ` ## Why fewer evaluators than coded? -The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 4 because the two surfaces use **different engines** in the SDK: +The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 2 supported types because the two surfaces use **different engines** in the SDK: - **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class. -- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only 4 implementation classes ship: `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator`. +- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only `LegacyLlmAsAJudgeEvaluator` (semantic-similarity), `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, and `LegacyJsonSimilarityEvaluator` are supported for low-code agents. -Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent — the cloud eval worker will not load them. +Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent. -The CLI also narrows the runtime surface further: of the 6 legacy `type` values the runtime accepts, the `--type` flag exposes only 4. See § Runtime-supported types not exposed by the CLI below. +The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equals` and `JsonSimilarity` are accepted by the runtime but require hand-written JSON — see § Runtime-supported types not exposed by the CLI below. ## Evaluator Types (CLI-exposed) @@ -19,8 +19,6 @@ The CLI also narrows the runtime surface further: of the 6 legacy `type` values |------|----------|----------------| | Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | | Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | -| Context Precision | `context-precision` | Whether retrieved context is relevant to the user's query | -| Faithfulness | `faithfulness` | Whether the agent's output is grounded in the provided context | ## Runtime-supported types not exposed by the CLI @@ -84,7 +82,7 @@ uip agent eval evaluator add --type --path --output js | Flag | Required | Description | Default | |------|----------|-------------|---------| -| `--type ` | Yes | One of: `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` | — | +| `--type ` | Yes | One of: `semantic-similarity`, `trajectory` | — | | `--description ` | No | Human-readable description | Auto-generated from type | | `--prompt ` | No | Custom LLM evaluation prompt | Built-in default per type | | `--target-key ` | No | Specific output key to evaluate | `*` (all keys) | @@ -164,8 +162,6 @@ The two `evaluator-default*.json` files are written by `uip agent init`, not by |----------|-------------------|------------| | `semantic-similarity` | 5 | 1 (output-based) | | `trajectory` | 7 | 3 (trajectory-based) | -| `context-precision` | 8 | 1 (output-based) | -| `faithfulness` | 9 | 1 (output-based) | ## Default Prompts and Template Variables @@ -175,16 +171,12 @@ The prompt and score scale the CLI writes when you run `evaluator add` differs f |------|-------------------------|--------------------------| | `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders | | `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` | -| `context-precision` | Asks 0–1; uses `{{UserQuery}}`, `{{RetrievedContext}}` | Not created by `init` | -| `faithfulness` | Asks 0–1; uses `{{AgentOutput}}`, `{{Context}}` | Not created by `init` | Two notable inconsistencies: 1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix. 2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want. -For `context-precision` and `faithfulness`, the SDK's legacy evaluator may use its own internal placeholders (`{{Query}}`, `{{Chunks}}`) that differ from what the CLI writes. Inspect the resulting evaluator JSON and run a small test before relying on customized prompts. See [evaluation-sets.md](evaluation-sets.md) § Matching evaluator to test case fields for the data flow. - ## Custom Prompts Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text. @@ -207,7 +199,7 @@ Validate runs schema migration, which enforces the following on every file in `e | Category | Name | Allowed `type` | Additional requirements | |----------|------|----------------|-------------------------| | `0` | Deterministic | `1`, `6` | — | -| `1` | LlmAsAJudge | `5`, `8`, `9` | `prompt` and `model` required | +| `1` | LlmAsAJudge | `5` | `prompt` and `model` required | | `3` | Trajectory | `7` | `prompt` and `model` required | Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually. @@ -234,9 +226,9 @@ to find any LLM evaluator without an explicit model. (Switch to `--output-filter ## Anti-patterns - **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`). -- **Don't pass `--type` in PascalCase.** Only `semantic-similarity`, `trajectory`, `context-precision`, `faithfulness` are accepted. +- **Don't pass `--type` in PascalCase.** Only `semantic-similarity` and `trajectory` are accepted. - **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling. - **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically. - **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution. - **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code. -- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → types {5, 8, 9}, cat 3 → type {7}. Anything else fails schema migration. +- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → type {5}, cat 3 → type {7}. Anything else fails schema migration. From 6cb0dac4092f53f31f4ff7c10a30e83ed63069b4 Mon Sep 17 00:00:00 2001 From: Mayank Jha Date: Mon, 4 May 2026 18:44:54 -0700 Subject: [PATCH 4/5] docs(uipath-agents): add realistic legacy evaluator JSON examples Replace the synthetic skeletons in "Runtime-supported types not exposed by the CLI" with the canonical shapes used in real low-code agent projects: - Equals (type 1) and JsonSimilarity (type 6) keep their Deterministic-category shape (no prompt/model needed) but now use realistic descriptions and filenames. - Add explicit LlmAsAJudge (type 5) and Trajectory (type 7) JSON shapes for hand-written use, including the full prompt strings, an explicit model pin, and the descriptions used in production examples. - Soften the filename rule: CLI-generated evaluators use evaluator-.json, but hand-written files can use any descriptive name. The runtime keys off id / evaluatorRefs. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../lowcode/evaluations/evaluators.md | 54 ++++++++++++++++--- 1 file changed, 47 insertions(+), 7 deletions(-) diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md index 2c35f0180..81dd6bdf1 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md @@ -24,16 +24,18 @@ The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equal The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/.json`: +For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-.json` pattern only applies to evaluators created via `uip agent eval evaluator add`. + ### `Equals` (type 1, category 0 — Deterministic) Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`. ```json { - "fileName": "evaluator-equals.json", + "fileName": "legacy-equality.json", "id": "", - "name": "exact-match", - "description": "Exact-match evaluator", + "name": "Equality Evaluator", + "description": "An evaluator that judges the agent based on expected output.", "category": 0, "type": 1, "targetOutputKey": "*", @@ -50,10 +52,10 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity` ```json { - "fileName": "evaluator-json-sim.json", + "fileName": "legacy-json-similarity.json", "id": "", - "name": "json-similarity", - "description": "JSON similarity evaluator", + "name": "JSON Similarity Evaluator", + "description": "An evaluator that compares JSON structures with tolerance for numeric and string differences.", "category": 0, "type": 6, "targetOutputKey": "*", @@ -62,7 +64,45 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity` } ``` -After hand-writing, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps. +### `LlmAsAJudge` semantic-similarity (type 5, category 1) — explicit shape + +If you want to hand-write a semantic-similarity evaluator instead of using `evaluator add` (e.g. to pin a specific model and prompt), the full shape is: + +```json +{ + "fileName": "legacy-llm-as-a-judge.json", + "id": "", + "name": "LLM As A Judge Evaluator", + "description": "An evaluator that uses an LLM to judge the similarity of the actual output to the expected output", + "category": 1, + "type": 5, + "prompt": "As an expert evaluator, analyze the semantic similarity of these outputs to determine a score from 0-100.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n", + "targetOutputKey": "*", + "model": "gpt-4.1-2025-04-14", + "createdAt": "", + "updatedAt": "" +} +``` + +### `Trajectory` (type 7, category 3) — explicit shape + +```json +{ + "fileName": "legacy-trajectory.json", + "id": "", + "name": "Trajectory Evaluator", + "description": "An evaluator that analyzes the execution trajectory and decision sequence taken by the agent.", + "category": 3, + "type": 7, + "prompt": "Evaluate the agent's execution trajectory based on the expected behavior.\n\nExpected Agent Behavior: {{ExpectedAgentBehavior}}\nAgent Run History: {{AgentRunHistory}}\n\nProvide a score from 0-100 based on how well the agent followed the expected trajectory.", + "model": "gpt-4.1-2025-04-14", + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +After hand-writing any evaluator, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps. ## Coded-only evaluators (NOT available on low-code) From 5fb59e7a1da09af53b9c8f5e4fa3a7228d868d26 Mon Sep 17 00:00:00 2001 From: Mayank Jha Date: Tue, 5 May 2026 18:09:05 -0700 Subject: [PATCH 5/5] docs(uipath-agents): reframe low-code evaluators as 4 first-class types MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Studio Web UI exposes 4 evaluator types (Semantic Similarity, Trajectory, Exact match, JSON similarity). Verified by counting evaluator JSON files across multiple production examples — only types 1, 5, 6, 7 appear; nothing else does. Previous framing called Exact match and JSON similarity "runtime-supported types not exposed by the CLI", which understated their status. Both are real first-class options; the only narrowing surface is the CLI's --type flag (which covers 2 of 4). evaluators.md changes: - New "Supported Evaluator Types" section with a 4-row table mapping UI label, type/category, --type flag (where applicable), what it scores, and whether it is LLM-based. - New subsection "How to add each type" calling out the three creation paths (UI, CLI, hand-write JSON). - Renamed the "Why fewer than coded?" section into a subsection of the Supported Types group; updated wording to reflect 4 supported types. - Renamed "Runtime-supported types not exposed by the CLI" to "JSON Shapes" and reordered the four shapes to match the table order (Exact match, JSON similarity, LLM-as-a-judge, Trajectory). evaluation-sets.md changes: - Added Exact match and JSON similarity rows to the field-mapping table so all 4 supported types are covered. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../lowcode/evaluations/evaluation-sets.md | 10 ++-- .../lowcode/evaluations/evaluators.md | 46 +++++++++++-------- 2 files changed, 32 insertions(+), 24 deletions(-) diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md index 528bc0868..acda99f05 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md @@ -82,10 +82,12 @@ uip agent eval remove --set "" --path -- The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace: -| Evaluator Type | From test case | From agent run trace | -|----------------|---------------|----------------------| -| `semantic-similarity` | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` | -| `trajectory` | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` | +| Evaluator Type | From test case | From agent run | +|----------------|---------------|----------------| +| Semantic Similarity (type 5) | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` | +| Trajectory (type 7) | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` | +| Exact match (type 1) | `expectedOutput` (compared verbatim, no placeholders) | Agent output (compared verbatim) | +| JSON similarity (type 6) | `expectedOutput` (tree-compared, no placeholders) | Agent output (tree-compared) | For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md index 81dd6bdf1..c3f4adfae 100644 --- a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md @@ -2,33 +2,39 @@ Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`. -## Why fewer evaluators than coded? +## Supported Evaluator Types -The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code lists only 2 supported types because the two surfaces use **different engines** in the SDK: +Low-code agents support exactly four evaluator types. All four are first-class options in the Studio Web "Add evaluator" dialog. Two also have CLI-flag shortcuts; the other two are created via the UI or by hand-writing JSON in `evals/evaluators/`. -- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class. -- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). Only `LegacyLlmAsAJudgeEvaluator` (semantic-similarity), `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, and `LegacyJsonSimilarityEvaluator` are supported for low-code agents. +| UI label | `type` | `category` | `--type` flag | What it scores | LLM-based | +|----------|--------|-----------|---------------|----------------|-----------| +| LLM-as-a-judge: Semantic Similarity | 5 | 1 (LlmAsAJudge) | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | Yes | +| Trajectory | 7 | 3 (Trajectory) | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | Yes | +| Exact match | 1 | 0 (Deterministic) | — | Whether the output precisely matches the expected output without variations in wording or formatting | No | +| JSON similarity | 6 | 0 (Deterministic) | — | Whether two JSON structures or values are "close enough" or share similar structure/contents | No | + +How to add each type: -Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) **have no legacy counterpart** and cannot be used on a low-code agent. +- **Studio Web UI** — Evaluators tab → **Create New** → Add evaluator dialog → pick any of the four. UI is the canonical surface and supports all four with no special steps. +- **CLI** — `uip agent eval evaluator add --type ` for `semantic-similarity` or `trajectory`. The CLI does not have a `--type` value for Exact match or JSON similarity; create those in the UI or hand-write the JSON. +- **Hand-write JSON** — drop a file in `evals/evaluators/` matching the schema below; run `uip agent validate --output json`; reference the new `id` from your eval set's `evaluatorRefs`. Useful when you want to pin a specific model and prompt for the LLM-based types, or when you're scaffolding eval files programmatically. -The CLI exposes only `semantic-similarity` and `trajectory` via `--type`. `Equals` and `JsonSimilarity` are accepted by the runtime but require hand-written JSON — see § Runtime-supported types not exposed by the CLI below. +### Why fewer than coded? -## Evaluator Types (CLI-exposed) +The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code supports only these four because the two surfaces use **different engines** in the SDK: -| Type | CLI Flag | What It Scores | -|------|----------|----------------| -| Semantic Similarity | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | -| Trajectory | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | +- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class. +- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). The four legacy classes shipped — `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator` — are exactly what the UI exposes. -## Runtime-supported types not exposed by the CLI +Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) have no legacy counterpart and cannot be used on a low-code agent. -The eval worker's discriminator (`uipath/eval/evaluators/evaluator.py` § `legacy_evaluator_discriminator`) accepts two more `type` values that have no `--type` flag. To use them, hand-write the evaluator JSON in `evals/evaluators/.json`: +## JSON Shapes For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-.json` pattern only applies to evaluators created via `uip agent eval evaluator add`. -### `Equals` (type 1, category 0 — Deterministic) +### Exact match (`type` 1, `category` 0 — Deterministic) -Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`. +No LLM. Equivalent of coded `uipath-exact-match`. ```json { @@ -46,9 +52,9 @@ Exact-match comparison; no LLM. Equivalent of coded `uipath-exact-match`. No `prompt`/`model` required (Deterministic category bypasses the LLM checks). -### `JsonSimilarity` (type 6, category 0 — Deterministic) +### JSON similarity (`type` 6, `category` 0 — Deterministic) -Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity`. +Tree-based JSON comparison. No LLM. Equivalent of coded `uipath-json-similarity`. ```json { @@ -64,9 +70,9 @@ Tree-based JSON comparison; no LLM. Equivalent of coded `uipath-json-similarity` } ``` -### `LlmAsAJudge` semantic-similarity (type 5, category 1) — explicit shape +### LLM-as-a-judge: Semantic Similarity (`type` 5, `category` 1 — LlmAsAJudge) -If you want to hand-write a semantic-similarity evaluator instead of using `evaluator add` (e.g. to pin a specific model and prompt), the full shape is: +The CLI's `evaluator add --type semantic-similarity` writes a shorter prompt; hand-write the file when you want to pin a specific model and the longer 0–100 prompt: ```json { @@ -84,7 +90,7 @@ If you want to hand-write a semantic-similarity evaluator instead of using `eval } ``` -### `Trajectory` (type 7, category 3) — explicit shape +### Trajectory (`type` 7, `category` 3 — Trajectory) ```json {