diff --git a/skills/uipath-agents/SKILL.md b/skills/uipath-agents/SKILL.md index 7dc4747c3..391fe7346 100644 --- a/skills/uipath-agents/SKILL.md +++ b/skills/uipath-agents/SKILL.md @@ -47,7 +47,8 @@ Determine the agent mode before proceeding: | Add an Action Center escalation (HITL) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/escalation/escalation.md` | | Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` | | Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists | -| Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/process.md` | +| Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` | +| Run low-code evaluations | Low-code | [lowcode/evaluations/evaluate.md](references/lowcode/evaluations/evaluate.md) | `lowcode/evaluations/evaluators.md`, `lowcode/evaluations/evaluation-sets.md`, `lowcode/evaluations/running-evaluations.md` | | Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` | ## Resources diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluate.md b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md new file mode 100644 index 000000000..e77c339c0 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluate.md @@ -0,0 +1,90 @@ +# Evaluate Low-Code Agents + +Design and run evaluations against low-code agents using the `uip agent eval` CLI. + +## Quick Reference + +```bash +# Add a test case +uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json + +# Run evals and wait for results +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json + +# Check results (failures only, with justifications) +uip agent eval run results --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json +``` + +## Prerequisites + +- Agent project initialized (`uip agent init `) +- `entry-points.json` present (defines `input`/`output` schema that test case `--inputs`/`--expected` must conform to) +- `uip agent validate --output json` passes (validate also checks evals and evaluators) +- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud) + +Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. Only `uip agent eval run *` commands require cloud connectivity. + +## Reference Navigation + +- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts +- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options +- [Running Evaluations](running-evaluations.md) — start, status, results, compare + +Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases. + +## File Structure + +After `uip agent init`, the project structure is: + +``` +my-agent/ + agent.json + entry-points.json # Input/output schema — test case --inputs / --expected must match + project.uiproj + flow-layout.json + evals/ + evaluators/ + evaluator-default.json # name: "Default Evaluator" (semantic-similarity) + evaluator-default-trajectory.json # name: "Default Trajectory Evaluator" + eval-sets/ + evaluation-set-default.json # name: "Default Evaluation Set" (references both evaluators) +``` + +Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories. + +CLI-added evaluators are written as `evaluator-.json` (first 8 hex chars of the evaluator UUID). The `` argument populates the `name` field inside the JSON, NOT the filename. Reference evaluators in eval sets by `id` (UUID), not filename. + +## Key Differences from Coded Agent Evals + +| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) | +|--------|-------------------------------|------------------------------| +| Execution | Local Python process | Cloud-based via Agent Runtime | +| Auth required | Only for `--report` | Always (cloud execution) | +| Prerequisite | `entry-points.json` | `uip agent push` | +| Mocking | `@mockable()` decorator + declarative | Simulation instructions only | +| CLI prefix | `uip codedagent eval` | `uip agent eval` | + +## Troubleshooting + +| Error | Cause | Fix | +|-------|-------|-----| +| Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id ` explicitly to `uip agent eval run start` | +| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` | +| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases | +| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory` | +| `Evaluator '' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model | +| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model | +| `401 Unauthorized` | Auth expired | Run `uip login --output json` | +| Eval run timeout (with `--wait`) | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web. Note: this only stops the local CLI from blocking; the run continues server-side — query with `uip agent eval run status ` | +| Validate fails with eval errors | Eval set references an evaluator that no longer exists, OR evaluator JSON missing required field, OR `category`/`type` mismatch (see [evaluators.md](evaluators.md) § What `uip agent validate` Checks) | Re-run `uip agent eval evaluator list` and reconcile `evaluatorRefs`; fix per the validate error message | + +The two model-resolution errors above are **runtime checks in the cloud eval worker**, not validate-time checks — `uip agent validate` will not catch them. They surface only after `uip agent eval run start`. To pre-empt them, inspect each evaluator's `model` field locally before pushing. + +## Anti-patterns + +- **Don't run `uip agent eval run start` before `uip agent push`.** The Agent Runtime executes against the pushed agent. Local edits to `agent.json` after the last push will not be reflected in the run. +- **Don't skip `uip agent validate` before push.** Validate checks `evals/` and `evaluators/`; broken eval JSON will not block push but will surface as runtime errors. +- **Don't hand-edit `id` or `evaluatorRefs` UUIDs.** Eval sets reference evaluators by UUID. Renaming an evaluator file or copy-pasting a UUID across evaluators silently breaks resolution. +- **Don't expect filenames to match ``.** CLI-generated evaluator files use `evaluator-.json`, not `.json`. Look up evaluators by the `name` field inside the JSON, not by filename. +- **Don't pass `--type` in PascalCase.** The CLI rejects `SemanticSimilarity`. Only kebab-case is accepted. +- **Don't reference evaluators across projects.** Each agent project has its own `evals/evaluators/` directory; UUIDs are not portable. diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md new file mode 100644 index 000000000..acda99f05 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluation-sets.md @@ -0,0 +1,153 @@ +# Evaluation Sets and Test Cases + +Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set. + +## Managing Eval Sets + +### Add an eval set + +```bash +uip agent eval set add --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--evaluators ` | No | Comma-separated evaluator IDs | All existing evaluators | +| `--path ` | No | Agent project directory | `.` | + +When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project. + +### List eval sets + +```bash +uip agent eval set list --path --output json +``` + +### Remove an eval set + +```bash +uip agent eval set remove --path --output json +``` + +## Managing Test Cases + +Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations. + +### Add a test case + +```bash +uip agent eval add \ + --set "" \ + --inputs '{"input":"hello"}' \ + --expected '{"content":"greeting response"}' \ + --path \ + --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--inputs ` | Yes | Input values as JSON | — | +| `--expected ` | No | Expected output as JSON | `{}` | +| `--expected-agent-behavior ` | No | Description of expected behavior (used by trajectory evaluator) | `""` | +| `--simulation-instructions ` | No | Instructions for simulating agent behavior | `""` | +| `--simulate-input` | No | Enable input simulation | `false` | +| `--simulate-tools` | No | Enable tool simulation | `false` | +| `--input-generation-instructions ` | No | Instructions for generating synthetic inputs | `""` | +| `--path ` | No | Agent project directory | `.` | + +### List test cases + +```bash +uip agent eval list --set "" --path --output json +``` + +### Remove a test case + +```bash +uip agent eval remove --set "" --path --output json +``` + +## Test Case Design + +### Aligning `--inputs` with `entry-points.json` + +`--inputs` JSON keys must match the `input` schema in `entry-points.json`. Mismatched keys do not block `eval add` (the CLI stores the JSON verbatim) but will fail at run time when the Agent Runtime invokes the agent. Run `uip agent validate --output json` after adding test cases to surface schema drift. + +### Matching evaluator to test case fields + +The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace: + +| Evaluator Type | From test case | From agent run | +|----------------|---------------|----------------| +| Semantic Similarity (type 5) | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` | +| Trajectory (type 7) | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` | +| Exact match (type 1) | `expectedOutput` (compared verbatim, no placeholders) | Agent output (compared verbatim) | +| JSON similarity (type 6) | `expectedOutput` (tree-compared, no placeholders) | Agent output (tree-compared) | + +For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output: + +```bash +uip agent eval add tool-usage-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"What is the weather in NYC?"}' \ + --expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \ + --path ./my-agent --output json +``` + +### Simulation options + +- `--simulate-input` — runtime generates synthetic input variations based on the provided input +- `--simulate-tools` — tool calls are simulated rather than executed against real services +- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters") +- `--simulation-instructions` — guides overall simulation behavior + +Use these to expand test coverage without writing every input by hand. + +## Eval Set JSON Format + +```json +{ + "fileName": "evaluation-set-default.json", + "id": "", + "name": "Default Evaluation Set", + "batchSize": 10, + "evaluatorRefs": ["", ""], + "evaluations": [ + { + "id": "", + "name": "happy-path", + "inputs": {"input": "hello"}, + "expectedOutput": {"content": "greeting"}, + "expectedAgentBehavior": "", + "simulationInstructions": "", + "simulateInput": false, + "simulateTools": false, + "inputGenerationInstructions": "", + "evalSetId": "", + "source": "manual", + "createdAt": "...", + "updatedAt": "..." + } + ], + "modelSettings": [], + "agentMemoryEnabled": false, + "agentMemorySettings": [], + "lineByLineEvaluation": false, + "createdAt": "...", + "updatedAt": "..." +} +``` + +The `source` field indicates how the test case was created. CLI-added test cases are always `"manual"` (verified). Other observed values from Studio Web include `"debugRun"`, `"runtimeRun"`, `"simulatedRun"`, and `"autopilotUserInitiated"` — treat the `source` field as an enum but do not set it manually; the CLI and Studio Web own this value. + +## Anti-patterns + +- **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`. +- **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push. +- **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons. +- **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync. diff --git a/skills/uipath-agents/references/lowcode/evaluations/evaluators.md b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md new file mode 100644 index 000000000..c3f4adfae --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/evaluators.md @@ -0,0 +1,280 @@ +# Evaluators + +Evaluators define how agent output is scored. Each evaluator is a JSON file in `evals/evaluators/`. + +## Supported Evaluator Types + +Low-code agents support exactly four evaluator types. All four are first-class options in the Studio Web "Add evaluator" dialog. Two also have CLI-flag shortcuts; the other two are created via the UI or by hand-writing JSON in `evals/evaluators/`. + +| UI label | `type` | `category` | `--type` flag | What it scores | LLM-based | +|----------|--------|-----------|---------------|----------------|-----------| +| LLM-as-a-judge: Semantic Similarity | 5 | 1 (LlmAsAJudge) | `semantic-similarity` | Whether the agent's output has the same meaning as the expected output | Yes | +| Trajectory | 7 | 3 (Trajectory) | `trajectory` | Whether the agent's reasoning path and tool usage match expected behavior | Yes | +| Exact match | 1 | 0 (Deterministic) | — | Whether the output precisely matches the expected output without variations in wording or formatting | No | +| JSON similarity | 6 | 0 (Deterministic) | — | Whether two JSON structures or values are "close enough" or share similar structure/contents | No | + +How to add each type: + +- **Studio Web UI** — Evaluators tab → **Create New** → Add evaluator dialog → pick any of the four. UI is the canonical surface and supports all four with no special steps. +- **CLI** — `uip agent eval evaluator add --type ` for `semantic-similarity` or `trajectory`. The CLI does not have a `--type` value for Exact match or JSON similarity; create those in the UI or hand-write the JSON. +- **Hand-write JSON** — drop a file in `evals/evaluators/` matching the schema below; run `uip agent validate --output json`; reference the new `id` from your eval set's `evaluatorRefs`. Useful when you want to pin a specific model and prompt for the LLM-based types, or when you're scaffolding eval files programmatically. + +### Why fewer than coded? + +The coded eval reference ([coded/lifecycle/evaluations/evaluators.md](../../coded/lifecycle/evaluations/evaluators.md)) lists 13 evaluator types. Low-code supports only these four because the two surfaces use **different engines** in the SDK: + +- **Coded** uses the new evaluator hierarchy (`BaseEvaluator`, eval sets carry `version: "1.0"`). 13 distinct `evaluatorTypeId` strings, each with its own implementation class. +- **Low-code** uses the **legacy** evaluator hierarchy (`BaseLegacyEvaluator`, no `version` field on the eval set). The four legacy classes shipped — `LegacyLlmAsAJudgeEvaluator`, `LegacyTrajectoryEvaluator`, `LegacyExactMatchEvaluator`, `LegacyJsonSimilarityEvaluator` — are exactly what the UI exposes. + +Most coded evaluator types (`contains`, `binary-classification`, `multiclass-classification`, all four `tool-call-*`, `llm-judge-output-strict-json-similarity`, `llm-judge-trajectory-simulation`) have no legacy counterpart and cannot be used on a low-code agent. + +## JSON Shapes + +For hand-written files, the filename can be any descriptive name (e.g. `legacy-equality.json`) — the runtime keys off `id` / `evaluatorRefs`, not the filename. The CLI-generated `evaluator-.json` pattern only applies to evaluators created via `uip agent eval evaluator add`. + +### Exact match (`type` 1, `category` 0 — Deterministic) + +No LLM. Equivalent of coded `uipath-exact-match`. + +```json +{ + "fileName": "legacy-equality.json", + "id": "", + "name": "Equality Evaluator", + "description": "An evaluator that judges the agent based on expected output.", + "category": 0, + "type": 1, + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +No `prompt`/`model` required (Deterministic category bypasses the LLM checks). + +### JSON similarity (`type` 6, `category` 0 — Deterministic) + +Tree-based JSON comparison. No LLM. Equivalent of coded `uipath-json-similarity`. + +```json +{ + "fileName": "legacy-json-similarity.json", + "id": "", + "name": "JSON Similarity Evaluator", + "description": "An evaluator that compares JSON structures with tolerance for numeric and string differences.", + "category": 0, + "type": 6, + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +### LLM-as-a-judge: Semantic Similarity (`type` 5, `category` 1 — LlmAsAJudge) + +The CLI's `evaluator add --type semantic-similarity` writes a shorter prompt; hand-write the file when you want to pin a specific model and the longer 0–100 prompt: + +```json +{ + "fileName": "legacy-llm-as-a-judge.json", + "id": "", + "name": "LLM As A Judge Evaluator", + "description": "An evaluator that uses an LLM to judge the similarity of the actual output to the expected output", + "category": 1, + "type": 5, + "prompt": "As an expert evaluator, analyze the semantic similarity of these outputs to determine a score from 0-100.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n", + "targetOutputKey": "*", + "model": "gpt-4.1-2025-04-14", + "createdAt": "", + "updatedAt": "" +} +``` + +### Trajectory (`type` 7, `category` 3 — Trajectory) + +```json +{ + "fileName": "legacy-trajectory.json", + "id": "", + "name": "Trajectory Evaluator", + "description": "An evaluator that analyzes the execution trajectory and decision sequence taken by the agent.", + "category": 3, + "type": 7, + "prompt": "Evaluate the agent's execution trajectory based on the expected behavior.\n\nExpected Agent Behavior: {{ExpectedAgentBehavior}}\nAgent Run History: {{AgentRunHistory}}\n\nProvide a score from 0-100 based on how well the agent followed the expected trajectory.", + "model": "gpt-4.1-2025-04-14", + "targetOutputKey": "*", + "createdAt": "", + "updatedAt": "" +} +``` + +After hand-writing any evaluator, run `uip agent validate --output json` to confirm the file passes schema migration. Then reference the new evaluator's `id` from your eval set's `evaluatorRefs`. Watch for: `id` collisions with existing evaluators, missing required fields, and ISO-8601 formatting on the timestamps. + +## Coded-only evaluators (NOT available on low-code) + +The following coded `evaluatorTypeId` strings have no legacy class — agents working on a low-code agent should not attempt to use them. Switch to a coded agent (`version: "1.0"` eval sets) if you need any of these: + +`uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, `uipath-llm-judge-trajectory-simulation`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-tool-call-order`, `uipath-tool-call-args`, `uipath-tool-call-count`, `uipath-tool-call-output`. + +## Managing Evaluators + +### Add an evaluator + +```bash +uip agent eval evaluator add --type --path --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--type ` | Yes | One of: `semantic-similarity`, `trajectory` | — | +| `--description ` | No | Human-readable description | Auto-generated from type | +| `--prompt ` | No | Custom LLM evaluation prompt | Built-in default per type | +| `--target-key ` | No | Specific output key to evaluate | `*` (all keys) | +| `--path ` | No | Agent project directory | `.` | + +**Example:** +```bash +uip agent eval evaluator add content-quality \ + --type semantic-similarity \ + --path ./my-agent \ + --output json +``` + +### List evaluators + +```bash +uip agent eval evaluator list --path --output json +``` + +### Remove an evaluator + +```bash +uip agent eval evaluator remove --path --output json +``` + +Removing an evaluator automatically removes its references from all eval sets that reference it. + +## Default Evaluators + +`uip agent init` creates two default evaluators: + +### Semantic Similarity (`evaluator-default.json`, `name: "Default Evaluator"`) + +Compares expected vs actual output for semantic equivalence. Default prompt asks the LLM for a 0–100 score and substitutes `{{ExpectedOutput}}` and `{{ActualOutput}}`. + +### Trajectory (`evaluator-default-trajectory.json`, `name: "Default Trajectory Evaluator"`) + +Evaluates the agent's reasoning path against expected behavior. Default prompt asks the LLM for a 0–100 score and substitutes `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, and `{{AgentRunHistory}}`. + +Both default evaluators ship with `"model": "same-as-agent"` — this is supported and resolves to the agent's configured model at runtime. Override with an explicit model only if you need to score with a different model than the agent uses. + +The runtime DTO normalizes all evaluator scores to a 0–100 scale regardless of what the prompt asks for, but mixed-scale prompts in the same eval set produce confusing intermediate values — pick one scale per eval set. + +## Filename vs Name + +CLI-added evaluators are saved as `evaluator-.json` (first 8 hex chars of the evaluator UUID). The `` argument populates the `name` field inside the JSON; it does NOT shape the filename. + +```bash +uip agent eval evaluator add content-quality --type semantic-similarity --path ./my-agent +# Creates: evals/evaluators/evaluator-b47e26ca.json +# JSON has: "name": "content-quality" +``` + +The two `evaluator-default*.json` files are written by `uip agent init`, not by `evaluator add`. Eval sets reference evaluators by `id` (UUID), not by filename or name. + +## Evaluator JSON Format + +```json +{ + "fileName": "evaluator-b47e26ca.json", + "id": "b47e26ca-7a13-4c83-9ee4-039d6415fb63", + "name": "content-quality", + "description": "Semantic Similarity", + "category": 1, + "type": 5, + "prompt": "As an expert evaluator, ... {{ExpectedOutput}} ... {{ActualOutput}} ...", + "model": "same-as-agent", + "targetOutputKey": "*", + "createdAt": "2026-05-04T00:00:00.000Z", + "updatedAt": "2026-05-04T00:00:00.000Z" +} +``` + +**Type and category mapping:** + +| CLI Type | `type` (numeric) | `category` | +|----------|-------------------|------------| +| `semantic-similarity` | 5 | 1 (output-based) | +| `trajectory` | 7 | 3 (trajectory-based) | + +## Default Prompts and Template Variables + +The prompt and score scale the CLI writes when you run `evaluator add` differs from what `uip agent init` writes for the two default evaluators: + +| Type | `evaluator add` default | `uip agent init` default | +|------|-------------------------|--------------------------| +| `semantic-similarity` | Asks 0–1; uses `{{ExpectedOutput}}`, `{{ActualOutput}}` | Asks 0–100; same placeholders | +| `trajectory` | Asks 0–1; uses `{{AgentRunHistory}}`, `{{ExpectedBehavior}}` | Asks 0–100; uses `{{UserOrSyntheticInput}}`, `{{SimulationInstructions}}`, `{{ExpectedAgentBehavior}}`, `{{AgentRunHistory}}` | + +Two notable inconsistencies: + +1. **Trajectory placeholder names**: `{{ExpectedBehavior}}` (CLI add) vs `{{ExpectedAgentBehavior}}` (init default). When editing a prompt, use the placeholders already present in that file — do not mix. +2. **Score scales**: `evaluator add` writes 0–1 prompts; `init` writes 0–100 prompts. The runtime normalizes both to 0–100 in the result DTO, but the LLM judge actually returns whatever the prompt asks for. Mixed-scale eval sets are hard to read; pick one and rewrite the prompts you don't want. + +## Custom Prompts + +Pass `--prompt` to override the default. Use only the placeholders listed above for the chosen `--type`; unknown placeholders are passed through to the LLM as literal text. + +```bash +uip agent eval evaluator add strict-match \ + --type semantic-similarity \ + --prompt 'Score 0-100 how closely {{ActualOutput}} matches {{ExpectedOutput}}. Return JSON {"score": N, "reason": "..."}.' \ + --path ./my-agent --output json +``` + +## What `uip agent validate` Checks + +Validate runs schema migration, which enforces the following on every file in `evals/evaluators/`: + +**Required fields:** `fileName`, `id`, `name`, `description`, `category`, `type`, `targetOutputKey`, `createdAt`, `updatedAt`. Missing field → `Required field "" is missing`. + +**Category ↔ type compatibility:** + +| Category | Name | Allowed `type` | Additional requirements | +|----------|------|----------------|-------------------------| +| `0` | Deterministic | `1`, `6` | — | +| `1` | LlmAsAJudge | `5` | `prompt` and `model` required | +| `3` | Trajectory | `7` | `prompt` and `model` required | + +Category `2` (`AgentScorer`) exists in the SDK enum but is reserved/unused — do not write it manually. + +Eval sets are validated against a Zod schema. The CLI surfaces the offending file path, JSON path, and message — fix and re-run validate. + +## Runtime Errors (Eval Worker) + +These errors surface only after `uip agent eval run start` — `uip agent validate` does NOT catch them. They come from the cloud eval worker (`python-eval-worker/workflows/eval/activities.py`) and the SDK's `EvaluatorFactory`. + +| Error string | Trigger | Fix | +|--------------|---------|-----| +| `Evaluator '' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig. Specify a valid model name (e.g. 'claude-haiku-4-5-20251001').` | Evaluator JSON has empty/missing `model` (and is not `same-as-agent`). The worker fail-fasts before calling the LLM gateway. | Set `model` in the evaluator JSON to a model available in your tenant, or set `"model": "same-as-agent"` and ensure `agent.json` has a model. | +| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"same-as-agent"` but `agent.json` has no resolvable model. | Set `model` in `agent.json`, or override the evaluator with an explicit model. | + +**Pre-empt locally:** before push, run + +```bash +uip agent eval evaluator list --path ./my-agent --output json --output-filter '[?model==`""` || model==null]' +``` + +to find any LLM evaluator without an explicit model. (Switch to `--output-filter '[?model==`"same-as-agent"`]'` if you want to flag those that depend on `agent.json`.) + +## Anti-patterns + +- **Don't reference an evaluator by filename.** Eval sets reference evaluators by UUID (`id`). +- **Don't pass `--type` in PascalCase.** Only `semantic-similarity` and `trajectory` are accepted. +- **Don't assume `evaluator add` mirrors `init`'s prompts.** They differ for trajectory; check the resulting JSON before reusing template variables in your own scoring tooling. +- **Don't delete an evaluator file by hand.** Use `uip agent eval evaluator remove` so `evaluatorRefs` in eval sets are cleaned up automatically. +- **Don't copy evaluator JSON across projects without regenerating UUIDs.** `id` collisions silently corrupt cross-project resolution. +- **Don't try to add a coded-only evaluator type to a low-code agent.** Anything starting with `uipath-tool-call-*`, `uipath-binary-classification`, `uipath-multiclass-classification`, `uipath-contains`, `uipath-llm-judge-output-strict-json-similarity`, or `uipath-llm-judge-trajectory-simulation` has no legacy class and the eval worker will not load it. If you need one of these, the agent must be coded, not low-code. +- **Don't hand-write a category/type combination outside the validate matrix.** Validate accepts cat 0 → types {1, 6}, cat 1 → type {5}, cat 3 → type {7}. Anything else fails schema migration. diff --git a/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md new file mode 100644 index 000000000..8713186b1 --- /dev/null +++ b/skills/uipath-agents/references/lowcode/evaluations/running-evaluations.md @@ -0,0 +1,207 @@ +# Running Evaluations + +Execute evaluations against the Agent Runtime, check status, view results, and compare runs. + +All run commands require the agent to be pushed to Studio Web first (`uip agent push`). The Agent Runtime executes test cases in the cloud using the pushed agent definition. + +## Start an Eval Run + +```bash +uip agent eval run start --set "" --path --wait --output json +``` + +**Options:** + +| Flag | Required | Description | Default | +|------|----------|-------------|---------| +| `--set ` | Yes | Eval set name or ID | — | +| `--path ` | No | Agent project directory | `.` | +| `--wait` | No | Block until the run completes, then print results | `false` | +| `--timeout ` | No | Maximum time to block when `--wait` is set | `600` (10 min) | +| `--solution-id ` | No | Override solution ID for this run | Auto-resolved from the pushed-agent state | + +Without `--wait`, the command returns immediately with `Code: AgentEvalRunStarted`: + +```json +{ + "Code": "AgentEvalRunStarted", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "EvalSetName": "Default Evaluation Set", + "TestCases": 5, + "Evaluators": 2 + } +} +``` + +With `--wait`, the CLI polls every 5 seconds (hardcoded interval) until the run reaches a terminal state (`completed` or `failed`) or `--timeout` elapses, then emits `AgentEvalRunCompleted` plus per-test `AgentEvalRunResults`. If `--timeout` elapses first, the run continues server-side; query progress with `eval run status `. + +### Output codes + +| Subcommand | `Code` | +|------------|--------| +| `run start` (no `--wait`) | `AgentEvalRunStarted` | +| `run start --wait` (summary) | `AgentEvalRunCompleted` | +| `run start --wait` (per-case detail) | `AgentEvalRunResults` | +| `run status` | `AgentEvalRunStatus` | +| `run results` | `AgentEvalRunResults` | +| `run results --export-format` | `AgentEvalRunExported` | +| `run list` | `AgentEvalRunList` | +| `run compare` | `AgentEvalRunComparison` | + +## Check Run Status + +```bash +uip agent eval run status --set "" --path --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunStatus", + "Data": { + "EvalSetRunId": "a1b2c3d4-...", + "Status": "completed", + "Score": 0.86, + "Duration": "42.5s", + "EvaluatorScores": "semantic: 0.9, trajectory: 0.82" + } +} +``` + +Terminal states: `completed` or `failed`. + +## View Results + +```bash +uip agent eval run results \ + --set "" \ + --path \ + --output json +``` + +**Options:** + +| Flag | Description | +|------|-------------| +| `--only-failed` | Show only failed or errored test cases | +| `--verbose` | Include evaluator justifications in output | +| `--export-format ` | Export results to file (`eval-results-{timestamp}.json` or `.csv`) | + +**Per-test-case output fields:** `TestCase`, `Status`, `Score`, `EvaluatorScores`, `Tokens`, `Duration`, `Error` (plus `Justifications` when `--verbose`). + +### Filtering results with `--output-filter` + +`--output-filter` takes a JMESPath expression and applies it to the JSON payload before printing. Useful for triage: + +```bash +# Print only test cases with a specific name +uip agent eval run results --set "Default Evaluation Set" --path ./my-agent \ + --output json --output-filter 'Data.Results[?TestCase==`greeting-test`]' + +# Print only the score field for each test case +uip agent eval run results --set "Default Evaluation Set" --path ./my-agent \ + --output json --output-filter 'Data.Results[*].{name: TestCase, score: Score}' +``` + +### Failure detection + +`--only-failed` filters to test cases where any of these are true (`isFailedRun()` in the CLI): + +1. `status === "failed"` (or numeric `"3"`) +2. `errorMessage` is non-null +3. `result.score.type === "error"` (or numeric `"2"`) +4. Any `assertionRuns[*].result.score.type === "error"` (or numeric `"2"`) +5. Any `assertionRuns[*].result.score.value === false` (exact-match evaluators that returned a false boolean) + +Status enum values from the SDK: `0 = pending`, `1 = running`, `2 = completed`, `3 = failed`. The CLI normalizes string and numeric forms. + +## List Past Runs + +```bash +uip agent eval run list --set "" --path --output json +``` + +**Per-row output:** `EvalSetRunId`, `Status`, `Score`, `TestCases`, `Duration`, `EvaluatorScores`, `CreatedAt`. + +## Compare Runs + +Compare two eval runs side by side to see score changes: + +```bash +uip agent eval run compare \ + --compare-to \ + --set "" \ + --path \ + --output json +``` + +**Output:** +```json +{ + "Code": "AgentEvalRunComparison", + "Data": { + "RunA": { "Id": "...", "Score": 0.86, "Status": "completed" }, + "RunB": { "Id": "...", "Score": 0.80, "Status": "completed" }, + "ScoreDelta": 0.06, + "TestCases": [ + { + "TestCase": "happy-path", + "ScoreA": 1.0, + "ScoreB": 0.9, + "Delta": "+0.1", + "StatusA": "completed", + "StatusB": "completed" + } + ] + } +} +``` + +Use `compare` after prompt changes to verify improvements without regressions. + +## Workflow Example + +```bash +# 1. Add test cases +uip agent eval add greeting-test \ + --set "Default Evaluation Set" \ + --inputs '{"input":"hi there"}' \ + --expected '{"content":"Hello! How can I help you?"}' \ + --expected-agent-behavior "Agent should respond with a friendly greeting" \ + --path ./my-agent --output json + +# 2. Validate (catches schema drift, missing evaluator refs, broken eval JSON) +uip agent validate --path ./my-agent --output json + +# 3. Push agent to Studio Web (required before running evals) +uip agent push --path ./my-agent --output json + +# 4. Run and wait +uip agent eval run start \ + --set "Default Evaluation Set" \ + --path ./my-agent \ + --wait --timeout 600 --output json + +# 5. Review failures +uip agent eval run results \ + --set "Default Evaluation Set" \ + --only-failed --verbose \ + --path ./my-agent --output json + +# 6. Make changes, validate, push, re-run, compare +uip agent validate --path ./my-agent --output json +uip agent push --path ./my-agent --output json +uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json +uip agent eval run compare --compare-to \ + --set "Default Evaluation Set" --path ./my-agent --output json +``` + +## Anti-patterns + +- **Don't run `eval run start` without `uip agent push` first.** The Agent Runtime executes against the pushed agent, not local files. Local edits made after the last push will not affect results. +- **Don't assume `--timeout` cancels the server-side run.** It only stops the local CLI from blocking. The run continues and can be inspected with `run status`. +- **Don't skip `uip agent validate` between edits and push.** Validate catches eval-set / evaluator drift that push will accept silently and the runtime will reject. +- **Don't compare runs from different eval sets.** `compare` aligns by test case `name` within the eval set; cross-set deltas are meaningless. +- **Don't rely on `Score` alone — inspect `EvaluatorScores`.** A 0.86 aggregate can mask a faithful-but-wrong agent (high semantic, low trajectory). Use `--verbose` to read justifications when scores look surprising. +- **Don't mix score scales across evaluators in the same eval set.** Defaults written by `uip agent init` use 0–100 prompts; defaults written by `evaluator add` use 0–1 prompts. The runtime DTO normalizes to 0–100, but mixed-scale prompts produce confusing per-evaluator scores. Decide on one scale per eval set and edit prompts to match. diff --git a/skills/uipath-agents/references/lowcode/lowcode.md b/skills/uipath-agents/references/lowcode/lowcode.md index 94b65521d..4858efd71 100644 --- a/skills/uipath-agents/references/lowcode/lowcode.md +++ b/skills/uipath-agents/references/lowcode/lowcode.md @@ -46,6 +46,7 @@ Capabilities are **orthogonal**: there is no ordering requirement among them. Ad | Scaffolding, validating, or running solution lifecycle commands | [project-lifecycle.md](project-lifecycle.md) | | Editing `agent.json` (prompts, schemas, model, contentTokens) or `entry-points.json` | [agent-definition.md](agent-definition.md) | | External tools / IS tools / index contexts / escalations behave unexpectedly after `uip solution resource refresh` | [solution-resources.md](solution-resources.md) | +| Running evaluations, adding test cases, managing evaluators | [evaluations/evaluate.md](evaluations/evaluate.md) | ### Capability Registry