Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions skills/uipath-agents/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Determine the agent mode before proceeding:
| Add guardrails (PII, harmful content, custom rules) to a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/guardrails/guardrails.md` |
| Add escalation guardrail (escalate action / Action Center app) | Low-code | [lowcode/capabilities/guardrails/guardrails.md](references/lowcode/capabilities/guardrails/guardrails.md) § escalate — Hand Off to Action Center | Run `uip solution resource list --kind App` to confirm app exists |
| Embed a low-code agent inline in a flow, or wire a multi-agent solution | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) § Capability Registry | `lowcode/capabilities/inline-in-flow/inline-in-flow.md`, `lowcode/capabilities/process/solution-agent.md` |
| Run low-code evaluations | Low-code | [lowcode/evaluations/evaluate.md](references/lowcode/evaluations/evaluate.md) | `lowcode/evaluations/evaluators.md`, `lowcode/evaluations/evaluation-sets.md`, `lowcode/evaluations/running-evaluations.md` |
| Validate, pack, publish, upload, or deploy a low-code agent | Low-code | [lowcode/lowcode.md](references/lowcode/lowcode.md) | `lowcode/project-lifecycle.md`, `lowcode/solution-resources.md` |

## Resources
Expand Down
90 changes: 90 additions & 0 deletions skills/uipath-agents/references/lowcode/evaluations/evaluate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Evaluate Low-Code Agents

Design and run evaluations against low-code agents using the `uip agent eval` CLI.

## Quick Reference

```bash
# Add a test case
uip agent eval add happy-path --set "Default Evaluation Set" --inputs '{"input":"hello"}' --expected '{"content":"greeting"}' --path ./my-agent --output json

# Run evals and wait for results
uip agent eval run start --set "Default Evaluation Set" --path ./my-agent --wait --output json

# Check results (failures only, with justifications)
uip agent eval run results <run_id> --set "Default Evaluation Set" --only-failed --verbose --path ./my-agent --output json
```

## Prerequisites

- Agent project initialized (`uip agent init <path>`)
- `entry-points.json` present (defines `input`/`output` schema that test case `--inputs`/`--expected` must conform to)
- `uip agent validate --output json` passes (validate also checks evals and evaluators)
- Agent pushed to Studio Web (`uip agent push`) — required for running evals (the Agent Runtime executes test cases in the cloud)

Local operations (managing evaluators, eval sets, test cases) do **not** require authentication or a cloud connection. Only `uip agent eval run *` commands require cloud connectivity.

## Reference Navigation

- [Evaluators](evaluators.md) — evaluator types, adding/removing, default prompts
- [Evaluation Sets and Test Cases](evaluation-sets.md) — creating sets, adding test cases, simulation options
- [Running Evaluations](running-evaluations.md) — start, status, results, compare

Read Evaluators before choosing an evaluator type, and Evaluation Sets before writing test cases.

## File Structure

After `uip agent init`, the project structure is:

```
my-agent/
agent.json
entry-points.json # Input/output schema — test case --inputs / --expected must match
project.uiproj
flow-layout.json
evals/
evaluators/
evaluator-default.json # name: "Default Evaluator" (semantic-similarity)
evaluator-default-trajectory.json # name: "Default Trajectory Evaluator"
eval-sets/
evaluation-set-default.json # name: "Default Evaluation Set" (references both evaluators)
```

Evaluators live in `evals/evaluators/` and eval sets (with inline test cases) live in `evals/eval-sets/`. Both are auto-discovered by the CLI from these directories.

CLI-added evaluators are written as `evaluator-<uuid8>.json` (first 8 hex chars of the evaluator UUID). The `<name>` argument populates the `name` field inside the JSON, NOT the filename. Reference evaluators in eval sets by `id` (UUID), not filename.

## Key Differences from Coded Agent Evals

| Aspect | Coded (`uip codedagent eval`) | Low-code (`uip agent eval`) |
|--------|-------------------------------|------------------------------|
| Execution | Local Python process | Cloud-based via Agent Runtime |
| Auth required | Only for `--report` | Always (cloud execution) |
| Prerequisite | `entry-points.json` | `uip agent push` |
| Mocking | `@mockable()` decorator + declarative | Simulation instructions only |
| CLI prefix | `uip codedagent eval` | `uip agent eval` |

## Troubleshooting

| Error | Cause | Fix |
|-------|-------|-----|
| Solution ID could not be resolved | Agent not pushed to Studio Web | Run `uip agent push --output json`, or pass `--solution-id <id>` explicitly to `uip agent eval run start` |
| `No evaluators found` | Empty `evals/evaluators/` directory | Run `uip agent eval evaluator add` or re-init with `uip agent init` |
| `No test cases in eval set` | Eval set has no evaluations | Run `uip agent eval add` to add test cases |
| `Unknown evaluator type "X"` | Wrong case on `--type` value | Use kebab-case only: `semantic-similarity`, `trajectory` |
| `Evaluator '<id>' is an LLM-based evaluator but 'model' is not set in its evaluatorConfig.` | LLM evaluator JSON has empty/missing `model` and is not `same-as-agent` | Set `"model"` in the evaluator JSON to a valid model (e.g. `claude-haiku-4-5-20251001`), or set it to `"same-as-agent"` and ensure `agent.json` has a model |
| `'same-as-agent' model option requires agent settings. Ensure agent.json contains valid model settings.` | Evaluator uses `"model": "same-as-agent"` but `agent.json` has no resolvable model | Set a model in `agent.json`, or override the evaluator with an explicit model |
| `401 Unauthorized` | Auth expired | Run `uip login --output json` |
| Eval run timeout (with `--wait`) | Agent taking too long or stuck | Increase `--timeout` or check agent health in Studio Web. Note: this only stops the local CLI from blocking; the run continues server-side — query with `uip agent eval run status <run_id>` |
| Validate fails with eval errors | Eval set references an evaluator that no longer exists, OR evaluator JSON missing required field, OR `category`/`type` mismatch (see [evaluators.md](evaluators.md) § What `uip agent validate` Checks) | Re-run `uip agent eval evaluator list` and reconcile `evaluatorRefs`; fix per the validate error message |

The two model-resolution errors above are **runtime checks in the cloud eval worker**, not validate-time checks — `uip agent validate` will not catch them. They surface only after `uip agent eval run start`. To pre-empt them, inspect each evaluator's `model` field locally before pushing.

## Anti-patterns

- **Don't run `uip agent eval run start` before `uip agent push`.** The Agent Runtime executes against the pushed agent. Local edits to `agent.json` after the last push will not be reflected in the run.
- **Don't skip `uip agent validate` before push.** Validate checks `evals/` and `evaluators/`; broken eval JSON will not block push but will surface as runtime errors.
- **Don't hand-edit `id` or `evaluatorRefs` UUIDs.** Eval sets reference evaluators by UUID. Renaming an evaluator file or copy-pasting a UUID across evaluators silently breaks resolution.
- **Don't expect filenames to match `<name>`.** CLI-generated evaluator files use `evaluator-<uuid8>.json`, not `<name>.json`. Look up evaluators by the `name` field inside the JSON, not by filename.
- **Don't pass `--type` in PascalCase.** The CLI rejects `SemanticSimilarity`. Only kebab-case is accepted.
- **Don't reference evaluators across projects.** Each agent project has its own `evals/evaluators/` directory; UUIDs are not portable.
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Evaluation Sets and Test Cases

Evaluation sets group test cases and reference which evaluators to use. Each set is a JSON file in `evals/eval-sets/`. Test cases are stored inline within the eval set.

## Managing Eval Sets

### Add an eval set

```bash
uip agent eval set add <name> --path <agent_dir> --output json
```

**Options:**

| Flag | Required | Description | Default |
|------|----------|-------------|---------|
| `--evaluators <ids>` | No | Comma-separated evaluator IDs | All existing evaluators |
| `--path <path>` | No | Agent project directory | `.` |

When `--evaluators` is not provided, the new eval set automatically references **all** evaluators in the project.

### List eval sets

```bash
uip agent eval set list --path <agent_dir> --output json
```

### Remove an eval set

```bash
uip agent eval set remove <id_or_name> --path <agent_dir> --output json
```

## Managing Test Cases

Test cases live inside eval sets. Each test case defines an input, expected output, and optional behavior expectations.

### Add a test case

```bash
uip agent eval add <name> \
--set "<eval_set_name>" \
--inputs '{"input":"hello"}' \
--expected '{"content":"greeting response"}' \
--path <agent_dir> \
--output json
```

**Options:**

| Flag | Required | Description | Default |
|------|----------|-------------|---------|
| `--set <name>` | Yes | Eval set name or ID | — |
| `--inputs <json>` | Yes | Input values as JSON | — |
| `--expected <json>` | No | Expected output as JSON | `{}` |
| `--expected-agent-behavior <text>` | No | Description of expected behavior (used by trajectory evaluator) | `""` |
| `--simulation-instructions <text>` | No | Instructions for simulating agent behavior | `""` |
| `--simulate-input` | No | Enable input simulation | `false` |
| `--simulate-tools` | No | Enable tool simulation | `false` |
| `--input-generation-instructions <text>` | No | Instructions for generating synthetic inputs | `""` |
| `--path <path>` | No | Agent project directory | `.` |

### List test cases

```bash
uip agent eval list --set "<eval_set_name>" --path <agent_dir> --output json
```

### Remove a test case

```bash
uip agent eval remove <id_or_name> --set "<eval_set_name>" --path <agent_dir> --output json
```

## Test Case Design

### Aligning `--inputs` with `entry-points.json`

`--inputs` JSON keys must match the `input` schema in `entry-points.json`. Mismatched keys do not block `eval add` (the CLI stores the JSON verbatim) but will fail at run time when the Agent Runtime invokes the agent. Run `uip agent validate --output json` after adding test cases to surface schema drift.

### Matching evaluator to test case fields

The `--inputs` and `--expected` flags populate `inputs` and `expectedOutput` on the test case. Each evaluator type sources its placeholder values from a different combination of test-case fields and agent run trace:

| Evaluator Type | From test case | From agent run |
|----------------|---------------|----------------|
| Semantic Similarity (type 5) | `expectedOutput` → `{{ExpectedOutput}}` | Agent output → `{{ActualOutput}}` |
| Trajectory (type 7) | `expectedAgentBehavior` → `{{ExpectedAgentBehavior}}`, `inputs` → `{{UserOrSyntheticInput}}`, `simulationInstructions` → `{{SimulationInstructions}}` | Trace → `{{AgentRunHistory}}` |
| Exact match (type 1) | `expectedOutput` (compared verbatim, no placeholders) | Agent output (compared verbatim) |
| JSON similarity (type 6) | `expectedOutput` (tree-compared, no placeholders) | Agent output (tree-compared) |

For trajectory evaluation, write `--expected-agent-behavior` as a natural language description of what the agent should do, not what it should output:

```bash
uip agent eval add tool-usage-test \
--set "Default Evaluation Set" \
--inputs '{"input":"What is the weather in NYC?"}' \
--expected-agent-behavior "Agent should call the weather tool with location NYC and return a formatted weather summary" \
--path ./my-agent --output json
```

### Simulation options

- `--simulate-input` — runtime generates synthetic input variations based on the provided input
- `--simulate-tools` — tool calls are simulated rather than executed against real services
- `--input-generation-instructions` — guides synthetic input generation (e.g., "generate edge cases with empty strings and special characters")
- `--simulation-instructions` — guides overall simulation behavior

Use these to expand test coverage without writing every input by hand.

## Eval Set JSON Format

```json
{
"fileName": "evaluation-set-default.json",
"id": "<uuid>",
"name": "Default Evaluation Set",
"batchSize": 10,
"evaluatorRefs": ["<evaluator-uuid-1>", "<evaluator-uuid-2>"],
"evaluations": [
{
"id": "<uuid>",
"name": "happy-path",
"inputs": {"input": "hello"},
"expectedOutput": {"content": "greeting"},
"expectedAgentBehavior": "",
"simulationInstructions": "",
"simulateInput": false,
"simulateTools": false,
"inputGenerationInstructions": "",
"evalSetId": "<eval-set-uuid>",
"source": "manual",
"createdAt": "...",
"updatedAt": "..."
}
],
"modelSettings": [],
"agentMemoryEnabled": false,
"agentMemorySettings": [],
"lineByLineEvaluation": false,
"createdAt": "...",
"updatedAt": "..."
}
```

The `source` field indicates how the test case was created. CLI-added test cases are always `"manual"` (verified). Other observed values from Studio Web include `"debugRun"`, `"runtimeRun"`, `"simulatedRun"`, and `"autopilotUserInitiated"` — treat the `source` field as an enum but do not set it manually; the CLI and Studio Web own this value.

## Anti-patterns

- **Don't hand-write `evalSetId` or test case `id` UUIDs.** Use `uip agent eval add` so the CLI keeps `evaluations[].evalSetId` consistent with the parent eval set's `id`.
- **Don't add `--inputs` keys that are not in `entry-points.json`.** The runtime will reject the test case at execution time. Run `uip agent validate` to catch this before push.
- **Don't set `--expected '{}'` (empty) and `--expected-agent-behavior ""` together.** The semantic-similarity evaluator scores against an empty `{{ExpectedOutput}}`; the trajectory evaluator scores against an empty `{{ExpectedAgentBehavior}}`. Every run scores low for non-actionable reasons.
- **Don't set the `source` field manually.** Owned by CLI and Studio Web; hand-edits may be overwritten on the next sync.
Loading
Loading