feat: add low-code agent evaluation docs by mjnovice · Pull Request #552 · UiPath/skills

mjnovice · 2026-05-04T22:15:24Z

Summary

Adds low-code agent evaluation documentation to the uipath-agents skill, filling a gap where coded agent evals have 5+ reference files but low-code has zero
Full CLI support exists in uip agent eval (evaluator CRUD, eval set/test case management, run start/status/results/compare) but was undocumented in the skill
4 new reference files under references/lowcode/evaluation/ matching the structure of the coded eval docs
Updates SKILL.md task navigation and lowcode.md capability registry with eval entries

Files Added

evaluation/evaluate.md — entry point, prerequisites, file structure, key differences from coded evals
evaluation/evaluators.md — 4 evaluator types (semantic-similarity, trajectory, context-precision, faithfulness), JSON format, custom prompts
evaluation/evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format
evaluation/running-evaluations.md — run start/status/results/list/compare commands with output examples

Test plan

Verify all internal links resolve (evaluate.md → evaluators.md, etc.)
Verify SKILL.md task navigation row links to correct file
Verify lowcode.md capability registry and "Read on demand" table entries
Test documented commands against a real low-code agent project with uip agent eval

🤖 Generated with Claude Code

The uipath-agents skill has comprehensive evaluation docs for coded agents (5 files under coded/lifecycle/evaluations/) but none for low-code agents, despite full CLI support in `uip agent eval`. Adds 4 reference files under lowcode/evaluation/: - evaluate.md — entry point, prerequisites, file structure, differences from coded - evaluators.md — 4 evaluator types, add/list/remove, JSON format, custom prompts - evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format - running-evaluations.md — run start/status/results/list/compare, workflow example Updates SKILL.md task navigation and lowcode.md capability registry to reference the new evaluation docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-04T22:15:41Z

Claude finished @mjnovice's task in 3m 0s —— View job

Summary

Adds 4 new reference files documenting the uip agent eval CLI for low-code agents under references/lowcode/evaluation/, plus task navigation and capability registry entries in SKILL.md and lowcode.md. Fills a real documentation gap — coded agent evals had extensive coverage while low-code had none.

Change-by-Change Review

1. `skills/uipath-agents/SKILL.md` (+1 line)

Severity: OK

New task navigation row correctly links to references/lowcode/evaluation/evaluate.md as the entry point and lists the three sub-docs in "Then". Placement is alphabetically consistent with surrounding rows. Pattern matches existing entries.

2. `skills/uipath-agents/references/lowcode/lowcode.md` (+2 lines)

Severity: OK

"Read on demand" table entry (line 49) and Capability Registry row (line 72) both link to evaluation/evaluate.md with correct relative paths. Sub-doc references in the "Then" column match the SKILL.md entry.

3. `skills/uipath-agents/references/lowcode/evaluation/evaluate.md` (new, 71 lines)

Severity: OK

Well-structured entry point. Quick Reference provides copy-paste commands, prerequisites are clear (push before run, local ops don't need auth), reference navigation links all resolve. Troubleshooting table covers common failure modes. Key Differences table vs coded evals is useful context.

Minor: Comments in the Quick Reference code block (lines 8, 11, 14 — # Add a test case, # Run evals and wait for results, # Check results...) restate what the adjacent command does. Per token-optimization rules, these could be stripped since the section heading "Quick Reference" already frames the context. Low — not blocking.

4. `skills/uipath-agents/references/lowcode/evaluation/evaluators.md` (new, 102 lines)

Severity: High — template variable inconsistency

Line 64 documents the trajectory evaluator's template variables as:

{{UserOrSyntheticInput}}, {{SimulationInstructions}}, {{ExpectedAgentBehavior}}, {{AgentRunHistory}}

But the Custom Prompts section at line 100 lists trajectory variables as:

{{AgentRunHistory}}, {{ExpectedBehavior}}

Two problems:

{{ExpectedAgentBehavior}} vs {{ExpectedBehavior}} — contradictory names. An agent following the Custom Prompts section would use the wrong variable.
Line 100 lists only 2 variables vs 4 on line 64 — the Custom Prompts section is incomplete, missing {{UserOrSyntheticInput}} and {{SimulationInstructions}}.

Fix: Reconcile the two lists — determine which variable name is correct ({{ExpectedAgentBehavior}} or {{ExpectedBehavior}}) by checking the actual CLI, then update the Custom Prompts section to list all 4 variables consistently. Fix this →

5. `skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md` (new, 140 lines)

Severity: OK

Comprehensive coverage of eval set and test case CRUD. Option tables are complete. Test case design section helpfully maps evaluator types to relevant flags. JSON format example is useful for agents that need to understand the file structure. Simulation options section is concise.

6. `skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md` (new, 163 lines)

Severity: Low

Full coverage of run start/status/results/list/compare. JSON output examples match expected CLI patterns. Failure detection criteria are explicit — good for LLM usability.

The Workflow Example (lines 134–163) has numbered step comments (# 1. Push agent..., # 2. Add test cases..., etc.). These serve as section boundaries in a long code block, which is acceptable per token-optimization rules. Borderline — the surrounding ## Workflow Example heading provides no prose context, so the comments carry necessary structure here.

What's Missing

E2E test tasks for low-code evaluation — No test task in tests/tasks/uipath-agents/ covers the new evaluation workflow (e.g., adding a test case, running an eval). Existing tasks cover guardrails, tools, contexts, escalations, etc., but not evals. Severity: Medium — consistent with the review guidelines that flag missing test coverage for new capabilities.
Error handling guidance — The docs cover the happy path well but don't specify what an agent should do when uip agent eval run start returns a non-zero exit code or unexpected JSON structure. The troubleshooting table in evaluate.md partially covers this, but the sub-docs lack explicit "if this fails, do X" instructions. Severity: Low — the troubleshooting table provides reasonable coverage.

Area Ratings

Area	Status	Notes
Frontmatter	OK	No frontmatter changes needed — existing SKILL.md frontmatter unmodified
E2E Tests	Issue	No eval-specific test tasks added (Medium)
Skill Body	OK	Task navigation and capability registry entries correct
References & Assets	Issue	Template variable inconsistency in evaluators.md (High)
Repo Hygiene	OK	CODEOWNERS covers `/skills/uipath-agents/`, no secrets, no cross-skill refs, all links resolve

Issues for Manual Review

CLI command accuracy — Cannot verify whether the documented uip agent eval subcommands, flags, and JSON output shapes match the actual CLI implementation. The commands follow established patterns from the coded eval docs, but the flag names (--simulate-input, --simulate-tools, --input-generation-instructions, --solution-id), JSON output codes (AgentEvalRunStarted, AgentEvalRunStatus, AgentEvalRunComparison), and evaluator type numeric mappings (5, 7, 8, 9) should be validated against the real CLI.
Template variable names — Which is correct: {{ExpectedAgentBehavior}} or {{ExpectedBehavior}}? Only the CLI source or actual evaluator default prompts can resolve this.

Conclusion

Solid documentation addition that fills a clear gap. One High issue must be fixed before merge: the contradictory trajectory evaluator template variable names in evaluators.md (line 64 vs line 100). The missing e2e test task is Medium — consistent with the general test coverage gap across the repo. Everything else is clean.

@Chibionos

Address PR #552 review comment from @Chibionos: drop SolutionStorage.json mentions throughout the eval refs (it is going away). Reword troubleshooting, prerequisites, file-structure tree, and the --solution-id default to describe the user-facing concept ("agent pushed to Studio Web") instead. Folds in additional corrections found while verifying the PR against the uip CLI (Code/cli), uipath-python SDK, and Agents service repo: - Rename evaluation/ → evaluations/ to match coded sibling convention. - Move eval row from Capability Registry to "Read on demand" in lowcode.md (eval is lifecycle, not a capability). - Fix evaluator filename example: actual pattern is evaluator-<uuid8>.json, not <name>.json. The user-supplied <name> goes into the JSON name field. - Restore --wait polling cadence (5s) and --timeout default (600s) — both hardcoded in eval-run.ts. Removed earlier when unverified. - Add complete output Code enum (AgentEvalRunStarted/Completed/Results/ Status/Exported/List/Comparison). - Expand failure detection with the numeric forms isFailedRun() actually checks (status "3", score.type "2"), plus the SDK status enum. - Document the worker-side LLM model fail-fast (activities.py) and the same-as-agent resolver error (EvaluatorFactory) — these are runtime, not validate-time, errors. - Correct context-precision/faithfulness data flow: both are trace-driven (RETRIEVER spans), not test-case-driven; faithfulness reads expectedOutput as the candidate text, not the agent's actual output. - Add "Why fewer evaluators than coded?" section explaining the legacy vs new SDK engine split, plus the 2 runtime-supported types not exposed by the CLI (Equals=1, JsonSimilarity=6) with copy-pasteable JSON. - Document validate's category↔type matrix (cat 0→{1,6}, cat 1→{5,8,9}, cat 3→{7}) and required fields per schema-validation-service.ts. - Add Anti-patterns section to all four eval reference files per skill-structure.md convention. - Workflow example: insert validate step between add and push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…code eval docs Remove context-precision and faithfulness from the low-code evaluator surface entirely. Updates: - evaluators.md: drop both rows from the CLI-exposed table, the --type description, the type/category mapping, and the default-prompts table. Narrow the validate matrix's cat 1 to type {5} only. Update the "Why fewer" intro to reflect 2 supported CLI types. - evaluation-sets.md: remove the trace-driven data-flow rows for both evaluators, the explanatory callout about RETRIEVER spans, and the related anti-patterns. Test-case design now covers only ss + trajectory. - evaluate.md: narrow the "Unknown evaluator type" troubleshooting hint. Coded eval refs are unchanged — those use uipath-llm-judge-* IDs, not the legacy CLI names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the synthetic skeletons in "Runtime-supported types not exposed by the CLI" with the canonical shapes used in real low-code agent projects: - Equals (type 1) and JsonSimilarity (type 6) keep their Deterministic-category shape (no prompt/model needed) but now use realistic descriptions and filenames. - Add explicit LlmAsAJudge (type 5) and Trajectory (type 7) JSON shapes for hand-written use, including the full prompt strings, an explicit model pin, and the descriptions used in production examples. - Soften the filename rule: CLI-generated evaluators use evaluator-<uuid8>.json, but hand-written files can use any descriptive name. The runtime keys off id / evaluatorRefs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Studio Web UI exposes 4 evaluator types (Semantic Similarity, Trajectory, Exact match, JSON similarity). Verified by counting evaluator JSON files across multiple production examples — only types 1, 5, 6, 7 appear; nothing else does. Previous framing called Exact match and JSON similarity "runtime-supported types not exposed by the CLI", which understated their status. Both are real first-class options; the only narrowing surface is the CLI's --type flag (which covers 2 of 4). evaluators.md changes: - New "Supported Evaluator Types" section with a 4-row table mapping UI label, type/category, --type flag (where applicable), what it scores, and whether it is LLM-based. - New subsection "How to add each type" calling out the three creation paths (UI, CLI, hand-write JSON). - Renamed the "Why fewer than coded?" section into a subsection of the Supported Types group; updated wording to reflect 4 supported types. - Renamed "Runtime-supported types not exposed by the CLI" to "JSON Shapes" and reordered the four shapes to match the table order (Exact match, JSON similarity, LLM-as-a-judge, Trajectory). evaluation-sets.md changes: - Added Exact match and JSON similarity rows to the field-mapping table so all 4 supported types are covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andreibalas-uipath

The correct semantics for "pushing" an agent to Studio Web is actually to use the solutions CLI: uip solution upload. The command either creates a new solution or updates existing after solution is edited locally.

mjnovice requested review from AlexandruCGhimisi, AlvinStanescu, adrian-badea, al3xanndru, andreibalas-uipath, cosmyo and gabrielavaduva as code owners May 4, 2026 22:15

Chibionos reviewed May 4, 2026

View reviewed changes

Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluate.md Outdated

Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluators.md Outdated

Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluators.md Outdated

mjnovice and others added 4 commits May 4, 2026 18:29

mjnovice requested a review from Chibionos May 6, 2026 01:15

Chibionos approved these changes May 6, 2026

View reviewed changes

andreibalas-uipath approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add low-code agent evaluation docs#552

feat: add low-code agent evaluation docs#552
mjnovice wants to merge 5 commits intomainfrom
feat/lowcode-eval-docs

mjnovice commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreibalas-uipath left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mjnovice commented May 4, 2026

Summary

Files Added

Test plan

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change-by-Change Review

1. skills/uipath-agents/SKILL.md (+1 line)

2. skills/uipath-agents/references/lowcode/lowcode.md (+2 lines)

3. skills/uipath-agents/references/lowcode/evaluation/evaluate.md (new, 71 lines)

4. skills/uipath-agents/references/lowcode/evaluation/evaluators.md (new, 102 lines)

5. skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md (new, 140 lines)

6. skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md (new, 163 lines)

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreibalas-uipath left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 4, 2026 •

edited

Loading

1. `skills/uipath-agents/SKILL.md` (+1 line)

2. `skills/uipath-agents/references/lowcode/lowcode.md` (+2 lines)

3. `skills/uipath-agents/references/lowcode/evaluation/evaluate.md` (new, 71 lines)

4. `skills/uipath-agents/references/lowcode/evaluation/evaluators.md` (new, 102 lines)

5. `skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md` (new, 140 lines)

6. `skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md` (new, 163 lines)