Conversation
The uipath-agents skill has comprehensive evaluation docs for coded agents (5 files under coded/lifecycle/evaluations/) but none for low-code agents, despite full CLI support in `uip agent eval`. Adds 4 reference files under lowcode/evaluation/: - evaluate.md — entry point, prerequisites, file structure, differences from coded - evaluators.md — 4 evaluator types, add/list/remove, JSON format, custom prompts - evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format - running-evaluations.md — run start/status/results/list/compare, workflow example Updates SKILL.md task navigation and lowcode.md capability registry to reference the new evaluation docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 3m 0s —— View job SummaryAdds 4 new reference files documenting the Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | OK | No frontmatter changes needed — existing SKILL.md frontmatter unmodified |
| E2E Tests | Issue | No eval-specific test tasks added (Medium) |
| Skill Body | OK | Task navigation and capability registry entries correct |
| References & Assets | Issue | Template variable inconsistency in evaluators.md (High) |
| Repo Hygiene | OK | CODEOWNERS covers /skills/uipath-agents/, no secrets, no cross-skill refs, all links resolve |
Issues for Manual Review
- CLI command accuracy — Cannot verify whether the documented
uip agent evalsubcommands, flags, and JSON output shapes match the actual CLI implementation. The commands follow established patterns from the coded eval docs, but the flag names (--simulate-input,--simulate-tools,--input-generation-instructions,--solution-id), JSON output codes (AgentEvalRunStarted,AgentEvalRunStatus,AgentEvalRunComparison), and evaluator type numeric mappings (5, 7, 8, 9) should be validated against the real CLI. - Template variable names — Which is correct:
{{ExpectedAgentBehavior}}or{{ExpectedBehavior}}? Only the CLI source or actual evaluator default prompts can resolve this.
Conclusion
Solid documentation addition that fills a clear gap. One High issue must be fixed before merge: the contradictory trajectory evaluator template variable names in evaluators.md (line 64 vs line 100). The missing e2e test task is Medium — consistent with the general test coverage gap across the repo. Everything else is clean.
Address PR #552 review comment from @Chibionos: drop SolutionStorage.json mentions throughout the eval refs (it is going away). Reword troubleshooting, prerequisites, file-structure tree, and the --solution-id default to describe the user-facing concept ("agent pushed to Studio Web") instead. Folds in additional corrections found while verifying the PR against the uip CLI (Code/cli), uipath-python SDK, and Agents service repo: - Rename evaluation/ → evaluations/ to match coded sibling convention. - Move eval row from Capability Registry to "Read on demand" in lowcode.md (eval is lifecycle, not a capability). - Fix evaluator filename example: actual pattern is evaluator-<uuid8>.json, not <name>.json. The user-supplied <name> goes into the JSON name field. - Restore --wait polling cadence (5s) and --timeout default (600s) — both hardcoded in eval-run.ts. Removed earlier when unverified. - Add complete output Code enum (AgentEvalRunStarted/Completed/Results/ Status/Exported/List/Comparison). - Expand failure detection with the numeric forms isFailedRun() actually checks (status "3", score.type "2"), plus the SDK status enum. - Document the worker-side LLM model fail-fast (activities.py) and the same-as-agent resolver error (EvaluatorFactory) — these are runtime, not validate-time, errors. - Correct context-precision/faithfulness data flow: both are trace-driven (RETRIEVER spans), not test-case-driven; faithfulness reads expectedOutput as the candidate text, not the agent's actual output. - Add "Why fewer evaluators than coded?" section explaining the legacy vs new SDK engine split, plus the 2 runtime-supported types not exposed by the CLI (Equals=1, JsonSimilarity=6) with copy-pasteable JSON. - Document validate's category↔type matrix (cat 0→{1,6}, cat 1→{5,8,9}, cat 3→{7}) and required fields per schema-validation-service.ts. - Add Anti-patterns section to all four eval reference files per skill-structure.md convention. - Workflow example: insert validate step between add and push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…code eval docs
Remove context-precision and faithfulness from the low-code evaluator
surface entirely. Updates:
- evaluators.md: drop both rows from the CLI-exposed table, the --type
description, the type/category mapping, and the default-prompts table.
Narrow the validate matrix's cat 1 to type {5} only. Update the "Why
fewer" intro to reflect 2 supported CLI types.
- evaluation-sets.md: remove the trace-driven data-flow rows for both
evaluators, the explanatory callout about RETRIEVER spans, and the
related anti-patterns. Test-case design now covers only ss + trajectory.
- evaluate.md: narrow the "Unknown evaluator type" troubleshooting hint.
Coded eval refs are unchanged — those use uipath-llm-judge-* IDs, not
the legacy CLI names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the synthetic skeletons in "Runtime-supported types not exposed by the CLI" with the canonical shapes used in real low-code agent projects: - Equals (type 1) and JsonSimilarity (type 6) keep their Deterministic-category shape (no prompt/model needed) but now use realistic descriptions and filenames. - Add explicit LlmAsAJudge (type 5) and Trajectory (type 7) JSON shapes for hand-written use, including the full prompt strings, an explicit model pin, and the descriptions used in production examples. - Soften the filename rule: CLI-generated evaluators use evaluator-<uuid8>.json, but hand-written files can use any descriptive name. The runtime keys off id / evaluatorRefs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Studio Web UI exposes 4 evaluator types (Semantic Similarity, Trajectory, Exact match, JSON similarity). Verified by counting evaluator JSON files across multiple production examples — only types 1, 5, 6, 7 appear; nothing else does. Previous framing called Exact match and JSON similarity "runtime-supported types not exposed by the CLI", which understated their status. Both are real first-class options; the only narrowing surface is the CLI's --type flag (which covers 2 of 4). evaluators.md changes: - New "Supported Evaluator Types" section with a 4-row table mapping UI label, type/category, --type flag (where applicable), what it scores, and whether it is LLM-based. - New subsection "How to add each type" calling out the three creation paths (UI, CLI, hand-write JSON). - Renamed the "Why fewer than coded?" section into a subsection of the Supported Types group; updated wording to reflect 4 supported types. - Renamed "Runtime-supported types not exposed by the CLI" to "JSON Shapes" and reordered the four shapes to match the table order (Exact match, JSON similarity, LLM-as-a-judge, Trajectory). evaluation-sets.md changes: - Added Exact match and JSON similarity rows to the field-mapping table so all 4 supported types are covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreibalas-uipath
left a comment
There was a problem hiding this comment.
The correct semantics for "pushing" an agent to Studio Web is actually to use the solutions CLI: uip solution upload. The command either creates a new solution or updates existing after solution is edited locally.
Summary
uipath-agentsskill, filling a gap where coded agent evals have 5+ reference files but low-code has zerouip agent eval(evaluator CRUD, eval set/test case management, run start/status/results/compare) but was undocumented in the skillreferences/lowcode/evaluation/matching the structure of the coded eval docsSKILL.mdtask navigation andlowcode.mdcapability registry with eval entriesFiles Added
evaluation/evaluate.md— entry point, prerequisites, file structure, key differences from coded evalsevaluation/evaluators.md— 4 evaluator types (semantic-similarity, trajectory, context-precision, faithfulness), JSON format, custom promptsevaluation/evaluation-sets.md— eval set and test case CRUD, simulation options, JSON formatevaluation/running-evaluations.md— run start/status/results/list/compare commands with output examplesTest plan
evaluate.md→evaluators.md, etc.)uip agent eval🤖 Generated with Claude Code