Skip to content

feat: add low-code agent evaluation docs#552

Open
mjnovice wants to merge 5 commits intomainfrom
feat/lowcode-eval-docs
Open

feat: add low-code agent evaluation docs#552
mjnovice wants to merge 5 commits intomainfrom
feat/lowcode-eval-docs

Conversation

@mjnovice
Copy link
Copy Markdown

@mjnovice mjnovice commented May 4, 2026

Summary

  • Adds low-code agent evaluation documentation to the uipath-agents skill, filling a gap where coded agent evals have 5+ reference files but low-code has zero
  • Full CLI support exists in uip agent eval (evaluator CRUD, eval set/test case management, run start/status/results/compare) but was undocumented in the skill
  • 4 new reference files under references/lowcode/evaluation/ matching the structure of the coded eval docs
  • Updates SKILL.md task navigation and lowcode.md capability registry with eval entries

Files Added

  • evaluation/evaluate.md — entry point, prerequisites, file structure, key differences from coded evals
  • evaluation/evaluators.md — 4 evaluator types (semantic-similarity, trajectory, context-precision, faithfulness), JSON format, custom prompts
  • evaluation/evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format
  • evaluation/running-evaluations.md — run start/status/results/list/compare commands with output examples

Test plan

  • Verify all internal links resolve (evaluate.mdevaluators.md, etc.)
  • Verify SKILL.md task navigation row links to correct file
  • Verify lowcode.md capability registry and "Read on demand" table entries
  • Test documented commands against a real low-code agent project with uip agent eval

🤖 Generated with Claude Code

The uipath-agents skill has comprehensive evaluation docs for coded
agents (5 files under coded/lifecycle/evaluations/) but none for
low-code agents, despite full CLI support in `uip agent eval`.

Adds 4 reference files under lowcode/evaluation/:
- evaluate.md — entry point, prerequisites, file structure, differences from coded
- evaluators.md — 4 evaluator types, add/list/remove, JSON format, custom prompts
- evaluation-sets.md — eval set and test case CRUD, simulation options, JSON format
- running-evaluations.md — run start/status/results/list/compare, workflow example

Updates SKILL.md task navigation and lowcode.md capability registry
to reference the new evaluation docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Claude finished @mjnovice's task in 3m 0s —— View job


Summary

Adds 4 new reference files documenting the uip agent eval CLI for low-code agents under references/lowcode/evaluation/, plus task navigation and capability registry entries in SKILL.md and lowcode.md. Fills a real documentation gap — coded agent evals had extensive coverage while low-code had none.

Change-by-Change Review

1. skills/uipath-agents/SKILL.md (+1 line)

Severity: OK

New task navigation row correctly links to references/lowcode/evaluation/evaluate.md as the entry point and lists the three sub-docs in "Then". Placement is alphabetically consistent with surrounding rows. Pattern matches existing entries.

2. skills/uipath-agents/references/lowcode/lowcode.md (+2 lines)

Severity: OK

"Read on demand" table entry (line 49) and Capability Registry row (line 72) both link to evaluation/evaluate.md with correct relative paths. Sub-doc references in the "Then" column match the SKILL.md entry.

3. skills/uipath-agents/references/lowcode/evaluation/evaluate.md (new, 71 lines)

Severity: OK

Well-structured entry point. Quick Reference provides copy-paste commands, prerequisites are clear (push before run, local ops don't need auth), reference navigation links all resolve. Troubleshooting table covers common failure modes. Key Differences table vs coded evals is useful context.

Minor: Comments in the Quick Reference code block (lines 8, 11, 14 — # Add a test case, # Run evals and wait for results, # Check results...) restate what the adjacent command does. Per token-optimization rules, these could be stripped since the section heading "Quick Reference" already frames the context. Low — not blocking.

4. skills/uipath-agents/references/lowcode/evaluation/evaluators.md (new, 102 lines)

Severity: High — template variable inconsistency

Line 64 documents the trajectory evaluator's template variables as:

{{UserOrSyntheticInput}}, {{SimulationInstructions}}, {{ExpectedAgentBehavior}}, {{AgentRunHistory}}

But the Custom Prompts section at line 100 lists trajectory variables as:

{{AgentRunHistory}}, {{ExpectedBehavior}}

Two problems:

  1. {{ExpectedAgentBehavior}} vs {{ExpectedBehavior}} — contradictory names. An agent following the Custom Prompts section would use the wrong variable.
  2. Line 100 lists only 2 variables vs 4 on line 64 — the Custom Prompts section is incomplete, missing {{UserOrSyntheticInput}} and {{SimulationInstructions}}.

Fix: Reconcile the two lists — determine which variable name is correct ({{ExpectedAgentBehavior}} or {{ExpectedBehavior}}) by checking the actual CLI, then update the Custom Prompts section to list all 4 variables consistently. Fix this →

5. skills/uipath-agents/references/lowcode/evaluation/evaluation-sets.md (new, 140 lines)

Severity: OK

Comprehensive coverage of eval set and test case CRUD. Option tables are complete. Test case design section helpfully maps evaluator types to relevant flags. JSON format example is useful for agents that need to understand the file structure. Simulation options section is concise.

6. skills/uipath-agents/references/lowcode/evaluation/running-evaluations.md (new, 163 lines)

Severity: Low

Full coverage of run start/status/results/list/compare. JSON output examples match expected CLI patterns. Failure detection criteria are explicit — good for LLM usability.

The Workflow Example (lines 134–163) has numbered step comments (# 1. Push agent..., # 2. Add test cases..., etc.). These serve as section boundaries in a long code block, which is acceptable per token-optimization rules. Borderline — the surrounding ## Workflow Example heading provides no prose context, so the comments carry necessary structure here.

What's Missing

  • E2E test tasks for low-code evaluation — No test task in tests/tasks/uipath-agents/ covers the new evaluation workflow (e.g., adding a test case, running an eval). Existing tasks cover guardrails, tools, contexts, escalations, etc., but not evals. Severity: Medium — consistent with the review guidelines that flag missing test coverage for new capabilities.
  • Error handling guidance — The docs cover the happy path well but don't specify what an agent should do when uip agent eval run start returns a non-zero exit code or unexpected JSON structure. The troubleshooting table in evaluate.md partially covers this, but the sub-docs lack explicit "if this fails, do X" instructions. Severity: Low — the troubleshooting table provides reasonable coverage.

Area Ratings

Area Status Notes
Frontmatter OK No frontmatter changes needed — existing SKILL.md frontmatter unmodified
E2E Tests Issue No eval-specific test tasks added (Medium)
Skill Body OK Task navigation and capability registry entries correct
References & Assets Issue Template variable inconsistency in evaluators.md (High)
Repo Hygiene OK CODEOWNERS covers /skills/uipath-agents/, no secrets, no cross-skill refs, all links resolve

Issues for Manual Review

  • CLI command accuracy — Cannot verify whether the documented uip agent eval subcommands, flags, and JSON output shapes match the actual CLI implementation. The commands follow established patterns from the coded eval docs, but the flag names (--simulate-input, --simulate-tools, --input-generation-instructions, --solution-id), JSON output codes (AgentEvalRunStarted, AgentEvalRunStatus, AgentEvalRunComparison), and evaluator type numeric mappings (5, 7, 8, 9) should be validated against the real CLI.
  • Template variable names — Which is correct: {{ExpectedAgentBehavior}} or {{ExpectedBehavior}}? Only the CLI source or actual evaluator default prompts can resolve this.

Conclusion

Solid documentation addition that fills a clear gap. One High issue must be fixed before merge: the contradictory trajectory evaluator template variable names in evaluators.md (line 64 vs line 100). The missing e2e test task is Medium — consistent with the general test coverage gap across the repo. Everything else is clean.


Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluate.md Outdated
Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluators.md Outdated
Comment thread skills/uipath-agents/references/lowcode/evaluation/evaluators.md Outdated
mjnovice and others added 4 commits May 4, 2026 18:29
Address PR #552 review comment from @Chibionos: drop SolutionStorage.json
mentions throughout the eval refs (it is going away). Reword troubleshooting,
prerequisites, file-structure tree, and the --solution-id default to describe
the user-facing concept ("agent pushed to Studio Web") instead.

Folds in additional corrections found while verifying the PR against the
uip CLI (Code/cli), uipath-python SDK, and Agents service repo:

- Rename evaluation/ → evaluations/ to match coded sibling convention.
- Move eval row from Capability Registry to "Read on demand" in lowcode.md
  (eval is lifecycle, not a capability).
- Fix evaluator filename example: actual pattern is evaluator-<uuid8>.json,
  not <name>.json. The user-supplied <name> goes into the JSON name field.
- Restore --wait polling cadence (5s) and --timeout default (600s) — both
  hardcoded in eval-run.ts. Removed earlier when unverified.
- Add complete output Code enum (AgentEvalRunStarted/Completed/Results/
  Status/Exported/List/Comparison).
- Expand failure detection with the numeric forms isFailedRun() actually
  checks (status "3", score.type "2"), plus the SDK status enum.
- Document the worker-side LLM model fail-fast (activities.py) and the
  same-as-agent resolver error (EvaluatorFactory) — these are runtime,
  not validate-time, errors.
- Correct context-precision/faithfulness data flow: both are trace-driven
  (RETRIEVER spans), not test-case-driven; faithfulness reads expectedOutput
  as the candidate text, not the agent's actual output.
- Add "Why fewer evaluators than coded?" section explaining the legacy vs
  new SDK engine split, plus the 2 runtime-supported types not exposed by
  the CLI (Equals=1, JsonSimilarity=6) with copy-pasteable JSON.
- Document validate's category↔type matrix (cat 0→{1,6}, cat 1→{5,8,9},
  cat 3→{7}) and required fields per schema-validation-service.ts.
- Add Anti-patterns section to all four eval reference files per
  skill-structure.md convention.
- Workflow example: insert validate step between add and push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…code eval docs

Remove context-precision and faithfulness from the low-code evaluator
surface entirely. Updates:

- evaluators.md: drop both rows from the CLI-exposed table, the --type
  description, the type/category mapping, and the default-prompts table.
  Narrow the validate matrix's cat 1 to type {5} only. Update the "Why
  fewer" intro to reflect 2 supported CLI types.
- evaluation-sets.md: remove the trace-driven data-flow rows for both
  evaluators, the explanatory callout about RETRIEVER spans, and the
  related anti-patterns. Test-case design now covers only ss + trajectory.
- evaluate.md: narrow the "Unknown evaluator type" troubleshooting hint.

Coded eval refs are unchanged — those use uipath-llm-judge-* IDs, not
the legacy CLI names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the synthetic skeletons in "Runtime-supported types not exposed
by the CLI" with the canonical shapes used in real low-code agent
projects:

- Equals (type 1) and JsonSimilarity (type 6) keep their
  Deterministic-category shape (no prompt/model needed) but now use
  realistic descriptions and filenames.
- Add explicit LlmAsAJudge (type 5) and Trajectory (type 7) JSON shapes
  for hand-written use, including the full prompt strings, an explicit
  model pin, and the descriptions used in production examples.
- Soften the filename rule: CLI-generated evaluators use
  evaluator-<uuid8>.json, but hand-written files can use any descriptive
  name. The runtime keys off id / evaluatorRefs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Studio Web UI exposes 4 evaluator types (Semantic Similarity, Trajectory,
Exact match, JSON similarity). Verified by counting evaluator JSON files
across multiple production examples — only types 1, 5, 6, 7 appear; nothing
else does.

Previous framing called Exact match and JSON similarity "runtime-supported
types not exposed by the CLI", which understated their status. Both are
real first-class options; the only narrowing surface is the CLI's --type
flag (which covers 2 of 4).

evaluators.md changes:
- New "Supported Evaluator Types" section with a 4-row table mapping UI
  label, type/category, --type flag (where applicable), what it scores,
  and whether it is LLM-based.
- New subsection "How to add each type" calling out the three creation
  paths (UI, CLI, hand-write JSON).
- Renamed the "Why fewer than coded?" section into a subsection of the
  Supported Types group; updated wording to reflect 4 supported types.
- Renamed "Runtime-supported types not exposed by the CLI" to "JSON
  Shapes" and reordered the four shapes to match the table order
  (Exact match, JSON similarity, LLM-as-a-judge, Trajectory).

evaluation-sets.md changes:
- Added Exact match and JSON similarity rows to the field-mapping table
  so all 4 supported types are covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mjnovice mjnovice requested a review from Chibionos May 6, 2026 01:15
Copy link
Copy Markdown
Contributor

@andreibalas-uipath andreibalas-uipath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct semantics for "pushing" an agent to Studio Web is actually to use the solutions CLI: uip solution upload. The command either creates a new solution or updates existing after solution is edited locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants