feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval#600
feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval#600
uip maestro flow eval#600Conversation
New self-contained skill covering the full uip maestro flow eval CLI surface: - evaluator add/list/remove for the 7 evaluator types (exact-match, json-similarity, contains, llm-judge-output, llm-judge-strict-json, llm-judge-trajectory, llm-judge-trajectory-simulation), with the internal uipath-* type-id mapping and JSON shapes. - set add/list/remove with --evaluators and --entry-point semantics. - Data point CRUD via top-level add/list/remove, including --inputs, --expected, --criteria, --input-file (repeatable file attachment), and --search-text for contains evaluators. - run start/status/results/list/compare with --wait/--timeout, output Code enum, JMESPath --output-filter examples, failure detection rules, and per-row output fields. Critical rule (per source review of cli/packages/flow-tool and solution-tool/src/commands/upload.ts): the skill must NEVER auto-run `uip solution upload`. The eval run requires the Flow solution to be in Studio Web, but uploading from a local working tree can overwrite concurrent Studio Web edits or push work-in-progress unintentionally. references/upload-safety.md documents the rule, the detection signals for "local workspace or VS Code" (missing SolutionStorage.json, .vscode/ present, no prior upload in session), and the safe alternative flow (read-only probe via `eval run list`, then explicit user consent). Skill is self-contained per repo conventions; SKILL.md frontmatter includes -> redirects for sibling skills (uipath-agents for agent evals, uipath-maestro-flow for non-eval Flow work). CODEOWNERS updated to mirror uipath-maestro-flow ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 2m 51s —— View job PR Review:
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | OK | Valid YAML, name matches folder, description 392 chars, front-loads identity, has → redirects |
| E2E Tests | Issue (Medium) | tests/tasks/uipath-maestro-flow-eval/ does not exist |
| Skill Body | OK | Canonical section order, 8 critical rules, anti-patterns, completion output |
| References & Assets | OK (with Medium note) | 5 references, all linked from SKILL.md, kebab-case naming. Upload-safety redundancy is a minor duplication concern |
| Repo Hygiene | OK | CODEOWNERS updated, no secrets, no cross-skill refs, all links resolve, changes scoped to new skill |
Issues for Manual Review
- CLI accuracy. The 7 evaluator
--typevalues and their internalevaluatorTypeIdmappings should be verified against the actualflow-eval-schema/registry.tsin the CLI source. The PR description claims verification but the reviewer cannot confirm. --waitand--timeoutdefaults.commands-reference.md:143states default timeout is 600s. Verify against CLI source — if the default has changed, the docs will mislead.--export-formatflag.running-guide.md:110documents--export-format <json|csv>onrun results. Verify this flag exists on the CLI — it's not in the commands-reference flag table forrun results(lines 162–168), which creates an inconsistency. Fix this →- Sibling skill conflict. Without a redirect in
uipath-maestro-flow, an agent might trigger the wrong skill for eval-related queries since both skills match onuip maestro flow.
Conclusion
This is a high-quality new skill. The structure, content, and LLM usability are strong — 8 numbered critical rules, prescriptive CLI examples, decision tables, comprehensive anti-patterns, and well-organized reference extraction. The upload-safety documentation is particularly thoughtful.
Request changes on two items:
- (Medium) Add a
→uipath-maestro-flow-evalredirect to theuipath-maestro-flowskill's description to prevent mis-triggering. This is the most impactful fix — without it, agents will frequently land on the wrong skill for eval queries. - (Medium) Reconcile the
--export-formatflag: it appears inrunning-guide.md:110but not incommands-reference.md'srun resultsflag table.
The missing e2e tests are tracked as Medium but not blocking given repo-wide test coverage status.
|
Consider adding smoke tests before merging. |
rockymadden
left a comment
There was a problem hiding this comment.
Will tests follow? Else, can we add them?
…aluate capability Merges uipath-maestro-flow-eval into uipath-maestro-flow as a 4th capability (evaluate/), addressing review feedback that subset-skills for the same project type compete for activation against the parent. - Move 5 eval reference files under references/evaluate/references/ - New evaluate/CAPABILITY.md mirroring author/operate/diagnose format; drops universal rules already enforced by parent SKILL.md, keeps the 6 eval-specific rules (no auto upload, --path semantics, login boundary, LLM-judge model pinning, UUID refs, --wait timeout behavior) - Parent SKILL.md: extend description with eval signal (399 chars), add "Evaluate" entry to When-to-use and capability router - shared/cli-commands.md: add eval subcommand block so flat CLI lookup stays authoritative - CODEOWNERS: drop standalone eval entry (parent maestro-flow ownership covers it with same owner set) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@gabrielavaduva — agreed, addressed in e4b58a3. Folded the eval surface into
Re: testing against activation evals — agreed, this should be covered. The merge collapses two competing skills into one, so the prior activation conflict (eval requests matching both |
uip maestro flow eval
… e2e) Adds coder_eval task YAMLs covering the new `evaluate` capability folded into uipath-maestro-flow in e4b58a3. Addresses @rockymadden's review request for tests on PR #600. - evaluate/local_crud.yaml (smoke) — agent scaffolds project + runs full local eval CRUD (evaluator add → set add → data point add → list). Asserts kebab-case --type, --output json on every eval command, evaluator JSON file lands on disk. - evaluate/evaluator_type_choice.yaml (smoke) — agent picks correct --type for 3 goals (NL similarity, deterministic JSON match, substring contains). Pure decision-making against report.json. - evaluate/no_auto_upload.yaml (smoke, anti-pattern guard) — agent refuses to auto-`uip solution upload` when solution isn't in Studio Web. Mirrors uipath-data-fabric/smoke_negative_guards.yaml pattern. Verifies the Critical Rule from references/evaluate/references/upload-safety.md. - evaluate/eval_run/eval_run.yaml + check_eval_run.py (e2e) — full lifecycle on live tenant: scaffold trivial deterministic flow, upload (explicit prompt-level authorization), run `eval run start --wait`, fetch results, assert score=1.0 across 3 data points. Cleanup wired through existing e2e.yaml post_run (cleanup_solutions.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 4m 36s —— View job Coder-eval task lint (advisory)4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Per-task lint
|
…ommand + sidecar The coder-eval TaskDefinition schema does not include `json_check` or the `operator: in` value (the data-fabric task I copied from must be similarly broken on main, but isn't surfaced because data-fabric is not changed in this PR). Refactor the two affected tasks to use `run_command` with a small Python sidecar — same pattern as the e2e and uipath-agents eval_exact_match. - evaluate/check_type_choice.py — validates report.json against expected evaluator type per goal (A/B/C) - evaluate/check_no_auto_upload.py — validates report.json shows refusal, ran_solution_upload=false, ran_eval_run_start=false, Studio Web reason - evaluate/evaluator_type_choice.yaml — replace 3 json_check entries with one run_command pointing at check_type_choice.py - evaluate/no_auto_upload.yaml — replace 3 json_check + operator:in entries with one run_command pointing at check_no_auto_upload.py. Drop two negative `command_executed` checks (max_count is not in the schema) — sidecar already verifies via report.json. Verified locally: all 4 YAMLs pass `yaml.safe_load` and use only valid TaskDefinition types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 3m 49s —— View job Coder-eval task lint (advisory)4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Per-task lint
|
Both local_crud and no_auto_upload hit MAX_TURNS_EXHAUSTED on the first real run. The agent budget was too tight for the full path: skill activation → read CAPABILITY → read upload-safety → scaffold solution + flow project → eval CRUD → write report.json. - local_crud.yaml: max_turns 30 → 60. Agent reached `evaluator add` but ran out of turns before `set add`, `eval add`, and the verification lists. Score was 0.47 (4/8 criteria passed). - no_auto_upload.yaml: max_turns 25 → 50. Agent never wrote report.json. Score was 0.00. - local_crud.yaml: also fix alternation precedence in the `eval add` regex — `(--inputs.*--expected|--expected.*--inputs)` properly groups both flag orderings under the `uip ... eval add ...` prefix. Reference: evaluator_type_choice.yaml passed 1.00/15-turn budget cleanly since it's pure decision-making (no CLI calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 3m 45s —— View job Coder-eval task lint (advisory)
Coder-eval task lint (advisory)4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Per-task lint
|
Address the advisory lint comment from coder-eval (Claude Code Action) on PR #600 (issuecomment-4391714203). Three substantive fixes; cosmetic prompt-tightening and the validate-only carve-out left as follow-ups. #1 — no_auto_upload.yaml: Meaningful coverage (High). Previously all 5 criteria read report.json; agent could bypass every CLI command. Add 5 `command_executed` checks for the CRUD setup steps: solution new, flow init, evaluator add (--type exact-match), eval set add, eval add. Sidecar weight rebalanced. #3 — evaluator_type_choice.yaml: Self-report anti-pattern (Critical). Previously a pure decision task — agent wrote report.json and check_type_ choice.py validated the JSON. Now the agent must scaffold a project and actually run `uip maestro flow eval evaluator add --type <chosen>` for each goal. Three `command_executed` matchers pin the kebab-case --type values (llm-judge-output / json-similarity / contains), so the agent has to invoke the skill, not just self-report. max_turns 15 → 50. #4 — local_crud.yaml: Meaningful coverage (Medium). Previously all 8 criteria were command/file existence; no output check. Add check_local_crud.py sidecar that walks SmokeEval/ and verifies an evaluator JSON exists with name="greeting-match" + evaluatorTypeId= "uipath-exact-match", plus an eval set "Smoke Set" containing a data point "hello" with inputs and expected output. Catches commands that ran but errored. #2 (passing-run claim) handled by separate PR description edit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @mjnovice's task in 4m 36s —— View job Coder-eval task lint (advisory)
4 task YAMLs changed; verdicts: 0 Critical, 0 High, 3 Medium, 1 Low, 0 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Per-task lint
|
Summary
Adds a new Evaluate capability to the
uipath-maestro-flowskill, covering the fulluip maestro flow evalCLI surface — evaluator CRUD (7 types), eval set CRUD with entry-point pinning, data point management with file attachments, and Studio Web run start/status/results/list/compare.The capability mirrors the existing
author/,operate/,diagnose/structure so the four lifecycle phases of a Flow project share one skill and one activation surface, avoiding the activation-conflict problem that arises when multiple sibling skills compete for the same.flowrequest.The CLI surface is verified end-to-end against
cli/packages/flow-tool/src/services/flow-eval-schema/(evaluator type-id mapping),cli/packages/flow-tool/src/services/flow-eval-run-context.ts(solution/project resolution), andcli/packages/solution-tool/src/commands/upload.ts(upload semantics).Critical rule: no auto-
uip solution uploadThe eval run requires the Flow solution to exist in Studio Web, but the skill MUST NEVER auto-upload to satisfy that prerequisite.
references/evaluate/references/upload-safety.mddocuments:SolutionStorage.json,.vscode/present, no prior upload in session)eval run list, then ask the userFiles
skills/uipath-maestro-flow/SKILL.md— extendeddescriptionwith eval signal; added "Evaluate" entry to When to use and capability routerskills/uipath-maestro-flow/references/evaluate/CAPABILITY.md— capability index: When-to-use, 6 eval-specific rules (no auto upload,--pathsemantics, login boundary, LLM-judge model pinning, UUID refs,--waittimeout), Quick Start, Workflow / Common-tasks tables, Anti-patterns, References. Universal rules (--output json, AskUserQuestion, narration, todos) inherit from parent SKILL.md.skills/uipath-maestro-flow/references/evaluate/references/commands-reference.md— every subcommand, flags, defaults, outputCodeenumskills/uipath-maestro-flow/references/evaluate/references/evaluators-guide.md— 7 evaluator types mapped to internaluipath-*IDs, JSON shapes, template variablesskills/uipath-maestro-flow/references/evaluate/references/eval-sets-guide.md— eval set + data point CRUD,--inputs/--expected/--criteria/--input-file/--search-textskills/uipath-maestro-flow/references/evaluate/references/running-guide.md— run start/status/results/list/compare, JMESPath--output-filter, failure detectionskills/uipath-maestro-flow/references/evaluate/references/upload-safety.md— thesolution uploadruleskills/uipath-maestro-flow/references/shared/cli-commands.md— addeduip maestro flow evalsubcommand block so the flat CLI lookup stays authoritativetests/tasks/uipath-maestro-flow/evaluate/— 3 smoke tasks + 1 e2e task with sidecar checkers (added in response to @rockymadden's review)CODEOWNERS— no new entry needed; the existinguipath-maestro-flowownership coversevaluate/History
uipath-maestro-flow-evalskill; addressed @gabrielavaduva's review feedback by folding it intouipath-maestro-flowas a 4th capability (commite4b58a3d).2a3b2fa4); coder-eval lint High/Critical findings closed in4a30c1c2.Test plan — passing-run evidence
Ran
skill-flow-eval-local-crud,skill-flow-eval-evaluator-type-choice,skill-flow-eval-no-auto-uploadin CI Smoke Skill Tests run 25458500050 — all 3 passedstatus=SUCCESS score=1.00. Updated tests in commit4a30c1c2will be re-validated by the next CI run on this push; will follow up with run id once it completes.SKILL.mdfrontmatter passeshooks/validate-skill-descriptions.sh.mdlinks audited across changed files)local_crud,evaluator_type_choice,no_auto_upload) ran in CI and passed at 1.00 — see linked run aboveeval_run) task against a live Studio Web tenant (cadence: daily/weekly via separate infra; not run on PR)--typevalues against the registry incli/packages/flow-tool/src/services/flow-eval-schema/registry.tsuipath-maestro-flowskill activates correctly on Flow eval requests (no longer competing with a sibling eval-only skill) — confirmed by smoke test results🤖 Generated with Claude Code