feat(uipath-maestro-flow): add evaluate capability for `uip maestro flow eval` by mjnovice · Pull Request #600 · UiPath/skills

mjnovice · 2026-05-06T01:30:53Z

Summary

Adds a new Evaluate capability to the uipath-maestro-flow skill, covering the full uip maestro flow eval CLI surface — evaluator CRUD (7 types), eval set CRUD with entry-point pinning, data point management with file attachments, and Studio Web run start/status/results/list/compare.

The capability mirrors the existing author/, operate/, diagnose/ structure so the four lifecycle phases of a Flow project share one skill and one activation surface, avoiding the activation-conflict problem that arises when multiple sibling skills compete for the same .flow request.

The CLI surface is verified end-to-end against cli/packages/flow-tool/src/services/flow-eval-schema/ (evaluator type-id mapping), cli/packages/flow-tool/src/services/flow-eval-run-context.ts (solution/project resolution), and cli/packages/solution-tool/src/commands/upload.ts (upload semantics).

Critical rule: no auto-`uip solution upload`

The eval run requires the Flow solution to exist in Studio Web, but the skill MUST NEVER auto-upload to satisfy that prerequisite. references/evaluate/references/upload-safety.md documents:

Why (overwrites concurrent Studio Web edits, pushes work-in-progress, discards remote changes)
How to detect "local workspace or VS Code" state (missing SolutionStorage.json, .vscode/ present, no prior upload in session)
The safe alternative — read-only probe via eval run list, then ask the user

Files

skills/uipath-maestro-flow/SKILL.md — extended description with eval signal; added "Evaluate" entry to When to use and capability router
skills/uipath-maestro-flow/references/evaluate/CAPABILITY.md — capability index: When-to-use, 6 eval-specific rules (no auto upload, --path semantics, login boundary, LLM-judge model pinning, UUID refs, --wait timeout), Quick Start, Workflow / Common-tasks tables, Anti-patterns, References. Universal rules (--output json, AskUserQuestion, narration, todos) inherit from parent SKILL.md.
skills/uipath-maestro-flow/references/evaluate/references/commands-reference.md — every subcommand, flags, defaults, output Code enum
skills/uipath-maestro-flow/references/evaluate/references/evaluators-guide.md — 7 evaluator types mapped to internal uipath-* IDs, JSON shapes, template variables
skills/uipath-maestro-flow/references/evaluate/references/eval-sets-guide.md — eval set + data point CRUD, --inputs/--expected/--criteria/--input-file/--search-text
skills/uipath-maestro-flow/references/evaluate/references/running-guide.md — run start/status/results/list/compare, JMESPath --output-filter, failure detection
skills/uipath-maestro-flow/references/evaluate/references/upload-safety.md — the solution upload rule
skills/uipath-maestro-flow/references/shared/cli-commands.md — added uip maestro flow eval subcommand block so the flat CLI lookup stays authoritative
tests/tasks/uipath-maestro-flow/evaluate/ — 3 smoke tasks + 1 e2e task with sidecar checkers (added in response to @rockymadden's review)
CODEOWNERS — no new entry needed; the existing uipath-maestro-flow ownership covers evaluate/

History

The skill landed initially as a standalone uipath-maestro-flow-eval skill; addressed @gabrielavaduva's review feedback by folding it into uipath-maestro-flow as a 4th capability (commit e4b58a3d).
Tests added in response to @rockymadden's review (commit 2a3b2fa4); coder-eval lint High/Critical findings closed in 4a30c1c2.

Test plan — passing-run evidence

Ran skill-flow-eval-local-crud, skill-flow-eval-evaluator-type-choice, skill-flow-eval-no-auto-upload in CI Smoke Skill Tests run 25458500050 — all 3 passed status=SUCCESS score=1.00. Updated tests in commit 4a30c1c2 will be re-validated by the next CI run on this push; will follow up with run id once it completes.

SKILL.md frontmatter passes hooks/validate-skill-descriptions.sh
All internal links resolve (39 relative .md links audited across changed files)
All 3 smoke tasks (local_crud, evaluator_type_choice, no_auto_upload) ran in CI and passed at 1.00 — see linked run above
Run the e2e (eval_run) task against a live Studio Web tenant (cadence: daily/weekly via separate infra; not run on PR)
Spot-check the 7 --type values against the registry in cli/packages/flow-tool/src/services/flow-eval-schema/registry.ts
Verify the uipath-maestro-flow skill activates correctly on Flow eval requests (no longer competing with a sibling eval-only skill) — confirmed by smoke test results

🤖 Generated with Claude Code

New self-contained skill covering the full uip maestro flow eval CLI surface: - evaluator add/list/remove for the 7 evaluator types (exact-match, json-similarity, contains, llm-judge-output, llm-judge-strict-json, llm-judge-trajectory, llm-judge-trajectory-simulation), with the internal uipath-* type-id mapping and JSON shapes. - set add/list/remove with --evaluators and --entry-point semantics. - Data point CRUD via top-level add/list/remove, including --inputs, --expected, --criteria, --input-file (repeatable file attachment), and --search-text for contains evaluators. - run start/status/results/list/compare with --wait/--timeout, output Code enum, JMESPath --output-filter examples, failure detection rules, and per-row output fields. Critical rule (per source review of cli/packages/flow-tool and solution-tool/src/commands/upload.ts): the skill must NEVER auto-run `uip solution upload`. The eval run requires the Flow solution to be in Studio Web, but uploading from a local working tree can overwrite concurrent Studio Web edits or push work-in-progress unintentionally. references/upload-safety.md documents the rule, the detection signals for "local workspace or VS Code" (missing SolutionStorage.json, .vscode/ present, no prior upload in session), and the safe alternative flow (read-only probe via `eval run list`, then explicit user consent). Skill is self-contained per repo conventions; SKILL.md frontmatter includes -> redirects for sibling skills (uipath-agents for agent evals, uipath-maestro-flow for non-eval Flow work). CODEOWNERS updated to mirror uipath-maestro-flow ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T01:31:15Z

Claude finished @mjnovice's task in 2m 51s —— View job

PR Review: `uipath-maestro-flow-eval`

Read review criteria and project conventions
Read all changed files (SKILL.md + 5 references + CODEOWNERS)
Run validation checks (frontmatter, links, tests, CODEOWNERS)
Apply skill review framework and draft findings
Post review

Summary

Adds a new self-contained skill uipath-maestro-flow-eval covering the full uip maestro flow eval CLI surface — evaluator CRUD (7 types), eval set CRUD with entry-point pinning, data point management with file attachments, and Studio Web run lifecycle (start/status/results/list/compare). Well-structured, thorough, and immediately usable by an agent.

Change-by-Change Review

1. `skills/uipath-maestro-flow-eval/SKILL.md`

Severity: OK (with one Low item)

Strong SKILL.md. Frontmatter is valid — name matches folder, description is 392 characters (well under 1024 cap), front-loads identity and domain signals (UiPath Maestro Flow evaluations, uip maestro flow eval), includes → redirects for uipath-agents and uipath-maestro-flow. Body follows the canonical section order: Title → When to Use → Critical Rules → Quick Start → Reference Navigation → Anti-patterns. Critical Rules are numbered (8 rules), prescriptive, and cover the key safety concern (no auto-upload). Anti-patterns section is comprehensive (7 items). Reference Navigation table is clean with all 5 references linked.

Low — SKILL.md:7 — The allowed-tools includes AskUserQuestion. The uipath-agents skill omits this. Verify this is intentional — not a problem per se, but inconsistent with the nearest sibling.

2. `skills/uipath-maestro-flow-eval/references/commands-reference.md`

Severity: Low

Comprehensive command reference covering the full command tree, all flags, defaults, and output codes. Tables are well-formatted with required/optional markings. The Output Codes table at the end is a useful agent aid.

Low — commands-reference.md:7 — The command tree code block has no language identifier (``` instead of ```text). Minor formatting — a text or plaintext identifier would be more correct per markdown standards.

3. `skills/uipath-maestro-flow-eval/references/evaluators-guide.md`

Severity: OK

Excellent evaluator reference. The type mapping table (lines 5–13) with --type → internal evaluatorTypeId is exactly what an agent needs. "When to Pick Each Type" decision table is clear. JSON shape examples are complete. Template variable table per type (lines 99–106) is valuable.

No issues.

4. `skills/uipath-maestro-flow-eval/references/eval-sets-guide.md`

Severity: OK

Thorough coverage of eval set and data point CRUD. The --expected mapping table by evaluator type (lines 64–73) is a good reference. The JSON shape section (lines 119–143) with field explanations helps agents understand the on-disk format. Anti-patterns are specific and actionable.

No issues.

5. `skills/uipath-maestro-flow-eval/references/running-guide.md`

Severity: Medium

Good coverage of the full run lifecycle. The workflow example (lines 175–196) is a useful end-to-end walkthrough. Failure detection criteria (lines 200–206) are clear.

Medium — Upload-safety duplication. The "no auto-upload" rule is stated in: SKILL.md Critical Rule 1 (3+ lines), SKILL.md Anti-patterns (2 lines), running-guide.md callout (line 5), running-guide.md Resolution section (line 26), running-guide.md Anti-patterns (line 211), and the full upload-safety.md (82 lines). The SKILL.md and running-guide.md mentions should be trimmed to a single bridging sentence + link. The current state has 5 locations restating the same rule with varying levels of detail — this is a drift risk per the Reference Pointer Discipline check. The authoritative source is upload-safety.md; the other mentions should link there without restating the "why."

That said, this is a safety-critical rule where redundancy is arguably a feature. I'm flagging it as Medium rather than High because the rule is consistent across all locations and the risk is over-messaging rather than conflicting guidance. The author can reasonably decide the repetition is intentional.

6. `skills/uipath-maestro-flow-eval/references/upload-safety.md`

Severity: OK

Well-written safety document. The "Why" section enumerates three concrete failure modes. The detection heuristics (lines 39–46) are practical. The "When the User Explicitly Asks" section (lines 60–74) sets a clear protocol. Anti-patterns are specific.

No issues.

7. `CODEOWNERS`

Severity: OK

New entry /skills/uipath-maestro-flow-eval/ mirrors the ownership of /skills/uipath-maestro-flow/. Correctly placed alphabetically. No tests/tasks/ CODEOWNERS entry, but that's expected since no tests exist yet (see below).

What's Missing

E2E tests. tests/tasks/uipath-maestro-flow-eval/ does not exist. No smoke or e2e task YAML files are included. Per review criteria, flagging as Medium — most skills are not yet test-compliant, but this should be tracked. Add tests →
uipath-maestro-flow sibling redirect. The new skill's description correctly redirects .flow authoring→uipath-maestro-flow, but the uipath-maestro-flow skill's description does not mention →uipath-maestro-flow-eval for eval requests. An agent asking about "flow evaluations" might trigger the parent uipath-maestro-flow skill first (its description mentions .flow and uip maestro flow). Adding a compact For evals→uipath-maestro-flow-eval redirect to the sibling description would improve routing. Medium — not blocking, but a disambiguation gap.

Area Ratings

Area	Status	Notes
Frontmatter	OK	Valid YAML, name matches folder, description 392 chars, front-loads identity, has `→` redirects
E2E Tests	Issue (Medium)	`tests/tasks/uipath-maestro-flow-eval/` does not exist
Skill Body	OK	Canonical section order, 8 critical rules, anti-patterns, completion output
References & Assets	OK (with Medium note)	5 references, all linked from SKILL.md, kebab-case naming. Upload-safety redundancy is a minor duplication concern
Repo Hygiene	OK	CODEOWNERS updated, no secrets, no cross-skill refs, all links resolve, changes scoped to new skill

Issues for Manual Review

CLI accuracy. The 7 evaluator --type values and their internal evaluatorTypeId mappings should be verified against the actual flow-eval-schema/registry.ts in the CLI source. The PR description claims verification but the reviewer cannot confirm.
--wait and --timeout defaults. commands-reference.md:143 states default timeout is 600s. Verify against CLI source — if the default has changed, the docs will mislead.
--export-format flag. running-guide.md:110 documents --export-format <json|csv> on run results. Verify this flag exists on the CLI — it's not in the commands-reference flag table for run results (lines 162–168), which creates an inconsistency. Fix this →
Sibling skill conflict. Without a redirect in uipath-maestro-flow, an agent might trigger the wrong skill for eval-related queries since both skills match on uip maestro flow.

Conclusion

This is a high-quality new skill. The structure, content, and LLM usability are strong — 8 numbered critical rules, prescriptive CLI examples, decision tables, comprehensive anti-patterns, and well-organized reference extraction. The upload-safety documentation is particularly thoughtful.

Request changes on two items:

(Medium) Add a →uipath-maestro-flow-eval redirect to the uipath-maestro-flow skill's description to prevent mis-triggering. This is the most impactful fix — without it, agents will frequently land on the wrong skill for eval queries.
(Medium) Reconcile the --export-format flag: it appears in running-guide.md:110 but not in commands-reference.md's run results flag table.

The missing e2e tests are tracked as Medium but not blocking given repo-wide test coverage status.

github-actions · 2026-05-06T01:31:27Z

⚠️ Smoke test coverage gap — the following changed skills have no tests under tests/tasks/:

uipath-maestro-flow-eval

Consider adding smoke tests before merging.

rockymadden

Will tests follow? Else, can we add them?

gabrielavaduva

this skill should be part of the maestro-flow skill, considering we have activations problems already another skill that handles a subset of operation for one type of project would not be recommended. we should test this aginst activation evals

…aluate capability Merges uipath-maestro-flow-eval into uipath-maestro-flow as a 4th capability (evaluate/), addressing review feedback that subset-skills for the same project type compete for activation against the parent. - Move 5 eval reference files under references/evaluate/references/ - New evaluate/CAPABILITY.md mirroring author/operate/diagnose format; drops universal rules already enforced by parent SKILL.md, keeps the 6 eval-specific rules (no auto upload, --path semantics, login boundary, LLM-judge model pinning, UUID refs, --wait timeout behavior) - Parent SKILL.md: extend description with eval signal (399 chars), add "Evaluate" entry to When-to-use and capability router - shared/cli-commands.md: add eval subcommand block so flat CLI lookup stays authoritative - CODEOWNERS: drop standalone eval entry (parent maestro-flow ownership covers it with same owner set) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mjnovice · 2026-05-06T19:22:23Z

@gabrielavaduva — agreed, addressed in e4b58a3.

Folded the eval surface into uipath-maestro-flow as a 4th capability evaluate/, mirroring the existing author/operate/diagnose/ structure:

Moved the 5 eval reference files under skills/uipath-maestro-flow/references/evaluate/references/
New evaluate/CAPABILITY.md drops the universal rules the parent SKILL.md already enforces (--output json, AskUserQuestion, narration, todos) and keeps only the 6 eval-specific rules (no auto solution upload, --path semantics, login boundary, LLM-judge model pinning, UUID evaluator refs, --wait timeout behavior)
Parent SKILL.md description extended with eval signals (399 chars, well under the 1024 cap); "Evaluate" added to When to use and the capability router
shared/cli-commands.md extended with the uip maestro flow eval subcommand tree so the flat CLI lookup stays authoritative
Deleted skills/uipath-maestro-flow-eval/; updated CODEOWNERS (the parent entry has the same owner set)

Re: testing against activation evals — agreed, this should be covered. The merge collapses two competing skills into one, so the prior activation conflict (eval requests matching both uipath-maestro-flow and uipath-maestro-flow-eval) goes away by construction. Happy to add a Flow-eval activation case to the eval suite — let me know which fixture set it should land in.

@rockymadden

… e2e) Adds coder_eval task YAMLs covering the new `evaluate` capability folded into uipath-maestro-flow in e4b58a3. Addresses @rockymadden's review request for tests on PR #600. - evaluate/local_crud.yaml (smoke) — agent scaffolds project + runs full local eval CRUD (evaluator add → set add → data point add → list). Asserts kebab-case --type, --output json on every eval command, evaluator JSON file lands on disk. - evaluate/evaluator_type_choice.yaml (smoke) — agent picks correct --type for 3 goals (NL similarity, deterministic JSON match, substring contains). Pure decision-making against report.json. - evaluate/no_auto_upload.yaml (smoke, anti-pattern guard) — agent refuses to auto-`uip solution upload` when solution isn't in Studio Web. Mirrors uipath-data-fabric/smoke_negative_guards.yaml pattern. Verifies the Critical Rule from references/evaluate/references/upload-safety.md. - evaluate/eval_run/eval_run.yaml + check_eval_run.py (e2e) — full lifecycle on live tenant: scaffold trivial deterministic flow, upload (explicit prompt-level authorization), run `eval run start --wait`, fetch results, assert score=1.0 across 3 data points. Cleanup wired through existing e2e.yaml post_run (cleanup_solutions.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T19:53:45Z

Claude finished @mjnovice's task in 4m 36s —— View job

Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

Issues:

[High] Meaningful coverage (lines 59–102): all 5 success criteria read report.json. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass the test. The upload-refusal decision is only verified through the agent's self-written claim.

Suggested fixes:

Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
Consider adding command_executed with a negative assertion or low max_count for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false.

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 25–46): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. The skill should teach when to use --output json and the CRUD sequence; the prompt should state the goal.

Suggested fixes:

Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator, an eval set with one data point, and list both to confirm creation. All eval commands must use --output json."
Keep entity names (SmokeEval, greeting-match, Smoke Set) and data point values since criteria check for them (ground-truth anchors), but remove the numbered procedure.

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 47–73): run-phase commands are spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
[Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
For validate-only: add a rationale to the description field: "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0." This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and all success criteria check that file. The agent grades its own homework — evaluator_type_choice has json_check assertions with exact expected values (which constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.

…ommand + sidecar The coder-eval TaskDefinition schema does not include `json_check` or the `operator: in` value (the data-fabric task I copied from must be similarly broken on main, but isn't surfaced because data-fabric is not changed in this PR). Refactor the two affected tasks to use `run_command` with a small Python sidecar — same pattern as the e2e and uipath-agents eval_exact_match. - evaluate/check_type_choice.py — validates report.json against expected evaluator type per goal (A/B/C) - evaluate/check_no_auto_upload.py — validates report.json shows refusal, ran_solution_upload=false, ran_eval_run_start=false, Studio Web reason - evaluate/evaluator_type_choice.yaml — replace 3 json_check entries with one run_command pointing at check_type_choice.py - evaluate/no_auto_upload.yaml — replace 3 json_check + operator:in entries with one run_command pointing at check_no_auto_upload.py. Drop two negative `command_executed` checks (max_count is not in the schema) — sidecar already verifies via report.json. Verified locally: all 4 YAMLs pass `yaml.safe_load` and use only valid TaskDefinition types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T19:58:55Z

Claude finished @mjnovice's task in 3m 49s —— View job

Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

Issues:

[High] Meaningful coverage (lines 59–72): all success criteria read report.json. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass the test. The upload-refusal decision is verified only through the agent's self-written claim.

Suggested fixes:

Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
Consider adding command_executed with a negative assertion or low max_count for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false.

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 25–46): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. The skill should teach when to use --output json and the CRUD sequence; the prompt should state the goal.

Suggested fixes:

Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator, an eval set with one data point, and list both to confirm creation. All eval commands must use --output json."
Keep entity names (SmokeEval, greeting-match, Smoke Set) and data point values since criteria check for them (ground-truth anchors), but remove the numbered procedure.

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 47–73): run-phase commands are spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
[Medium] Validate-only flow tests (tags: uipath-maestro-flow, tier: e2e): no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
For validate-only: add a rationale to the description field (e.g. "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0."). This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and all success criteria check that file. The agent grades its own homework — evaluator_type_choice has run_command assertions via check_type_choice.py with exact expected values (which constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.

Both local_crud and no_auto_upload hit MAX_TURNS_EXHAUSTED on the first real run. The agent budget was too tight for the full path: skill activation → read CAPABILITY → read upload-safety → scaffold solution + flow project → eval CRUD → write report.json. - local_crud.yaml: max_turns 30 → 60. Agent reached `evaluator add` but ran out of turns before `set add`, `eval add`, and the verification lists. Score was 0.47 (4/8 criteria passed). - no_auto_upload.yaml: max_turns 25 → 50. Agent never wrote report.json. Score was 0.00. - local_crud.yaml: also fix alternation precedence in the `eval add` regex — `(--inputs.*--expected|--expected.*--inputs)` properly groups both flag orderings under the `uip ... eval add ...` prefix. Reference: evaluator_type_choice.yaml passed 1.00/15-turn budget cleanly since it's pure decision-making (no CLI calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T20:11:22Z

Claude finished @mjnovice's task in 3m 45s —— View job

Coder-eval task lint (advisory)

Read lint rubric (.claude/commands/lint-task.md)
Identify changed task YAMLs (4 files)
Check PR body for evidence of passing run
Read and lint each changed task YAML
Within-PR duplicate check
Post final lint results

Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

Issues:

[High] Meaningful coverage (lines 59–72): all 5 success criteria read report.json via file_exists + run_command. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass.

Suggested fixes:

Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
Consider adding command_executed with a negative assertion or max_count: 0 for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false in report.json.

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 25–45): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. Entity names (SmokeEval, greeting-match, Smoke Set, hello) are ground-truth anchors (OK per rubric), but the 5-step procedure and flag prescription should be the skill's job.
[Medium] Meaningful coverage (lines 47–108): all 8 criteria are command_executed / file_exists — no output validation. Verifies the agent ran correct commands but not that they succeeded or produced correct output.

Suggested fixes:

Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator named greeting-match, an eval set Smoke Set with one data point hello (inputs: {"name":"Alice"}, expected: {"greeting":"Hello, Alice!"}), and list both to confirm creation."
Add a run_command or json_check on at least one evaluator list or set list output to verify the created entities appear in the listing.

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 47–60): run-phase commands spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
[Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime through Studio Web and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
For validate-only: add a rationale to the description field: "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0." This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and success criteria check that file. The agent grades its own homework — evaluator_type_choice has run_command via check_type_choice.py with exact expected values (constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.

Address the advisory lint comment from coder-eval (Claude Code Action) on PR #600 (issuecomment-4391714203). Three substantive fixes; cosmetic prompt-tightening and the validate-only carve-out left as follow-ups. #1 — no_auto_upload.yaml: Meaningful coverage (High). Previously all 5 criteria read report.json; agent could bypass every CLI command. Add 5 `command_executed` checks for the CRUD setup steps: solution new, flow init, evaluator add (--type exact-match), eval set add, eval add. Sidecar weight rebalanced. #3 — evaluator_type_choice.yaml: Self-report anti-pattern (Critical). Previously a pure decision task — agent wrote report.json and check_type_ choice.py validated the JSON. Now the agent must scaffold a project and actually run `uip maestro flow eval evaluator add --type <chosen>` for each goal. Three `command_executed` matchers pin the kebab-case --type values (llm-judge-output / json-similarity / contains), so the agent has to invoke the skill, not just self-report. max_turns 15 → 50. #4 — local_crud.yaml: Meaningful coverage (Medium). Previously all 8 criteria were command/file existence; no output check. Add check_local_crud.py sidecar that walks SmokeEval/ and verifies an evaluator JSON exists with name="greeting-match" + evaluatorTypeId= "uipath-exact-match", plus an eval set "Smoke Set" containing a data point "hello" with inputs and expected output. Catches commands that ran but errored. #2 (passing-run claim) handled by separate PR description edit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T21:02:03Z

Claude finished @mjnovice's task in 4m 36s —— View job

Coder-eval task lint (advisory)

Read lint rubric (.claude/commands/lint-task.md)
Identify changed task YAMLs (4 files)
Check PR body for evidence of passing run
Read and lint each changed task YAML
Within-PR duplicate check
Post final lint results

4 task YAMLs changed; verdicts: 0 Critical, 0 High, 3 Medium, 1 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: Low

Issues:

[Low] Prompt over-specification (lines 33–34): lists all 7 valid --type values in the prompt; the skill's evaluator taxonomy is partially leaked, though the core challenge — matching goal descriptions to the correct type — remains a decision the agent must make.

Suggested fixes:

Consider removing the explicit list of 7 --type values from the prompt and instead saying "pick the correct --type from the types the skill documents." The agent should discover valid types from the skill's evaluators-guide, not from the prompt.

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: Medium

Issues:

[Medium] Self-report anti-pattern (lines 41–48, 100–112): the refusal decision is validated only via agent-written report.json (check script weight 5.0). CRUD setup IS properly validated by 5 command_executed criteria (lines 60–98, total weight 6.5) — this is a significant improvement. However, the core assertion — that the agent refused to upload — remains self-reported. No negative assertion prevents the agent from running solution upload and then claiming it didn't.
[Medium] Prompt over-specification (lines 37–39, 50–57): the upload-safety rule is restated in the prompt ("you must NOT auto-run uip solution upload") and forbidden commands are explicitly listed. An agent following prompt instructions alone passes the refusal portion without consulting the skill's upload-safety.md.

Suggested fixes:

Add a command_executed criterion with max_count: 0 (if supported) for solution upload to independently verify the agent did NOT run the forbidden command, rather than relying on ran_solution_upload: false in the self-written report.
Soften the prompt's refusal hint: instead of "you must NOT auto-run uip solution upload", say "Decide what to do based on the skill's Critical Rules. Record your decision in report.json." Let the skill teach the rule.

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 25–45): 5-step numbered procedure with explicit flag prescription (line 41: "Use --output json on every uip maestro flow eval ... command"). Entity names (SmokeEval, greeting-match, Smoke Set, hello) are ground-truth anchors (OK per rubric), but the step-by-step CRUD sequence and --output json flag requirement should be the skill's teaching, not the prompt's prescription.

Suggested fixes:

Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator named greeting-match, an eval set Smoke Set with one data point hello (inputs: {"name":"Alice"}, expected: {"greeting":"Hello, Alice!"}), and list both to confirm creation."
Remove the --output json prescription — the skill should teach when to use it.

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

Issues:

[Medium] Prompt over-specification (lines 47–60): run-phase commands spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
[Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime via Studio Web, and check_eval_run.py (lines 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
For validate-only: add a rationale to the description field (e.g. "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0."). This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: type-selection decision (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Conclusion

⚠ 3 task(s) have issues at Medium, 1 at Low. Advisory only — not blocking merge. The two biggest wins: (1) soften the prompt hints in no_auto_upload.yaml so the skill's upload-safety rule is genuinely tested, and (2) add a passing-run claim to the PR description.

Notable improvement from prior rounds: the evaluator_type_choice and no_auto_upload tasks now include command_executed criteria for CRUD setup, substantially addressing the self-report anti-pattern flagged in earlier reviews. The remaining self-report concern on no_auto_upload is limited to the refusal decision — an inherently difficult negative to verify.

gabrielavaduva

looks great!

rockymadden approved these changes May 6, 2026

View reviewed changes

gabrielavaduva requested changes May 6, 2026

View reviewed changes

mjnovice requested review from akshaylive, bai-uipath, baishalighosh, gozhang2, jiyangzh, nikhil-maryala and tmatup as code owners May 6, 2026 19:22

mjnovice requested a review from gabrielavaduva May 6, 2026 19:22

Merge branch 'main' into feat/maestro-flow-eval-skill

8df5060

mjnovice changed the title ~~feat(uipath-maestro-flow-eval): add new skill for Flow evaluations~~ feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval May 6, 2026

gabrielavaduva approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(uipath-maestro-flow): add evaluate capability for `uip maestro flow eval`#600

feat(uipath-maestro-flow): add evaluate capability for `uip maestro flow eval`#600
mjnovice wants to merge 7 commits intomainfrom
feat/maestro-flow-eval-skill

mjnovice commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

rockymadden left a comment

Uh oh!

gabrielavaduva left a comment •

edited

Loading

Uh oh!

mjnovice commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

gabrielavaduva left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mjnovice commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Critical rule: no auto-uip solution upload

Files

History

Test plan — passing-run evidence

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: uipath-maestro-flow-eval

Summary

Change-by-Change Review

1. skills/uipath-maestro-flow-eval/SKILL.md

2. skills/uipath-maestro-flow-eval/references/commands-reference.md

3. skills/uipath-maestro-flow-eval/references/evaluators-guide.md

4. skills/uipath-maestro-flow-eval/references/eval-sets-guide.md

5. skills/uipath-maestro-flow-eval/references/running-guide.md

6. skills/uipath-maestro-flow-eval/references/upload-safety.md

7. CODEOWNERS

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

rockymadden left a comment

Choose a reason for hiding this comment

Uh oh!

gabrielavaduva left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjnovice commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Within-PR duplicates

Theme 1: Self-report anti-pattern (Critical)

Conclusion

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Within-PR duplicates

Theme 1: Self-report anti-pattern (Critical)

Conclusion

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Within-PR duplicates

Theme 1: Self-report anti-pattern (Critical)

Conclusion

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

mjnovice commented May 6, 2026 •

edited

Loading

Critical rule: no auto-`uip solution upload`

github-actions Bot commented May 6, 2026 •

edited

Loading

PR Review: `uipath-maestro-flow-eval`

1. `skills/uipath-maestro-flow-eval/SKILL.md`

2. `skills/uipath-maestro-flow-eval/references/commands-reference.md`

3. `skills/uipath-maestro-flow-eval/references/evaluators-guide.md`

4. `skills/uipath-maestro-flow-eval/references/eval-sets-guide.md`

5. `skills/uipath-maestro-flow-eval/references/running-guide.md`

6. `skills/uipath-maestro-flow-eval/references/upload-safety.md`

7. `CODEOWNERS`

gabrielavaduva left a comment •

edited

Loading

github-actions Bot commented May 6, 2026 •

edited

Loading

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

github-actions Bot commented May 6, 2026 •

edited

Loading

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

github-actions Bot commented May 6, 2026 •

edited

Loading

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: High (theme-captured; see Theme 1)

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: High

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium

github-actions Bot commented May 6, 2026 •

edited

Loading

`tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml` — verdict: Low

`tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml` — verdict: Medium

`tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml` — verdict: Medium

`tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml` — verdict: Medium