Skip to content

feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval#600

Open
mjnovice wants to merge 7 commits intomainfrom
feat/maestro-flow-eval-skill
Open

feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval#600
mjnovice wants to merge 7 commits intomainfrom
feat/maestro-flow-eval-skill

Conversation

@mjnovice
Copy link
Copy Markdown

@mjnovice mjnovice commented May 6, 2026

Summary

Adds a new Evaluate capability to the uipath-maestro-flow skill, covering the full uip maestro flow eval CLI surface — evaluator CRUD (7 types), eval set CRUD with entry-point pinning, data point management with file attachments, and Studio Web run start/status/results/list/compare.

The capability mirrors the existing author/, operate/, diagnose/ structure so the four lifecycle phases of a Flow project share one skill and one activation surface, avoiding the activation-conflict problem that arises when multiple sibling skills compete for the same .flow request.

The CLI surface is verified end-to-end against cli/packages/flow-tool/src/services/flow-eval-schema/ (evaluator type-id mapping), cli/packages/flow-tool/src/services/flow-eval-run-context.ts (solution/project resolution), and cli/packages/solution-tool/src/commands/upload.ts (upload semantics).

Critical rule: no auto-uip solution upload

The eval run requires the Flow solution to exist in Studio Web, but the skill MUST NEVER auto-upload to satisfy that prerequisite. references/evaluate/references/upload-safety.md documents:

  • Why (overwrites concurrent Studio Web edits, pushes work-in-progress, discards remote changes)
  • How to detect "local workspace or VS Code" state (missing SolutionStorage.json, .vscode/ present, no prior upload in session)
  • The safe alternative — read-only probe via eval run list, then ask the user

Files

  • skills/uipath-maestro-flow/SKILL.md — extended description with eval signal; added "Evaluate" entry to When to use and capability router
  • skills/uipath-maestro-flow/references/evaluate/CAPABILITY.md — capability index: When-to-use, 6 eval-specific rules (no auto upload, --path semantics, login boundary, LLM-judge model pinning, UUID refs, --wait timeout), Quick Start, Workflow / Common-tasks tables, Anti-patterns, References. Universal rules (--output json, AskUserQuestion, narration, todos) inherit from parent SKILL.md.
  • skills/uipath-maestro-flow/references/evaluate/references/commands-reference.md — every subcommand, flags, defaults, output Code enum
  • skills/uipath-maestro-flow/references/evaluate/references/evaluators-guide.md — 7 evaluator types mapped to internal uipath-* IDs, JSON shapes, template variables
  • skills/uipath-maestro-flow/references/evaluate/references/eval-sets-guide.md — eval set + data point CRUD, --inputs/--expected/--criteria/--input-file/--search-text
  • skills/uipath-maestro-flow/references/evaluate/references/running-guide.md — run start/status/results/list/compare, JMESPath --output-filter, failure detection
  • skills/uipath-maestro-flow/references/evaluate/references/upload-safety.md — the solution upload rule
  • skills/uipath-maestro-flow/references/shared/cli-commands.md — added uip maestro flow eval subcommand block so the flat CLI lookup stays authoritative
  • tests/tasks/uipath-maestro-flow/evaluate/ — 3 smoke tasks + 1 e2e task with sidecar checkers (added in response to @rockymadden's review)
  • CODEOWNERS — no new entry needed; the existing uipath-maestro-flow ownership covers evaluate/

History

  • The skill landed initially as a standalone uipath-maestro-flow-eval skill; addressed @gabrielavaduva's review feedback by folding it into uipath-maestro-flow as a 4th capability (commit e4b58a3d).
  • Tests added in response to @rockymadden's review (commit 2a3b2fa4); coder-eval lint High/Critical findings closed in 4a30c1c2.

Test plan — passing-run evidence

Ran skill-flow-eval-local-crud, skill-flow-eval-evaluator-type-choice, skill-flow-eval-no-auto-upload in CI Smoke Skill Tests run 25458500050 — all 3 passed status=SUCCESS score=1.00. Updated tests in commit 4a30c1c2 will be re-validated by the next CI run on this push; will follow up with run id once it completes.

  • SKILL.md frontmatter passes hooks/validate-skill-descriptions.sh
  • All internal links resolve (39 relative .md links audited across changed files)
  • All 3 smoke tasks (local_crud, evaluator_type_choice, no_auto_upload) ran in CI and passed at 1.00 — see linked run above
  • Run the e2e (eval_run) task against a live Studio Web tenant (cadence: daily/weekly via separate infra; not run on PR)
  • Spot-check the 7 --type values against the registry in cli/packages/flow-tool/src/services/flow-eval-schema/registry.ts
  • Verify the uipath-maestro-flow skill activates correctly on Flow eval requests (no longer competing with a sibling eval-only skill) — confirmed by smoke test results

🤖 Generated with Claude Code

New self-contained skill covering the full uip maestro flow eval CLI
surface:

- evaluator add/list/remove for the 7 evaluator types (exact-match,
  json-similarity, contains, llm-judge-output, llm-judge-strict-json,
  llm-judge-trajectory, llm-judge-trajectory-simulation), with the
  internal uipath-* type-id mapping and JSON shapes.
- set add/list/remove with --evaluators and --entry-point semantics.
- Data point CRUD via top-level add/list/remove, including --inputs,
  --expected, --criteria, --input-file (repeatable file attachment),
  and --search-text for contains evaluators.
- run start/status/results/list/compare with --wait/--timeout, output
  Code enum, JMESPath --output-filter examples, failure detection
  rules, and per-row output fields.

Critical rule (per source review of cli/packages/flow-tool and
solution-tool/src/commands/upload.ts): the skill must NEVER auto-run
`uip solution upload`. The eval run requires the Flow solution to be
in Studio Web, but uploading from a local working tree can overwrite
concurrent Studio Web edits or push work-in-progress unintentionally.
references/upload-safety.md documents the rule, the detection signals
for "local workspace or VS Code" (missing SolutionStorage.json, .vscode/
present, no prior upload in session), and the safe alternative flow
(read-only probe via `eval run list`, then explicit user consent).

Skill is self-contained per repo conventions; SKILL.md frontmatter
includes -> redirects for sibling skills (uipath-agents for agent
evals, uipath-maestro-flow for non-eval Flow work). CODEOWNERS updated
to mirror uipath-maestro-flow ownership.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Claude finished @mjnovice's task in 2m 51s —— View job


PR Review: uipath-maestro-flow-eval

  • Read review criteria and project conventions
  • Read all changed files (SKILL.md + 5 references + CODEOWNERS)
  • Run validation checks (frontmatter, links, tests, CODEOWNERS)
  • Apply skill review framework and draft findings
  • Post review

Summary

Adds a new self-contained skill uipath-maestro-flow-eval covering the full uip maestro flow eval CLI surface — evaluator CRUD (7 types), eval set CRUD with entry-point pinning, data point management with file attachments, and Studio Web run lifecycle (start/status/results/list/compare). Well-structured, thorough, and immediately usable by an agent.

Change-by-Change Review

1. skills/uipath-maestro-flow-eval/SKILL.md

Severity: OK (with one Low item)

Strong SKILL.md. Frontmatter is valid — name matches folder, description is 392 characters (well under 1024 cap), front-loads identity and domain signals (UiPath Maestro Flow evaluations, uip maestro flow eval), includes redirects for uipath-agents and uipath-maestro-flow. Body follows the canonical section order: Title → When to Use → Critical Rules → Quick Start → Reference Navigation → Anti-patterns. Critical Rules are numbered (8 rules), prescriptive, and cover the key safety concern (no auto-upload). Anti-patterns section is comprehensive (7 items). Reference Navigation table is clean with all 5 references linked.

  • LowSKILL.md:7 — The allowed-tools includes AskUserQuestion. The uipath-agents skill omits this. Verify this is intentional — not a problem per se, but inconsistent with the nearest sibling.

2. skills/uipath-maestro-flow-eval/references/commands-reference.md

Severity: Low

Comprehensive command reference covering the full command tree, all flags, defaults, and output codes. Tables are well-formatted with required/optional markings. The Output Codes table at the end is a useful agent aid.

  • Lowcommands-reference.md:7 — The command tree code block has no language identifier (``` instead of ```text). Minor formatting — a text or plaintext identifier would be more correct per markdown standards.

3. skills/uipath-maestro-flow-eval/references/evaluators-guide.md

Severity: OK

Excellent evaluator reference. The type mapping table (lines 5–13) with --type → internal evaluatorTypeId is exactly what an agent needs. "When to Pick Each Type" decision table is clear. JSON shape examples are complete. Template variable table per type (lines 99–106) is valuable.

No issues.

4. skills/uipath-maestro-flow-eval/references/eval-sets-guide.md

Severity: OK

Thorough coverage of eval set and data point CRUD. The --expected mapping table by evaluator type (lines 64–73) is a good reference. The JSON shape section (lines 119–143) with field explanations helps agents understand the on-disk format. Anti-patterns are specific and actionable.

No issues.

5. skills/uipath-maestro-flow-eval/references/running-guide.md

Severity: Medium

Good coverage of the full run lifecycle. The workflow example (lines 175–196) is a useful end-to-end walkthrough. Failure detection criteria (lines 200–206) are clear.

  • MediumUpload-safety duplication. The "no auto-upload" rule is stated in: SKILL.md Critical Rule 1 (3+ lines), SKILL.md Anti-patterns (2 lines), running-guide.md callout (line 5), running-guide.md Resolution section (line 26), running-guide.md Anti-patterns (line 211), and the full upload-safety.md (82 lines). The SKILL.md and running-guide.md mentions should be trimmed to a single bridging sentence + link. The current state has 5 locations restating the same rule with varying levels of detail — this is a drift risk per the Reference Pointer Discipline check. The authoritative source is upload-safety.md; the other mentions should link there without restating the "why."

    That said, this is a safety-critical rule where redundancy is arguably a feature. I'm flagging it as Medium rather than High because the rule is consistent across all locations and the risk is over-messaging rather than conflicting guidance. The author can reasonably decide the repetition is intentional.

6. skills/uipath-maestro-flow-eval/references/upload-safety.md

Severity: OK

Well-written safety document. The "Why" section enumerates three concrete failure modes. The detection heuristics (lines 39–46) are practical. The "When the User Explicitly Asks" section (lines 60–74) sets a clear protocol. Anti-patterns are specific.

No issues.

7. CODEOWNERS

Severity: OK

New entry /skills/uipath-maestro-flow-eval/ mirrors the ownership of /skills/uipath-maestro-flow/. Correctly placed alphabetically. No tests/tasks/ CODEOWNERS entry, but that's expected since no tests exist yet (see below).

What's Missing

  • E2E tests. tests/tasks/uipath-maestro-flow-eval/ does not exist. No smoke or e2e task YAML files are included. Per review criteria, flagging as Medium — most skills are not yet test-compliant, but this should be tracked. Add tests →
  • uipath-maestro-flow sibling redirect. The new skill's description correctly redirects .flow authoring→uipath-maestro-flow, but the uipath-maestro-flow skill's description does not mention →uipath-maestro-flow-eval for eval requests. An agent asking about "flow evaluations" might trigger the parent uipath-maestro-flow skill first (its description mentions .flow and uip maestro flow). Adding a compact For evals→uipath-maestro-flow-eval redirect to the sibling description would improve routing. Medium — not blocking, but a disambiguation gap.

Area Ratings

Area Status Notes
Frontmatter OK Valid YAML, name matches folder, description 392 chars, front-loads identity, has redirects
E2E Tests Issue (Medium) tests/tasks/uipath-maestro-flow-eval/ does not exist
Skill Body OK Canonical section order, 8 critical rules, anti-patterns, completion output
References & Assets OK (with Medium note) 5 references, all linked from SKILL.md, kebab-case naming. Upload-safety redundancy is a minor duplication concern
Repo Hygiene OK CODEOWNERS updated, no secrets, no cross-skill refs, all links resolve, changes scoped to new skill

Issues for Manual Review

  • CLI accuracy. The 7 evaluator --type values and their internal evaluatorTypeId mappings should be verified against the actual flow-eval-schema/registry.ts in the CLI source. The PR description claims verification but the reviewer cannot confirm.
  • --wait and --timeout defaults. commands-reference.md:143 states default timeout is 600s. Verify against CLI source — if the default has changed, the docs will mislead.
  • --export-format flag. running-guide.md:110 documents --export-format <json|csv> on run results. Verify this flag exists on the CLI — it's not in the commands-reference flag table for run results (lines 162–168), which creates an inconsistency. Fix this →
  • Sibling skill conflict. Without a redirect in uipath-maestro-flow, an agent might trigger the wrong skill for eval-related queries since both skills match on uip maestro flow.

Conclusion

This is a high-quality new skill. The structure, content, and LLM usability are strong — 8 numbered critical rules, prescriptive CLI examples, decision tables, comprehensive anti-patterns, and well-organized reference extraction. The upload-safety documentation is particularly thoughtful.

Request changes on two items:

  1. (Medium) Add a →uipath-maestro-flow-eval redirect to the uipath-maestro-flow skill's description to prevent mis-triggering. This is the most impactful fix — without it, agents will frequently land on the wrong skill for eval queries.
  2. (Medium) Reconcile the --export-format flag: it appears in running-guide.md:110 but not in commands-reference.md's run results flag table.

The missing e2e tests are tracked as Medium but not blocking given repo-wide test coverage status.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

⚠️ Smoke test coverage gap — the following changed skills have no tests under tests/tasks/:

  • uipath-maestro-flow-eval

Consider adding smoke tests before merging.

Copy link
Copy Markdown
Collaborator

@rockymadden rockymadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will tests follow? Else, can we add them?

Copy link
Copy Markdown
Collaborator

@gabrielavaduva gabrielavaduva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skill should be part of the maestro-flow skill, considering we have activations problems already another skill that handles a subset of operation for one type of project would not be recommended. we should test this aginst activation evals

…aluate capability

Merges uipath-maestro-flow-eval into uipath-maestro-flow as a 4th
capability (evaluate/), addressing review feedback that subset-skills for
the same project type compete for activation against the parent.

- Move 5 eval reference files under references/evaluate/references/
- New evaluate/CAPABILITY.md mirroring author/operate/diagnose format;
  drops universal rules already enforced by parent SKILL.md, keeps the
  6 eval-specific rules (no auto upload, --path semantics, login boundary,
  LLM-judge model pinning, UUID refs, --wait timeout behavior)
- Parent SKILL.md: extend description with eval signal (399 chars),
  add "Evaluate" entry to When-to-use and capability router
- shared/cli-commands.md: add eval subcommand block so flat CLI lookup
  stays authoritative
- CODEOWNERS: drop standalone eval entry (parent maestro-flow ownership
  covers it with same owner set)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mjnovice
Copy link
Copy Markdown
Author

mjnovice commented May 6, 2026

@gabrielavaduva — agreed, addressed in e4b58a3.

Folded the eval surface into uipath-maestro-flow as a 4th capability evaluate/, mirroring the existing author/operate/diagnose/ structure:

  • Moved the 5 eval reference files under skills/uipath-maestro-flow/references/evaluate/references/
  • New evaluate/CAPABILITY.md drops the universal rules the parent SKILL.md already enforces (--output json, AskUserQuestion, narration, todos) and keeps only the 6 eval-specific rules (no auto solution upload, --path semantics, login boundary, LLM-judge model pinning, UUID evaluator refs, --wait timeout behavior)
  • Parent SKILL.md description extended with eval signals (399 chars, well under the 1024 cap); "Evaluate" added to When to use and the capability router
  • shared/cli-commands.md extended with the uip maestro flow eval subcommand tree so the flat CLI lookup stays authoritative
  • Deleted skills/uipath-maestro-flow-eval/; updated CODEOWNERS (the parent entry has the same owner set)

Re: testing against activation evals — agreed, this should be covered. The merge collapses two competing skills into one, so the prior activation conflict (eval requests matching both uipath-maestro-flow and uipath-maestro-flow-eval) goes away by construction. Happy to add a Flow-eval activation case to the eval suite — let me know which fixture set it should land in.

@mjnovice mjnovice requested a review from gabrielavaduva May 6, 2026 19:22
@mjnovice mjnovice changed the title feat(uipath-maestro-flow-eval): add new skill for Flow evaluations feat(uipath-maestro-flow): add evaluate capability for uip maestro flow eval May 6, 2026
… e2e)

Adds coder_eval task YAMLs covering the new `evaluate` capability folded
into uipath-maestro-flow in e4b58a3. Addresses @rockymadden's review
request for tests on PR #600.

- evaluate/local_crud.yaml (smoke) — agent scaffolds project + runs full
  local eval CRUD (evaluator add → set add → data point add → list).
  Asserts kebab-case --type, --output json on every eval command, evaluator
  JSON file lands on disk.
- evaluate/evaluator_type_choice.yaml (smoke) — agent picks correct --type
  for 3 goals (NL similarity, deterministic JSON match, substring contains).
  Pure decision-making against report.json.
- evaluate/no_auto_upload.yaml (smoke, anti-pattern guard) — agent refuses
  to auto-`uip solution upload` when solution isn't in Studio Web. Mirrors
  uipath-data-fabric/smoke_negative_guards.yaml pattern. Verifies the
  Critical Rule from references/evaluate/references/upload-safety.md.
- evaluate/eval_run/eval_run.yaml + check_eval_run.py (e2e) — full
  lifecycle on live tenant: scaffold trivial deterministic flow, upload
  (explicit prompt-level authorization), run `eval run start --wait`,
  fetch results, assert score=1.0 across 3 data points. Cleanup wired
  through existing e2e.yaml post_run (cleanup_solutions.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Claude finished @mjnovice's task in 4m 36s —— View job


Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

Issues:

  • [High] Meaningful coverage (lines 59–102): all 5 success criteria read report.json. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass the test. The upload-refusal decision is only verified through the agent's self-written claim.

Suggested fixes:

  • Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
  • Consider adding command_executed with a negative assertion or low max_count for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false.

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 25–46): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. The skill should teach when to use --output json and the CRUD sequence; the prompt should state the goal.

Suggested fixes:

  • Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator, an eval set with one data point, and list both to confirm creation. All eval commands must use --output json."
  • Keep entity names (SmokeEval, greeting-match, Smoke Set) and data point values since criteria check for them (ground-truth anchors), but remove the numbered procedure.

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 47–73): run-phase commands are spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
  • [Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

  • For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
  • For validate-only: add a rationale to the description field: "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0." This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and all success criteria check that file. The agent grades its own homework — evaluator_type_choice has json_check assertions with exact expected values (which constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

  • evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
  • no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.


…ommand + sidecar

The coder-eval TaskDefinition schema does not include `json_check` or the
`operator: in` value (the data-fabric task I copied from must be similarly
broken on main, but isn't surfaced because data-fabric is not changed in
this PR). Refactor the two affected tasks to use `run_command` with a
small Python sidecar — same pattern as the e2e and uipath-agents
eval_exact_match.

- evaluate/check_type_choice.py — validates report.json against expected
  evaluator type per goal (A/B/C)
- evaluate/check_no_auto_upload.py — validates report.json shows refusal,
  ran_solution_upload=false, ran_eval_run_start=false, Studio Web reason
- evaluate/evaluator_type_choice.yaml — replace 3 json_check entries with
  one run_command pointing at check_type_choice.py
- evaluate/no_auto_upload.yaml — replace 3 json_check + operator:in
  entries with one run_command pointing at check_no_auto_upload.py.
  Drop two negative `command_executed` checks (max_count is not in the
  schema) — sidecar already verifies via report.json.

Verified locally: all 4 YAMLs pass `yaml.safe_load` and use only valid
TaskDefinition types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Claude finished @mjnovice's task in 3m 49s —— View job


Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

Issues:

  • [High] Meaningful coverage (lines 59–72): all success criteria read report.json. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass the test. The upload-refusal decision is verified only through the agent's self-written claim.

Suggested fixes:

  • Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
  • Consider adding command_executed with a negative assertion or low max_count for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false.

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 25–46): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. The skill should teach when to use --output json and the CRUD sequence; the prompt should state the goal.

Suggested fixes:

  • Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator, an eval set with one data point, and list both to confirm creation. All eval commands must use --output json."
  • Keep entity names (SmokeEval, greeting-match, Smoke Set) and data point values since criteria check for them (ground-truth anchors), but remove the numbered procedure.

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 47–73): run-phase commands are spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
  • [Medium] Validate-only flow tests (tags: uipath-maestro-flow, tier: e2e): no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

  • For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
  • For validate-only: add a rationale to the description field (e.g. "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0."). This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and all success criteria check that file. The agent grades its own homework — evaluator_type_choice has run_command assertions via check_type_choice.py with exact expected values (which constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

  • evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
  • no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.


Both local_crud and no_auto_upload hit MAX_TURNS_EXHAUSTED on the first
real run. The agent budget was too tight for the full path: skill
activation → read CAPABILITY → read upload-safety → scaffold solution +
flow project → eval CRUD → write report.json.

- local_crud.yaml: max_turns 30 → 60. Agent reached `evaluator add` but
  ran out of turns before `set add`, `eval add`, and the verification
  lists. Score was 0.47 (4/8 criteria passed).
- no_auto_upload.yaml: max_turns 25 → 50. Agent never wrote report.json.
  Score was 0.00.
- local_crud.yaml: also fix alternation precedence in the `eval add`
  regex — `(--inputs.*--expected|--expected.*--inputs)` properly groups
  both flag orderings under the `uip ... eval add ...` prefix.

Reference: evaluator_type_choice.yaml passed 1.00/15-turn budget cleanly
since it's pure decision-making (no CLI calls).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Claude finished @mjnovice's task in 3m 45s —— View job


Coder-eval task lint (advisory)

  • Read lint rubric (.claude/commands/lint-task.md)
  • Identify changed task YAMLs (4 files)
  • Check PR body for evidence of passing run
  • Read and lint each changed task YAML
  • Within-PR duplicate check
  • Post final lint results

Coder-eval task lint (advisory)

4 task YAMLs changed; verdicts: 0 Critical, 2 High, 2 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: High (theme-captured; see Theme 1)

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: High

Issues:

  • [High] Meaningful coverage (lines 59–72): all 5 success criteria read report.json via file_exists + run_command. No command_executed for any setup CRUD (solution new, flow init, evaluator add, eval set add, data point add). Agent can skip all CLI work and write a compliant JSON file to pass.

Suggested fixes:

  • Add command_executed criteria for at least solution new, evaluator add, and eval set add to verify the agent actually built the pre-eval state before making the refusal decision.
  • Consider adding command_executed with a negative assertion or max_count: 0 for solution upload to verify the agent did NOT run the forbidden command, rather than trusting ran_solution_upload: false in report.json.

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 25–45): numbered step-by-step procedure with exact entity names, types, and explicit --output json flag prescription. Entity names (SmokeEval, greeting-match, Smoke Set, hello) are ground-truth anchors (OK per rubric), but the 5-step procedure and flag prescription should be the skill's job.
  • [Medium] Meaningful coverage (lines 47–108): all 8 criteria are command_executed / file_exists — no output validation. Verifies the agent ran correct commands but not that they succeeded or produced correct output.

Suggested fixes:

  • Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator named greeting-match, an eval set Smoke Set with one data point hello (inputs: {"name":"Alice"}, expected: {"greeting":"Hello, Alice!"}), and list both to confirm creation."
  • Add a run_command or json_check on at least one evaluator list or set list output to verify the created entities appear in the listing.

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 47–60): run-phase commands spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
  • [Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime through Studio Web and check_eval_run.py (line 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

  • For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
  • For validate-only: add a rationale to the description field: "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0." This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: decision-only knowledge test (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Theme 1: Self-report anti-pattern (Critical)

Affected: evaluator_type_choice.yaml, no_auto_upload.yaml

Both tasks instruct the agent to write report.json and success criteria check that file. The agent grades its own homework — evaluator_type_choice has run_command via check_type_choice.py with exact expected values (constrains answers but doesn't prove the skill was consulted), and no_auto_upload has no command_executed criteria at all. A capable LLM could pass both without invoking the skill or running any CLI commands.

Suggested systemic fix:

  • evaluator_type_choice: have the agent run uip maestro flow eval evaluator add with each chosen type (they'll fail without a project, but command_executed can verify the correct --type flag was used). Or scaffold a minimal project and validate the evaluator was created with evaluator list --output json + json_check.
  • no_auto_upload: add command_executed checks for the CRUD setup steps (the agent should demonstrably build the pre-eval state) and keep report.json only for the refusal-decision fields.

Conclusion

⚠ 4 task(s) have issues, max severity High. Advisory only — not blocking merge. The two biggest wins: (1) add command_executed criteria to no_auto_upload.yaml so the CRUD setup is verified, and (2) add a passing-run claim to the PR description.


Address the advisory lint comment from coder-eval (Claude Code Action) on
PR #600 (issuecomment-4391714203). Three substantive fixes; cosmetic
prompt-tightening and the validate-only carve-out left as follow-ups.

#1 — no_auto_upload.yaml: Meaningful coverage (High).
Previously all 5 criteria read report.json; agent could bypass every CLI
command. Add 5 `command_executed` checks for the CRUD setup steps:
solution new, flow init, evaluator add (--type exact-match), eval set add,
eval add. Sidecar weight rebalanced.

#3 — evaluator_type_choice.yaml: Self-report anti-pattern (Critical).
Previously a pure decision task — agent wrote report.json and check_type_
choice.py validated the JSON. Now the agent must scaffold a project and
actually run `uip maestro flow eval evaluator add --type <chosen>` for
each goal. Three `command_executed` matchers pin the kebab-case --type
values (llm-judge-output / json-similarity / contains), so the agent has
to invoke the skill, not just self-report. max_turns 15 → 50.

#4 — local_crud.yaml: Meaningful coverage (Medium).
Previously all 8 criteria were command/file existence; no output check.
Add check_local_crud.py sidecar that walks SmokeEval/ and verifies an
evaluator JSON exists with name="greeting-match" + evaluatorTypeId=
"uipath-exact-match", plus an eval set "Smoke Set" containing a data
point "hello" with inputs and expected output. Catches commands that
ran but errored.

#2 (passing-run claim) handled by separate PR description edit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Claude finished @mjnovice's task in 4m 36s —— View job


Coder-eval task lint (advisory)

  • Read lint rubric (.claude/commands/lint-task.md)
  • Identify changed task YAMLs (4 files)
  • Check PR body for evidence of passing run
  • Read and lint each changed task YAML
  • Within-PR duplicate check
  • Post final lint results

4 task YAMLs changed; verdicts: 0 Critical, 0 High, 3 Medium, 1 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. The test plan checkboxes cover SKILL.md frontmatter and link resolution, but no coder-eval task run is mentioned. Please edit the PR description to add a line like: Ran skill-flow-eval-local-crud locally and it passed.

Per-task lint

tests/tasks/uipath-maestro-flow/evaluate/evaluator_type_choice.yaml — verdict: Low

Issues:

  • [Low] Prompt over-specification (lines 33–34): lists all 7 valid --type values in the prompt; the skill's evaluator taxonomy is partially leaked, though the core challenge — matching goal descriptions to the correct type — remains a decision the agent must make.

Suggested fixes:

  • Consider removing the explicit list of 7 --type values from the prompt and instead saying "pick the correct --type from the types the skill documents." The agent should discover valid types from the skill's evaluators-guide, not from the prompt.

tests/tasks/uipath-maestro-flow/evaluate/no_auto_upload.yaml — verdict: Medium

Issues:

  • [Medium] Self-report anti-pattern (lines 41–48, 100–112): the refusal decision is validated only via agent-written report.json (check script weight 5.0). CRUD setup IS properly validated by 5 command_executed criteria (lines 60–98, total weight 6.5) — this is a significant improvement. However, the core assertion — that the agent refused to upload — remains self-reported. No negative assertion prevents the agent from running solution upload and then claiming it didn't.
  • [Medium] Prompt over-specification (lines 37–39, 50–57): the upload-safety rule is restated in the prompt ("you must NOT auto-run uip solution upload") and forbidden commands are explicitly listed. An agent following prompt instructions alone passes the refusal portion without consulting the skill's upload-safety.md.

Suggested fixes:

  • Add a command_executed criterion with max_count: 0 (if supported) for solution upload to independently verify the agent did NOT run the forbidden command, rather than relying on ran_solution_upload: false in the self-written report.
  • Soften the prompt's refusal hint: instead of "you must NOT auto-run uip solution upload", say "Decide what to do based on the skill's Critical Rules. Record your decision in report.json." Let the skill teach the rule.

tests/tasks/uipath-maestro-flow/evaluate/local_crud.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 25–45): 5-step numbered procedure with explicit flag prescription (line 41: "Use --output json on every uip maestro flow eval ... command"). Entity names (SmokeEval, greeting-match, Smoke Set, hello) are ground-truth anchors (OK per rubric), but the step-by-step CRUD sequence and --output json flag requirement should be the skill's teaching, not the prompt's prescription.

Suggested fixes:

  • Condense steps into a goal: "Scaffold a Flow project called SmokeEval (double-nested layout), then use the uipath-maestro-flow skill's evaluate capability to add an exact-match evaluator named greeting-match, an eval set Smoke Set with one data point hello (inputs: {"name":"Alice"}, expected: {"greeting":"Hello, Alice!"}), and list both to confirm creation."
  • Remove the --output json prescription — the skill should teach when to use it.

tests/tasks/uipath-maestro-flow/evaluate/eval_run/eval_run.yaml — verdict: Medium

Issues:

  • [Medium] Prompt over-specification (lines 47–60): run-phase commands spelled out as a numbered procedure with exact flags (--wait --timeout 600 --output json, resource refresh before upload). The skill's evaluate capability teaches this workflow; the prompt should state the goal.
  • [Medium] Validate-only flow tests (line 12): e2e tier, tags include uipath-maestro-flow, no command_executed matching flow\s+debug. Mitigating: eval run start --wait exercises runtime via Studio Web, and check_eval_run.py (lines 160–166, weight 8.0) validates all 3 data points scored 1.0. The eval run IS the runtime path for this capability — flow debug is not applicable to eval workflows.

Suggested fixes:

  • For over-spec: replace the numbered run steps (lines 48–60) with a goal: "Upload the solution (authorized for this task), start an eval run, wait for completion, and save detailed results to eval-results.json."
  • For validate-only: add a rationale to the description field (e.g. "Skipping flow debug — eval run exercises runtime; check_eval_run.py validates all data points scored 1.0."). This triggers the rubric's description-rationale carve-out and downgrades the finding to Low.

Within-PR duplicates

No duplicate clusters detected. The 4 tasks exercise materially distinct operations: type-selection decision (evaluator_type_choice), local CRUD happy-path (local_crud), anti-pattern guard / negative test (no_auto_upload), and full e2e lifecycle (eval_run). local_crud and no_auto_upload share CRUD setup scaffolding but test fundamentally different objectives.

Conclusion

⚠ 3 task(s) have issues at Medium, 1 at Low. Advisory only — not blocking merge. The two biggest wins: (1) soften the prompt hints in no_auto_upload.yaml so the skill's upload-safety rule is genuinely tested, and (2) add a passing-run claim to the PR description.

Notable improvement from prior rounds: the evaluator_type_choice and no_auto_upload tasks now include command_executed criteria for CRUD setup, substantially addressing the self-report anti-pattern flagged in earlier reviews. The remaining self-report concern on no_auto_upload is limited to the refusal decision — an inherently difficult negative to verify.


Copy link
Copy Markdown
Collaborator

@gabrielavaduva gabrielavaduva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants