test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary by tmatup · Pull Request #555 · UiPath/skills

tmatup · 2026-05-04T22:40:43Z

Summary

Replaces 12 brittle json_check.contains substring assertions across the five HITL smoke tasks with strict canonical-name checks aligned with references/hitl-patterns.md.

Task	Before	After
`smoke_01_explicit`	`pattern contains pprov \| rite \| alid` (3 fragments, OR @ 0.33)	`pattern regex ^(approval-gate\|write-back-validation)$`
`smoke_02_approval_gate`	`pattern contains pprov`	`pattern equals approval-gate`
`smoke_03_escalation`	`pattern contains scalat`, `insertion_point contains lassif`	`pattern regex ^(exception-escalation\|agentic-output-review)$`, `insertion_point regex (?i)(classif\|confidence\|routing\|...)`
`smoke_04_writeback`	`pattern contains rite \| alid \| nrich` (3 fragments, OR @ 0.33)	`pattern regex ^(write-back-validation\|data-enrichment\|agentic-output-review)$`
`smoke_05_compliance`	`pattern contains udit \| omplianc \| ign \| pprov` (4 fragments, OR @ 0.25)	`pattern regex ^(compliance-checkpoint\|approval-gate)$` (+ prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference)

Each task's prompt now lists the same 6 canonical pattern machine names mirroring the section titles in references/hitl-patterns.md:

approval-gate, exception-escalation, data-enrichment, compliance-checkpoint, write-back-validation, agentic-output-review

The agent is required to emit exactly one of those names verbatim. Where the scenario maps to multiple defensible canonical answers per the skill doc, the criterion uses regex and accepts each documented option.

Why

The original assertions used 4–6-character substrings of the canonical pattern names (e.g. expected: "scalat" for "escalation", expected: "udit" for "audit", expected: "ign" for "sign-off"). When the agent's free-form English answer didn't happen to contain that exact substring fragment, the smoke task failed — even when the answer was semantically correct. Local triage of run 2026-05-04_04-05-27 (and a 9-rep cross-grid across PR #517 / #545 / #546 worktrees) traced one such failure to expected: "scalat": smoke_03_escalation passed 8/9 locally but the failing rep emitted pattern: "confidence-gated human review" — a perfectly correct semantic answer that just didn't contain the literal substring.

Replacing the substring trap with a controlled vocabulary:

Decouples the test from agent-paraphrase variance.
Forces the agent to consult references/hitl-patterns.md (the prompt names the file) and pick a documented pattern.
Surfaces test/skill drift loudly: if the skill ever renames a pattern, the test fails immediately rather than slowly drifting into uselessness.

Commits

fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary — the originally failing test that triggered the investigation.
test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary — the proactive cleanup of the same anti-pattern in the four sibling smoke tasks.

Test plan

Local: coder-eval run --repeats 3 against this branch — 3/3 SUCCESS at score 1.0 for smoke_03 (commit 1) and 12/12 SUCCESS at score 1.0 across smoke_01/02/04/05 (commit 2). 15/15 total.
Authoritative validator-driven scan of origin/main (post-sync) confirms all 6 failing criteria across these 5 files are covered. After this PR lands: 0 violators repo-wide.
CI smoke-skills run (will fire on PR push).

Companion PR

UiPath/coder_eval#212 — adds the matching Pydantic validator that hard-fails any task YAML using json_check.contains with a < 8-char literal at task-load time. Merge this PR first, otherwise smoke-skills CI will go red the moment coder_eval ships the validator.

🤖 Generated with Claude Code

…ulary The criterion `pattern contains "scalat"` (a 5-char fragment of "escalation") was a paraphrase trap: when the agent answered with "confidence-gated review" or any non-"escalation" phrasing, the substring missed and the smoke task failed. Local 9-rep cross-grid across 3 PR worktrees confirmed it as a stochastic ~11% flake unrelated to PR diffs. Replace the brittle substring with a canonical-vocabulary contract: - Prompt now lists the 6 canonical HITL pattern machine names from `references/hitl-patterns.md` (`approval-gate`, `exception-escalation`, `data-enrichment`, `compliance-checkpoint`, `write-back-validation`, `agentic-output-review`) and requires the agent to emit one verbatim. Names mirror the doc's section titles. - Prompt borrows the skill's own signal phrases for the scenario ("edge cases", "cannot resolve autonomously", "escalate to a supervisor") so the test exercises documented recognition rather than free-form English. - `pattern` criterion: `regex ^(exception-escalation|agentic-output-review)$` — both names are documented as applicable to a low-confidence-AI scenario, so either is accepted (honest about the documented overlap rather than overfitting the prompt). - `insertion_point` criterion: a multi-alternate regex matching classifier / confidence / routing / low-confidence variants. The `regex` operator makes the pattern-matching nature visible in the YAML — no more short-literal substring guessing. Verified locally with 3 reps × 3 PR-worktree combinations after the rewrite: 3/3 SUCCESS at score 1.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-04T22:41:03Z

Claude finished @tmatup's task in 2m 18s —— View job

PR Review: fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary

Read review criteria and project conventions
Read PR diff and full changed file
Read referenced skill docs (hitl-patterns.md) for context
Check sibling smoke tests for the old anti-pattern
Perform review and post feedback

Summary

Rewrites smoke_03_escalation.yaml to constrain the agent to a closed vocabulary of HITL pattern machine names instead of matching a fragile 5-character substring ("scalat"). The prompt now enumerates 6 canonical names and the assertion uses regex with an exact match against the two valid answers. This directly addresses stochastic CI failures.

Change-by-Change Review

1. `tests/tasks/uipath-human-in-the-loop/smoke_03_escalation.yaml` — prompt rewrite

Severity: OK (with two minor notes)

What's good:

Switching from operator: contains / expected: "scalat" to operator: regex / expected: "^(exception-escalation|agentic-output-review)$" eliminates the failure mode where a semantically correct paraphrase misses a substring fragment. This is the right fix.
Accepting both exception-escalation and agentic-output-review is well-reasoned — both patterns legitimately apply to a low-confidence-AI scenario per the skill doc, and the description explains why.
The insertion_point regex ((?i)(classif|confidence|routing|low.{0,10}confidence|after.{0,30}\\bAI\\b)) is a solid improvement over "lassif" — it covers the semantic space without being overly brittle.
The scenario text now uses the skill's own signal phrases ("edge cases", "cannot resolve autonomously", "escalate to a supervisor"), making it a better test of whether the agent recognises documented patterns.

Note 1 — pattern count mismatch (Low):
Line 37 says "enumerates six canonical patterns", but references/hitl-patterns.md actually has 10 ###-level sections (6 core + 4 domain-specific: IT/change-management, HR/offer, customer-communication, financial-transaction). The prompt explicitly lists the 6 names so the agent is still properly constrained — but an agent that reads the file will see 10, not 6. Consider saying "six core patterns" or updating the count.

Note 2 — compliance-checkpoint naming (Low):
Five of the six machine names are exact kebab-case conversions of hitl-patterns.md headings. The exception is compliance-checkpoint which truncates "Compliance and audit checkpoint". The prompt claims names "mirror the section titles" (line 36–37). This doesn't affect this test (neither name is in the accepted regex), but when the follow-up PR migrates the other smoke tests, smoke_05_compliance will presumably accept compliance-checkpoint — and the mismatch between that name and the heading could confuse the agent. Worth normalizing now or in that follow-up.

2. Task metadata (lines 1–6)

Severity: OK
task_id, description, and tags are unchanged and correct. tags follows convention: skill name first, then test type.

What's Missing

Sibling smoke test migration (acknowledged): smoke_01, smoke_02, smoke_04, smoke_05 all still use the fragile operator: contains + substring pattern (confirmed via grep). The PR description explicitly calls this out as a follow-up — no issue, just confirming it's real.
Nothing else identified.

Area Ratings

Area	Status	Notes
Frontmatter	N/A	No skill frontmatter changed
E2E Tests	OK	Smoke test improved; e2e tests untouched and still present
Skill Body	N/A	No skill doc changed
References & Assets	N/A	No reference files changed
Repo Hygiene	OK	Scoped to one file, no secrets, no cross-skill refs

Issues for Manual Review

Verify that the coder_eval harness supports the regex operator on json_check assertions — the test assumes it does. If it only supports contains / equals, this will fail at runtime.
The 3-rep local validation is promising but a small sample; CI run will be the real confirmation.

Conclusion

Clean, well-motivated fix. The core change (substring → canonical vocabulary + regex) is correct and directly addresses the CI flake. The two Low-severity notes (pattern count wording, compliance-checkpoint naming) are polish items that can land in this PR or the follow-up. No blocking issues.

Replace 10 brittle `json_check.contains` substring assertions across four HITL smoke tasks (smoke_01_explicit, smoke_02_approval_gate, smoke_04_writeback, smoke_05_compliance) with strict canonical-name checks aligned with `references/hitl-patterns.md`. Each task's prompt now lists the same 6 canonical pattern machine names and requires the agent to emit one verbatim. Each task's `pattern` criterion uses `equals` (single canonical answer) or `regex` (multiple defensible canonical answers per the skill doc): - smoke_01_explicit: `equals approval-gate` -> `regex (approval-gate |write-back-validation)` — the "approve before write" scenario documents both as applicable. - smoke_02_approval_gate: `equals approval-gate` — single canonical fit, manager-must-approve language. - smoke_04_writeback: `regex (write-back-validation|data-enrichment |agentic-output-review)` — AI-enriches-then-writes-to-SAP scenario matches three patterns in the doc. - smoke_05_compliance: `regex (compliance-checkpoint|approval-gate)` — GDPR scenario; "regulatory sign-off" steers toward compliance-checkpoint, but "sign off" alone is also an approval-gate signal phrase. Prompt also strengthened with explicit "regulatory compliance" framing and a GDPR Article 17 reference. Companion to #555 (smoke_03_escalation), which migrated the same anti-pattern in a separate, focused PR. Verified: all four tasks pass 3/3 reps each (12/12 SUCCESS at score 1.0) using local coder_eval against this branch. This PR plus #555 unblocks the matching coder_eval validator (UiPath/coder_eval#212), which hard-fails any task YAML using `json_check.contains` with a `< 8`-char literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… asks for (#561) Local 3-rep baseline of `skill-flow-ipe-ceql-where` on current main: 0/3 PASS at score 0.273. Three reps split across two checker-driven failure modes that aren't really agent failures: 1. Connector-key invention (reps 00 + 02) — agent wrote `uipath-microsoft-entra` / `uipath-microsoft-entra-id`, derived from the current product name "Microsoft Entra ID". The registered connector key is the legacy `uipath-microsoft-azureactivedirectory`, and nothing in the prompt or skill SKILL.md surfaces that. 2. Empty inputs.detail (rep 01) — agent built a structurally-valid flow with `inputs: {}` and ran `uip maestro flow validate`, which returned "Status: Valid". The CLI itself does not require inputs.detail for a connector node; only this test's checker did. Since the prompt explicitly forbids `uip flow node configure` (no live tenant), populating inputs.detail requires the agent to reverse-engineer what the CLI would emit — that's not what the prompt asks for. Fix follows the same playbook as #555 (smoke_03_escalation): give the agent the canonical vocabulary and grade the artifacts the prompt actually asks for. Prompt updates: - Names the registered connector key verbatim (`uipath-microsoft-azureactivedirectory`) and warns that display names like "Microsoft Entra" / "Microsoft Entra ID" are NOT registry keys. - Points the agent at the canonical filter-tree shape: section "Filter Trees (CEQL)" in `skills/uipath-platform/references/integration-service/activities.md` (added by #492) and Step 6a in the Maestro Flow connector plugin's impl.md (which cross-references it). - Enumerates the required `where_detail.json` shape inline: groupOperator + filters[].id + PascalCase operator + WorkflowValue- wrapped value. Checker updates: - Validates `where_detail.json`'s `filter` against the canonical shape (groupOperator numeric, filters[] non-empty, leaves carry PascalCase operator + id referencing displayName + value 'active'). This is the artifact the prompt asks the agent to plan. - Validates the .flow file at the structural level only: registered connector key present + List Groups operation referenced + Decision + Terminate nodes. No more inputs.detail / queryParameters.where / configuration / =jsonString:... reverse-engineering — those are the CLI's job, not the agent's. - Drops the `inputs.detail` / =js: expression form path (the CLI's own validator accepts `inputs: {}`, so we should too). Verified locally on this branch with 3 fresh reps: 3/3 SUCCESS at score 1.0 (durations 298s / 602s / 346s, vs baseline 835s / 402s / 1086s — the canonical-key directive shaves ~45% off the agent's trial-and-error loops on connector picking and validator schema). All three reps converged on the canonical filter-tree shape verbatim. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tmatup requested a review from dushyant-uipath as a code owner May 4, 2026 22:40

tmatup mentioned this pull request May 4, 2026

test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary #556

Closed

3 tasks

tmatup changed the title ~~fix(hitl): rewrite skill-hitl-smoke-escalation around canonical pattern vocabulary~~ test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary May 4, 2026

rockymadden approved these changes May 4, 2026

View reviewed changes

tmatup mentioned this pull request May 4, 2026

fix(maestro-flow): grade CEQL where test against artifacts the prompt asks for #561

Merged

dushyant-uipath closed this May 5, 2026

tmatup reopened this May 5, 2026

dushyant-uipath approved these changes May 5, 2026

View reviewed changes

tmatup merged commit db92cbc into main May 5, 2026
10 checks passed

tmatup deleted the fix/skill-hitl-smoke-escalation-canonical-vocab branch May 5, 2026 04:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary#555

test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary#555
tmatup merged 2 commits intomainfrom
fix/skill-hitl-smoke-escalation-canonical-vocab

tmatup commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tmatup commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Commits

Test plan

Companion PR

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary

Summary

Change-by-Change Review

1. tests/tasks/uipath-human-in-the-loop/smoke_03_escalation.yaml — prompt rewrite

2. Task metadata (lines 1–6)

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmatup commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

1. `tests/tasks/uipath-human-in-the-loop/smoke_03_escalation.yaml` — prompt rewrite