Skip to content

test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary#555

Merged
tmatup merged 2 commits intomainfrom
fix/skill-hitl-smoke-escalation-canonical-vocab
May 5, 2026
Merged

test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary#555
tmatup merged 2 commits intomainfrom
fix/skill-hitl-smoke-escalation-canonical-vocab

Conversation

@tmatup
Copy link
Copy Markdown
Member

@tmatup tmatup commented May 4, 2026

Summary

Replaces 12 brittle json_check.contains substring assertions across the five HITL smoke tasks with strict canonical-name checks aligned with references/hitl-patterns.md.

Task Before After
smoke_01_explicit pattern contains pprov | rite | alid (3 fragments, OR @ 0.33) pattern regex ^(approval-gate|write-back-validation)$
smoke_02_approval_gate pattern contains pprov pattern equals approval-gate
smoke_03_escalation pattern contains scalat, insertion_point contains lassif pattern regex ^(exception-escalation|agentic-output-review)$, insertion_point regex (?i)(classif|confidence|routing|...)
smoke_04_writeback pattern contains rite | alid | nrich (3 fragments, OR @ 0.33) pattern regex ^(write-back-validation|data-enrichment|agentic-output-review)$
smoke_05_compliance pattern contains udit | omplianc | ign | pprov (4 fragments, OR @ 0.25) pattern regex ^(compliance-checkpoint|approval-gate)$ (+ prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference)

Each task's prompt now lists the same 6 canonical pattern machine names mirroring the section titles in references/hitl-patterns.md:

  • approval-gate, exception-escalation, data-enrichment, compliance-checkpoint, write-back-validation, agentic-output-review

The agent is required to emit exactly one of those names verbatim. Where the scenario maps to multiple defensible canonical answers per the skill doc, the criterion uses regex and accepts each documented option.

Why

The original assertions used 4–6-character substrings of the canonical pattern names (e.g. expected: "scalat" for "escalation", expected: "udit" for "audit", expected: "ign" for "sign-off"). When the agent's free-form English answer didn't happen to contain that exact substring fragment, the smoke task failed — even when the answer was semantically correct. Local triage of run 2026-05-04_04-05-27 (and a 9-rep cross-grid across PR #517 / #545 / #546 worktrees) traced one such failure to expected: "scalat": smoke_03_escalation passed 8/9 locally but the failing rep emitted pattern: "confidence-gated human review" — a perfectly correct semantic answer that just didn't contain the literal substring.

Replacing the substring trap with a controlled vocabulary:

  • Decouples the test from agent-paraphrase variance.
  • Forces the agent to consult references/hitl-patterns.md (the prompt names the file) and pick a documented pattern.
  • Surfaces test/skill drift loudly: if the skill ever renames a pattern, the test fails immediately rather than slowly drifting into uselessness.

Commits

  1. fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary — the originally failing test that triggered the investigation.
  2. test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary — the proactive cleanup of the same anti-pattern in the four sibling smoke tasks.

Test plan

  • Local: coder-eval run --repeats 3 against this branch — 3/3 SUCCESS at score 1.0 for smoke_03 (commit 1) and 12/12 SUCCESS at score 1.0 across smoke_01/02/04/05 (commit 2). 15/15 total.
  • Authoritative validator-driven scan of origin/main (post-sync) confirms all 6 failing criteria across these 5 files are covered. After this PR lands: 0 violators repo-wide.
  • CI smoke-skills run (will fire on PR push).

Companion PR

  • UiPath/coder_eval#212 — adds the matching Pydantic validator that hard-fails any task YAML using json_check.contains with a < 8-char literal at task-load time. Merge this PR first, otherwise smoke-skills CI will go red the moment coder_eval ships the validator.

🤖 Generated with Claude Code

…ulary

The criterion `pattern contains "scalat"` (a 5-char fragment of
"escalation") was a paraphrase trap: when the agent answered with
"confidence-gated review" or any non-"escalation" phrasing, the
substring missed and the smoke task failed. Local 9-rep cross-grid
across 3 PR worktrees confirmed it as a stochastic ~11% flake unrelated
to PR diffs.

Replace the brittle substring with a canonical-vocabulary contract:

- Prompt now lists the 6 canonical HITL pattern machine names from
  `references/hitl-patterns.md` (`approval-gate`, `exception-escalation`,
  `data-enrichment`, `compliance-checkpoint`, `write-back-validation`,
  `agentic-output-review`) and requires the agent to emit one verbatim.
  Names mirror the doc's section titles.
- Prompt borrows the skill's own signal phrases for the scenario
  ("edge cases", "cannot resolve autonomously", "escalate to a
  supervisor") so the test exercises documented recognition rather
  than free-form English.
- `pattern` criterion: `regex ^(exception-escalation|agentic-output-review)$`
  — both names are documented as applicable to a low-confidence-AI
  scenario, so either is accepted (honest about the documented overlap
  rather than overfitting the prompt).
- `insertion_point` criterion: a multi-alternate regex matching
  classifier / confidence / routing / low-confidence variants. The
  `regex` operator makes the pattern-matching nature visible in the
  YAML — no more short-literal substring guessing.

Verified locally with 3 reps × 3 PR-worktree combinations after the
rewrite: 3/3 SUCCESS at score 1.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tmatup tmatup requested a review from dushyant-uipath as a code owner May 4, 2026 22:40
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Claude finished @tmatup's task in 2m 18s —— View job


PR Review: fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary

  • Read review criteria and project conventions
  • Read PR diff and full changed file
  • Read referenced skill docs (hitl-patterns.md) for context
  • Check sibling smoke tests for the old anti-pattern
  • Perform review and post feedback

Summary

Rewrites smoke_03_escalation.yaml to constrain the agent to a closed vocabulary of HITL pattern machine names instead of matching a fragile 5-character substring ("scalat"). The prompt now enumerates 6 canonical names and the assertion uses regex with an exact match against the two valid answers. This directly addresses stochastic CI failures.

Change-by-Change Review

1. tests/tasks/uipath-human-in-the-loop/smoke_03_escalation.yaml — prompt rewrite

Severity: OK (with two minor notes)

What's good:

  • Switching from operator: contains / expected: "scalat" to operator: regex / expected: "^(exception-escalation|agentic-output-review)$" eliminates the failure mode where a semantically correct paraphrase misses a substring fragment. This is the right fix.
  • Accepting both exception-escalation and agentic-output-review is well-reasoned — both patterns legitimately apply to a low-confidence-AI scenario per the skill doc, and the description explains why.
  • The insertion_point regex ((?i)(classif|confidence|routing|low.{0,10}confidence|after.{0,30}\\bAI\\b)) is a solid improvement over "lassif" — it covers the semantic space without being overly brittle.
  • The scenario text now uses the skill's own signal phrases ("edge cases", "cannot resolve autonomously", "escalate to a supervisor"), making it a better test of whether the agent recognises documented patterns.

Note 1 — pattern count mismatch (Low):
Line 37 says "enumerates six canonical patterns", but references/hitl-patterns.md actually has 10 ###-level sections (6 core + 4 domain-specific: IT/change-management, HR/offer, customer-communication, financial-transaction). The prompt explicitly lists the 6 names so the agent is still properly constrained — but an agent that reads the file will see 10, not 6. Consider saying "six core patterns" or updating the count.

Note 2 — compliance-checkpoint naming (Low):
Five of the six machine names are exact kebab-case conversions of hitl-patterns.md headings. The exception is compliance-checkpoint which truncates "Compliance and audit checkpoint". The prompt claims names "mirror the section titles" (line 36–37). This doesn't affect this test (neither name is in the accepted regex), but when the follow-up PR migrates the other smoke tests, smoke_05_compliance will presumably accept compliance-checkpoint — and the mismatch between that name and the heading could confuse the agent. Worth normalizing now or in that follow-up.

2. Task metadata (lines 1–6)

Severity: OK
task_id, description, and tags are unchanged and correct. tags follows convention: skill name first, then test type.

What's Missing

  • Sibling smoke test migration (acknowledged): smoke_01, smoke_02, smoke_04, smoke_05 all still use the fragile operator: contains + substring pattern (confirmed via grep). The PR description explicitly calls this out as a follow-up — no issue, just confirming it's real.
  • Nothing else identified.

Area Ratings

Area Status Notes
Frontmatter N/A No skill frontmatter changed
E2E Tests OK Smoke test improved; e2e tests untouched and still present
Skill Body N/A No skill doc changed
References & Assets N/A No reference files changed
Repo Hygiene OK Scoped to one file, no secrets, no cross-skill refs

Issues for Manual Review

  • Verify that the coder_eval harness supports the regex operator on json_check assertions — the test assumes it does. If it only supports contains / equals, this will fail at runtime.
  • The 3-rep local validation is promising but a small sample; CI run will be the real confirmation.

Conclusion

Clean, well-motivated fix. The core change (substring → canonical vocabulary + regex) is correct and directly addresses the CI flake. The two Low-severity notes (pattern count wording, compliance-checkpoint naming) are polish items that can land in this PR or the follow-up. No blocking issues.


Replace 10 brittle `json_check.contains` substring assertions across
four HITL smoke tasks (smoke_01_explicit, smoke_02_approval_gate,
smoke_04_writeback, smoke_05_compliance) with strict canonical-name
checks aligned with `references/hitl-patterns.md`.

Each task's prompt now lists the same 6 canonical pattern machine
names and requires the agent to emit one verbatim. Each task's
`pattern` criterion uses `equals` (single canonical answer) or
`regex` (multiple defensible canonical answers per the skill doc):

- smoke_01_explicit: `equals approval-gate` -> `regex (approval-gate
  |write-back-validation)` — the "approve before write" scenario
  documents both as applicable.
- smoke_02_approval_gate: `equals approval-gate` — single canonical
  fit, manager-must-approve language.
- smoke_04_writeback: `regex (write-back-validation|data-enrichment
  |agentic-output-review)` — AI-enriches-then-writes-to-SAP scenario
  matches three patterns in the doc.
- smoke_05_compliance: `regex (compliance-checkpoint|approval-gate)`
  — GDPR scenario; "regulatory sign-off" steers toward
  compliance-checkpoint, but "sign off" alone is also an approval-gate
  signal phrase. Prompt also strengthened with explicit "regulatory
  compliance" framing and a GDPR Article 17 reference.

Companion to #555 (smoke_03_escalation), which migrated
the same anti-pattern in a separate, focused PR.

Verified: all four tasks pass 3/3 reps each (12/12 SUCCESS at score
1.0) using local coder_eval against this branch.

This PR plus #555 unblocks the matching coder_eval validator
(UiPath/coder_eval#212), which hard-fails any task YAML using
`json_check.contains` with a `< 8`-char literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tmatup tmatup changed the title fix(hitl): rewrite skill-hitl-smoke-escalation around canonical pattern vocabulary test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary May 4, 2026
@tmatup tmatup reopened this May 5, 2026
baishalighosh pushed a commit that referenced this pull request May 5, 2026
… asks for (#561)

Local 3-rep baseline of `skill-flow-ipe-ceql-where` on current main:
0/3 PASS at score 0.273. Three reps split across two checker-driven
failure modes that aren't really agent failures:

1. Connector-key invention (reps 00 + 02) — agent wrote
   `uipath-microsoft-entra` / `uipath-microsoft-entra-id`, derived from
   the current product name "Microsoft Entra ID". The registered
   connector key is the legacy `uipath-microsoft-azureactivedirectory`,
   and nothing in the prompt or skill SKILL.md surfaces that.
2. Empty inputs.detail (rep 01) — agent built a structurally-valid
   flow with `inputs: {}` and ran `uip maestro flow validate`, which
   returned "Status: Valid". The CLI itself does not require
   inputs.detail for a connector node; only this test's checker did.
   Since the prompt explicitly forbids `uip flow node configure` (no
   live tenant), populating inputs.detail requires the agent to
   reverse-engineer what the CLI would emit — that's not what the
   prompt asks for.

Fix follows the same playbook as
#555 (smoke_03_escalation):
give the agent the canonical vocabulary and grade the artifacts the
prompt actually asks for.

Prompt updates:
- Names the registered connector key verbatim
  (`uipath-microsoft-azureactivedirectory`) and warns that display
  names like "Microsoft Entra" / "Microsoft Entra ID" are NOT registry
  keys.
- Points the agent at the canonical filter-tree shape: section "Filter
  Trees (CEQL)" in
  `skills/uipath-platform/references/integration-service/activities.md`
  (added by #492) and Step 6a in the Maestro Flow connector plugin's
  impl.md (which cross-references it).
- Enumerates the required `where_detail.json` shape inline:
  groupOperator + filters[].id + PascalCase operator + WorkflowValue-
  wrapped value.

Checker updates:
- Validates `where_detail.json`'s `filter` against the canonical shape
  (groupOperator numeric, filters[] non-empty, leaves carry
  PascalCase operator + id referencing displayName + value 'active').
  This is the artifact the prompt asks the agent to plan.
- Validates the .flow file at the structural level only: registered
  connector key present + List Groups operation referenced + Decision
  + Terminate nodes. No more inputs.detail / queryParameters.where /
  configuration / =jsonString:... reverse-engineering — those are the
  CLI's job, not the agent's.
- Drops the `inputs.detail` / =js: expression form path (the CLI's
  own validator accepts `inputs: {}`, so we should too).

Verified locally on this branch with 3 fresh reps: 3/3 SUCCESS at
score 1.0 (durations 298s / 602s / 346s, vs baseline 835s / 402s /
1086s — the canonical-key directive shaves ~45% off the agent's
trial-and-error loops on connector picking and validator schema).
All three reps converged on the canonical filter-tree shape verbatim.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tmatup tmatup merged commit db92cbc into main May 5, 2026
10 checks passed
@tmatup tmatup deleted the fix/skill-hitl-smoke-escalation-canonical-vocab branch May 5, 2026 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants