test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary#556
test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary#556
Conversation
Replace 10 brittle `json_check.contains` substring assertions across four HITL smoke tasks (smoke_01_explicit, smoke_02_approval_gate, smoke_04_writeback, smoke_05_compliance) with strict canonical-name checks aligned with `references/hitl-patterns.md`. Each task's prompt now lists the same 6 canonical pattern machine names and requires the agent to emit one verbatim. Each task's `pattern` criterion uses `equals` (single canonical answer) or `regex` (multiple defensible canonical answers per the skill doc): - smoke_01_explicit: `equals approval-gate` -> `regex (approval-gate |write-back-validation)` — the "approve before write" scenario documents both as applicable. - smoke_02_approval_gate: `equals approval-gate` — single canonical fit, manager-must-approve language. - smoke_04_writeback: `regex (write-back-validation|data-enrichment |agentic-output-review)` — AI-enriches-then-writes-to-SAP scenario matches three patterns in the doc. - smoke_05_compliance: `regex (compliance-checkpoint|approval-gate)` — GDPR scenario; "regulatory sign-off" steers toward compliance-checkpoint, but "sign off" alone is also an approval-gate signal phrase. Prompt also strengthened with explicit "regulatory compliance" framing and a GDPR Article 17 reference. Companion to #555 (smoke_03_escalation), which migrated the same anti-pattern in a separate, focused PR. Verified: all four tasks pass 3/3 reps each (12/12 SUCCESS at score 1.0) using local coder_eval against this branch. This PR plus #555 unblocks the matching coder_eval validator (UiPath/coder_eval#212), which hard-fails any task YAML using `json_check.contains` with a `< 8`-char literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @tmatup's task in 2m 34s —— View job PR Review: test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary
SummaryReplaces 10 brittle Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill frontmatter changed — test task files only |
| E2E Tests | OK | Smoke tasks correctly migrated, regex patterns syntactically valid |
| Skill Body | N/A | No skill doc changes |
| References & Assets | OK | hitl-patterns.md verified — canonical names match section titles |
| Repo Hygiene | OK | No secrets, no cross-skill refs, changes scoped to 4 test files |
Issues for Manual Review
- Confirm that the coder_eval runner supports the
regexoperator forjson_checkassertions. The PR description and 12/12 local test results imply it does, but this can't be verified from this repo alone. - Confirm merge ordering: this PR + test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary #555 must land before UiPath/coder_eval#212 (which hard-fails short-literal
contains).
Conclusion
Clean PR. The migration from substring fragments to canonical pattern vocabulary is well-motivated, correctly implemented, and consistent across all 4 files. The only finding is a Low severity factual inaccuracy in the prompts ("six" vs. 10 patterns in the referenced file) that doesn't affect test outcomes but could be tightened. Approve.
|
Superseded by #555 — its commit was cherry-picked into that PR for a unified history. Closing. |
Summary
Migrates 10 brittle
json_check.containssubstring assertions across four HITL smoke tasks to strict canonical-name checks aligned withreferences/hitl-patterns.md. Companion to #555 (smoke_03_escalation).smoke_01_explicitpattern contains pprov | rite | alid(3 fragments, OR @ 0.33)pattern regex ^(approval-gate|write-back-validation)$smoke_02_approval_gatepattern contains pprovpattern equals approval-gatesmoke_04_writebackpattern contains rite | alid | nrich(3 fragments, OR @ 0.33)pattern regex ^(write-back-validation|data-enrichment|agentic-output-review)$smoke_05_compliancepattern contains udit | omplianc | ign | pprov(4 fragments, OR @ 0.25)pattern regex ^(compliance-checkpoint|approval-gate)$(+ prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference)Each task's prompt now lists the same 6 canonical pattern machine names mirroring the section titles in
references/hitl-patterns.md:approval-gate,exception-escalation,data-enrichment,compliance-checkpoint,write-back-validation,agentic-output-reviewThe agent is required to emit exactly one of those names verbatim. Where the scenario maps to multiple defensible canonical answers per the skill doc, the criterion uses
regexand accepts each documented option (mirroring the approach already used in #555 forexception-escalationvsagentic-output-review).Why
The original assertions used 4–6-character substrings of the canonical pattern names (e.g.
expected: "scalat"for "escalation",expected: "udit"for "audit",expected: "ign"for "sign-off"). When the agent's free-form English answer didn't happen to contain that exact substring fragment, the smoke task failed — even when the answer was semantically correct. Local triage of run2026-05-04_04-05-27traced one such failure toexpected: "scalat".Replacing the substring trap with a controlled vocabulary:
references/hitl-patterns.md(the prompt names the file) and pick a documented pattern.#555did forsmoke_03_escalation.Test plan
coder-eval run --repeats 3 --max-parallel 6against this branch — 12/12 SUCCESS at score 1.0 across the four migrated tasks.json_check.containsviolators with< 8-char literals intests/tasks/after this PR + test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary #555 land.Merge ordering
This PR plus #555 are the prerequisites for the matching coder_eval validator (UiPath/coder_eval#212), which hard-fails any task YAML using
json_check.containswith a< 8-char literal at task-load time. Land both #555 and this PR before merging UiPath/coder_eval#212.🤖 Generated with Claude Code