Skip to content

test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary#556

Closed
tmatup wants to merge 1 commit intomainfrom
chore/migrate-canonical-criteria
Closed

test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary#556
tmatup wants to merge 1 commit intomainfrom
chore/migrate-canonical-criteria

Conversation

@tmatup
Copy link
Copy Markdown
Member

@tmatup tmatup commented May 4, 2026

Summary

Migrates 10 brittle json_check.contains substring assertions across four HITL smoke tasks to strict canonical-name checks aligned with references/hitl-patterns.md. Companion to #555 (smoke_03_escalation).

Task Before After
smoke_01_explicit pattern contains pprov | rite | alid (3 fragments, OR @ 0.33) pattern regex ^(approval-gate|write-back-validation)$
smoke_02_approval_gate pattern contains pprov pattern equals approval-gate
smoke_04_writeback pattern contains rite | alid | nrich (3 fragments, OR @ 0.33) pattern regex ^(write-back-validation|data-enrichment|agentic-output-review)$
smoke_05_compliance pattern contains udit | omplianc | ign | pprov (4 fragments, OR @ 0.25) pattern regex ^(compliance-checkpoint|approval-gate)$ (+ prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference)

Each task's prompt now lists the same 6 canonical pattern machine names mirroring the section titles in references/hitl-patterns.md:

  • approval-gate, exception-escalation, data-enrichment, compliance-checkpoint, write-back-validation, agentic-output-review

The agent is required to emit exactly one of those names verbatim. Where the scenario maps to multiple defensible canonical answers per the skill doc, the criterion uses regex and accepts each documented option (mirroring the approach already used in #555 for exception-escalation vs agentic-output-review).

Why

The original assertions used 4–6-character substrings of the canonical pattern names (e.g. expected: "scalat" for "escalation", expected: "udit" for "audit", expected: "ign" for "sign-off"). When the agent's free-form English answer didn't happen to contain that exact substring fragment, the smoke task failed — even when the answer was semantically correct. Local triage of run 2026-05-04_04-05-27 traced one such failure to expected: "scalat".

Replacing the substring trap with a controlled vocabulary:

  • Decouples the test from agent paraphrase variance.
  • Forces the agent to consult references/hitl-patterns.md (the prompt names the file) and pick a documented pattern.
  • Surfaces test/skill drift loudly: if the skill ever renames a pattern, the test fails immediately rather than slowly drifting into uselessness.
  • Matches what #555 did for smoke_03_escalation.

Test plan

Merge ordering

This PR plus #555 are the prerequisites for the matching coder_eval validator (UiPath/coder_eval#212), which hard-fails any task YAML using json_check.contains with a < 8-char literal at task-load time. Land both #555 and this PR before merging UiPath/coder_eval#212.

🤖 Generated with Claude Code

Replace 10 brittle `json_check.contains` substring assertions across
four HITL smoke tasks (smoke_01_explicit, smoke_02_approval_gate,
smoke_04_writeback, smoke_05_compliance) with strict canonical-name
checks aligned with `references/hitl-patterns.md`.

Each task's prompt now lists the same 6 canonical pattern machine
names and requires the agent to emit one verbatim. Each task's
`pattern` criterion uses `equals` (single canonical answer) or
`regex` (multiple defensible canonical answers per the skill doc):

- smoke_01_explicit: `equals approval-gate` -> `regex (approval-gate
  |write-back-validation)` — the "approve before write" scenario
  documents both as applicable.
- smoke_02_approval_gate: `equals approval-gate` — single canonical
  fit, manager-must-approve language.
- smoke_04_writeback: `regex (write-back-validation|data-enrichment
  |agentic-output-review)` — AI-enriches-then-writes-to-SAP scenario
  matches three patterns in the doc.
- smoke_05_compliance: `regex (compliance-checkpoint|approval-gate)`
  — GDPR scenario; "regulatory sign-off" steers toward
  compliance-checkpoint, but "sign off" alone is also an approval-gate
  signal phrase. Prompt also strengthened with explicit "regulatory
  compliance" framing and a GDPR Article 17 reference.

Companion to #555 (smoke_03_escalation), which migrated
the same anti-pattern in a separate, focused PR.

Verified: all four tasks pass 3/3 reps each (12/12 SUCCESS at score
1.0) using local coder_eval against this branch.

This PR plus #555 unblocks the matching coder_eval validator
(UiPath/coder_eval#212), which hard-fails any task YAML using
`json_check.contains` with a `< 8`-char literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tmatup tmatup requested a review from dushyant-uipath as a code owner May 4, 2026 22:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Claude finished @tmatup's task in 2m 34s —— View job


PR Review: test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary

  • Read review criteria and project conventions
  • Read full diff and changed files
  • Validate canonical names against references/hitl-patterns.md
  • Check for remaining short-literal contains assertions
  • Check for secrets, cross-skill refs, repo hygiene
  • Post review

Summary

Replaces 10 brittle json_check.contains substring assertions (e.g. expected: "pprov", expected: "udit") across 4 HITL smoke tasks with strict regex/equals checks against canonical pattern machine names. Each prompt now explicitly lists the 6 accepted pattern names and instructs the agent to emit one verbatim. Well-motivated by real test failures traced to substring fragility.

Change-by-Change Review

1. tests/tasks/uipath-human-in-the-loop/smoke_01_explicit.yaml

Severity: OK

Prompt upgraded with canonical vocabulary block. Three contains assertions ("pprov", "rite", "alid" at pass_threshold: 0.33) replaced with a single regex assertion: ^(approval-gate|write-back-validation)$ at pass_threshold: 1.0. Both patterns are defensible for an "approve before writing" scenario per hitl-patterns.md. Clean.

2. tests/tasks/uipath-human-in-the-loop/smoke_02_approval_gate.yaml

Severity: OK

Prompt upgraded with canonical vocabulary block. Single contains "pprov" replaced with equals "approval-gate". This is the strictest task — only one valid answer, which is correct since the scenario is unambiguously an approval gate. Clean.

3. tests/tasks/uipath-human-in-the-loop/smoke_04_writeback.yaml

Severity: OK

Prompt upgraded with canonical vocabulary block. Three contains assertions ("rite", "alid", "nrich" at pass_threshold: 0.33) replaced with regex: ^(write-back-validation|data-enrichment|agentic-output-review)$. All three patterns are documented as applicable for "AI enriches data, writes it back to a system of record." Clean.

4. tests/tasks/uipath-human-in-the-loop/smoke_05_compliance.yaml

Severity: OK

Prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference. Four contains assertions ("udit", "omplianc", "ign", "pprov" at pass_threshold: 0.25) replaced with regex: ^(compliance-checkpoint|approval-gate)$. Both patterns are defensible per hitl-patterns.md — compliance-checkpoint is the primary fit, approval-gate covers the "sign off" signal overlap. Clean.

5. Cross-cutting: Canonical vocabulary block duplication

Severity: Low

The 10-line "Use EXACTLY one of these machine names..." block is copy-pasted identically across all 4 files. This is expected and correct — each task YAML must be self-contained for the coder_eval runner. Not a real duplication issue.

6. Cross-cutting: Prompt claims "six canonical patterns" but hitl-patterns.md has 10 sections

Severity: Low

Each prompt says "The skill's references/hitl-patterns.md enumerates six canonical patterns" — but the file actually has 10 ###-level pattern sections (the 6 core patterns plus 4 industry-specific ones: IT/change-management, HR/contract, customer-communication, financial-transaction). The agent gets the explicit list of 6 names so this doesn't cause test failures (12/12 confirmed), but the claim is factually inaccurate when the agent reads the file. If the agent notices the discrepancy, it could hesitate.

Consider either:

  • Changing "six" to "six core" in the prompts, or
  • Adding a --- separator or heading in hitl-patterns.md between the core and industry-specific patterns to make the distinction clearer

Fix this →

What's Missing

Area Ratings

Area Status Notes
Frontmatter N/A No skill frontmatter changed — test task files only
E2E Tests OK Smoke tasks correctly migrated, regex patterns syntactically valid
Skill Body N/A No skill doc changes
References & Assets OK hitl-patterns.md verified — canonical names match section titles
Repo Hygiene OK No secrets, no cross-skill refs, changes scoped to 4 test files

Issues for Manual Review

  • Confirm that the coder_eval runner supports the regex operator for json_check assertions. The PR description and 12/12 local test results imply it does, but this can't be verified from this repo alone.
  • Confirm merge ordering: this PR + test(hitl): migrate HITL smoke tasks to canonical pattern vocabulary #555 must land before UiPath/coder_eval#212 (which hard-fails short-literal contains).

Conclusion

Clean PR. The migration from substring fragments to canonical pattern vocabulary is well-motivated, correctly implemented, and consistent across all 4 files. The only finding is a Low severity factual inaccuracy in the prompts ("six" vs. 10 patterns in the referenced file) that doesn't affect test outcomes but could be tightened. Approve.

@tmatup
Copy link
Copy Markdown
Member Author

tmatup commented May 4, 2026

Superseded by #555 — its commit was cherry-picked into that PR for a unified history. Closing.

@tmatup tmatup closed this May 4, 2026
@tmatup tmatup deleted the chore/migrate-canonical-criteria branch May 4, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant