Conversation
…ulary
The criterion `pattern contains "scalat"` (a 5-char fragment of
"escalation") was a paraphrase trap: when the agent answered with
"confidence-gated review" or any non-"escalation" phrasing, the
substring missed and the smoke task failed. Local 9-rep cross-grid
across 3 PR worktrees confirmed it as a stochastic ~11% flake unrelated
to PR diffs.
Replace the brittle substring with a canonical-vocabulary contract:
- Prompt now lists the 6 canonical HITL pattern machine names from
`references/hitl-patterns.md` (`approval-gate`, `exception-escalation`,
`data-enrichment`, `compliance-checkpoint`, `write-back-validation`,
`agentic-output-review`) and requires the agent to emit one verbatim.
Names mirror the doc's section titles.
- Prompt borrows the skill's own signal phrases for the scenario
("edge cases", "cannot resolve autonomously", "escalate to a
supervisor") so the test exercises documented recognition rather
than free-form English.
- `pattern` criterion: `regex ^(exception-escalation|agentic-output-review)$`
— both names are documented as applicable to a low-confidence-AI
scenario, so either is accepted (honest about the documented overlap
rather than overfitting the prompt).
- `insertion_point` criterion: a multi-alternate regex matching
classifier / confidence / routing / low-confidence variants. The
`regex` operator makes the pattern-matching nature visible in the
YAML — no more short-literal substring guessing.
Verified locally with 3 reps × 3 PR-worktree combinations after the
rewrite: 3/3 SUCCESS at score 1.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @tmatup's task in 2m 18s —— View job PR Review: fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary
SummaryRewrites Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill frontmatter changed |
| E2E Tests | OK | Smoke test improved; e2e tests untouched and still present |
| Skill Body | N/A | No skill doc changed |
| References & Assets | N/A | No reference files changed |
| Repo Hygiene | OK | Scoped to one file, no secrets, no cross-skill refs |
Issues for Manual Review
- Verify that the coder_eval harness supports the
regexoperator onjson_checkassertions — the test assumes it does. If it only supportscontains/equals, this will fail at runtime. - The 3-rep local validation is promising but a small sample; CI run will be the real confirmation.
Conclusion
Clean, well-motivated fix. The core change (substring → canonical vocabulary + regex) is correct and directly addresses the CI flake. The two Low-severity notes (pattern count wording, compliance-checkpoint naming) are polish items that can land in this PR or the follow-up. No blocking issues.
Replace 10 brittle `json_check.contains` substring assertions across four HITL smoke tasks (smoke_01_explicit, smoke_02_approval_gate, smoke_04_writeback, smoke_05_compliance) with strict canonical-name checks aligned with `references/hitl-patterns.md`. Each task's prompt now lists the same 6 canonical pattern machine names and requires the agent to emit one verbatim. Each task's `pattern` criterion uses `equals` (single canonical answer) or `regex` (multiple defensible canonical answers per the skill doc): - smoke_01_explicit: `equals approval-gate` -> `regex (approval-gate |write-back-validation)` — the "approve before write" scenario documents both as applicable. - smoke_02_approval_gate: `equals approval-gate` — single canonical fit, manager-must-approve language. - smoke_04_writeback: `regex (write-back-validation|data-enrichment |agentic-output-review)` — AI-enriches-then-writes-to-SAP scenario matches three patterns in the doc. - smoke_05_compliance: `regex (compliance-checkpoint|approval-gate)` — GDPR scenario; "regulatory sign-off" steers toward compliance-checkpoint, but "sign off" alone is also an approval-gate signal phrase. Prompt also strengthened with explicit "regulatory compliance" framing and a GDPR Article 17 reference. Companion to #555 (smoke_03_escalation), which migrated the same anti-pattern in a separate, focused PR. Verified: all four tasks pass 3/3 reps each (12/12 SUCCESS at score 1.0) using local coder_eval against this branch. This PR plus #555 unblocks the matching coder_eval validator (UiPath/coder_eval#212), which hard-fails any task YAML using `json_check.contains` with a `< 8`-char literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… asks for (#561) Local 3-rep baseline of `skill-flow-ipe-ceql-where` on current main: 0/3 PASS at score 0.273. Three reps split across two checker-driven failure modes that aren't really agent failures: 1. Connector-key invention (reps 00 + 02) — agent wrote `uipath-microsoft-entra` / `uipath-microsoft-entra-id`, derived from the current product name "Microsoft Entra ID". The registered connector key is the legacy `uipath-microsoft-azureactivedirectory`, and nothing in the prompt or skill SKILL.md surfaces that. 2. Empty inputs.detail (rep 01) — agent built a structurally-valid flow with `inputs: {}` and ran `uip maestro flow validate`, which returned "Status: Valid". The CLI itself does not require inputs.detail for a connector node; only this test's checker did. Since the prompt explicitly forbids `uip flow node configure` (no live tenant), populating inputs.detail requires the agent to reverse-engineer what the CLI would emit — that's not what the prompt asks for. Fix follows the same playbook as #555 (smoke_03_escalation): give the agent the canonical vocabulary and grade the artifacts the prompt actually asks for. Prompt updates: - Names the registered connector key verbatim (`uipath-microsoft-azureactivedirectory`) and warns that display names like "Microsoft Entra" / "Microsoft Entra ID" are NOT registry keys. - Points the agent at the canonical filter-tree shape: section "Filter Trees (CEQL)" in `skills/uipath-platform/references/integration-service/activities.md` (added by #492) and Step 6a in the Maestro Flow connector plugin's impl.md (which cross-references it). - Enumerates the required `where_detail.json` shape inline: groupOperator + filters[].id + PascalCase operator + WorkflowValue- wrapped value. Checker updates: - Validates `where_detail.json`'s `filter` against the canonical shape (groupOperator numeric, filters[] non-empty, leaves carry PascalCase operator + id referencing displayName + value 'active'). This is the artifact the prompt asks the agent to plan. - Validates the .flow file at the structural level only: registered connector key present + List Groups operation referenced + Decision + Terminate nodes. No more inputs.detail / queryParameters.where / configuration / =jsonString:... reverse-engineering — those are the CLI's job, not the agent's. - Drops the `inputs.detail` / =js: expression form path (the CLI's own validator accepts `inputs: {}`, so we should too). Verified locally on this branch with 3 fresh reps: 3/3 SUCCESS at score 1.0 (durations 298s / 602s / 346s, vs baseline 835s / 402s / 1086s — the canonical-key directive shaves ~45% off the agent's trial-and-error loops on connector picking and validator schema). All three reps converged on the canonical filter-tree shape verbatim. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces 12 brittle
json_check.containssubstring assertions across the five HITL smoke tasks with strict canonical-name checks aligned withreferences/hitl-patterns.md.smoke_01_explicitpattern contains pprov | rite | alid(3 fragments, OR @ 0.33)pattern regex ^(approval-gate|write-back-validation)$smoke_02_approval_gatepattern contains pprovpattern equals approval-gatesmoke_03_escalationpattern contains scalat,insertion_point contains lassifpattern regex ^(exception-escalation|agentic-output-review)$,insertion_point regex (?i)(classif|confidence|routing|...)smoke_04_writebackpattern contains rite | alid | nrich(3 fragments, OR @ 0.33)pattern regex ^(write-back-validation|data-enrichment|agentic-output-review)$smoke_05_compliancepattern contains udit | omplianc | ign | pprov(4 fragments, OR @ 0.25)pattern regex ^(compliance-checkpoint|approval-gate)$(+ prompt strengthened with explicit "regulatory compliance" framing and GDPR Article 17 reference)Each task's prompt now lists the same 6 canonical pattern machine names mirroring the section titles in
references/hitl-patterns.md:approval-gate,exception-escalation,data-enrichment,compliance-checkpoint,write-back-validation,agentic-output-reviewThe agent is required to emit exactly one of those names verbatim. Where the scenario maps to multiple defensible canonical answers per the skill doc, the criterion uses
regexand accepts each documented option.Why
The original assertions used 4–6-character substrings of the canonical pattern names (e.g.
expected: "scalat"for "escalation",expected: "udit"for "audit",expected: "ign"for "sign-off"). When the agent's free-form English answer didn't happen to contain that exact substring fragment, the smoke task failed — even when the answer was semantically correct. Local triage of run2026-05-04_04-05-27(and a 9-rep cross-grid across PR #517 / #545 / #546 worktrees) traced one such failure toexpected: "scalat": smoke_03_escalation passed 8/9 locally but the failing rep emittedpattern: "confidence-gated human review"— a perfectly correct semantic answer that just didn't contain the literal substring.Replacing the substring trap with a controlled vocabulary:
references/hitl-patterns.md(the prompt names the file) and pick a documented pattern.Commits
fix(hitl): rewrite smoke_03_escalation around canonical pattern vocabulary— the originally failing test that triggered the investigation.test(hitl): migrate 4 HITL smoke tasks to canonical pattern vocabulary— the proactive cleanup of the same anti-pattern in the four sibling smoke tasks.Test plan
coder-eval run --repeats 3against this branch — 3/3 SUCCESS at score 1.0 for smoke_03 (commit 1) and 12/12 SUCCESS at score 1.0 across smoke_01/02/04/05 (commit 2). 15/15 total.origin/main(post-sync) confirms all 6 failing criteria across these 5 files are covered. After this PR lands: 0 violators repo-wide.Companion PR
json_check.containswith a< 8-char literal at task-load time. Merge this PR first, otherwise smoke-skills CI will go red the moment coder_eval ships the validator.🤖 Generated with Claude Code