Skip to content

(Diagnostic Agent) - rpa preflight test#631

Open
MarinRzv wants to merge 3 commits intomainfrom
feat/diagnostics-rpa-preflight-test
Open

(Diagnostic Agent) - rpa preflight test#631
MarinRzv wants to merge 3 commits intomainfrom
feat/diagnostics-rpa-preflight-test

Conversation

@MarinRzv
Copy link
Copy Markdown
Collaborator

@MarinRzv MarinRzv commented May 7, 2026

Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @MarinRzv's task in 2m 12s —— View job


Coder-eval task lint (advisory)

3 task YAMLs changed; verdict counts: 0 Critical, 1 High, 0 Medium, 0 Low, 3 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. Please edit the PR description to add a line like: Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)

Per-task lint

tests/tasks/uipath-diagnostics/rpa-preflight-failure/task.yaml — verdict: OK

New faithful-replay scenario for a dual-root-cause preflight failure (missing asset + missing IS connection). Prompt is minimal ("can you investigate my last failed job?"), no procedure leakage. skill_triggered + llm_judge with ground truth via RESOLUTION.md — appropriate for diagnostic investigations where the output is a diagnosis. Criteria are not gameable without actually running the skill. No near-duplicate — tests a materially distinct failure mode (preflight check with two independent root causes) from the existing faulted_excel_o365 (single cross-workspace connection) and maestro-stuck-rpa-job (stuck Maestro job).

tests/tasks/uipath-diagnostics/faulted_excel_o365/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 800). No structural impact.

tests/tasks/uipath-diagnostics/maestro-stuck-rpa-job/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 1000). No structural impact.

Within-PR duplicates

No duplicate clusters detected. Only one new task added; the two modified files are config tunes.

Conclusion

⚠ 1 task has issues, max severity High. Advisory only — not blocking merge. The sole High is the missing passing-run evidence in the PR description — the task YAMLs themselves are well-structured.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @MarinRzv's task in 2m 56s —— View job


Summary

Adds a new rpa-preflight-failure e2e test scenario for the uipath-diagnostics skill, replaying a real investigation where two co-equal root causes (missing Orchestrator asset + missing Integration Service connection) caused a preflight check failure. Alongside the test, the PR makes significant behavioral changes to the diagnostic skill itself — adding a co-equal-roots guard, removing the upstream-cause gate and causal precedence logic from the depth-verifier, simplifying hypothesis generation ordering, and streamlining integration-service investigation/playbook docs. Also includes unrelated cross-skill changes to uipath-llm-configuration-byo-connections, uipath-maestro-flow, and uipath-platform.

Change-by-Change Review

1. skills/uipath-diagnostics/SKILL.md — Behavioral changes

Severity: Medium

Description broadened to add "investigating, troubleshooting" and "stuck jobs" — good for trigger matching.

The upstream-cause gate and sibling-precedence backstop paragraphs were removed and replaced with a new Co-equal-roots guard. This is a meaningful logic change: the old mechanism worked at hypothesis classification time (preventing premature explains-WHY classification), while the new mechanism works at the exit-decision point (preventing skip-remaining when co-equal playbooks remain). The co-equal-roots guard text is clear and prescriptive — well-written for LLM usability. However:

  • The reference to agents/depth-verifier.md § Causal precedence was removed alongside the linked section itself. The old mechanism enforced a deeper property (distinguishing consequence vs. originating fault). The new guard only checks playbook confidence levels, which is a narrower criterion. This is a design choice worth calling out — it trades depth-of-analysis correctness for simpler multi-root handling.
  • "When to stop testing" section at line 126 correctly cross-references the co-equal guard.

2. skills/uipath-diagnostics/agents/depth-verifier.md — Causal precedence removed

Severity: Medium

The entire "Causal precedence" section (36 lines) was deleted. This removed explicit-event and implicit-presupposition checks that prevented downstream hypotheses from being accepted as root causes. This is the largest behavioral reduction in the PR — the depth-verifier loses the ability to reject hypotheses that describe consequences rather than originating faults. If this was intentional (e.g., the checks were causing false rejections in practice), noting that rationale in the PR description would help reviewers.

3. skills/uipath-diagnostics/agents/hypothesis-generator.md — Step 5 removed

Severity: Low

Removed the "Inspect for explicit fault signals first" step that required originating-fault hypotheses to be drafted first and at highest confidence. Remaining steps renumbered. Consistent with the removal of causal-precedence logic.

4. skills/uipath-diagnostics/references/products/integration-service/investigation_guide.md — Simplified

Severity: Low

Ownership analysis ("Who owns the connection?", "Where is it bound?") condensed to a simpler three-step flow. The connection resource file is still checked first, which is correct. The **Do NOT infer the connector from the activity package name** warning is preserved. Clean simplification.

5. skills/uipath-diagnostics/references/products/integration-service/overview.md — Path table removed

Severity: Low

Multi-layout path table (standalone/solution single-folder/multi-folder) replaced with a single-line path pattern. The glob recommendation (**/connection/<connector-key>/*.json) was removed. This simplifies the doc but may cause misses for solution-layout projects where the resource file is under resources/.

6. skills/uipath-diagnostics/references/products/integration-service/playbooks/connection-invalid.md — Simplified

Severity: Low

Investigation steps reduced from 4 to 3, removing the hard precondition about reading the resource file before CLI commands and the cross-reference workflow binding step. The old precondition paragraph was quite prescriptive about ordering — its removal means the agent could now draw conclusions from CLI evidence alone without the resource file, which the old text explicitly warned against.

7. skills/uipath-llm-configuration-byo-connections/SKILL.mdcreateadd, --folder-key--folder-id

Severity: High (scoping)

This is an unrelated change to a different skill. Renames the CLI verb from create to add and the flag from --folder-key to --folder-id across the entire file (~20 substitutions). This should be in its own PR per repo conventions: "Changes are scoped to the skill being modified — no drive-by changes to unrelated skills."

8. skills/uipath-maestro-flow/references/author/references/plugins/connector-trigger/impl.md — Filter tree docs rewritten

Severity: High (scoping)

Another unrelated cross-skill change. The filter tree building procedure was inlined (previously linked to uipath-platform/references/integration-service/triggers.md), and array-field filter handling was changed from supported to "not yet supported." This is a meaningful doc change to a different skill.

9. skills/uipath-platform/references/integration-service/triggers.md — Section removed

Severity: High (scoping)

Entire "Building Filter Trees from filterFields" section (26 lines) deleted. This was the target of the cross-reference from uipath-maestro-flow. Combined with change #8, the content was moved/inlined — but this belongs in a separate PR.

10. tests/tasks/uipath-diagnostics/rpa-preflight-failure/task.yaml — New test scenario

Severity: OK

Well-structured e2e task. task_id: skill-diagnostics-rpa-preflight-failure follows the naming convention. Tags include uipath-diagnostics as first tag. LLM judge prompt has clear dimensions and scoring rubric. Simulation persona is well-constrained. turn_timeout: 1800 and task_timeout: 2400 are generous but reasonable for a two-root-cause scenario. max_turns: 60 matches existing tasks.

11. tests/tasks/uipath-diagnostics/rpa-preflight-failure/fixtures/mocks/responses/or-jobs-get-*.json:31 — Unscrubbed personal data

Severity: High

"LocalSystemAccount": "UIPATH\\ELENA.BUCHIR" — this appears to be a real person's name. The mandatory scrub list requires replacing "Real first names used as identifiers" with generic placeholders. This field should be scrubbed to something like "UIPATH\\MOCK-USER". Fix this →

12. tests/tasks/uipath-diagnostics/rpa-preflight-failure/README.md and RESOLUTION.md

Severity: OK

README clearly documents the scenario purpose, layers, success criteria, and regeneration command. RESOLUTION.md is thorough — two root causes with evidence, immediate fixes, preventive fixes, and an investigation summary table. Well-suited as LLM judge ground truth.

13. tests/tasks/uipath-diagnostics/rpa-preflight-failure/fixtures/mocks/responses/manifest.json

Severity: OK

11 canned rules + 1 docsai passthrough. unmocked_default returns empty array with exit 0 — permissive fallback is appropriate for diagnostic exploration. Rules cover the full investigation path (folders, jobs list/get/logs, assets, IS connections).

14. tests/tasks/uipath-diagnostics/rpa-preflight-failure/process/ — Project snapshot

Severity: OK

Main.xaml, project.json, entry-points.json, project.uiproj — clean UiPath project snapshot. The XAML correctly contains both the Outlook 365 activity (connection ID 7d15b17a-...) and the Get Asset activity (PO_PricingTable), matching the scenario's two root causes. No personal paths detected in the project files.

15. tests/tasks/uipath-diagnostics/_shared/scripts/generate_scenario.py — Ignore dirs

Severity: OK

Adds .local, .tmh, .settings, .entities, .project, .templates, .claude to the project snapshot ignore list. These are UiPath Studio caches/runtime artifacts — correct exclusions.

16. Existing task.yaml turn_timeout tuning

Severity: OK

faulted_excel_o365: 1200 → 800, maestro-stuck-rpa-job: 1200 → 1000. Comment explains the rationale (observed run times). Fine.

What's Missing

  • PR description is empty — "No description provided." Given the scope of changes (behavioral skill changes + new test + cross-skill modifications), a description explaining the motivation for removing causal precedence and adding the co-equal-roots guard would be valuable.
  • No smoke test — The new scenario is tagged e2e but no diagnostics task is tagged smoke. Per CONTRIBUTING.md, both are required. This is pre-existing (not introduced by this PR), but adding a lightweight smoke test would improve coverage.
  • No connection resource file in the process/ snapshot — The process references connection ID 7d15b17a-... (Outlook 365) but no connection/ folder exists under process/. The investigation guide says "Read the project's connection resource file first" — without one, the test only exercises the CLI-based fallback path. This may be intentional (the real project may not have had one), but worth noting.

Area Ratings

Area Status Notes
Frontmatter OK Description broadened appropriately, under 1024 chars
E2E Tests OK Well-structured e2e task with clear judge rubric; no smoke test (pre-existing gap)
Skill Body Issue Significant behavioral changes (causal precedence removal, co-equal guard) bundled with a test PR without PR description explaining rationale
References & Assets OK Simplifications are internally consistent
Repo Hygiene Issue Unscrubbed personal identifier; unrelated cross-skill changes should be separate PRs

Issues for Manual Review

  • Causal precedence removal trade-off — The old depth-verifier logic prevented downstream hypotheses from being classified as root causes. The new co-equal-roots guard addresses a different concern (multi-root stopping). Domain experts should verify that removing causal precedence won't cause the agent to accept "describes-WHAT" hypotheses as root causes in single-root scenarios.
  • Integration-service overview simplification — Removing the multi-layout path table and glob recommendation may cause investigation failures for solution-layout projects. Verify this against real solution-based projects.
  • CLI verb rename in byo-connections — Verify createadd and --folder-key--folder-id match the actual current CLI. If these are correct CLI updates, they should still be in a separate PR.

Conclusion

The new rpa-preflight-failure test scenario is well-crafted — good fixtures, clear RESOLUTION.md, appropriate judge rubric. However, there are three issues to address before merge:

  1. High — Scrub ELENA.BUCHIR from or-jobs-get-*.json:31 per the mandatory scrub list.
  2. High — Split cross-skill changes (uipath-llm-configuration-byo-connections, uipath-maestro-flow, uipath-platform) into separate PRs per repo conventions.
  3. Medium — Add a PR description explaining the rationale for the behavioral changes to the diagnostic skill (causal precedence removal, co-equal-roots guard).

--- | Branch

MarinRzv added 2 commits May 7, 2026 13:58
- description: add 'investigating'/'troubleshooting' verbs and 'stuck jobs' to broaden activation. The user's natural-language prompts ('investigate this job', 'why is my job stuck') were not matching the previous 'diagnosing/failed jobs' wording.
- co-equal-roots guard: if triage matches >=2 playbooks at the same highest confidence for distinct independent error signatures, the orchestrator must test all hypotheses sourced from those playbooks before depth-checking and exiting. Prevents the 'skip remaining' rule from short-circuiting multi-root failures (asset 404 + connection invalid), where reporting only one root cause leaves the user with an incomplete fix list.
…rn_timeouts

- new scenario rpa-preflight-failure: faithful replay of an RPA job that faulted on a preflight check with two co-equal root causes (Orchestrator asset 404 + IS connection invalid). Project source is a real RPA/XAML PurchaseOrderProcessing project; CLI fixtures populated from the original session's .investigation/raw/* outputs (folder list, faulted-job list, job get/logs, asset list, IS connection lookups). docsai rules collapsed to a single passthrough per existing convention. PII scrubbed (emails -> placeholders, hostnames -> MOCK-HOST). Validates the new co-equal-roots guard end-to-end (3/3 SUCCESS = 1.000).
- generator: extend PROJECT_SNAPSHOT_IGNORE_DIRS with UiPath Studio caches (.local, .tmh, .settings, .entities, .project, .templates, .claude). Project snapshot for the new test went from 239 files to 5.
- turn_timeouts: bump rpa-preflight-failure to 1800s (multi-root path needs ~25 min wall under contention); tighten faulted_excel_o365 to 800s (observed ~10 min) and maestro-stuck-rpa-job to 1000s (observed ~13 min) so failures surface earlier instead of the previous shared 1200s ceiling.
@MarinRzv MarinRzv force-pushed the feat/diagnostics-rpa-preflight-test branch from b47c487 to d1310d1 Compare May 7, 2026 11:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @MarinRzv's task in 1m 20s —— View job


Coder-eval task lint (advisory)

3 task YAMLs changed; verdict counts: 0 Critical, 1 High, 0 Medium, 0 Low, 3 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. Please edit the PR description to add a line like: Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)

Per-task lint

tests/tasks/uipath-diagnostics/rpa-preflight-failure/task.yaml — verdict: OK

New faithful-replay scenario for a dual-root-cause preflight failure (missing asset + missing IS connection). Prompt is minimal ("can you investigate my last failed job?"), no procedure leakage. skill_triggered + llm_judge with ground truth via RESOLUTION.md — appropriate for diagnostic investigations where the output is a diagnosis. Criteria are not gameable without actually running the skill. No near-duplicate — tests a materially distinct failure mode (preflight check with two independent root causes) from the existing faulted_excel_o365 (single cross-workspace connection) and maestro-stuck-rpa-job (stuck Maestro job).

tests/tasks/uipath-diagnostics/faulted_excel_o365/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 800). No structural impact.

tests/tasks/uipath-diagnostics/maestro-stuck-rpa-job/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 1000). No structural impact.

Within-PR duplicates

No duplicate clusters detected. Only one new task added; the two modified files are config tunes.

Conclusion

⚠ 1 task has issues, max severity High. Advisory only — not blocking merge. The sole High is the missing passing-run evidence in the PR description — the task YAMLs themselves are well-structured.


| Branch

…pected_skill

Aligns the diagnostic task YAMLs with the upstream coder_eval schema change in commit 48cc93e (UiPath/coder_eval) that paired with skill activation evals (#596). The SkillTriggeredCriterion now requires expected_skill (the row's expected skill, with internal logic 'expected = yes iff expected_skill == skill_name') instead of expected ("yes" / "no"). For our diagnostic scenarios where the skill must always activate, set expected_skill to the same value as skill_name. Generator template updated so future scenarios emit the new field. Without this, CI smoke fails to load the task files because the local-pinned coder_eval is older than CI's main install.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @MarinRzv's task in 1m 36s —— View job


Coder-eval task lint (advisory)

3 task YAMLs changed; verdict counts: 0 Critical, 0 High, 0 Medium, 0 Low, 3 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

✅ Author confirms passing run: "Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)"

Per-task lint

tests/tasks/uipath-diagnostics/rpa-preflight-failure/task.yaml — verdict: OK

New faithful-replay scenario for a dual-root-cause preflight failure (missing asset + missing IS connection). Prompt is minimal ("can you investigate my last failed job?"), no procedure leakage. skill_triggered + llm_judge with ground truth via RESOLUTION.md — appropriate for diagnostic investigations where the output is a diagnosis. Criteria are not gameable without actually running the skill. No near-duplicate — tests a materially distinct failure mode (preflight check with two independent root causes) from the existing faulted_excel_o365 (single cross-workspace connection) and maestro-stuck-rpa-job (stuck Maestro job).

tests/tasks/uipath-diagnostics/faulted_excel_o365/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 800). No structural impact.

tests/tasks/uipath-diagnostics/maestro-stuck-rpa-job/task.yaml — verdict: OK

Change is a turn_timeout tune (1200 → 1000). No structural impact.

Within-PR duplicates

No duplicate clusters detected. Only one new task added; the two modified files are config tunes.

Conclusion

✅ All changed tasks pass the rubric. Evidence of passing run confirmed.


| Branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant