Conversation
|
Claude finished @MarinRzv's task in 2m 12s —— View job Coder-eval task lint (advisory)3 task YAMLs changed; verdict counts: 0 Critical, 1 High, 0 Medium, 0 Low, 3 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. Please edit the PR description to add a line like: Per-task lint
|
|
Claude finished @MarinRzv's task in 2m 56s —— View job SummaryAdds a new Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | OK | Description broadened appropriately, under 1024 chars |
| E2E Tests | OK | Well-structured e2e task with clear judge rubric; no smoke test (pre-existing gap) |
| Skill Body | Issue | Significant behavioral changes (causal precedence removal, co-equal guard) bundled with a test PR without PR description explaining rationale |
| References & Assets | OK | Simplifications are internally consistent |
| Repo Hygiene | Issue | Unscrubbed personal identifier; unrelated cross-skill changes should be separate PRs |
Issues for Manual Review
- Causal precedence removal trade-off — The old depth-verifier logic prevented downstream hypotheses from being classified as root causes. The new co-equal-roots guard addresses a different concern (multi-root stopping). Domain experts should verify that removing causal precedence won't cause the agent to accept "describes-WHAT" hypotheses as root causes in single-root scenarios.
- Integration-service overview simplification — Removing the multi-layout path table and glob recommendation may cause investigation failures for solution-layout projects. Verify this against real solution-based projects.
- CLI verb rename in byo-connections — Verify
create→addand--folder-key→--folder-idmatch the actual current CLI. If these are correct CLI updates, they should still be in a separate PR.
Conclusion
The new rpa-preflight-failure test scenario is well-crafted — good fixtures, clear RESOLUTION.md, appropriate judge rubric. However, there are three issues to address before merge:
- High — Scrub
ELENA.BUCHIRfromor-jobs-get-*.json:31per the mandatory scrub list. - High — Split cross-skill changes (
uipath-llm-configuration-byo-connections,uipath-maestro-flow,uipath-platform) into separate PRs per repo conventions. - Medium — Add a PR description explaining the rationale for the behavioral changes to the diagnostic skill (causal precedence removal, co-equal-roots guard).
--- | Branch
- description: add 'investigating'/'troubleshooting' verbs and 'stuck jobs' to broaden activation. The user's natural-language prompts ('investigate this job', 'why is my job stuck') were not matching the previous 'diagnosing/failed jobs' wording.
- co-equal-roots guard: if triage matches >=2 playbooks at the same highest confidence for distinct independent error signatures, the orchestrator must test all hypotheses sourced from those playbooks before depth-checking and exiting. Prevents the 'skip remaining' rule from short-circuiting multi-root failures (asset 404 + connection invalid), where reporting only one root cause leaves the user with an incomplete fix list.
…rn_timeouts - new scenario rpa-preflight-failure: faithful replay of an RPA job that faulted on a preflight check with two co-equal root causes (Orchestrator asset 404 + IS connection invalid). Project source is a real RPA/XAML PurchaseOrderProcessing project; CLI fixtures populated from the original session's .investigation/raw/* outputs (folder list, faulted-job list, job get/logs, asset list, IS connection lookups). docsai rules collapsed to a single passthrough per existing convention. PII scrubbed (emails -> placeholders, hostnames -> MOCK-HOST). Validates the new co-equal-roots guard end-to-end (3/3 SUCCESS = 1.000). - generator: extend PROJECT_SNAPSHOT_IGNORE_DIRS with UiPath Studio caches (.local, .tmh, .settings, .entities, .project, .templates, .claude). Project snapshot for the new test went from 239 files to 5. - turn_timeouts: bump rpa-preflight-failure to 1800s (multi-root path needs ~25 min wall under contention); tighten faulted_excel_o365 to 800s (observed ~10 min) and maestro-stuck-rpa-job to 1000s (observed ~13 min) so failures surface earlier instead of the previous shared 1200s ceiling.
b47c487 to
d1310d1
Compare
|
Claude finished @MarinRzv's task in 1m 20s —— View job Coder-eval task lint (advisory)3 task YAMLs changed; verdict counts: 0 Critical, 1 High, 0 Medium, 0 Low, 3 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. Please edit the PR description to add a line like: Per-task lint
|
…pected_skill Aligns the diagnostic task YAMLs with the upstream coder_eval schema change in commit 48cc93e (UiPath/coder_eval) that paired with skill activation evals (#596). The SkillTriggeredCriterion now requires expected_skill (the row's expected skill, with internal logic 'expected = yes iff expected_skill == skill_name') instead of expected ("yes" / "no"). For our diagnostic scenarios where the skill must always activate, set expected_skill to the same value as skill_name. Generator template updated so future scenarios emit the new field. Without this, CI smoke fails to load the task files because the local-pinned coder_eval is older than CI's main install.
|
Claude finished @MarinRzv's task in 1m 36s —— View job Coder-eval task lint (advisory)3 task YAMLs changed; verdict counts: 0 Critical, 0 High, 0 Medium, 0 Low, 3 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run✅ Author confirms passing run: "Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)" Per-task lint
|
Ran skill-diagnostics-rpa-preflight-failure locally and it passed. (Also confirm the two modified tasks still pass with the new turn_timeout values.)