Skip to content

test(uipath-agents): fill remaining coded-agent test gaps#633

Open
radugheo wants to merge 1 commit intomainfrom
test/fill-coded-agents-tests
Open

test(uipath-agents): fill remaining coded-agent test gaps#633
radugheo wants to merge 1 commit intomainfrom
test/fill-coded-agents-tests

Conversation

@radugheo
Copy link
Copy Markdown
Contributor

@radugheo radugheo commented May 7, 2026

Summary

Adds 11 coded-agent tests covering remaining gaps in the uipath-agents test suite, plus small fixes to four existing tests.

Tests added

Task Tier What it covers
skill-agent-coded-antipattern-dev-non-interactive smoke C7 — uip codedagent dev auto-appends --interactive; agent must switch to uip codedagent run.
skill-agent-coded-antipattern-output-aggregation e2e C8 — Annotated[list, operator.add] reducers vanish from --output-file; agent must carry aggregate fields explicitly per node.
skill-agent-coded-antipattern-pack-publish smoke C10 — uip codedagent pack and publish are wrapper-disabled; agent must consolidate to uip codedagent deploy.
skill-agent-coded-chat-react-agent e2e LangGraph chat-style agent via create_react_agent + tools.
skill-agent-coded-edit-add-node e2e First positive lifecycle:edit test — modifies an existing LangGraph by adding a second node.
skill-agent-coded-eval-classification e2e uipath-binary-classification + uipath-multiclass-classification evaluators.
skill-agent-coded-eval-output-suite e2e uipath-contains + uipath-json-similarity + uipath-llm-judge-output-strict-json-similarity + uipath-llm-judge-trajectory-simulation.
skill-agent-coded-eval-tool-call-suite e2e All four tool-call evaluators (uipath-tool-call-{order,args,count,output}).
skill-agent-coded-langgraph-supervisor e2e LangGraph multi-agent supervisor topology — supervisor + two worker agents.
skill-agent-coded-login-status-checkpoint integration C6 — auth check is one-shot via uip login status --output json; no redundant logins.
skill-agent-coded-sdk-services-suite e2e Direct UiPath() SDK client surface across multiple services (assets + llm + attachments).

Existing tests touched

  • bindings_syncpython -cpython3 -c (sandbox only has Python 3).
  • eval_exact_match, eval_llm_judges — natural-language prompt cleanup; eval turn_timeout 1200 → 1800s.
  • openai_agents_handoff — natural-language prompt cleanup; check no longer pins the typed-context class name (was hardcoded CustomerInput).

Test plan

  • Each check_*.py dry-runs green against synthetic well-formed projects.
  • All YAMLs parse, tag lists conform to the namespaced taxonomy in tests/README.md.
  • All 11 new tests run end-to-end locally in WSL.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @radugheo's task in 4m 12s —— View job


Review: PR #633 — test(uipath-agents): fill remaining coded-agent test gaps

  • Read review criteria and project conventions
  • Read existing PR conversation
  • Review all 23 changed files in detail
  • Check test conventions (tags, task_ids, YAML structure)
  • Check Python checkers for bugs and consistency
  • Post review findings

Summary

Adds 12 well-structured coded-agent test tasks (YAML + Python checkers) covering evaluator typeIds, edit lifecycle, anti-patterns, SDK services, and solution upload. Also extends .claude/commands/test-coverage.md with a Planned Skills Registry and planned-skill report template. The tests are consistent with existing conventions and the checkers are thorough.

Change-by-Change Review

1. .claude/commands/test-coverage.md — Planned Skills Registry

Severity: OK

Clean extension. The registry adds 8 planned skills with a 0%-by-definition stub template, and wires them into the all-mode discovery path. The transition logic (Phase 1 step 1 existence check) is sound — entries auto-graduate when the folder appears. Rule 13 is a good guardrail.

2. antipattern_dev_non_interactive/ (YAML + checker)

Severity: OK

Smoke-tier negative test for Critical Rule 7. Tags, task_id, and structure are correct. The checker regex (DEV_RX, RUN_RX) is line-anchored and correctly skips comments. The \S*?\b prefix before uip handles env-style wrappers — nice touch.

3. antipattern_output_aggregation/ (YAML + checker)

Severity: OK

E2E negative test for Rules 8+9. The checker is the most complex in this PR (260 lines) and does thorough AST-based validation: reducer annotation detection with alias tracking, carry-forward verification, graph export, lazy-init scan, and out.json content validation. The _is_reducer_marker + _collect_reducer_aliases pair correctly handles both operator.add and from operator import add patterns.

One minor note: check_antipattern_output_aggregation.py:207items is typed list[str] | None but the candidates loop always expects a dict. If the top-level JSON is a list, the loop silently skips to items is None and exits with a reasonable message, so this is fine in practice.

4. antipattern_pack_publish/ (YAML + checker)

Severity: OK

Smoke-tier negative test for Critical Rule 10. Checker is clean — three regex checks (no pack, no publish, has deploy --my-workspace).

5. chat_react_agent/ (YAML + checker)

Severity: OK

E2E create_react_agent pattern test. Checker validates the import path (langgraph.prebuilt), UiPath LLM client import, tools=[...] with ≥1 entry (handles both kwarg and positional), and graph export. Lazy-init scan included.

6. edit_add_node/ (YAML + checker)

Severity: OK

First positive lifecycle:edit test — adds a second node + conditional edge, re-inits, runs both branches. The checker validates 2+ add_node calls, GraphOutput has both size and category fields (via AST), entry-points.json reflects the schema change, and outputs.json has both tiny_run and huge_run branches.

check_edit_add_node.py:129 — the outputs.json check uses expected_category not in flat (string containment in serialized JSON). This is loose but pragmatic — "tiny" and "huge" are unlikely to appear in unrelated fields.

7. eval_classification/ (YAML + checker)

Severity: OK

Covers uipath-binary-classification and uipath-multiclass-classification evaluators. Checker validates positiveClass on binary config, averaging on multiclass config, eval set refs, and expectedClass on each test case.

8. eval_output_suite/ (YAML + checker)

Severity: OK

Covers all four output evaluators. The REQUIRED_FIELD_BY_TYPE map correctly matches each typeId to its per-evaluator field (searchText, expectedOutput, expectedAgentBehavior). Clean structure.

9. eval_tool_call_suite/ (YAML + checker)

Severity: OK

Covers all four tool-call evaluators. Checker additionally validates version == "1.0" on the eval set — the only eval checker that does this.

Severity: Low — The version check at check_eval_tool_call_suite.py:84 (doc.get("version") != "1.0") is not present in the other two eval checkers (eval_classification, eval_output_suite). This is a minor inconsistency — either all eval checkers should validate the version or none should, to avoid one test being stricter than its siblings for no documented reason.

10. langgraph_supervisor/ (YAML + checker)

Severity: OK

Multi-agent supervisor topology test. Checker validates Router TypedDict with next field (AST), with_structured_output(Router) call, 2+ create_react_agent calls, all three nodes registered (supervisor, researcher, coder), conditional edges from supervisor, and loop-back edges from both workers. Thorough topology assertion.

11. login_status_checkpoint/ (YAML only — no checker)

Severity: OK

Integration-tier auth checkpoint test. Uses built-in command_executed, file_exists, and file_contains success criteria — no custom checker needed. The file_contains criterion validates Status, Organization, Tenant are all present.

This is the only integration-tier task in the PR (all others are smoke or e2e), and the only one tagged lifecycle:activate — both are correct for an auth-surface test.

12. sdk_services_suite/ (YAML + checker)

Severity: OK

SDK Integration pattern test. Checker validates from uipath.platform import UiPath (not from uipath import UiPath), all three SDK call sites (assets.retrieve_credential_async, llm.chat_completions, attachments.upload), UiPath() constructed inside main() (AST walk), and bindings.json with SubType: "credentialAsset".

13. solution_upload/ (YAML + checker)

Severity: OK

Delivery Option B (solution upload) test. Checker validates scaffold artifacts exist, .venv was stripped, .uipx archive exists with size ≥ 1024 bytes, and upload_marker.txt references the actual archive name.

What's Missing

  • eval_classification minimum test cases: check_eval_classification.py:96 requires len(cases) < 2 (≥2 test cases) while the YAML prompt says "three test cases (one per sentiment)". The other eval checkers also require ≥2. This is fine — the checker is more lenient than the prompt, which is the right direction. Nothing to fix.
  • Contradiction with deploy_my_workspace: The pre-existing deploy_my_workspace task description mentions "pack → publish (or deploy)" as valid, while the new antipattern_pack_publish task says pack/publish are wrapper-disabled. This appears to be a pre-existing inconsistency (Rule 10 may have been added after the deploy test was written). Not introduced by this PR, but worth a follow-up to align the deploy test description. Fix this →

Area Ratings

Area Status Notes
Frontmatter N/A No skill SKILL.md changes — test-only PR
E2E Tests OK All 12 tasks well-formed: correct task_id pattern (skill-agent-coded-*), first tag is uipath-agents, required tags present (skill + tier + lifecycle:), feature tags use established vocabulary
Skill Body N/A No skill content changes
References & Assets N/A No reference/asset changes
Repo Hygiene OK No secrets, no cross-skill refs, changes scoped to tests/tasks/uipath-agents/ + one command file

Issues for Manual Review

  • CLI command correctness: The eval tasks instruct the agent to run uip codedagent eval ... --no-report. Verify this flag exists and behaves as described (saves results locally without uploading). Automated review cannot validate actual CLI surface.
  • uip codedagent run vs uip codedagent dev: The antipattern_dev_non_interactive test claims dev auto-appends --interactive. Verify this is still the current behavior of the wrapper.
  • pack/publish wrapper-disabled status: The antipattern_pack_publish test claims these commands fail with "command is disabled". Verify this matches the current uip wrapper behavior — the pre-existing deploy_my_workspace test seems to still accept pack+publish as valid (see What's Missing).
  • UiPath() import path: sdk_services_suite checker asserts from uipath.platform import UiPath. Verify this is still the canonical import (not from uipath import UiPath).

Conclusion

Approve. This is a clean, well-structured PR that significantly improves test coverage. All 12 tasks follow established conventions, tag taxonomy is consistent, checkers are thorough with good AST-based validation, and shared helpers are reused without modification. The only actionable finding is a minor inconsistency (version check in one eval checker but not the others — Low severity). The pre-existing deploy_my_workspace description conflict is worth a follow-up but is not blocking.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @radugheo's task in 6m 25s —— View job


Coder-eval task lint (advisory)

12 task YAMLs added; verdicts (after theme downgrade): 0 Critical, 0 High, 1 Medium, 2 Low, 9 OK. 1 theme at High severity.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body claims check_*.py dry-runs green against synthetic projects, but does not claim the coder-eval tasks themselves were run end-to-end (i.e., an agent given the initial_prompt produced passing output). Please edit the PR description to add a line like: Ran skill-agent-coded-antipattern-dev-non-interactive (and 11 others) locally and they passed.

Per-task lint

antipattern_dev_non_interactive/antipattern_dev_non_interactive.yaml — verdict: OK

antipattern_output_aggregation/antipattern_output_aggregation.yaml — verdict: OK (theme-captured; see Theme 1)

antipattern_pack_publish/antipattern_pack_publish.yaml — verdict: OK

chat_react_agent/chat_react_agent.yaml — verdict: OK

edit_add_node/edit_add_node.yaml — verdict: OK

eval_classification/eval_classification.yaml — verdict: OK

eval_output_suite/eval_output_suite.yaml — verdict: OK

eval_tool_call_suite/eval_tool_call_suite.yaml — verdict: OK

langgraph_supervisor/langgraph_supervisor.yaml — verdict: Low

Issues:

  • [Low] Prompt over-specification: wiring requirements (lines 32–43) prescribe the exact supervisor topology — Router TypedDict, with_structured_output, three named nodes, conditional edge routing, worker loop-back — reducing the skill's role to code-generation rather than pattern selection.

Suggested fixes:

  • Replace the detailed wiring list with a goal like "Build a multi-agent supervisor that routes between a researcher and a coder worker" and let the skill teach the documented Router / with_structured_output / worker-loop pattern.

login_status_checkpoint/login_status_checkpoint.yaml — verdict: Low (theme-captured; see Theme 1)

sdk_services_suite/sdk_services_suite.yaml — verdict: OK

solution_upload/solution_upload.yaml — verdict: Medium (theme-captured; see Theme 1)

Issues:

  • [High → Medium via theme] Prompt over-specification: lines 38–43 give the exact five-command solution-upload sequence verbatim (uip solution newcduip solution project import --sourcerm -rf .venvuip solution upload). Any agent can follow this recipe without invoking the skill. The .venv strip rationale (lines 46–48) is also inlined. This is the most prescriptive prompt in the PR.

Suggested fixes:

  • Replace the verbatim command block with a goal: "Perform the Studio Web solution-upload hand-off (Delivery Option B) for the agent." Keep the solution name UploadSmokeSol as a ground-truth anchor but let the skill teach the import/strip/upload sequence.
  • Move the .venv strip explanation to a description comment in the YAML (documents intent without leaking procedure to the agent).

Within-PR duplicates

No duplicate clusters detected. The three eval-suite tasks (classification, output_suite, tool_call_suite) share scaffold structure but exercise materially distinct evaluator families — good template reuse.

Themes

  • [High] Theme 1 — Prompt prescribes CLI procedures the skill should teach. Three tasks include CLI command sequences in initial_prompt that the skill's workflow is supposed to teach: antipattern_output_aggregation (scaffold steps in parenthetical, line 29–30), login_status_checkpoint (6-step scaffold recipe, lines 33–34), solution_upload (full 5-command upload recipe, lines 38–43). Severity scales with how much of the core tested content is given away — solution_upload is the worst offender (the entire delivery sequence is verbatim), while the other two prescribe setup steps that are ancillary to their core test goal.

Conclusion

⚠ 3 task(s) have issues (max severity High via theme); 1 additional task has an independent Low finding. Evidence of passing run is missing from the PR description. Advisory only — not blocking merge.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @radugheo's task in 4m 34s —— View job


Coder-eval task lint (advisory)

17 task YAMLs changed (12 added, 5 modified); verdicts: 0 Critical, 0 High, 0 Medium, 1 Low, 16 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body claims check_*.py dry-runs green against synthetic projects, but does not claim the coder-eval tasks themselves were run end-to-end (i.e., an agent given the initial_prompt produced passing output). Please edit the PR description to add a line like: Ran skill-agent-coded-antipattern-dev-non-interactive (and 11 others) locally and they passed.

Per-task lint

antipattern_dev_non_interactive/antipattern_dev_non_interactive.yaml — verdict: OK

antipattern_output_aggregation/antipattern_output_aggregation.yaml — verdict: OK

antipattern_pack_publish/antipattern_pack_publish.yaml — verdict: OK

chat_react_agent/chat_react_agent.yaml — verdict: OK

edit_add_node/edit_add_node.yaml — verdict: OK

eval_classification/eval_classification.yaml — verdict: OK

eval_output_suite/eval_output_suite.yaml — verdict: OK

eval_tool_call_suite/eval_tool_call_suite.yaml — verdict: OK

langgraph_supervisor/langgraph_supervisor.yaml — verdict: Low

Issues:

  • [Low] Prompt over-specification: wiring requirements (lines 31–43) prescribe the exact supervisor topology — Router TypedDict, with_structured_output, three named nodes, conditional edge routing, worker loop-back. The prompt describes the desired architecture in enough detail that an agent could implement it from the requirements alone without consulting the skill's supervisor pattern docs.

Suggested fixes:

  • Replace the detailed wiring list with a goal like "Build a multi-agent supervisor that routes between a researcher and a coder worker, using the documented supervisor pattern" and let the skill teach the Router / with_structured_output / worker-loop details. Keep the node names (supervisor, researcher, coder) as ground-truth anchors for the checker.

login_status_checkpoint/login_status_checkpoint.yaml — verdict: OK

sdk_services_suite/sdk_services_suite.yaml — verdict: OK

solution_upload/solution_upload.yaml — verdict: OK

bindings_sync.yaml (modified) — verdict: OK

eval_exact_match/eval_exact_match.yaml (modified) — verdict: OK

eval_llm_judges/eval_llm_judges.yaml (modified) — verdict: OK

openai_agents_handoff/openai_agents_handoff.yaml (modified) — verdict: OK

subtype_credential_asset/subtype_credential_asset.yaml (modified) — verdict: OK

Within-PR duplicates

No duplicate clusters detected. The three eval-suite tasks (classification, output_suite, tool_call_suite) share scaffold structure but exercise materially distinct evaluator families — good template reuse. The three antipattern tasks each target a different Critical Rule.

Conclusion

⚠ 1 task has issues (max severity Low). Evidence of passing run is missing from the PR description. Advisory only — not blocking merge.

@radugheo radugheo force-pushed the test/fill-coded-agents-tests branch from f02375f to eb1a4fd Compare May 7, 2026 20:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Claude finished @radugheo's task in 4m 18s —— View job


  • Read lint rubric (.claude/commands/lint-task.md)
  • Identify changed task YAMLs (15 files: 11 added, 4 modified)
  • Check PR body for evidence of passing run
  • Read all changed task YAMLs + nearby siblings for duplicate detection
  • Apply rubric to each changed file
  • Within-PR duplicate check
  • Post final lint results

Coder-eval task lint (advisory)

15 task YAMLs changed (11 added, 4 modified); verdicts: 0 Critical, 0 High, 0 Medium, 1 Low, 14 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body claims check_*.py dry-runs green against synthetic projects, but does not claim the coder-eval tasks themselves were run end-to-end (i.e., an agent given the initial_prompt produced passing output). Please edit the PR description to add a line like: Ran skill-agent-coded-antipattern-dev-non-interactive (and 10 others) locally and they passed.

Per-task lint

antipattern_dev_non_interactive/antipattern_dev_non_interactive.yaml — verdict: OK

antipattern_output_aggregation/antipattern_output_aggregation.yaml — verdict: OK

antipattern_pack_publish/antipattern_pack_publish.yaml — verdict: OK

chat_react_agent/chat_react_agent.yaml — verdict: OK

edit_add_node/edit_add_node.yaml — verdict: OK

eval_classification/eval_classification.yaml — verdict: OK

eval_output_suite/eval_output_suite.yaml — verdict: OK

eval_tool_call_suite/eval_tool_call_suite.yaml — verdict: OK

langgraph_supervisor/langgraph_supervisor.yaml — verdict: Low

Issues:

  • [Low] Prompt over-specification: lines 31–45 prescribe the full supervisor topology — Router TypedDict shape, with_structured_output(Router) call, three named nodes, conditional edge routing, worker loop-back edges. An agent familiar with LangGraph could implement this from the prompt alone without consulting the skill's supervisor pattern docs. Rated Low (not Medium) because the prompt describes the desired architecture rather than CLI procedure, and the skill is still needed for UiPath-specific scaffolding (uip codedagent new/init), lazy-init rules, and graph export conventions.

Suggested fixes:

  • Replace the detailed wiring list (lines 31–45) with a goal like "Build a multi-agent supervisor that routes between a researcher and a coder worker, using the documented supervisor pattern." Keep the node names (supervisor, researcher, coder) as ground-truth anchors for the checker, but let the skill teach the Router TypedDict / with_structured_output / worker-loop wiring.

login_status_checkpoint/login_status_checkpoint.yaml — verdict: OK

sdk_services_suite/sdk_services_suite.yaml — verdict: OK

bindings_sync.yaml (modified) — verdict: OK

Good change: removed the 5-step procedure prescription (lines 89–93 on main) that leaked the scanning/resolution steps. Also fixed pythonpython3 in the run_command criterion.

eval_exact_match/eval_exact_match.yaml (modified) — verdict: OK

Good change: timeout bump (1200→1800) and de-specified "ExactMatch evaluator" to "deterministic-output evaluator", letting the skill pick the right typeId.

eval_llm_judges/eval_llm_judges.yaml (modified) — verdict: OK

Good change: removed leaked evaluator IDs (LLMJudgeOutputEvaluator, LLMJudgeTrajectoryEvaluator) and field names (expectedOutput, expectedAgentBehavior) from the prompt. Now describes the evaluators by behavior rather than by internal name.

openai_agents_handoff/openai_agents_handoff.yaml (modified) — verdict: OK

Good change: removed prescriptive Pydantic model naming (CustomerInput), JSON registration format, and exact uip codedagent run invocation syntax from the prompt. Now goal-oriented.

Within-PR duplicates

No duplicate clusters detected. The three eval-suite tasks (classification, output_suite, tool_call_suite) share scaffold structure but exercise materially distinct evaluator families — good template reuse. The three antipattern tasks each target a different Critical Rule (C7, C8/C9, C10).

Conclusion

⚠ 1 task has issues (max severity Low). Evidence of passing run is missing from the PR description. The 4 modified tasks are all quality improvements — de-specifying prompts and fixing minor issues. Advisory only — not blocking merge.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant