fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server by Copilot · Pull Request #499 · microsoft/agent-lightning

Copilot · 2026-02-28T03:15:08Z

The "latest" CI configuration runs uv lock --upgrade, which upgraded verl from 0.6.1 to 0.7.0. verl 0.7.0 has breaking API changes (SPMD rollout removal, ResourcePool/WorkerGroup changes) that prevent the vLLM async server from starting, resulting in ConnectionRefusedError.

Root cause analysis

By comparing the dependency artifacts from passing vs failing CI runs:

stable (passing): verl==0.6.1, vllm==0.11.0
latest (failing): verl==0.7.0, vllm==0.11.0

Both configs use the same vLLM version. The root cause is the verl 0.7.0 upgrade, not vLLM itself.

Changes

Pinned verl>=0.6.0,<0.7.0 in the torch-gpu-stable dependency group in pyproject.toml
Pinned verl>=0.5.0,<0.7.0 in the verl optional extra
Updated uv.lock to reflect the version constraints

Original prompt

This section details on the original issue you should resolve

<issue_title>calc-x example fails on next</issue_title>
<issue_description>2026-02-27T19:28:35.0128090Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0128599Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0128990Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0129903Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/openai/_base_client.py", line 1604, in request
2026-02-27T19:28:35.0130773Z �[36m(TaskRunner pid=7250)�[0m response = await self._client.send(
2026-02-27T19:28:35.0131268Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0131882Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0132226Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0132529Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0133299Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1629, in send
2026-02-27T19:28:35.0134157Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_auth(
2026-02-27T19:28:35.0134874Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0135322Z �[36m(TaskRunner pid=7250)�[0m ...<4 lines>...
2026-02-27T19:28:35.0135660Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0135959Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0136860Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
2026-02-27T19:28:35.0137842Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_redirects(
2026-02-27T19:28:35.0138479Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0139027Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0139441Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0140052Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0140974Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
2026-02-27T19:28:35.0142294Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_single_request(request)
2026-02-27T19:28:35.0142913Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0144052Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1730, in _send_single_request
2026-02-27T19:28:35.0145249Z �[36m(TaskRunner pid=7250)�[0m response = await transport.handle_async_request(request)
2026-02-27T19:28:35.0145969Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0147192Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 272, in handle_async_request
2026-02-27T19:28:35.0148324Z �[36m(TaskRunner pid=7250)�[0m with map_aiohttp_exceptions():
2026-02-27T19:28:35.0148944Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~~~~~~~~~^^
2026-02-27T19:28:35.0149885Z �[36m(TaskRunner pid=7250)�[0m File "/home/cloudtest/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 162, in exit
2026-02-27T19:28:35.0150764Z �[36m(TaskRunner pid=7250)�[0m self.gen.throw(value)
2026-02-27T19:28:35.0151174Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~^^^^^^^
2026-02-27T19:28:35.0152980Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 74, in map_aiohttp_exceptions
2026-02-27T19:28:35.0154190Z �[36m(TaskRunner pid=7250)�[0m raise mapped_exc(message) from exc
2026-02-27T19:28:35.0155122Z �[36m(TaskRunner pid=7250)�[0m httpx.ConnectError: Cannot connect to host 10.1.0.15:35035 ssl:default [Connect call failed ('10.1.0.15', 35035)]
2026-02-27T19:28:35.0155785Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0156302Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0156815Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0157200Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0158203Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/openai/openai....

Fixes calc-x example fails on next #498

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

… failed rollouts Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

ultmaster · 2026-02-28T04:34:43Z

@copilot is this not the correct fix. instead of guard the failed rollouts, you should find out why vllm connection fails.

…nc server Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

Copilot · 2026-02-28T04:45:43Z

@copilot is this not the correct fix. instead of guard the failed rollouts, you should find out why vllm connection fails.

You're right — the empty tensor guard was treating the symptom, not the cause. I investigated the dependency artifacts from the failing CI runs:

stable (passing): verl==0.6.1, vllm==0.11.0
latest (failing): verl==0.7.0, vllm==0.11.0

Both use the same vLLM version. The root cause is verl 0.7.0 being pulled in by uv lock --upgrade. verl 0.7.0 has breaking API changes (SPMD rollout removal, ResourcePool/WorkerGroup changes) that prevent the vLLM async server from starting, causing ConnectionRefusedError.

Reverted the compute_data_metrics guards and instead pinned verl>=0.6.0,<0.7.0 in pyproject.toml + updated uv.lock. See commit 7690d1e.

Initial plan

a855e37

Copilot AI assigned Copilot and ultmaster Feb 28, 2026

Copilot started work on behalf of ultmaster February 28, 2026 03:15 View session

Copilot AI and others added 2 commits February 28, 2026 03:31

fix: handle empty tensors in compute_data_metrics to prevent crash on…

19c72db

… failed rollouts Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

chore: remove accidentally committed cache files and add to .gitignore

7d56275

Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix calc-x example failing on next~~ fix: guard compute_data_metrics against empty tensors from failed rollouts Feb 28, 2026

Copilot finished work on behalf of ultmaster February 28, 2026 03:38

Copilot started work on behalf of ultmaster February 28, 2026 04:34 View session

fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM asy…

7690d1e

…nc server Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>

Copilot AI changed the title ~~fix: guard compute_data_metrics against empty tensors from failed rollouts~~ fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server Feb 28, 2026

Copilot finished work on behalf of ultmaster February 28, 2026 04:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499

fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499
Copilot wants to merge 4 commits intomainfrom
copilot/fix-calc-x-example

Copilot AI commented Feb 28, 2026 •

edited

Loading

Uh oh!

ultmaster commented Feb 28, 2026

Uh oh!

Copilot AI commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause analysis

Changes

Uh oh!

ultmaster commented Feb 28, 2026

Uh oh!

Copilot AI commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 28, 2026 •

edited

Loading