Skip to content

fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/fix-calc-x-example
Draft

fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499
Copilot wants to merge 4 commits intomainfrom
copilot/fix-calc-x-example

Conversation

Copy link
Contributor

Copilot AI commented Feb 28, 2026

The "latest" CI configuration runs uv lock --upgrade, which upgraded verl from 0.6.1 to 0.7.0. verl 0.7.0 has breaking API changes (SPMD rollout removal, ResourcePool/WorkerGroup changes) that prevent the vLLM async server from starting, resulting in ConnectionRefusedError.

Root cause analysis

By comparing the dependency artifacts from passing vs failing CI runs:

  • stable (passing): verl==0.6.1, vllm==0.11.0
  • latest (failing): verl==0.7.0, vllm==0.11.0

Both configs use the same vLLM version. The root cause is the verl 0.7.0 upgrade, not vLLM itself.

Changes

  • Pinned verl>=0.6.0,<0.7.0 in the torch-gpu-stable dependency group in pyproject.toml
  • Pinned verl>=0.5.0,<0.7.0 in the verl optional extra
  • Updated uv.lock to reflect the version constraints
Original prompt

This section details on the original issue you should resolve

<issue_title>calc-x example fails on next</issue_title>
<issue_description>2026-02-27T19:28:35.0128090Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0128599Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0128990Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0129903Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/openai/_base_client.py", line 1604, in request
2026-02-27T19:28:35.0130773Z �[36m(TaskRunner pid=7250)�[0m response = await self._client.send(
2026-02-27T19:28:35.0131268Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0131882Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0132226Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0132529Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0133299Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1629, in send
2026-02-27T19:28:35.0134157Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_auth(
2026-02-27T19:28:35.0134874Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0135322Z �[36m(TaskRunner pid=7250)�[0m ...<4 lines>...
2026-02-27T19:28:35.0135660Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0135959Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0136860Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
2026-02-27T19:28:35.0137842Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_redirects(
2026-02-27T19:28:35.0138479Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0139027Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0139441Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0140052Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0140974Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
2026-02-27T19:28:35.0142294Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_single_request(request)
2026-02-27T19:28:35.0142913Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0144052Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1730, in _send_single_request
2026-02-27T19:28:35.0145249Z �[36m(TaskRunner pid=7250)�[0m response = await transport.handle_async_request(request)
2026-02-27T19:28:35.0145969Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0147192Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 272, in handle_async_request
2026-02-27T19:28:35.0148324Z �[36m(TaskRunner pid=7250)�[0m with map_aiohttp_exceptions():
2026-02-27T19:28:35.0148944Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~~~~~~~~~^^
2026-02-27T19:28:35.0149885Z �[36m(TaskRunner pid=7250)�[0m File "/home/cloudtest/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 162, in exit
2026-02-27T19:28:35.0150764Z �[36m(TaskRunner pid=7250)�[0m self.gen.throw(value)
2026-02-27T19:28:35.0151174Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~^^^^^^^
2026-02-27T19:28:35.0152980Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 74, in map_aiohttp_exceptions
2026-02-27T19:28:35.0154190Z �[36m(TaskRunner pid=7250)�[0m raise mapped_exc(message) from exc
2026-02-27T19:28:35.0155122Z �[36m(TaskRunner pid=7250)�[0m httpx.ConnectError: Cannot connect to host 10.1.0.15:35035 ssl:default [Connect call failed ('10.1.0.15', 35035)]
2026-02-27T19:28:35.0155785Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0156302Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0156815Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0157200Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0158203Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/openai/openai....


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits February 28, 2026 03:31
… failed rollouts

Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix calc-x example failing on next fix: guard compute_data_metrics against empty tensors from failed rollouts Feb 28, 2026
@ultmaster
Copy link
Contributor

@copilot is this not the correct fix. instead of guard the failed rollouts, you should find out why vllm connection fails.

…nc server

Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 28, 2026

@copilot is this not the correct fix. instead of guard the failed rollouts, you should find out why vllm connection fails.

You're right — the empty tensor guard was treating the symptom, not the cause. I investigated the dependency artifacts from the failing CI runs:

  • stable (passing): verl==0.6.1, vllm==0.11.0
  • latest (failing): verl==0.7.0, vllm==0.11.0

Both use the same vLLM version. The root cause is verl 0.7.0 being pulled in by uv lock --upgrade. verl 0.7.0 has breaking API changes (SPMD rollout removal, ResourcePool/WorkerGroup changes) that prevent the vLLM async server from starting, causing ConnectionRefusedError.

Reverted the compute_data_metrics guards and instead pinned verl>=0.6.0,<0.7.0 in pyproject.toml + updated uv.lock. See commit 7690d1e.

Copilot AI changed the title fix: guard compute_data_metrics against empty tensors from failed rollouts fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

calc-x example fails on next

2 participants