fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499
fix: pin verl<0.7.0 to prevent incompatible upgrade breaking vLLM async server#499
Conversation
… failed rollouts Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
|
@copilot is this not the correct fix. instead of guard the failed rollouts, you should find out why vllm connection fails. |
…nc server Co-authored-by: ultmaster <8463288+ultmaster@users.noreply.github.com>
You're right — the empty tensor guard was treating the symptom, not the cause. I investigated the dependency artifacts from the failing CI runs:
Both use the same vLLM version. The root cause is Reverted the |
The "latest" CI configuration runs
uv lock --upgrade, which upgradedverlfrom 0.6.1 to 0.7.0. verl 0.7.0 has breaking API changes (SPMD rollout removal, ResourcePool/WorkerGroup changes) that prevent the vLLM async server from starting, resulting inConnectionRefusedError.Root cause analysis
By comparing the dependency artifacts from passing vs failing CI runs:
verl==0.6.1,vllm==0.11.0verl==0.7.0,vllm==0.11.0Both configs use the same vLLM version. The root cause is the verl 0.7.0 upgrade, not vLLM itself.
Changes
verl>=0.6.0,<0.7.0in thetorch-gpu-stabledependency group inpyproject.tomlverl>=0.5.0,<0.7.0in theverloptional extrauv.lockto reflect the version constraintsOriginal prompt
This section details on the original issue you should resolve
<issue_title>calc-x example fails on next</issue_title>
<issue_description>2026-02-27T19:28:35.0128090Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0128599Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0128990Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0129903Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/openai/_base_client.py", line 1604, in request
2026-02-27T19:28:35.0130773Z �[36m(TaskRunner pid=7250)�[0m response = await self._client.send(
2026-02-27T19:28:35.0131268Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0131882Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0132226Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0132529Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0133299Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1629, in send
2026-02-27T19:28:35.0134157Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_auth(
2026-02-27T19:28:35.0134874Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0135322Z �[36m(TaskRunner pid=7250)�[0m ...<4 lines>...
2026-02-27T19:28:35.0135660Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0135959Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0136860Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
2026-02-27T19:28:35.0137842Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_handling_redirects(
2026-02-27T19:28:35.0138479Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0139027Z �[36m(TaskRunner pid=7250)�[0m ...<3 lines>...
2026-02-27T19:28:35.0139441Z �[36m(TaskRunner pid=7250)�[0m )
2026-02-27T19:28:35.0140052Z �[36m(TaskRunner pid=7250)�[0m ^
2026-02-27T19:28:35.0140974Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
2026-02-27T19:28:35.0142294Z �[36m(TaskRunner pid=7250)�[0m response = await self._send_single_request(request)
2026-02-27T19:28:35.0142913Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0144052Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1730, in _send_single_request
2026-02-27T19:28:35.0145249Z �[36m(TaskRunner pid=7250)�[0m response = await transport.handle_async_request(request)
2026-02-27T19:28:35.0145969Z �[36m(TaskRunner pid=7250)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-27T19:28:35.0147192Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 272, in handle_async_request
2026-02-27T19:28:35.0148324Z �[36m(TaskRunner pid=7250)�[0m with map_aiohttp_exceptions():
2026-02-27T19:28:35.0148944Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~~~~~~~~~^^
2026-02-27T19:28:35.0149885Z �[36m(TaskRunner pid=7250)�[0m File "/home/cloudtest/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 162, in exit
2026-02-27T19:28:35.0150764Z �[36m(TaskRunner pid=7250)�[0m self.gen.throw(value)
2026-02-27T19:28:35.0151174Z �[36m(TaskRunner pid=7250)�[0m ~~~~~~~~~~~~~~^^^^^^^
2026-02-27T19:28:35.0152980Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 74, in map_aiohttp_exceptions
2026-02-27T19:28:35.0154190Z �[36m(TaskRunner pid=7250)�[0m raise mapped_exc(message) from exc
2026-02-27T19:28:35.0155122Z �[36m(TaskRunner pid=7250)�[0m httpx.ConnectError: Cannot connect to host 10.1.0.15:35035 ssl:default [Connect call failed ('10.1.0.15', 35035)]
2026-02-27T19:28:35.0155785Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0156302Z �[36m(TaskRunner pid=7250)�[0m The above exception was the direct cause of the following exception:
2026-02-27T19:28:35.0156815Z �[36m(TaskRunner pid=7250)�[0m
2026-02-27T19:28:35.0157200Z �[36m(TaskRunner pid=7250)�[0m Traceback (most recent call last):
2026-02-27T19:28:35.0158203Z �[36m(TaskRunner pid=7250)�[0m File "/mnt/vss/_work/agent-lightning/agent-lightning/.venv/lib/python3.13/site-packages/litellm/llms/openai/openai....
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.