-
Notifications
You must be signed in to change notification settings - Fork 233
[train] Enable RayPrometheusStatLogger for async vLLM engine #900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for vLLM's RayPrometheusStatLogger to enable metrics collection
via Ray's metrics infrastructure for the async vLLM inference engine.
Changes:
- Add `enable_ray_prometheus_stats` config option (default: false)
- Pass config through main_base.py and ray_wrapped_inference_engine.py
- Modify AsyncVLLMInferenceEngine to create and use RayPrometheusStatLogger
when enabled (requires vLLM >= 0.9.0)
- Add unit tests for the new feature
Usage:
generator:
async_engine: true
enable_ray_prometheus_stats: true
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for vLLM's RayPrometheusStatLogger to enable metrics collection for the async vLLM engine. The changes include adding a new configuration option, passing it through the necessary components, and conditionally enabling the stat logger in AsyncVLLMInferenceEngine. The implementation correctly handles different vLLM versions by using a try-except block for the import. Unit tests have been added to verify the new functionality.
My review focuses on improving documentation clarity and test coverage. I've suggested updating a docstring to match the implementation and adding a test case for the failure scenario where the required vLLM module is not available.
| import sys | ||
|
|
||
|
|
||
| class TestRayPrometheusStatLoggers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that you've added tests for the success case. To make the tests more robust, consider adding a test case for the failure scenario where RayPrometheusStatLogger cannot be imported (e.g., on older vLLM versions). This would verify that the method correctly returns None and logs a warning.
You could add a test like this to TestRayPrometheusStatLoggers:
def test_create_ray_prometheus_stat_loggers_v1_unavailable(self):
"""Test that None is returned when vLLM v1 API is not available."""
# By setting the module to None in sys.modules, the import will fail.
with patch.dict(sys.modules, {"vllm.v1.metrics.ray_wrappers": None}):
from skyrl_train.inference_engines.vllm.vllm_engine import AsyncVLLMInferenceEngine
# Create a minimal instance without actually initializing the engine
engine = object.__new__(AsyncVLLMInferenceEngine)
with patch("skyrl_train.inference_engines.vllm.vllm_engine.logger") as mock_logger:
result = engine._create_ray_prometheus_stat_loggers()
assert result is None
mock_logger.warning.assert_called_once()
assert "not available in this vLLM version" in mock_logger.warning.call_args[0][0]Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Test verifies that _create_ray_prometheus_stat_loggers returns None and logs a warning when vLLM v1 API is not available (e.g., on older vLLM versions). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@kanwang really appreciate the PR! Could you run the |
Thanks! just did. |
|
Ah I see the CI still failing due to vllm not being available in the CPU test (only available in GPU test). I can fix it myself |
Add support for vLLM's RayPrometheusStatLogger to enable metrics collection via Ray's metrics infrastructure for the async vLLM inference engine.
Changes:
enable_ray_prometheus_statsconfig option (default: false)Usage:
generator: async_engine: true enable_ray_prometheus_stats: true