Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606
Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606mnoukhov wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a GPU allocation validation function to ensure that the Ray cluster resources match the expected configuration for learners and vLLM engines. Review feedback indicates that the current implementation fails to account for tensor parallelism and single GPU mode, which would result in incorrect validation. Suggestions were provided to update the validation logic, its call sites, and the associated unit tests.
| "env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS}, | ||
| } | ||
| ) | ||
| validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines) |
There was a problem hiding this comment.
Update the call to validate_allocated_gpus to include the missing configuration parameters required for accurate GPU validation.
| validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines) | |
| validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines, vllm_config.vllm_tensor_parallel_size, args.single_gpu_mode) |
| ray.init(**ray_init_kwargs) | ||
| grpo_utils.validate_allocated_gpus( | ||
| tuple(args.num_learners_per_node), | ||
| vllm_config.vllm_num_engines, |
There was a problem hiding this comment.
Let's pass vllm_config in here instead?
| expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size | ||
| expected_gpus = total_learners + expected_vllm_gpus | ||
| allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0) | ||
| if not math.isclose(allocated_gpus, expected_gpus): |
There was a problem hiding this comment.
It's integer math, so we can check equality!
| ) | ||
| expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size | ||
| expected_gpus = total_learners + expected_vllm_gpus | ||
| allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0) |
There was a problem hiding this comment.
Make this an int! Also let's not default to 0. When would "GPU" not be present? Only if we're running on CPU, right?
There was a problem hiding this comment.
I'll have to cast it, as ray returns a float for some reason but can do that
There was a problem hiding this comment.
Let's convert it?
allocated_gpus = int(ray.cluster_resources().get("GPU"))
| vllm_tensor_parallel_size: int, | ||
| single_gpu_mode: bool, | ||
| ) -> None: | ||
| """Validate that Ray sees the expected number of GPUs for this job.""" |
There was a problem hiding this comment.
Please mention that it raises a ValueError if the GPUs are wrong.
|
|
||
|
|
||
| class TestValidateAllocatedGpus(unittest.TestCase): | ||
| def test_accepts_matching_gpu_count(self): |
There was a problem hiding this comment.
Parameterize these!
| "env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS}, | ||
| } | ||
| ) | ||
| grpo_utils.validate_allocated_gpus( |
There was a problem hiding this comment.
In beaker, we can sometimes have straggler nodes that take an extra few minutes to startup, and we want to wait for them instead of instant failing instantly (waiting for 10-15ish min seems okay?).
There was a problem hiding this comment.
I can put this right before we do ray allocation?
Currently if you specify incorrectly, ray just hangs forever, do you want to make this a 10-min timeout?
Summary
grpo_fast.pyandgrpo.pyafterray.init()vllm_tensor_parallel_sizewhen computing expected vLLM GPU usagesingle_gpu_modeconfigurations unless there is exactly one learner and one vLLM enginegrpo_utils.pytest_grpo_utils.pymodule and keeptest_grpo_fast_eval.pyfocused on GRPO-fast eval behaviorCHANGELOG.mdDetails
The shared helper now expects:
sum(num_learners_per_node) + vllm_num_engines * vllm_tensor_parallel_sizesingle_gpu_mode: exactly 1 learner and 1 vLLM engine, with no extra vLLM GPUs expectedThis avoids undercounting tensor-parallel vLLM allocations and prevents
single_gpu_modefrom silently accepting unsupported multi-learner or multi-engine layouts.Testing
uv run pytest open_instruct/test_grpo_utils.py open_instruct/test_grpo_fast_eval.pymake stylemake quality