Skip to content

Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606

Open
mnoukhov wants to merge 2 commits into
mainfrom
feat/ray-check-gpus
Open

Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606
mnoukhov wants to merge 2 commits into
mainfrom
feat/ray-check-gpus

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

@mnoukhov mnoukhov commented Apr 13, 2026

Summary

  • validate Ray GPU allocation in both grpo_fast.py and grpo.py after ray.init()
  • account for vllm_tensor_parallel_size when computing expected vLLM GPU usage
  • reject invalid single_gpu_mode configurations unless there is exactly one learner and one vLLM engine
  • move the shared validation helper into grpo_utils.py
  • move validation coverage into a dedicated test_grpo_utils.py module and keep test_grpo_fast_eval.py focused on GRPO-fast eval behavior
  • add a changelog entry for this PR in CHANGELOG.md

Details

The shared helper now expects:

  • standard mode: sum(num_learners_per_node) + vllm_num_engines * vllm_tensor_parallel_size
  • single_gpu_mode: exactly 1 learner and 1 vLLM engine, with no extra vLLM GPUs expected

This avoids undercounting tensor-parallel vLLM allocations and prevents single_gpu_mode from silently accepting unsupported multi-learner or multi-engine layouts.

Testing

  • uv run pytest open_instruct/test_grpo_utils.py open_instruct/test_grpo_fast_eval.py
  • make style
  • make quality

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a GPU allocation validation function to ensure that the Ray cluster resources match the expected configuration for learners and vLLM engines. Review feedback indicates that the current implementation fails to account for tensor parallelism and single GPU mode, which would result in incorrect validation. Suggestions were provided to update the validation logic, its call sites, and the associated unit tests.

Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
"env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS},
}
)
validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the call to validate_allocated_gpus to include the missing configuration parameters required for accurate GPU validation.

Suggested change
validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines)
validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines, vllm_config.vllm_tensor_parallel_size, args.single_gpu_mode)

Comment thread open_instruct/test_grpo_fast_eval.py Outdated
@finbarrtimbers finbarrtimbers self-requested a review April 13, 2026 16:38
Comment thread open_instruct/grpo.py
ray.init(**ray_init_kwargs)
grpo_utils.validate_allocated_gpus(
tuple(args.num_learners_per_node),
vllm_config.vllm_num_engines,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pass vllm_config in here instead?

expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size
expected_gpus = total_learners + expected_vllm_gpus
allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0)
if not math.isclose(allocated_gpus, expected_gpus):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's integer math, so we can check equality!

)
expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size
expected_gpus = total_learners + expected_vllm_gpus
allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an int! Also let's not default to 0. When would "GPU" not be present? Only if we're running on CPU, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to cast it, as ray returns a float for some reason but can do that

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert it?

allocated_gpus = int(ray.cluster_resources().get("GPU"))

vllm_tensor_parallel_size: int,
single_gpu_mode: bool,
) -> None:
"""Validate that Ray sees the expected number of GPUs for this job."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that it raises a ValueError if the GPUs are wrong.



class TestValidateAllocatedGpus(unittest.TestCase):
def test_accepts_matching_gpu_count(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterize these!

"env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS},
}
)
grpo_utils.validate_allocated_gpus(
Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In beaker, we can sometimes have straggler nodes that take an extra few minutes to startup, and we want to wait for them instead of instant failing instantly (waiting for 10-15ish min seems okay?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put this right before we do ray allocation?

Currently if you specify incorrectly, ray just hangs forever, do you want to make this a 10-min timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants