Use Ray to validate that allocated gpus correspond to requeusted # of GPUs by mnoukhov · Pull Request #1606 · allenai/open-instruct

mnoukhov · 2026-04-13T16:17:38Z

Summary

validate Ray GPU allocation in both grpo_fast.py and grpo.py after ray.init()
account for vllm_tensor_parallel_size when computing expected vLLM GPU usage
reject invalid single_gpu_mode configurations unless there is exactly one learner and one vLLM engine
move the shared validation helper into grpo_utils.py
move validation coverage into a dedicated test_grpo_utils.py module and keep test_grpo_fast_eval.py focused on GRPO-fast eval behavior
add a changelog entry for this PR in CHANGELOG.md

Details

The shared helper now expects:

standard mode: sum(num_learners_per_node) + vllm_num_engines * vllm_tensor_parallel_size
single_gpu_mode: exactly 1 learner and 1 vLLM engine, with no extra vLLM GPUs expected

This avoids undercounting tensor-parallel vLLM allocations and prevents single_gpu_mode from silently accepting unsupported multi-learner or multi-engine layouts.

Testing

uv run pytest open_instruct/test_grpo_utils.py open_instruct/test_grpo_fast_eval.py
make style
make quality

gemini-code-assist

Code Review

This pull request introduces a GPU allocation validation function to ensure that the Ray cluster resources match the expected configuration for learners and vLLM engines. Review feedback indicates that the current implementation fails to account for tensor parallelism and single GPU mode, which would result in incorrect validation. Suggestions were provided to update the validation logic, its call sites, and the associated unit tests.

gemini-code-assist · 2026-04-13T16:24:05Z

            "env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS},
        }
    )
+    validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines)


Update the call to validate_allocated_gpus to include the missing configuration parameters required for accurate GPU validation.

Suggested change

validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines)

validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines, vllm_config.vllm_tensor_parallel_size, args.single_gpu_mode)

finbarrtimbers · 2026-04-13T16:38:12Z

    ray.init(**ray_init_kwargs)
+    grpo_utils.validate_allocated_gpus(
+        tuple(args.num_learners_per_node),
+        vllm_config.vllm_num_engines,


Let's pass vllm_config in here instead?

finbarrtimbers · 2026-04-13T16:39:46Z

+    expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size
+    expected_gpus = total_learners + expected_vllm_gpus
+    allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0)
+    if not math.isclose(allocated_gpus, expected_gpus):


It's integer math, so we can check equality!

finbarrtimbers · 2026-04-13T16:40:13Z

+        )
+    expected_vllm_gpus = 0 if single_gpu_mode else vllm_num_engines * vllm_tensor_parallel_size
+    expected_gpus = total_learners + expected_vllm_gpus
+    allocated_gpus: float = ray.cluster_resources().get("GPU", 0.0)


Make this an int! Also let's not default to 0. When would "GPU" not be present? Only if we're running on CPU, right?

I'll have to cast it, as ray returns a float for some reason but can do that

Let's convert it?

allocated_gpus = int(ray.cluster_resources().get("GPU"))

finbarrtimbers · 2026-04-13T16:40:40Z

+    vllm_tensor_parallel_size: int,
+    single_gpu_mode: bool,
+) -> None:
+    """Validate that Ray sees the expected number of GPUs for this job."""


Please mention that it raises a ValueError if the GPUs are wrong.

finbarrtimbers · 2026-04-13T16:41:19Z

+
+
+class TestValidateAllocatedGpus(unittest.TestCase):
+    def test_accepts_matching_gpu_count(self):


Parameterize these!

hamishivi · 2026-04-13T16:46:35Z

            "env_vars": {k: v for k, v in os.environ.items() if k not in EXCLUDED_ENV_VARS},
        }
    )
+    grpo_utils.validate_allocated_gpus(


In beaker, we can sometimes have straggler nodes that take an extra few minutes to startup, and we want to wait for them instead of instant failing instantly (waiting for 10-15ish min seems okay?).

I can put this right before we do ray allocation?

Currently if you specify incorrectly, ray just hangs forever, do you want to make this a 10-min timeout?

validate allocated gpus

eae58fa

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Validate GRPO Ray GPU allocation

feb5cb8

finbarrtimbers self-requested a review April 13, 2026 16:38

finbarrtimbers requested changes Apr 13, 2026

View reviewed changes

hamishivi requested changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606

Use Ray to validate that allocated gpus correspond to requeusted # of GPUs#1606
mnoukhov wants to merge 2 commits into
mainfrom
feat/ray-check-gpus

mnoukhov commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

mnoukhov Apr 13, 2026

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

finbarrtimbers Apr 13, 2026

Uh oh!

hamishivi Apr 13, 2026 •

edited

Loading

Uh oh!

mnoukhov Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines)
	validate_allocated_gpus(tuple(args.num_learners_per_node), vllm_config.vllm_num_engines, vllm_config.vllm_tensor_parallel_size, args.single_gpu_mode)



		class TestValidateAllocatedGpus(unittest.TestCase):
		def test_accepts_matching_gpu_count(self):

Conversation

mnoukhov commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamishivi Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mnoukhov commented Apr 13, 2026 •

edited

Loading

hamishivi Apr 13, 2026 •

edited

Loading