Runs the benchmarks with the hybrid model by finbarrtimbers · Pull Request #1425 · allenai/open-instruct

finbarrtimbers · 2026-01-26T21:30:57Z

Summary

Add hybrid model (Olmo-Hybrid) support to the training pipeline
Monkey-patch vLLM 0.18.0 MambaSpec.dtypes to fix hybrid model dtype serialization bug
Forces tokens to be decoded to int, as now HF returns numpy dtypes
Pass trust_remote_code through to vLLM engines for custom model/tokenizer support
Add min_tokens to SamplingConfig with extra_body pass-through for vLLM API compat
Upgrade dependencies: vllm>=0.18.0, transformers>=5.3.0, torch upper bound removal
Add return_dict=False to all apply_chat_template calls for transformers 5.x compat
Remove configure_hf_hub_retry() — the underlying huggingface_hub.configure_http_backend() was removed in huggingface_hub v1.0.0 (we now pull in v1.7.2 via the vllm/transformers upgrade). HF Hub v1.0+ has built-in retry logic via http_backoff() with the same 429/5xx exponential backoff behavior, and the pre-download fix from PR Pre-download HF model before Ray actors spawn #1528 remains in place
Add hybrid test scripts (GRPO + DPO) and production training scripts

Runs

Multi-node GRPO: Beaker
DPO 2-node: Beaker
Single GPU benchmark: Beaker

GPU_TESTS=01KMNGSSB8HAR6HKS1136BJFZH

chatgpt-codex-connector · 2026-01-26T21:31:06Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

gemini-code-assist · 2026-01-26T21:31:30Z

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on integrating and benchmarking a 'hybrid model' with VLLM. It introduces explicit control over VLLM's data type, updates the VLLM dependency to a custom fork that likely provides hybrid model support, and enhances the stability of VLLM request handling. A new script is added to streamline single GPU benchmarking with these configurations, complemented by a comprehensive update and cleanup of various Python package dependencies.

Highlights

VLLM Data Type Configuration: The VLLM configuration now supports specifying the data type (vllm_dtype), defaulting to bfloat16, and this parameter is correctly propagated to the VLLM engine creation process.
VLLM Tool Parser Refactoring: Updated VLLM tool parser import paths to align with internal refactoring within the VLLM library, moving them from vllm.entrypoints.openai.tool_parsers to vllm.tool_parsers.
VLLM Request Processing Enhancements: Improved robustness in VLLM request processing by adding timeouts and error handling for queue retrieval, and addressed Ray serialization issues by manually constructing SamplingConfig objects instead of using dataclasses.replace.
Custom VLLM Dependency and Torch Override: The VLLM dependency has been updated to point to a custom Git fork (yanhong-lbh/vllm branch olmo-3.5-hybrid), removing previous specific version pins. Additionally, an override for the torch dependency (torch>=2.9.0,<2.10) has been added.
New Hybrid Model Benchmarking Script: A new benchmarking script (hybrid_single_gpu.sh) has been introduced for single GPU hybrid models, which configures VLLM with vllm_dtype auto and other specific parameters for performance evaluation.
Extensive Dependency Updates and Cleanup: Numerous Python package dependencies in requirements.txt have been updated to newer versions. Several CUDA-related and flashinfer-python dependencies were removed, while grpcio-reflection and intel-openmp were added, indicating a significant dependency overhaul.
Removal of GPU Rollout Saving Tests: GPU-specific tests for rl_utils rollout saving functionality have been removed from the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces changes to integrate a hybrid model for benchmarking, primarily focusing on updating vLLM dependencies and configurations. Key changes include adding vllm_dtype to VLLMConfig and passing it to engine creation, updating vLLM import paths and parser configurations to align with internal library changes, and enhancing error handling during vLLM engine initialization. A new benchmarking script hybrid_single_gpu.sh has been added, and test_rl_utils_gpu.py has been removed. Dependency versions in pyproject.toml and requirements.txt have also been updated.

gemini-code-assist · 2026-01-26T21:33:29Z

+        except Exception:
+            pass


Catching a broad Exception and then passing silently can hide critical issues and make debugging very difficult. It's best practice to either catch more specific exceptions or at least log the exception details to ensure that unexpected errors are not swallowed.

Suggested change

except Exception:

pass

except Exception as e:

logger.error(f"Error getting request from prompt queue: {e}")

The get_kv_cache_spec RPC hangs on hybrid models due to multi-dtype serialization issues in vLLM v1. Skip the call and use default batch size. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Set min_tokens = max_tokens to ensure all responses are exactly the specified length for accurate throughput benchmarking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

4 engines * 2 tensor_parallel = 8 GPUs total. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

min_tokens is a vLLM-specific parameter not supported by the standard OpenAI API, so it must be passed in extra_body. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>