Skip to content

Runs the benchmarks with the hybrid model#1425

Merged
finbarrtimbers merged 282 commits intomainfrom
finbarr/hybrid-benchmarks
Mar 26, 2026
Merged

Runs the benchmarks with the hybrid model#1425
finbarrtimbers merged 282 commits intomainfrom
finbarr/hybrid-benchmarks

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Jan 26, 2026

Summary

  • Add hybrid model (Olmo-Hybrid) support to the training pipeline
  • Monkey-patch vLLM 0.18.0 MambaSpec.dtypes to fix hybrid model dtype serialization bug
  • Forces tokens to be decoded to int, as now HF returns numpy dtypes
  • Pass trust_remote_code through to vLLM engines for custom model/tokenizer support
  • Add min_tokens to SamplingConfig with extra_body pass-through for vLLM API compat
  • Upgrade dependencies: vllm>=0.18.0, transformers>=5.3.0, torch upper bound removal
  • Add return_dict=False to all apply_chat_template calls for transformers 5.x compat
  • Remove configure_hf_hub_retry() — the underlying huggingface_hub.configure_http_backend() was removed in huggingface_hub v1.0.0 (we now pull in v1.7.2 via the vllm/transformers upgrade). HF Hub v1.0+ has built-in retry logic via http_backoff() with the same 429/5xx exponential backoff behavior, and the pre-download fix from PR Pre-download HF model before Ray actors spawn #1528 remains in place
  • Add hybrid test scripts (GRPO + DPO) and production training scripts

Runs

  1. Multi-node GRPO: Beaker
  2. DPO 2-node: Beaker
  3. Single GPU benchmark: Beaker

GPU_TESTS=01KMNGSSB8HAR6HKS1136BJFZH

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on integrating and benchmarking a 'hybrid model' with VLLM. It introduces explicit control over VLLM's data type, updates the VLLM dependency to a custom fork that likely provides hybrid model support, and enhances the stability of VLLM request handling. A new script is added to streamline single GPU benchmarking with these configurations, complemented by a comprehensive update and cleanup of various Python package dependencies.

Highlights

  • VLLM Data Type Configuration: The VLLM configuration now supports specifying the data type (vllm_dtype), defaulting to bfloat16, and this parameter is correctly propagated to the VLLM engine creation process.
  • VLLM Tool Parser Refactoring: Updated VLLM tool parser import paths to align with internal refactoring within the VLLM library, moving them from vllm.entrypoints.openai.tool_parsers to vllm.tool_parsers.
  • VLLM Request Processing Enhancements: Improved robustness in VLLM request processing by adding timeouts and error handling for queue retrieval, and addressed Ray serialization issues by manually constructing SamplingConfig objects instead of using dataclasses.replace.
  • Custom VLLM Dependency and Torch Override: The VLLM dependency has been updated to point to a custom Git fork (yanhong-lbh/vllm branch olmo-3.5-hybrid), removing previous specific version pins. Additionally, an override for the torch dependency (torch>=2.9.0,<2.10) has been added.
  • New Hybrid Model Benchmarking Script: A new benchmarking script (hybrid_single_gpu.sh) has been introduced for single GPU hybrid models, which configures VLLM with vllm_dtype auto and other specific parameters for performance evaluation.
  • Extensive Dependency Updates and Cleanup: Numerous Python package dependencies in requirements.txt have been updated to newer versions. Several CUDA-related and flashinfer-python dependencies were removed, while grpcio-reflection and intel-openmp were added, indicating a significant dependency overhaul.
  • Removal of GPU Rollout Saving Tests: GPU-specific tests for rl_utils rollout saving functionality have been removed from the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces changes to integrate a hybrid model for benchmarking, primarily focusing on updating vLLM dependencies and configurations. Key changes include adding vllm_dtype to VLLMConfig and passing it to engine creation, updating vLLM import paths and parser configurations to align with internal library changes, and enhancing error handling during vLLM engine initialization. A new benchmarking script hybrid_single_gpu.sh has been added, and test_rl_utils_gpu.py has been removed. Dependency versions in pyproject.toml and requirements.txt have also been updated.

Comment thread open_instruct/vllm_utils.py Outdated
Comment on lines +453 to +454
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a broad Exception and then passing silently can hide critical issues and make debugging very difficult. It's best practice to either catch more specific exceptions or at least log the exception details to ensure that unexpected errors are not swallowed.

Suggested change
except Exception:
pass
except Exception as e:
logger.error(f"Error getting request from prompt queue: {e}")

@finbarrtimbers finbarrtimbers force-pushed the finbarr/hybrid-benchmarks branch from 0291177 to 3ce4c53 Compare January 28, 2026 23:35
finbarrtimbers and others added 26 commits February 4, 2026 22:46
The get_kv_cache_spec RPC hangs on hybrid models due to multi-dtype
serialization issues in vLLM v1. Skip the call and use default batch size.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set min_tokens = max_tokens to ensure all responses are exactly
the specified length for accurate throughput benchmarking.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4 engines * 2 tensor_parallel = 8 GPUs total.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
min_tokens is a vLLM-specific parameter not supported by the
standard OpenAI API, so it must be passed in extra_body.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The hybrid model experiment reached step 18/19 in 64 minutes before
hitting the 1h timeout. Increasing to 90m to allow completion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Switch to tulu_thinker template (same as main) to isolate the
model as the only variable when comparing performance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep the origin/main Qwen version as the default large_test_script.sh
and save the hybrid model version separately.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Let exceptions propagate to identify the root cause of
description updates not showing progress bars.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This will help debug why progress bars aren't appearing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log BEAKER_JOB_ID and BEAKER_WORKLOAD_ID availability
before and after each update call.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The collective_rpc("get_kv_cache_spec") call in get_kv_cache_info() hangs
on hybrid models (like OLMo with linear RNNs) due to multi-dtype
serialization issues when using the multiprocessing executor (TP>1).

This fix returns the pre-set inference_batch_size (64) instead of calling
collective_rpc, which matches the fallback behavior already documented
in the LLMRayActor.__init__ method.

Also adds PYTHONUNBUFFERED=1 to the hybrid test script.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
beaker_config was only assigned inside the push_to_hub block but
accessed outside it, causing an UnboundLocalError when push_to_hub
was false.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The standard Python logger doesn't support the main_process_only
parameter that is specific to Accelerate's logger.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Transformers v5 changed apply_chat_template to return a BatchEncoding
dict by default. This broke tokenization for models using custom
tokenizers (e.g. hybrid OLMo 3.5). Instead of fixing each call site,
patch the method at tokenizer load time via functools.partial.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move cache path logging into build_reference_logprobs_cache
- Inline hash/path construction at call sites
- Fix E402 lint error in benchmark_generators.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --gradient_checkpointing to single_gpu_cache_hybrid.sh to fix OOM
- Replace --gradient_checkpointing with --activation_memory_budget 0.5
  in multi_node_hybrid.sh (dpo.py uses new API)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace dead args.gradient_checkpointing references with
args.activation_memory_budget < 1.0 to match ExperimentConfig.
Update hybrid DPO cache scripts to use --activation_memory_budget 0.5.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…y_chat_template return_dict fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rdcoding flash_attention_2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…detect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x OpenAI client compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mplify benchmark_generators SamplingConfig Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_hub_retry, DeepSpeed hooks) — split to separate PRs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…huggingface_hub v1.0 (now v1.7.2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…set_transformation.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…red-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cache_info fails on hybrid models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some comments. Also, does this make qwen3.5 work out of the box now?

Comment thread open_instruct/vllm_utils.py
Comment thread open_instruct/vllm_utils.py Outdated
self.inference_batch_size = self.get_kv_cache_info()
try:
self.inference_batch_size = self.get_kv_cache_info()
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the try catch (and fallback inference size) with the monkey-patch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Let me try removing it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine! Beaker.

Comment thread scripts/debug/repro_vllm_hybrid_dtype.py Outdated
…ch_size since MambaSpec monkey-patch fixes the root cause Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…al models like Qwen3.5 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…el Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arks

# Conflicts:
#	open_instruct/vllm_utils.py
…r Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ide verify, add CHANGELOG entry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers added this pull request to the merge queue Mar 26, 2026
Merged via the queue into main with commit d5a6b60 Mar 26, 2026
6 of 7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/hybrid-benchmarks branch March 26, 2026 22:00
hamishivi pushed a commit that referenced this pull request Mar 30, 2026
* Skip KV cache info call for hybrid model compatibility

The get_kv_cache_spec RPC hangs on hybrid models due to multi-dtype
serialization issues in vLLM v1. Skip the call and use default batch size.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Force exact response length in benchmark with min_tokens

Set min_tokens = max_tokens to ensure all responses are exactly
the specified length for accurate throughput benchmarking.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use 4 engines with TP=2 to utilize all 8 GPUs

4 engines * 2 tensor_parallel = 8 GPUs total.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Move min_tokens to extra_body for vLLM API compatibility

min_tokens is a vLLM-specific parameter not supported by the
standard OpenAI API, so it must be passed in extra_body.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Increase batch timeout to 2400s for 32k token benchmarks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Increase large_test_script timeout to 90m for hybrid model

The hybrid model experiment reached step 18/19 in 64 minutes before
hitting the 1h timeout. Increasing to 90m to allow completion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add minimal repro for KV cache hang on hybrid models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Set vLLM logging level to INFO in benchmark

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use tulu_thinker template for hybrid model test

Switch to tulu_thinker template (same as main) to isolate the
model as the only variable when comparing performance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Move hybrid model test script to separate file

Keep the origin/main Qwen version as the default large_test_script.sh
and save the hybrid model version separately.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove error suppression in Beaker description updates

Let exceptions propagate to identify the root cause of
description updates not showing progress bars.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add logging to Beaker description updates

This will help debug why progress bars aren't appearing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use origin/main version of single_gpu_on_beaker.sh for testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add debug logging for Beaker description updates

Log BEAKER_JOB_ID and BEAKER_WORKLOAD_ID availability
before and after each update call.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix get_kv_cache_info() hang on hybrid models with multiproc executor

The collective_rpc("get_kv_cache_spec") call in get_kv_cache_info() hangs
on hybrid models (like OLMo with linear RNNs) due to multi-dtype
serialization issues when using the multiprocessing executor (TP>1).

This fix returns the pre-set inference_batch_size (64) instead of calling
collective_rpc, which matches the fallback behavior already documented
in the LLMRayActor.__init__ method.

Also adds PYTHONUNBUFFERED=1 to the hybrid test script.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* added repro_rope_bug.py

* Update uv.lock

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add hybrid model DPO debug scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add hybrid model DPO cache debug scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix unbound beaker_config in dpo_tune_cache.py

beaker_config was only assigned inside the push_to_hub block but
accessed outside it, causing an UnboundLocalError when push_to_hub
was false.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix logger.info() call with unsupported main_process_only kwarg

The standard Python logger doesn't support the main_process_only
parameter that is specific to Accelerate's logger.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Patch apply_chat_template to default return_dict=False

Transformers v5 changed apply_chat_template to return a BatchEncoding
dict by default. This broke tokenization for models using custom
tokenizers (e.g. hybrid OLMo 3.5). Instead of fixing each call site,
patch the method at tokenizer load time via functools.partial.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Clean up build_reference_logprobs_cache callers and fix lint

- Move cache path logging into build_reference_logprobs_cache
- Inline hash/path construction at call sites
- Fix E402 lint error in benchmark_generators.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add TORCH_LOGS env var to single GPU hybrid DPO scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Sync hybrid DPO debug scripts with latest flags

- Add --gradient_checkpointing to single_gpu_cache_hybrid.sh to fix OOM
- Replace --gradient_checkpointing with --activation_memory_budget 0.5
  in multi_node_hybrid.sh (dpo.py uses new API)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix gradient checkpointing in dpo_tune_cache.py

Replace dead args.gradient_checkpointing references with
args.activation_memory_budget < 1.0 to match ExperimentConfig.
Update hybrid DPO cache scripts to use --activation_memory_budget 0.5.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Reduce multi_node_cache_hybrid.sh to 8k seq length to avoid OOM

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Reduce multi_node_cache_hybrid.sh to 4k seq length to avoid OOM

8k sequence length OOMs during backward pass on the hybrid model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Reduce multi_node_cache_hybrid.sh to 2k seq length to avoid OOM

4k still OOMs during backward pass. The hybrid model's linear attention
layers have large intermediate states that HF gradient checkpointing
doesn't help enough with.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Try 1k seq len + budget=0.25 for hybrid DPO cache

2k with budget=0.5 still OOMs at ~72GB - the HF gradient checkpointing
isn't reducing memory enough for the hybrid model's linear attention
layers. Try more aggressive settings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Improve gradient checkpointing in dpo_tune_cache.py

- Use use_reentrant=False for gradient checkpointing (more memory efficient
  and recommended for PyTorch 2.4+)
- Add log message when enabling gradient checkpointing
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to hybrid script
  to reduce memory fragmentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Enable ZeRO-3 for hybrid DPO cache training

ZeRO-3 partitions model parameters, gradients, and optimizer state
across GPUs, significantly reducing per-GPU memory usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix bf16 mixed precision for ZeRO-3 in dpo_tune_cache.py

When using ZeRO-3, the accelerator needs mixed_precision="bf16" set
explicitly, otherwise DeepSpeed's "auto" bf16 setting resolves to False,
causing FlashAttention to fail (it requires fp16 or bf16).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Increase hybrid DPO cache to 4k seq length

Testing how far we can push sequence length with ZeRO-3.
2k worked, trying 4k.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Increase hybrid DPO cache to 8k seq length

4k worked with ZeRO-3. Doubling to 8k.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Increase hybrid DPO cache to 16k seq length

8k worked with ZeRO-3. Doubling to 16k.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add hybrid model support to dpo.py via olmo_core_utils

- Update ai2-olmo-core dependency to tyler/anejs/linear-rnns branch for FLA support
- Add get_olmo3_7b_hybrid_config() to build OLMo 3.1 7B hybrid config with GatedDeltaNet
- Map hybrid model checkpoint path to olmo3_7B_hybrid config in OLMO_MODEL_CONFIG_MAP

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Replace load_hf_model with safetensors loader supporting hybrid FLA checkpoints

- Load safetensors directly and remap keys using mappings derived from
  olmo-core's convert_checkpoint_to_hf_hybrid.py (reversed). This supports
  both standard and hybrid (FLA) checkpoints.
- Fix SpeedMonitorCallback kwarg renamed in linear-rnns branch
  (device_peak_flops -> device_peak_flops_per_second).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update transformers to latest HEAD of olmo-3.5-hybrid-clean branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Lower activation memory budget to 0.1 for hybrid DPO

Previous run OOM'd at step 11 with budget=0.5 due to variable
sequence lengths causing memory spikes in the lm_head.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update hybrid DPO script with torchrun launch config for jupiter

Switch from accelerate to torchrun, remove NCCL env vars, add
hybrid-specific flags (trust_remote_code, zero_stage 3), loop over
SFT checkpoints and LRs, remove auto evals. Currently launching 1
job for testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace --gradient_checkpointing with --activation_memory_budget 0.5

dpo_tune_cache.py doesn't accept --gradient_checkpointing as a CLI
arg; it uses --activation_memory_budget instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Filter rows with None message content in DPO preprocessing

One of the mixer datasets has rows where message content is None,
causing apply_chat_template to crash with TypeError. Skip these
rows by returning empty results that preference_tulu_filter_v1
will filter out.

Also add --description to mason.py call for more informative
Beaker experiment names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Gather model params before reference logprob caching to avoid per-batch all-gather

With ZeRO-3 across multiple nodes, every forward pass during caching triggers
an all-gather of the full model. Since caching is inference-only, we use
unwrap_model_for_generation to gather params once upfront, run local forward
passes, then restore ZeRO-3 hooks for training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use 4x batch size during reference logprob caching and report progress to Beaker

- Create a separate dataloader with 4x the training batch size for caching
  (only 31% memory used during caching, so plenty of headroom)
- Update Beaker experiment description every step during caching

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Reduce caching batch size multiplier from 4x to 2x to avoid OOM

4x OOMed trying to allocate 41.87 GiB for activations with only 38.84 GiB free
(full gathered model takes ~40GB in fp32 under ZeRO-3).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix add_hooks to use setup_zero_stage3_hooks (compatible with DeepSpeed 0.18+)

_register_hooks_recursively was removed in newer DeepSpeed. The replacement
is setup_zero_stage3_hooks() which re-registers all forward/backward hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix tokens_per_second undercount and add hybrid layer speed benchmark

- Fix token_count metric in dpo_tune_cache.py: the reduce(mean) + divide
  by grad_accum*logging_steps was applied to token_count before the
  num_processes correction, causing a 4x undercount. Now multiplies by
  the full correction factor (num_processes * grad_accum * logging_steps).
- Add GPU benchmark test comparing GatedDeltaNet (linear attention) vs
  FlashAttention layer speed to verify hybrid model training slowdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Avoid OOM in full model benchmark by running models sequentially

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use inline config for hybrid model benchmark (no HF download needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Error if flash-linear-attention is missing for hybrid models

Prevents silent fallback to slow PyTorch linear attention kernels
when the fla package is not installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use inline configs for both models in benchmark (no HF downloads)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Switch hybrid DPO script to accelerate launch and add FLA check

- Use accelerate launch with DeepSpeed config file instead of torchrun
  with --zero_stage, matching the OLMo3 DPO setup.
- Error if flash-linear-attention is missing for hybrid models to prevent
  silent fallback to slow PyTorch linear attention kernels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update hybrid benchmark: model arch is NOT the slowdown cause

Beaker GPU tests showed the hybrid model (0.93x) is actually slightly
faster than OLMo3 at the full-model level, despite individual linear
attention layers being slower. The smaller hidden_size (3840 vs 4096)
compensates. The ~3.5x DPO training slowdown must come from distributed
training overhead, not model architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Jupiter NCCL env vars to hybrid DPO script

Match the NCCL configuration that OLMo-core uses for Jupiter
(NCCL_IB_HCA, NCCL_SOCKET_IFNAME, TORCH_NCCL_AVOID_RECORD_STREAMS,
TORCH_DIST_INIT_BARRIER) to ensure inter-node traffic uses InfiniBand.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add hybrid DPO performance investigation doc

Track hypotheses and results for the ~3.5x hybrid DPO training slowdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add OLMo3 7B DPO baseline script for Jupiter

Same config as the regular DPO script but targeting Jupiter instead of
Augusta, to get a fair throughput comparison with the hybrid DPO run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add tests for token_count metric reduction fix

Verifies that the corrected formula (multiply by num_processes *
grad_accum_steps * logging_steps) recovers the true total token count,
and that the old formula undercounted by grad_accum_steps * logging_steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add torch.profiler support to DPO training and enable on hybrid script

Add --profile_every_n_steps flag that profiles one micro-batch every N
optimization steps. Exports Chrome traces and logs top-30 CUDA ops by
time. Enabled on the hybrid DPO script (every 2 steps) to diagnose the
~16x distributed training slowdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add checkpoint resume bugs doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix checkpoint resume bugs: race condition, non-deterministic output dir

- Add ignore_errors=True to shutil.rmtree in checkpoint cleanup to handle
  race conditions on shared filesystems where multiple nodes may delete
  the same directory
- Change is_local_main_process to is_main_process for checkpoint cleanup
  in dpo_tune_cache.py so only global rank 0 does cleanup on shared Weka
- Remove non-deterministic int(time.time()) from exp_name construction
  in dpo_tune_cache.py, finetune.py, and utils.py so output_dir is stable
  across Beaker retries
- Add tests for clean_last_n_checkpoints

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add multi-node dpo_tune_cache test script and enable checkpointing in integration test

- New scripts/train/debug/dpo/multi_node_cache.sh: 2-node dpo_tune_cache
  run that exercises checkpoint cleanup with checkpointing_steps=5 and
  keep_last_n_checkpoints=2
- Enable checkpointing in dpo_integration_test.sh to exercise the
  is_main_process and ignore_errors fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix indentation error in profiler code

The metrics tracking block was incorrectly indented after the profiler
changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* changed baseline RL script

* Setting triton flag

* Add DPO checkpoint integration test with save, cleanup, and resume

Rewrites multi_node_cache.sh to run two sequential Beaker jobs that
exercise checkpoint saving (COMPLETED markers), cleanup
(keep_last_n_checkpoints), and resume from a WEKA-persisted output dir.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Enable packing and switch cache scripts to accelerate launch

- Add --packing to all DPO debug scripts and production hybrid DPO script
- Switch cache hybrid scripts from torchrun to accelerate launch with deepspeed
- Add TRITON_PRINT_AUTOTUNING=1 env var to debug scripts
- Create single_gpu_cache.sh for non-hybrid cache testing
- Fix run-dpo-experiments skill to reference correct script paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix TPS metrics to be per-GPU in dpo_tune_cache.py

Divide tokens_per_second_step and tokens_per_second_total by
accelerator.num_processes so they report per-GPU throughput, consistent
with MFU and the olmo-core SpeedMonitorCallback convention. Also use
enumerate() for the micro_step loop to fix SIM113 lint warning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add batch shape logging to dpo_tune_cache.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pad packed batches to max_seq_length in DPO collator

Move pad_to_length to padding_free_collator to avoid circular imports,
add pad_to_max_length/pad_token_id fields to TensorDataCollatorWithFlattening,
and wire up padding in dpo_tune_cache.py so every packed batch has a fixed
(1, max_seq_length) shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* no saturn

* Truncate padded tensors in get_batch_logps before splitting

get_batch_logps uses cu_seq_lens to split tensors by sequence, but
after padding the tensor is longer than cu_seq_lens accounts for.
Truncate to real content length before splitting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix NaN losses and duplicate logging in padded DPO

Strip padding from chosen/rejected tensors in concatenated_inputs()
before concatenating, so cu_seq_lens offsets point to real data instead
of padding. Guard batch-shape and step-metrics logging with
is_main_process to avoid 16x duplicate log lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert padding changes, fix duplicate logging, add forward shape log

Revert collator padding (pad_to_max_length) that caused NaN losses and
Triton autotuning overhead. Keep duplicate-logging fix (is_main_process
guards). Add input_ids shape logging before model forward to diagnose
autotuning triggers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Restore pad_to_length used by dpo_utils

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix forward shape logging to correct function and log all input shapes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* now flags match

* Add GRPO performance investigation doc for hybrid model

Documents the 3.4x step time gap between hybrid (GatedDeltaNet) and
standard (OLMo-3) GRPO training, tracing root causes to two architectural
differences: 3.25x more parameter tensors per layer (causing ZeRO-3
all-gather overhead) and FLA Triton kernel incompatibility with CUDA graphs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix benchmark timeout at long response lengths

Remove queue maxsize to prevent blocking the vLLM async event loop,
scale batch timeout with response length, and add health checks to
the main batch collection loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Disable HF generation_config override in vLLM engines

The model's generation_config.json sets max_new_tokens=2048 which
vLLM applies as a cap on max_tokens, causing 400 errors when
requesting longer responses. Pass generation_config="vllm" to use
vLLM defaults instead. Also fix health check spam when no results
are collected yet (0 % 10 == 0).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fla to tyler/fewer-l2norm-recompilations branch

Switch flash-linear-attention to tyler-romero fork with fewer L2 norm
recompilations for improved hybrid model performance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add realistic generation mode to benchmark for hybrid model investigation

Support variable-length responses with stop_strings in benchmark_generators.py
(matching GRPO conditions) to isolate the long-tail response length effect on TPS.
Add two new benchmark scripts for the hybrid model: one with GRPO-realistic settings
(enforce_eager + stop_strings + 32x16 batch) and one with fixed-length + enforce_eager.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update investigation doc with enforce_eager and long-tail findings

- enforce_eager causes 47% TPS penalty for hybrid (vs 25% for standard transformer)
- Long-tail response lengths are the primary driver of per-step TPS variance:
  steps with MaxSeq<1000 achieve 2,500-3,566 TPS, steps with MaxSeq=4096 drop to 625
- The original 2.6x gap attribution to enforce_eager alone was incorrect;
  the benchmark comparison was confounded by fixed-length vs variable-length generation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add hybrid GRPO test script without enforce_eager

Test whether the hybrid model can run GRPO with CUDA graphs enabled,
since the standalone benchmark already works without enforce_eager.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix batches_processed crash in reference logprobs cache

Plain DataLoader wrapped by accelerate doesn't have batches_processed.
Track step locally and save it in the checkpoint dict instead of relying
on dataloader.state_dict()/load_state_dict(). Also make checkpoint
frequency configurable via checkpoint_every_n_steps parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add kv_cache_concurrency benchmark script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Beaker launch script for KV cache concurrency analysis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix kv_cache_concurrency: disable V1 multiprocessing for engine access

SyncMPClient (used with TP>1) doesn't expose engine_core. Setting
VLLM_ENABLE_V1_MULTIPROCESSING=0 forces InprocClient so we can
access the engine core internals to extract KV cache specs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Set enforce eager on script

* Add GRPO hybrid vs non-hybrid benchmark comparison

Three-way comparison of hybrid+enforce_eager, hybrid, and non-hybrid
OLMo 3.1 runs on the large test script (2x8 GPU, 19 steps).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pass trust_remote_code through to vLLM engines

Models like nvidia/Nemotron-H-8B-Base-8K require trust_remote_code=True
to load in vLLM. This was missing from create_vllm_engines, causing
failures for custom-code models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update transformers fork and rename olmo3_5_hybrid to olmo3_2_hybrid

The upstream fork renamed the model type on Feb 12. Update uv.lock
to pick up commit 77d52dc and update all code references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add minimal repro for olmo3_2_hybrid ZeRO-3 crash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update repro to use Accelerator with DeepSpeed ZeRO-3 context

The previous version loaded the model directly without the Accelerate
context, so ZeRO-3 partitioning wasn't active and the crash didn't
reproduce. This version initializes an Accelerator with DeepSpeedPlugin
before loading the model, matching the dpo_tune_cache.py code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Workaround olmo3_2_hybrid _init_weights crash under ZeRO-3

The _init_weights added in 01f141b accesses Embedding weights by
padding_idx, which crashes under ZeRO-3 where weights are partitioned
(size 0 on non-owning ranks). Monkey-patch until fixed upstream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove debug logging and profiling from hybrid DPO

Remove per-step Model forward shape logging and disable profiler tracing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* updated benchmarking scripts

* Add hybrid GRPO training script for OLMo3 7B on Jupiter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add benchmark results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update vLLM to latest olmo-3.5-hybrid-clean branch (411aac2e)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* added perf/seconds_per_step

* Apply olmo3_2_hybrid _init_weights ZeRO-3 workaround to grpo_fast.py, add Slack alerts to DPO and RL scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix 7b_instruct_hybrid_rl.sh: add checkpoint_state_dir, remove max_retries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove invalid --beaker_eval_freq flag from hybrid RL script

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove --add_bos from hybrid RL script (incompatible with olmo123 template)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Increase weight sync timeout from 120s to 1200s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Set vllm_tensor_parallel_size=2 and halve vllm_num_engines in hybrid RL script

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "Set vllm_tensor_parallel_size=2 and halve vllm_num_engines in hybrid RL script"

This reverts commit e8f0faa61a61d9f2c0d28651f4fb619371129a5c.

* updated code

* now we set checkpointing

* Reduce vLLM nodes from 7 to 3 (24 engines) and total nodes from 8 to 4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* less vllm

* now trying 2 nodes

* 2 node

* Add debug resume scripts for DPO and RL experiments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* set model name

* Fix ref policy checkpoint save under DeepSpeed ZeRO-3

When saving a checkpoint with ZeRO-3 sharding and 16 GPUs, calling
state_dict() on rank 0 only returns that rank's local shard (~1/16 of
each parameter), producing a ~3MB file instead of the full ~14GB model.
On resume, loading this broken file silently falls back to the initial
HuggingFace weights, so the ref policy has step-0 weights while the
training model has step-N weights, causing KL≈2 at the first logged step.

Fix by using GatheredParameters (the same pattern as save_model) to
assemble the full tensors on rank 0 before saving.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update transformers to v5.3.0 from PyPI

Remove the custom fork (yanhong-lbh/transformers olmo-3.5-hybrid-clean)
and pin to the official transformers==5.3.0 release, which now includes
olmo3_2_hybrid support natively.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update vllm and transformers to latest olmo-3.5-hybrid-clean commits

Bumps both forks to their latest commits and relaxes numpy<2 to
numpy>=2, which the new vllm now requires via opencv-python-headless.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update vllm, transformers, and torch to latest versions

- vllm and transformers: latest olmo-3.5-hybrid-clean commits
- torch: 2.9.0 -> 2.10.0 (required by new vllm's torchvision==0.25.0)
- numpy: <2 -> >=2 (required by new vllm's opencv-python-headless)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update olmo3_2_hybrid references to olmo_hybrid for new transformers fork

The transformers fork renamed the module from olmo3_2_hybrid to
olmo_hybrid with new class names (OlmoHybrid* instead of Olmo3_2Hybrid*).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix LD_LIBRARY_PATH for pip-bundled NVIDIA libs in launch script

torch 2.10+cu129 bundles CUDA 12.9 libs that must be found before
system CUDA libs (e.g. 12.8) to avoid missing symbol errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add repro script for olmo3_2_hybrid model_type not recognized

The new transformers fork (olmo-3.5-hybrid-clean) renamed the model type
from olmo3_2_hybrid to olmo_hybrid, but existing HF checkpoints still use
the old name, causing AutoConfig.from_pretrained() to fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Register olmo3_2_hybrid model_type for backward compat and simplify repro script

Register OlmoHybridConfig under the old olmo3_2_hybrid model_type in
grpo_fast.py so AutoConfig works with old checkpoints. Simplify repro
script to just test the real HF checkpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove broken olmo3_2_hybrid AutoConfig registration from grpo_fast.py

The registration fails because OlmoHybridConfig has model_type="olmo_hybrid",
causing a mismatch ValueError. The HF checkpoint has been updated to use
olmo_hybrid, so the backward-compat registration is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove FLASH_ATTENTION_SKIP_CUDA_BUILD to fix torch 2.10 compat

flash-attn 2.8.3 has no prebuilt wheel for torch 2.10, so with
SKIP_CUDA_BUILD the CUDA kernels were never compiled and importing
flash_attn_2_cuda failed at runtime. Removing the flag lets uv build
from source against the installed torch version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Bump hybrid RL exp_name to v2 for fresh checkpoint after layernorm key change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* added script

* Fix pg_options TypeError by using inspect instead of string version comparison

The lexicographic comparison `str(torch.__version__) >= "2.6"` fails for
torch 2.10+ because "2.10" < "2.6" as strings. Replace with
inspect.signature introspection on _new_process_group_helper to detect the
correct parameter name. Also surface Ray actor exceptions in
ray_get_with_progress by catching and logging before re-raising.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add vLLM vs HF logprobs comparison test for hybrid model

Tests that vLLM and HuggingFace produce matching logprobs using the
Olmo-Hybrid-Instruct-DPO-7B checkpoint, with a Beaker launch script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Copy test into scripts/ so it's included in Docker image

tests/ is in .dockerignore, so place the test file in scripts/ which
is copied into the image.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix test to use rl_utils (not rl_utils2) and pass vllm_logprobs arg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix vLLM generate API: use dict prompt instead of prompt_token_ids kwarg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Bump exp_name to avoid stale hybrid checkpoint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add transformer model (Olmo-3-1025-7B) to logprobs comparison test

Tests both hybrid and transformer models to compare vLLM vs HF logprobs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove -x flag so all test cases run even if some fail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add explicit GPU memory cleanup between test cases

Delete model and vLLM engine and clear CUDA cache after each use
to prevent OOM when running multiple model tests sequentially.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Set vLLM gpu_memory_utilization=0.5 to avoid OOM across test cases

vLLM subprocess retains GPU memory even after del, so limit its
allocation to allow subsequent model loads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Restore dynamic get_kv_cache_info() with fallback for hybrid models

The hardcoded inference_batch_size=64 was causing 22x worse logprob
divergence and 6.4x slower weight sync vs old-main. Restore the dynamic
KV cache calculation (~12 for Olmo-3-1025-7B on H100) with a try/except
fallback to 16 for hybrid models where collective_rpc hangs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Bump exp_name to avoid restoring stale checkpoint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove --no_auto_dataset_cache to avoid HF rate limiting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert: restore --no_auto_dataset_cache (local caching fails)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove --no_auto_dataset_cache to fix HF rate limiting on Beaker

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Restore --no_auto_dataset_cache (olmo123 template not in CHAT_TEMPLATES)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add vLLM A/B test script for weight sync performance comparison

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add weight_sync A/B test script: load_ref_policy=true vs false

Identical to large_test_script.sh except load_ref_policy=true.
Baseline (large_test_script / vllm_ab_test) has ~14.6s weight_sync.
Hypothesis: load_ref_policy=true causes GPU memory pressure → slower weight_sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add weight_sync A/B test 2: match slow grpo_p64 config exactly

Changes from large_test_script.sh: load_ref_policy=true,
num_mini_batches=4, sequence_parallel_size=1. These are the
remaining config differences vs the slow grpo_p64_2node_v2 run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pin vLLM fork to working commit 411aac2e2

The fork's branch head (b041ef4f9) introduced a msgspec
ValidationError for hybrid models: "Expected array of length 1,
got 2 - at $.dtypes". Pin to the last known working commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Override torchvision/torchaudio to fix circular import with torch 2.10

Pinning vLLM to an older commit caused uv to resolve torchvision 0.24.1
(instead of 0.25.0), which has a circular import bug with torch 2.10.0.
Add override-dependencies to ensure compatible versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pin torch <2.10 for compatibility with pinned vLLM commit

The vLLM fork at commit 411aac2e2 was built against torch 2.9.x. Using
torch 2.10.0 causes an undefined symbol error in vLLM's CUDA kernels.
Revert torch constraint to <2.10 to match the working configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pin vLLM to 3677c274d: has OlmoHybrid, before dtypes bug

Previous pin (411aac2e2) was upstream vllm-main without OlmoHybrid support.
This commit is the last fork-specific OlmoHybrid commit before the merge
that introduced the dtypes serialization bug. Requires torch 2.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Force rebuild: empty commit to get new git hash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove torchvision/torchaudio from Docker image

torchvision 0.25.0 has a circular import bug with torch 2.10 that
crashes on startup. Neither package is used at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pin torch >=2.10 to match vLLM 3677c274d build ABI

UV was resolving torch 2.9.1 despite <2.11 constraint, causing ABI
mismatch with vLLM compiled against torch 2.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add VLLM_ALLOW_INSECURE_SERIALIZATION for hybrid model support

The hybrid model's dual KV cache dtypes causes a msgspec validation
error (Expected array of length 1, got 2). Setting this env var uses
pickle serialization instead, which handles variable-length tuples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Patch vLLM MambaSpec to allow variable-length dtypes tuple

The hybrid model has 2 KV cache dtypes but MambaSpec.dtypes is typed
as tuple[torch.dtype] (exactly 1). Patch to tuple[torch.dtype, ...]
so msgspec accepts variable-length tuples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add logprob comparison tests: eager vs compiled vs fp32 SSM state

Tests three vLLM modes against HF logprobs for hybrid models:

1. eager mode (baseline)

2. compiled mode (expected divergence due to bf16 state compounding)

3. compiled + fp32 SSM state (expected fix per team findings)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add long-sequence logprob comparison tests (256, 512, 1024 tokens)

Tests hybrid model divergence compounding over sequence length across

eager, compiled, and compiled+fp32 SSM state modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Test prefill-mode logprob divergence up to 16384 tokens

Matches production GRPO scenario: both vLLM and HF score the same

pre-existing sequence in a single forward pass (prefill/chunk path).

Tests eager, compiled, and compiled+fp32 SSM state at 256-16384 tokens.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Test GRPO-matching logprob comparison: vLLM generate + HF score

Uses real dataset prompts (hamishivi/rlvr_acecoder_filtered_filtered)

with olmo123 chat template. vLLM generates up to 16384 tokens,

then HF scores the full sequence in one forward pass.

Tests eager, compiled, and compiled+fp32 SSM state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Force full-length generation with ignore_eos, test 1024-8192 tokens

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix sequence packing recurrent state leak for hybrid models

For hybrid (recurrent) models, SSM state accumulated across packed sequence
boundaries, contaminating logprobs of later sequences in a pack. Add
_forward_packed_sequences_separately() which processes each packed sequence
independently with fresh state. Enabled via reset_recurrent_state flag,
auto-detected for olmo_hybrid models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Relax packing test threshold: bf16 gives ~0.21 mean diff

The state-reset path gives 0.21 mean diff vs individual (compared to 4.75
for naive packed). The residual is bf16 numerical noise, not state leak.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Free intermediate tensors in separate forward to prevent OOM

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Rename experiment to v3_packing_fix for fresh start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pre-download model to shared HF cache before spawning Ray actors

Prevents HuggingFace rate limits when 16 processes try to download simultaneously.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Monkey-patch OlmoHybridGatedDeltaNet to pass cu_seqlens for packed sequences

Replace the separate-forward-pass workaround (_forward_packed_sequences_separately) with a monkey-patch that passes cu_seqlens to the FLA recurrent kernels. This resets recurrent state at sequence boundaries natively within a single forward pass, matching the olmo-core implementation.

The old approach ran N separate model() calls per packed sequence, which was catastrophically slow with DeepSpeed ZeRO-3 due to multiplied NCCL all-gather overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add standalone repro script for OlmoHybrid recurrent state leak

Demonstrates that OlmoHybridGatedDeltaNet.forward() doesn't pass cu_seqlens to FLA recurrent kernels, causing SSM state to leak across sequence boundaries in packed inputs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move snapshot_download before tokenizer load and set HF_HUB_OFFLINE after

The tokenizer load was hitting HF API rate limits (429) because snapshot_download ran after make_tokenizer(). Move it before, and set HF_HUB_OFFLINE=1 after download to prevent transformers from making API calls (e.g. model_info() inside _patch_mistral_regex).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use stock transformers>=5.3.0 and base Olmo-Hybrid-7B model

Drop trust_remote_code and custom fork dependency so the repro script works with stock transformers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restructure logprobs comparison tests for hybrid vs transformer diagnosis

- Parameterize on both hybrid and transformer models at 1024/4096/8192 tokens
- Use full production dataset mix (6 datasets) for realistic prompts
- Add TestPatchEffect: compare patched vs unpatched HF vs vLLM
- Add TestLengthScaling: measure gap growth at 128-8192 tokens
- Keep TestPackingStateLeak for cu_seqlens correctness
- Use logger_utils and modern type hints per project conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Move LD_LIBRARY_PATH setup before uv sync in launch script

flash-attn build imports torch which loads cusparse, requiring
nvjitlink 12.9. The pip-installed NVIDIA libs in .venv have the
right version but LD_LIBRARY_PATH was set after uv sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Focus logprobs tests on production config: 8192 response tokens

Match scripts/train/olmo3/7b_instruct_hybrid_rl.sh:
- response_length=8192, pack_length=11264, max_prompt_length=2048
- 1 prompt per dataset (6 total) instead of 2 (12)
- TestLengthScaling: 1024/4096/8192 instead of 128-8192
- Bump Beaker timeout to 120m

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Handle tokenizers with multiple chat templates in logprobs test

Olmo-3-1025-7B has multiple chat templates with no default, causing
apply_chat_template to raise ValueError. Use 'default' template
when tokenizer.chat_template is a dict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix chat template for base models and launch script for local builds

- Fall back to hybrid model's tokenizer for chat template formatting
  when the target model has no chat template (e.g. Olmo-3-1025-7B)
- Use --no-install-package for flash-attn/vllm in local uv sync
- Move LD_LIBRARY_PATH setup before uv sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix chat template to match production and add test results to doc

- Use CHAT_TEMPLATES["olmo"] for transformer model (matching production
  experiment 01KK201Y3C2Z6VNJVKRPASEGHA which used --chat_template_name olmo)
- Hybrid model uses its built-in chat_template.jinja (matching production
  experiment 01KKVTQQZ86A5PB1MV1C2337DQ which used --chat_template_name olmo123)
- Document test results from experiment 01KM0KKAAAAACZERB01GR7M5YY
- Add instructions for running the test suite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update doc with corrected test results using proper olmo chat template

Results from experiment 01KM0QY7F73Y1JG6P7S7Q4PYYA with the correct
chat template (olmo for transformer, built-in for hybrid).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Document hypotheses for test-vs-production divergence with test plans

Two hypotheses explain the inverted pattern (hybrid better in test,
worse in production; transformer worse in test, better in production):
1. cu_seqlens packing error at long sequences (hybrid only)
2. Production responses shorter than 8192 due to stop strings (transformer only)

Each includes a concrete test plan to confirm or reject.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add hypothesis tests for packing error and natural response lengths

TestPackingAtProdLength (hypothesis 1): Pack two ~4500-token responses
into an 11264-token tensor matching production config. Compare
patched-packed vs individual scoring to measure cu_seqlens packing
error at production lengths.

TestNaturalResponseLength (hypothesis 2): Generate without ignore_eos
using production stop strings to measure actual response lengths and
vLLM-HF diff at those lengths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove redundant tests: TestPatchEffect, TestLengthScaling, TestPackingStateLeak

- TestPatchEffect: confirmed patch has zero effect on single sequences
- TestLengthScaling: superseded by TestNaturalResponseLength
- TestPackingStateLeak: superseded by TestPackingAtProdLength

Remaining tests: TestGRPOLogprobsMatch, TestPackingAtProdLength,
TestNaturalResponseLength (4 test cases total).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add TestVllmVsPackedHF to reproduce production logprob comparison

Replace TestPackingAtProdLength with TestVllmVsPackedHF which directly
replicates the production metric: vLLM logprobs vs packed-HF logprobs
scored with forward_for_logprobs. Two variants (packed/single) to
isolate whether the gap comes from packing or the scoring function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix bool mask in TestVllmVsPackedHF: response_masks are Long, not bool

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update docs with TestVllmVsPackedHF results: packing does not explain gap

Packed vLLM-vs-HF diff is 0.012, single is 0.030 — both far below
the production 0.24-0.58. Transformer gap explained by natural response
lengths (median 1559 tokens vs forced 8192). Hybrid gap remains unexplained
by isolated tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add hybrid packed metrics debug script and W&B analysis results

Production W&B run (fo8rjg42) only has combined diff metric, no
packed/unpacked split. Debug script runs a short hybrid GRPO training
to capture the split metrics (debug/vllm_diff_mean_packed vs _unpacked).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert torch to <2.10 and restore FLASH_ATTENTION_SKIP_CUDA_BUILD

Torch 2.10 cu129 wheels bundle mismatched NVIDIA pip libraries
(cusparse 12.7 vs nvjitlink <12.9) that crash on import in uv's
build isolation environment on CPU-only machines. Reverting to
torch <2.10 to match main and restore local uv run/sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --no-install-package flags to single_gpu_on_beaker.sh

uv run on CPU machines cannot build flash-attn or vllm from source. These packages are only needed inside the Beaker container, not for mason.py submission.

* Use uv run --no-sync in hybrid_packed_metrics.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove torch <2.10 upper bound to fix vllm ABI mismatch

The vllm fork (3677c274d) compiles against torch 2.10's C10 CUDA API where

c10_cuda_check_implementation takes unsigned int, but torch 2.9.1 uses int.

Removing the upper bound lets vllm's build isolation use the correct torch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Bump torch to >=2.10.0 to match vllm fork's ABI

The vllm fork (3677c274d) was compiled against torch 2.10's C10 CUDA API.

torch 2.9.1 has an incompatible symbol signature (int vs unsigned int).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use HF repo allenai/Olmo-Hybrid-Instruct-DPO-7B in single GPU debug script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use eager attention in single GPU debug script to avoid flash-attn issue

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "Use eager attention in single GPU debug script to avoid flash-attn issue"

This reverts commit 9c97550cb0fc4bc5d58164859435c7484bb94ffc.

* Remove FLASH_ATTENTION_SKIP_CUDA_BUILD so flash-attn compiles CUDA kernels

The skip flag prevented flash_attn_2_cuda from being built in Docker.

Local dev uses --no-install-package flash-attn so this doesn't affect CPU machines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use uv run --no-sync in large_test_script.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use uv run --no-sync in 7b_instruct_hybrid_rl.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Bump torch>=2.10, transformers>=5.3, vllm>=0.18 for olmo_hybrid support

olmo_hybrid was upstreamed to transformers 5.3.0. Override vllm's
transformers<5 pin since the models are compatible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Guard configure_http_backend for huggingface_hub 1.x compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix tokenizer.decode() to pass list for transformers 5.x compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cast tokens to int in visualize_token for datasets with string tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Guard visualize_token call against dataset format changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Switch to custom vllm fork for olmo_hybrid serialization support

Upstream vllm 0.18.0's v1 engine can't serialize olmo_hybrid Mamba state outputs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Document vLLM hybrid dtype bug with minimal repro script

MambaSpec.dtypes is tuple[torch.dtype] (length 1) but hybrid models

send 2 dtypes, causing msgspec.ValidationError during serialization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Simplify repro script and add Beaker launcher

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cleaned up PR

* Add if __name__ guard for vllm spawn multiprocessing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Upgrade vLLM to 0.17.1, torch to 2.10, and uv to 0.10.12 (#1556)

* Upgrade vLLM to 0.17.1, torch to 2.10, and uv to 0.10.12 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cleaned up PR

* Cleaned up PR

* Cleaned up PR

* Remove investigation artifacts and minimize diff vs main Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add hybrid model test scripts for GRPO, DPO, and benchmarks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Switch from vllm fork to vllm>=0.18.0 from PyPI, remove flash-attn v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add minimal repro for vLLM 0.18.0 hybrid dtype serialization bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix repro script multiprocessing guard, remove unused VLLM_ATTENTION_BACKEND Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Rebased

* Simplify repro script and add Beaker launcher Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove unrelated changes from hybrid benchmarks branch (return_dict=False, dead code, defensive try/except, compat shims, token_count tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Switch hybrid GRPO test to TP=1 with 8 engines to work around vLLM dtype serialization bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update repro to match Ray actor path: VLLM_ENABLE_V1_MULTIPROCESSING=0 + mp executor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Repro via collective_rpc get_kv_cache_spec + fix configure_http_backend for new HF Hub Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Work around vLLM hybrid dtype bug: catch collective_rpc failure in get_kv_cache_info Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cast tokens to int in visualize_token for datasets with string tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Switch hybrid GRPO test to TP=1 to match production config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Guard visualize_token call against dataset format changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove unused vllm_dtype config field and parameter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Monkey-patch MambaSpec.dtypes to fix vLLM hybrid model serialization bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add return_dict=False to all apply_chat_template calls for transformers 5.x compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove try/except guard around visualize_token Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Bump DATASET_CACHE_VERSION to v6 to invalidate stale caches from apply_chat_template return_dict fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Auto-detect attention implementation (fa3 > fa2 > sdpa) instead of hardcoding flash_attention_2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Keep attn_implementation as optional CLI override, fall back to auto-detect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move min_tokens to extra_body in vLLM completions.create() call to fix OpenAI client compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove configure_hf_hub_retry (HF hub handles retries internally), simplify benchmark_generators SamplingConfig Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cleaned up PR.

* Revert non-hybrid changes (mason VLLM_ATTENTION_BACKEND, configure_hf_hub_retry, DeepSpeed hooks) — split to separate PRs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove configure_hf_hub_retry; configure_http_backend was removed in huggingface_hub v1.0 (now v1.7.2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore 'here marks the end of the previous messages' comment in dataset_transformation.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore accidentally deleted repro_vllm_hybrid_dtype scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use fallback_inference_batch_size instead of sys.maxsize when get_kv_cache_info fails on hybrid models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* cleaned up PR

* Added qwen3.5 test

* Remove try/except around get_kv_cache_info and fallback_inference_batch_size since MambaSpec monkey-patch fixes the root cause Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use get_text_config() in ModelDims.from_hf_config to support multimodal models like Qwen3.5 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Switch large_test_script_hybrid.sh to Olmo-Hybrid-Instruct-DPO-7B model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Hardcode vllm dtype to bfloat16 and remove unused vllm_dtype parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix CI: add get_text_config to test mocks, setup beaker for GPU override verify, add CHANGELOG entry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants