IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18 #410

ipanfilo · 2026-01-13T05:26:20Z

Description

IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18

Fixes #14813

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove GH pinned deps Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pin onnxscript Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Reset FP8 weight workspace if usages are invalid Signed-off-by: Tim Moon <tmoon@nvidia.com>

…end` (#1965) Update utils.py Fix the condition error of the FP8 attention in `get_attention_backend` Signed-off-by: yuzhongw-nvidia <yuzhongw@nvidia.com> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>

* exclude 9.10.0/.1 for certain configs Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kv_channels Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add get_backend to tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add init files Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix numerics and cuda graph tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor changes after renaming Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import structure and rename get_attention_backends Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix docs and benchmarks Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get backend calls Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix get backend calls" This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix docs and benchmarks" This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix docs, benchmarks and pre-commit ci Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dpa/mha flash attn selection Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix rng states Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix backend selection on Ampere Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix issues from last merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/utils.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove initialization of rng_states to None Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * redefine ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix seed for CP tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/test_sanity.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fixture from utils to individual tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CI Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

…ug quantizer (#1963) * Debug linear layer when saving original input and using debug quantizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Workaround bugs with quantizing with only column-wise usage Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused imports Signed-off-by: Tim Moon <tmoon@nvidia.com> * Avoid unnecessary row-wise data Signed-off-by: Tim Moon <tmoon@nvidia.com> * Workaround bugs with quantizing with only column-wise usage FP8 does not support transpose-only cast. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fixed conflicts Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor code refactoring to avoid unnecessary checks Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed dBias accumulation error due to initialization. Minor code refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test case to reproduce the init error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed rowwise dbias error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed ptx API Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added a struct for two packed FP8 values Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Rolled back to scalar code for columnwise scaling due to its better performance Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Minor corrections Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rebased on main Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per code review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed constexpr in C++ test suite to build faster Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Computed activations are now numerically truncated to InputType before scaling. Improved test suite. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Minor refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modified mismatches checks of MXFP8 to address FP8 numerics Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Implemented Jeremy's fixes to JAX test suite with an intermediate downcast Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduced the dims of the test tensors to improve CI runtime Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed memory alignment issue. Compute dbias without downcast. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Refactor _OperationFuserAutogradFunction.forward to use less parameters Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit f8f59b1bb184e89468058521df4cfff029ad909c) * Rename `BackwardBiasActivation` to `BackwardActivationBias` Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 397c58fc296f801fe4ad600aadc2daff3b78be45) * Use forward operation order in backward fused operations Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 2d37a9385069b066e6cdeff3eb9173c2079cb791) * Rename `prev_op_grad_input_quantizer` to `prev_op_grad_output_quantizer` Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit d7ab5dfb23e216866f7f4fc4d7a99f625d329f1e) * Make OperationFuser persistent Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 77984d9715d31e87519dc6ea1e02c483a81355a7) * Distribute extra inputs to and collect extra outputs from multiple module groups in Sequential Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 0716aaad542e59f2c1ac4620167965a0334bbf71) * Take requires_grad into account when fusing operations Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Change get_quantizer to return None if no quantization recipe is used Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Refactor pre_first_forward Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Fix for failing `test_make_graphed_callables[fp8_recipe0-*-True-*-linear_op]` Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Fix linting errors Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Fix fp8 meta tensors in CUDA Graph capture Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix failing distributed userbuffers tests Signed-off-by: Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…1979) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

…ory length changes (#1985) * Fix bug where TE ops were not updating fp8_meta dicts Signed-off-by: Tim Moon <tmoon@nvidia.com> * Rename reset_recipe_state function Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update error message when initializing meta device quantized weight without recipe Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix current device for cuDNN/cuBLAS handles Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use weight device and improve tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… L0 (#1990) Fix current scaling test_helper.py and enable test_helper.py in L0 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

…on-MXFP8 recipes. (#1962) * add manage_primitives() helper * disable GEMM primitives for non-MXFP8 recipes * implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE * replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples * fix use_jax_gemm contextmanager Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Increase intermediate precision and reuse tensors from fwd Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * JIT warmup only when required Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Recompute only rsqrt_norm Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Evgeny <etsykunov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix cuDNN lib runtime loading and simplify Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Mark output tensors as not deletable in backward Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Add `in_place` kwarg to `MakeExtraOutput` Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Rename `AddInPlace` to `AddExtraInput` and add an `in_place` kwarg Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Fix cudnn versioning in support in PyTorch DPA and Fused attn Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fixed integer overflow when computing offsets Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…elism correctly for sequence-parallel inputs (#1980) * updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism Signed-off-by: Alp Dener <adener@nvidia.com> * corrected handling of FSDP sharding for the RHS operand Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral Signed-off-by: Alp Dener <adener@nvidia.com> * fixed linting issues Signed-off-by: Alp Dener <adener@nvidia.com> * added assert on sequence-parallel options when GemmPrimitive is disabled Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* optimize static grad outputs Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Support RMSNorm for QK Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rms -> RMSNorm, l2 -> L2Normalization (align with current pattern) Signed-off-by: Evgeny <etsykunov@nvidia.com> * Support LayerNorm + init refactor Signed-off-by: Evgeny <etsykunov@nvidia.com> * Before/after RoPE Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix pylint Signed-off-by: Evgeny <etsykunov@nvidia.com> --------- Signed-off-by: Evgeny <etsykunov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…#1994) * Remove deprecated device arg Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixed double buffering issue for assymetric layers Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add ops for dropout and constant scale Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add verbosity only for failing tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Prune some tests and preinit recipe Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Prune further tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix multitensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Minor fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix a100 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* remove reciprocal op Signed-off-by: zhongboz <zhongboz@nvidia.com> * Refactor Quantizer::create_tensor function Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug when constructing FP8 tensor Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add quantize function to C++ quantizers Signed-off-by: Tim Moon <tmoon@nvidia.com> * Prototype function to coerce Python quantized tensors to match quantizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use quantizer class in tex.quantize Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling support for activation backward Signed-off-by: Tim Moon <tmoon@nvidia.com> * Disable quantized GEMM output with FP8 current scaling Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add coerce_tensor functions for MXFP8 and DSv3 Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Avoid quantizing empty tensors Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use consistent shapes for FP8 transposes Signed-off-by: Tim Moon <tmoon@nvidia.com> * In attention impl, construct FP8 tensors with pre-initialized scale-invs Signed-off-by: Tim Moon <tmoon@nvidia.com> * Initialize MXFP8 scales to zero Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Store copy of quantizer when creating quantized tensors Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by: Tim Moon <tmoon@nvidia.com> * Make sure quantized tensors have private quantizer Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Rename "coerce_tensor" to "convert_and_update_tensor" Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure CUDA context is available when launching NVRTC kernel Signed-off-by: Tim Moon <tmoon@nvidia.com> * Expose CUDA context creation function externally Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: zhongboz <zhongboz@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Compute amax in activation kernels when the output pointer is provided, even for non-fp8 outputs Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 9f13fe2fefc58cae93bc467d87d01ecf792a0381) * Initialize metatensor values Signed-off-by: Jan Bielak <jbielak@nvidia.com> * Fuse computation of amax into the activation kernel for fp8 current scaling Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 2b54327ac9c931a5340983a79e99de5caa0399dd) Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Zero out amax in `create_hp_tensor_with_amax` instead of relying on `Float8CurrentScalingQuantizer.__init__` to zero-initialize it Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix merge conflict bug with clearing op outputs Signed-off-by: Tim Moon <tmoon@nvidia.com>

…allel (#2125) * fix memory overhead of all gather from sequence parallel Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/tensor/_internal/float8_blockwise_tensor_base.py Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * quick fix the errors when for UB buffers Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/linear.py Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Avoid deallocating FP8 scale-invs since they are reused Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

… (#2045) * feat: add cutlass group gemm support Signed-off-by: Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor multi tensor gemm interface Signed-off-by: Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor nvte_multi_stream_cublas_gemm func and add license info Signed-off-by: Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add unit test for cutlass group gemm Signed-off-by: Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add cutlass support type protect Signed-off-by: Min Yang <min.yang@shopee.com> * add tests and fix lint Signed-off-by: Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: fix unit tests error Signed-off-by: Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: refactor host workspace malloc Signed-off-by: Min Yang <min.yang@shopee.com> * update cutlass Signed-off-by: Xin Yao <xiny@nvidia.com> * update cutlass Signed-off-by: Xin Yao <xiny@nvidia.com> * further relex threshold and add a env var to warn fall back Signed-off-by: Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Min Yang <min.yang@shopee.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: alan yang <89962857+cassiewilliam@users.noreply.github.com> Co-authored-by: Min Yang <min.yang@shopee.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

feature(FA3,MLA,CP): 1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward 2. Update get_attention_backend method because FA3 support MLA now 3. Add CP MLA support for FA3 4. Add unit tests for FA3 MLA CP 5. Update attention doc Signed-off-by: zhujian <zhujian.whu.cs@gmail.com>

…#2185) * Fix cudnn version checks for kv cache for sm89. Add cudnn version check in preparation for 9.14 when getting backend Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix for cuDNN version condition check Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix build ant UT issues Upcoming ROCm and JAX 0.8 support - cherry-pick: 8e25035 03525d3 (#403)

tests/pytorch/attention/test_attention.py

tests/pytorch/attention/test_kv_cache.py

tests/pytorch/test_numerics.py

transformer_engine/pytorch/tensor/mxfp8_tensor.py

tests/cpp/operator/test_cast_mxfp8.cu

tests/jax/test_distributed_layernorm_mlp.py

tests/jax/test_distributed_softmax.py

tests/pytorch/attention/test_attention.py

transformer_engine/common/swizzle/swizzle.cu

transformer_engine/common/util/rocm_cast_gated_kernels.cuh

ipanfilo

Dummy comment for dummy GitHub letting me submit my comments

transformer_engine/common/util/rocm_cast_gated_kernels.cuh

tests/jax/test_distributed_layernorm_mlp.py

tests/jax/test_distributed_softmax.py

transformer_engine/common/util/rocm_cast_gated_kernels.cuh

…able Pytorch MXFP8 scale swizzling

wenchenvincent

LGTM.

wangye805

LGTM

KshitijLakhani and others added 30 commits July 21, 2025 01:04

Changed VERSION to 2.7.0.dev0 (#1973)

7a9a082

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

[PyTorch] Remove GH pinned deps (#1961)

5ba7953

* Remove GH pinned deps Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pin onnxscript Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Reset FP8 weight workspace if usages are invalid (#1972)

78a3821

Reset FP8 weight workspace if usages are invalid Signed-off-by: Tim Moon <tmoon@nvidia.com>

fix: Add stream synchronization before destroying MPI communicator (#…

d1967d5

…1979) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

[JAX] Fix current scaling test_helper.py and enable test_helper.py in…

992ba01

… L0 (#1990) Fix current scaling test_helper.py and enable test_helper.py in L0 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

Fix runtime lib loading for cuDNN (#1989)

fe27bf1

Fix cuDNN lib runtime loading and simplify Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix cudnn versioning support in PyTorch DPA and Fused attn (#1991)

71b2dd4

Fix cudnn versioning in support in PyTorch DPA and Fused attn Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

Fix the use-after-free bug in unfused normalization (#2002)

5a495a3

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Rename do_not_clear to _do_not_clear (#1977)

f858dc3

Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Fix bug with clearing op outputs during backward (#2008)

020428f

Fix merge conflict bug with clearing op outputs Signed-off-by: Tim Moon <tmoon@nvidia.com>

yuzhongw-nvidia and others added 9 commits September 16, 2025 22:59

Fix incorrect TP rank calculation when using data parallel (#2179)

eb69fad

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

Merge upstream 7f77127 v2.8.0.dev0 with TODOs

7f4b020

Merged conflits resolution and restore ROCm functionality

b95717e

Fix build ant UT issues Upcoming ROCm and JAX 0.8 support - cherry-pick: 8e25035 03525d3 (#403)

Merge dev f141f34

7d2ed36

Fix JAX and Pytorch UT; code cleanup; ROCm 7.2 w/a (#404)

92ce375

ipanfilo requested review from wangye805 and wenchenvincent as code owners January 13, 2026 05:26

wenchenvincent reviewed Jan 14, 2026

View reviewed changes

tests/pytorch/attention/test_attention.py Outdated Show resolved Hide resolved

wenchenvincent reviewed Jan 14, 2026

View reviewed changes

tests/pytorch/attention/test_kv_cache.py Show resolved Hide resolved

wenchenvincent reviewed Jan 14, 2026

View reviewed changes

tests/pytorch/test_numerics.py Show resolved Hide resolved

wenchenvincent reviewed Jan 14, 2026

View reviewed changes

transformer_engine/pytorch/tensor/mxfp8_tensor.py Show resolved Hide resolved

ipanfilo requested a review from wenchenvincent January 14, 2026 02:50

wangye805 requested changes Jan 15, 2026

View reviewed changes

ipanfilo commented Jan 15, 2026

View reviewed changes

Review comments

a406914

ipanfilo requested a review from wangye805 January 15, 2026 22:42

wangye805 requested changes Jan 16, 2026

View reviewed changes

transformer_engine/common/util/rocm_cast_gated_kernels.cuh Outdated Show resolved Hide resolved

transformer_engine/common/util/rocm_cast_gated_kernels.cuh Show resolved Hide resolved

ipanfilo added 2 commits January 16, 2026 16:12

Merge branch 'dev' into IFU-dev-250918-v2.8

403527b

Remove not needed intermetdiate var in cast kernel. Update tests. Dis…

cfe3fc3

…able Pytorch MXFP8 scale swizzling

ipanfilo requested a review from wangye805 January 17, 2026 02:47

wenchenvincent approved these changes Jan 17, 2026

View reviewed changes

ipanfilo and others added 3 commits January 16, 2026 23:28

Merge branch 'dev' into IFU-dev-250918-v2.8

c29c8fb

Resolve automerge error

08db27e

Fix benchmark script. Remove not needed debug messages

8427362

wangye805 approved these changes Jan 18, 2026

View reviewed changes

ipanfilo merged commit a627ad8 into dev Jan 18, 2026
2 of 4 checks passed

IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18 #410

IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18 #410

Uh oh!

Conversation

ipanfilo commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenchenvincent left a comment

Choose a reason for hiding this comment

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ipanfilo commented Jan 13, 2026 •

edited

Loading