Skip to content

Conversation

@ipanfilo
Copy link
Collaborator

@ipanfilo ipanfilo commented Jan 13, 2026

Description

IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18

Fixes #14813

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

KshitijLakhani and others added 30 commits July 21, 2025 01:04
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
* Remove GH pinned deps

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Pin onnxscript

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Reset FP8 weight workspace if usages are invalid

Signed-off-by: Tim Moon <tmoon@nvidia.com>
…end` (#1965)

Update utils.py

Fix the condition error of the FP8 attention in `get_attention_backend`

Signed-off-by: yuzhongw-nvidia <yuzhongw@nvidia.com>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
* exclude 9.10.0/.1 for certain configs

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix kv_channels

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add get_backend to tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add init files

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix numerics and cuda graph tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor changes after renaming

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix import structure and rename get_attention_backends

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs and benchmarks

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix get backend calls

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix get backend calls"

This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix docs and benchmarks"

This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix docs, benchmarks and pre-commit ci

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix dpa/mha flash attn selection

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix rng states

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix backend selection on Ampere

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix issues from last merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/utils.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove initialization of rng_states to None

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* redefine ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix seed for CP tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/test_sanity.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move fixture from utils to individual tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
…ug quantizer (#1963)

* Debug linear layer when saving original input and using debug quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Workaround bugs with quantizing with only column-wise usage

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused imports

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid unnecessary row-wise data

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Workaround bugs with quantizing with only column-wise usage

FP8 does not support transpose-only cast.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fixed conflicts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor code refactoring to avoid unnecessary checks

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed dBias accumulation error due to initialization. Minor code refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Test case to reproduce the init error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed rowwise dbias error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed ptx API

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added a struct for two packed FP8 values

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Rolled back to scalar code for columnwise scaling due to its better performance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor corrections

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rebased on main

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per code review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed constexpr in C++ test suite to build faster

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Computed activations are now numerically truncated to InputType before scaling. Improved test suite.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modified mismatches checks of MXFP8 to address FP8 numerics

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Implemented Jeremy's fixes to JAX test suite with an intermediate downcast

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reduced the dims of the test tensors to improve CI runtime

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed memory alignment issue. Compute dbias without downcast.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Refactor _OperationFuserAutogradFunction.forward to use less parameters

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit f8f59b1bb184e89468058521df4cfff029ad909c)

* Rename `BackwardBiasActivation` to `BackwardActivationBias`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 397c58fc296f801fe4ad600aadc2daff3b78be45)

* Use forward operation order in backward fused operations

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 2d37a9385069b066e6cdeff3eb9173c2079cb791)

* Rename `prev_op_grad_input_quantizer` to `prev_op_grad_output_quantizer`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit d7ab5dfb23e216866f7f4fc4d7a99f625d329f1e)

* Make OperationFuser persistent

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 77984d9715d31e87519dc6ea1e02c483a81355a7)

* Distribute extra inputs to and collect extra outputs from multiple module groups in Sequential

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 0716aaad542e59f2c1ac4620167965a0334bbf71)

* Take requires_grad into account when fusing operations

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Change get_quantizer to return None if no quantization recipe is used

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Refactor pre_first_forward

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix for failing `test_make_graphed_callables[fp8_recipe0-*-True-*-linear_op]`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix linting errors

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix fp8 meta tensors in CUDA Graph capture

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix failing distributed userbuffers tests

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…1979)

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
…ory length changes (#1985)

* Fix bug where TE ops were not updating fp8_meta dicts

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Rename reset_recipe_state function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update error message when initializing meta device quantized weight without recipe

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix current device for cuDNN/cuBLAS handles

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use weight device and improve tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… L0 (#1990)

Fix current scaling test_helper.py and enable test_helper.py in L0

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
…on-MXFP8 recipes. (#1962)

* add manage_primitives() helper

* disable GEMM primitives for non-MXFP8 recipes

* implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE

* replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples

* fix use_jax_gemm contextmanager

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Increase intermediate precision and reuse tensors from fwd

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* JIT warmup only when required

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Recompute only rsqrt_norm

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix cuDNN lib runtime loading and simplify

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Mark output tensors as not deletable in backward

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Add `in_place` kwarg to `MakeExtraOutput`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Rename `AddInPlace` to `AddExtraInput` and add an `in_place` kwarg

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Fix cudnn versioning in support in PyTorch DPA and Fused attn

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
* Fixed integer overflow when computing offsets

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…elism correctly for sequence-parallel inputs (#1980)

* updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism

Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected handling of FSDP sharding for the RHS operand

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral

Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting issues

Signed-off-by: Alp Dener <adener@nvidia.com>

* added assert on sequence-parallel options when GemmPrimitive is disabled

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* optimize static grad outputs

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Support RMSNorm for QK

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rms -> RMSNorm, l2 -> L2Normalization (align with current pattern)

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Support LayerNorm + init refactor

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Before/after RoPE

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pylint

Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#1994)

* Remove deprecated device arg

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fixed double buffering issue for assymetric layers

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add ops for dropout and constant scale

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add verbosity only for failing tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Prune some tests and preinit recipe

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Prune further tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix multitensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Minor fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix a100

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* remove reciprocal op

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Refactor Quantizer::create_tensor function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug when constructing FP8 tensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add quantize function to C++ quantizers

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Prototype function to coerce Python quantized tensors to match quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use quantizer class in tex.quantize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add FP8 current scaling support for activation backward

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable quantized GEMM output with FP8 current scaling

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add coerce_tensor functions for MXFP8 and DSv3

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Avoid quantizing empty tensors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use consistent shapes for FP8 transposes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* In attention impl, construct FP8 tensors with pre-initialized scale-invs

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Initialize MXFP8 scales to zero

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Store copy of quantizer when creating quantized tensors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure quantized tensors have private quantizer

Avoid problems with in-place ops after quantizer usages are changed externally.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Rename "coerce_tensor" to "convert_and_update_tensor"

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make sure CUDA context is available when launching NVRTC kernel

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expose CUDA context creation function externally

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Compute amax in activation kernels when the output pointer is provided, even for non-fp8 outputs

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 9f13fe2fefc58cae93bc467d87d01ecf792a0381)

* Initialize metatensor values

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fuse computation of amax into the activation kernel for fp8 current scaling

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 2b54327ac9c931a5340983a79e99de5caa0399dd)
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Zero out amax in `create_hp_tensor_with_amax` instead of relying on `Float8CurrentScalingQuantizer.__init__` to zero-initialize it

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix merge conflict bug with clearing op outputs

Signed-off-by: Tim Moon <tmoon@nvidia.com>
yuzhongw-nvidia and others added 9 commits September 16, 2025 22:59
…allel (#2125)

* fix memory overhead of all gather from sequence parallel

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/tensor/_internal/float8_blockwise_tensor_base.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* quick fix the errors when for UB buffers

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/module/linear.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Avoid deallocating FP8 scale-invs since they are reused

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
… (#2045)

* feat: add cutlass group gemm support

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor: refactor multi tensor gemm interface

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor: refactor nvte_multi_stream_cublas_gemm func and add license info

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: add unit test for cutlass group gemm

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: add cutlass support type protect

Signed-off-by: Min Yang <min.yang@shopee.com>

* add tests and fix lint

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: fix unit tests error

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: refactor host workspace malloc

Signed-off-by: Min Yang <min.yang@shopee.com>

* update cutlass

Signed-off-by: Xin Yao <xiny@nvidia.com>

* update cutlass

Signed-off-by: Xin Yao <xiny@nvidia.com>

* further relex threshold and add a env var to warn fall back

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Min Yang <min.yang@shopee.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: alan yang <89962857+cassiewilliam@users.noreply.github.com>
Co-authored-by: Min Yang <min.yang@shopee.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
feature(FA3,MLA,CP):
1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward
2. Update get_attention_backend method because FA3 support MLA now
3. Add CP MLA support for FA3
4. Add unit tests for FA3 MLA CP
5. Update attention doc

Signed-off-by: zhujian <zhujian.whu.cs@gmail.com>
…#2185)

* Fix cudnn version checks for kv cache for sm89. Add cudnn version check in preparation for 9.14 when getting backend

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor fix for cuDNN version condition check

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix build ant UT issues
Upcoming ROCm and JAX 0.8 support - cherry-pick: 8e25035 03525d3 (#403)
Copy link
Collaborator Author

@ipanfilo ipanfilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy comment for dummy GitHub letting me submit my comments

@ipanfilo ipanfilo requested a review from wangye805 January 15, 2026 22:42
@ipanfilo ipanfilo requested a review from wangye805 January 17, 2026 02:47
Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ipanfilo ipanfilo merged commit a627ad8 into dev Jan 18, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.