[PyTorch] Introduce quantizer roles by negvet · Pull Request #2620 · NVIDIA/TransformerEngine

negvet · 2026-01-23T15:31:22Z

Description

Introducing QuantizerRole

@dataclasses.dataclass(frozen=True)
class QuantizerRole:
    module_type: str = ""   # e.g. "linear", "grouped_linear", "dpa"
    tensor_type: str = ""   # e.g. "input", "weight", "grad_output", "qkv", "s"
    name: str = ""          # instance name, e.g. "qkv", "proj", "fc1", "fc2"

This is an API that allows to go down to "set this LayerNormLinear in this transformer layer to be less aggressively quantized." (fine-grained, per-module/per-tensor quantization control mechanism)
See test_custom_recipe.py::test_custom_recipe_quantization_targets().

Quantizer factory uses roles to dispatch according to its needs.

TE module/op emits a list of QuantizerRole:

Linear, LayerNormLinear, LayerNormMLP emit module_type="linear" with tensor_type in {"input", "weight", "grad_output"}.
GroupedLinear emits module_type="grouped_linear".

CustomRecipe accepts a qfactory callable that receives QuantizerRole and returns a quantizer.

Factories can be composed - e.g., dispatch (to different sub-factories as an option) based on module_type (dpa vs linear) and then refine based on tensor_type.

Summary:

Modules implement get_quantizer_roles() that returns a list of QuantizerRole objects.
During set_meta_tensor(), modules call get_quantizer_roles() and pass roles to RecipeState.create().
RecipeState.create() assigns roles to the state (e.g., CustomRecipeState.roles).
CustomRecipeState.make_quantizers() calls qfactory(role) for each role to create quantizers.
The factory can inspect role.module_type, role.tensor_type, and role.name to dispatch to different quantizers.

This PR enables granular control over the recipe. Which might have some limitations and edge cases.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…ipe state Signed-off-by: Evgeny <etsykunov@nvidia.com>

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-23T15:34:41Z

Greptile Summary

This PR introduces QuantizerRole — a frozen dataclass (module_type, tensor_type, name) that allows fine-grained, per-module/per-tensor quantization dispatch when using CustomRecipe. All major TE modules (Linear, LayerNormLinear, LayerNormMLP, GroupedLinear, DotProductAttention) now implement get_quantizer_roles() to describe their internal quantizer slots, and the CustomRecipeState machinery is updated to pass these roles to the user-provided qfactory. The PR also adds support for DelayedScalingRequest — a mechanism to request TE-managed stateful (delayed-scaling) quantizers from within a CustomRecipe — and ships two new helper modules (quantization_recipes_base.py, quantization_factory_examples.py) with reference factory implementations.

Key points:

QuantizerRole dataclass is the central new abstraction; it is experimental and marked as such.
Boundary slots (output in forward, grad-input in backward) default to None as the role, which means every factory must guard if role is None explicitly — a new contract that is not consistently applied in the updated test files.
test_nvfp4_module_exact.py: the factory inside get_nvfp4_quantizer_factory() accesses role.tensor_type without a None guard, causing AttributeError at runtime for boundary slots. The same factory also returns None for some tensor types, which CustomRecipeState.make_quantizers() now unconditionally rejects with ValueError.
LayerNormMLP.get_quantizer_roles(): forward slots 2 and 3 share an identical QuantizerRole (tensor_type="input", name=fc2_name), making it impossible for a factory to distinguish the fc1 output tensor from the fc2 input tensor by role alone.
_has_delayed_scaling_state() helper correctly unifies built-in and custom DS state detection across the codebase.
Several existing recipe.delayed() guards are correctly updated to use _has_delayed_scaling_state(), enabling custom-recipe modules to participate in the global amax-reduction path.

Confidence Score: 3/5

The core role-dispatch infrastructure looks sound, but a runtime crash exists in test_nvfp4_module_exact.py that would surface during CI runs.
The PR introduces a well-designed and well-tested new API. The main concern is a logic bug in test_nvfp4_module_exact.py where the factory does not handle role=None, leading to an AttributeError for boundary slots under the new dispatch contract. This would cause test failures. Additionally, the None-rejection in make_quantizers() is a silent breaking change for third-party factories. These are moderate-severity issues that should be addressed before merge.
tests/pytorch/nvfp4/test_nvfp4_module_exact.py (None-role crash), transformer_engine/pytorch/module/layernorm_mlp.py (duplicate slot roles), transformer_engine/pytorch/quantization.py (None rejection breaking change)

Important Files Changed

Filename	Overview
transformer_engine/pytorch/quantization.py	Introduces QuantizerRole, QuantizerRequest, DelayedScalingRequest dataclasses and reworks CustomRecipeState.make_quantizers() to use role-based dispatch; rejects None factory returns (potential breaking change); adds _has_delayed_scaling_state helper to support mixed DS/stateless quantizers in one state.
transformer_engine/pytorch/module/base.py	Adds output_quantizer_role / grad_input_quantizer_role properties and get_quantizer_roles() base method; updates set_meta_tensor to pass roles to RecipeState.create(); replaces delayed() checks with _has_delayed_scaling_state throughout for custom-recipe DS compatibility.
transformer_engine/pytorch/module/layernorm_mlp.py	Adds get_quantizer_roles() for LayerNormMLP; forward slots 2 and 3 share an identical role (both fc2 input), making individual per-slot targeting impossible; backward slot ordering follows GEMM index rather than execution order, which may be surprising.
transformer_engine/pytorch/attention/multi_head_attention.py	Adds _update_output_quantizer_roles() that wires boundary roles across QKV→DPA→proj edges; called in forward(); handles both self and cross attention; logic looks correct but boundary conditions for cross-attention with key_value module are worth verifying.
tests/pytorch/nvfp4/test_nvfp4_module_exact.py	Updated to use quantization_ref_nvfp4, but the factory inside get_nvfp4_quantizer_factory() doesn't guard against role=None (boundary slots), causing AttributeError; also returns None for "output" and "grad_input" which is now rejected by CustomRecipeState.
transformer_engine/pytorch/custom_recipes/quantization_recipes_base.py	New file providing canonical factory implementations (delayed_scaling, current_scaling, mxfp8, float8_block_scaling, nvfp4) that correctly handle role=None; well documented with role-dispatch logic.

Sequence Diagram

sequenceDiagram
    participant Module as TE Module<br/>(Linear/DPA/etc.)
    participant Base as TransformerEngineBaseModule
    participant RecipeState as RecipeState.create()
    participant CRS as CustomRecipeState
    participant Factory as qfactory (user-provided)
    participant DSRS as DelayedScalingRecipeState

    Module->>Base: set_meta_tensor(fwd, recipe)
    Base->>Module: get_quantizer_roles(fwd, num_quantizers)
    Module-->>Base: List[QuantizerRole | None]
    Base->>RecipeState: create(recipe, mode, num_quantizers, roles)
    RecipeState-->>CRS: CustomRecipeState(recipe, mode, num_quantizers)
    RecipeState->>CRS: state.roles = roles

    Module->>CRS: make_quantizers()
    loop for each slot i
        CRS->>Factory: qfactory(roles[i])
        alt role is None (boundary slot)
            Factory-->>CRS: Quantizer (user handles None)
        else role is QuantizerRole
            Factory-->>CRS: Quantizer OR DelayedScalingRequest
        end
    end

    CRS->>CRS: _handle_delayed_scaling_requests(raw)
    alt DS requests found
        CRS->>DSRS: DelayedScalingRecipeState(inner_recipe, n_ds_slots)
        DSRS-->>CRS: Float8Quantizer instances (splice into raw)
        CRS->>CRS: self._ds_state = DSRS
    end

    CRS-->>Module: List[Quantizer]
    Module->>Base: fp8_meta[key] = recipe_state
    Module->>Base: quantizers[key] = quantizers

Comments Outside Diff (4)

tests/pytorch/nvfp4/test_nvfp4_module_exact.py, line 113-151 (link)

Factory crashes on None role for boundary slots

The factory inside get_nvfp4_quantizer_factory() accesses role.tensor_type unconditionally. Under the new role-based dispatch, boundary slots (output in forward, grad_input in backward) emit None as the role (from self._output_quantizer_role / self._grad_input_quantizer_role, which default to None). So when the factory is called with role=None, role.tensor_type raises AttributeError.

Additionally, the two arms that return None (tensor_type == "output" and tensor_type == "grad_input") would now trigger a ValueError from CustomRecipeState.make_quantizers(), which rejects None factory returns.

The factory should guard against None at the top and return a valid quantizer (or forward to a fallback factory) for those slots:
```
def factory(role):
    if role is None:
        # Boundary slot (output / grad_input): return a default quantizer.
        return quantization_ref_nvfp4.NVFP4QuantizerRef(
            dtype=utils.Fp4Formats.E2M1,
            quant_tile_shape=(1, 16),
            pow_2_scales=False,
            with_rht=True,
        )
    if role.tensor_type == "input":
        ...
    elif role.tensor_type == "weight":
        ...
    elif role.tensor_type == "grad_output":
        ...
    else:
        # Fallback for any unrecognised tensor_type
        return quantization_ref_nvfp4.NVFP4QuantizerRef(...)
```
The same pattern is used correctly in the updated factories in test_custom_recipe.py (e.g. if role is None: return Float8CurrentScalingQuantizer(...)) and in quantization_recipes_base.py.
transformer_engine/pytorch/module/layernorm_mlp.py, line 2348-2358 (link)

Duplicate roles for fc1 output and fc2 input make per-slot targeting impossible

In the forward base list, slots 2 and 3 are identical QuantizerRole objects:
```
# slot 2: fc1 output (labeled as fc2 input, consumer = fc2)
QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name),
# slot 3: fc2 input
QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name),
```
Both carry tensor_type="input" and name=fc2_name. A user factory that dispatches on role fields (e.g. role.name == "tl1.layernorm_mlp.fc2") will apply the same quantizer to both the activation-input-to-fc2 tensor (slot 2, the stored fc1 output used internally) and the fc2 GEMM input tensor (slot 3). Typically these are the same tensor, but the duplication makes it impossible for the factory to target only one of them.

Consider either:
- Using a dedicated tensor_type string (e.g. "fc1_output" vs "input") for slot 2 so the factory can differentiate, or
- Adding a clear comment in the docstring warning that slots 2 and 3 always share the same role by design.
transformer_engine/pytorch/module/layernorm_mlp.py, line 2360-2368 (link)

Backward slot ordering may be surprising: fc1 precedes fc2

The backward base list is ordered [fc1_grad_output, fc1_grad_input, fc2_grad_output, fc2_grad_input]. Numerically the backward pass runs fc2 first then fc1, so a user expecting the quantizer slots to follow execution order (fc2 → fc1) would misread this. The slot order follows the GEMM index convention (GEMM1=fc1, GEMM2=fc2), not execution order, which is a reasonable choice — but it's worth adding a note in the docstring explicitly stating that slot ordering is by GEMM index, not backward-pass execution order, to avoid confusion when users target individual backward slots by index.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/pytorch/quantization.py, line 2930-2943 (link)

None rejection in make_quantizers is a breaking change for existing factories

The new guard unconditionally rejects None from the factory:
```
for i, q in enumerate(raw):
    if q is None:
        raise ValueError(...)
```
This is a silently breaking change for any qfactory that previously returned None to disable quantization for a particular slot (e.g. the reference factories in quantization_ref_nvfp4.py formerly returned None for "linear_output" / "linear_grad_input"). While the TODO comment acknowledges the missing IdentityQuantizer, it would be safer to also emit a deprecation warning (DeprecationWarning) with a helpful message rather than immediately raising, so downstream users get a migration path instead of an outright crash. Alternatively, the TODO should be tracked as a follow-up issue to avoid a hard cut-off without a replacement.

_{Last reviewed commit: eaa04bc}

Signed-off-by: Evgeny <etsykunov@nvidia.com>

timmoon10

Overall this design is quite clean and generalizable.

transformer_engine/pytorch/quantization.py

transformer_engine/pytorch/custom_recipes/quantization_nvfp4.py

timmoon10 · 2026-02-20T02:59:37Z

transformer_engine/pytorch/module/linear.py

+            base = [
+                QuantizerRole(module_type="linear", tensor_type="input", name=name),
+                QuantizerRole(module_type="linear", tensor_type="weight", name=name),
+                QuantizerRole(module_type="linear", tensor_type="output", name=name),
+            ]
+        else:
+            base = [
+                QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),
+                QuantizerRole(module_type="linear", tensor_type="grad_input", name=name),
+            ]


"output" and "grad_input" roles don't make sense. In reality, we are implicitly assuming that the tensor will be consumed by another linear-like layer.

Suggested change

base = [

QuantizerRole(module_type="linear", tensor_type="input", name=name),

QuantizerRole(module_type="linear", tensor_type="weight", name=name),

QuantizerRole(module_type="linear", tensor_type="output", name=name),

]

else:

base = [

QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),

QuantizerRole(module_type="linear", tensor_type="grad_input", name=name),

]

base = [

QuantizerRole(module_type="linear", tensor_type="input", name=name),

QuantizerRole(module_type="linear", tensor_type="weight", name=name),

QuantizerRole(module_type="linear", tensor_type="input", name=name),

]

else:

base = [

QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),

QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),

]

Alternatively, if we want to use the output in FP8 DPA, the right role would be module_type="dpa" and module_type="input". We should probably make this configurable. I kind of like that this design is exposing the hidden assumptions we've been making.

I agree about "output" and "grad_input" roles. Setting roles for those slots to None (the safest) and enabling the configuration. Also configured it in MHA.

timmoon10 · 2026-02-20T03:10:24Z

tests/pytorch/test_custom_recipe.py

+    assert counts["input"] == 1
+    assert counts["weight"] == 1
+    assert counts["output"] == 1
+    assert counts["grad_output"] == 1
+    assert counts["grad_input"] == 1


Suggested change

assert counts["input"] == 1

assert counts["weight"] == 1

assert counts["output"] == 1

assert counts["grad_output"] == 1

assert counts["grad_input"] == 1

assert counts["input"] == 2

assert counts["weight"] == 1

assert counts["output"] == 0

assert counts["grad_output"] == 2

assert counts["grad_input"] == 0

Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{15 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10 · 2026-02-20T22:59:33Z

transformer_engine/pytorch/quantization.py

+    def is_gemm(self) -> bool:
+        """Whether this role belongs to a GEMM-based module."""
+        return self.module_type in self.GEMM_MODULE_TYPES
+


I think this is baking in assumptions about what formats are similar (our recent experiences with grouped tensors makes me wonder if the requirements for "linear" and "grouped_linear" will diverge in the future), and it's also not giving us that much convenience.

Suggested change

def is_gemm(self) -> bool:

"""Whether this role belongs to a GEMM-based module."""

return self.module_type in self.GEMM_MODULE_TYPES

Sure, removed

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-02T13:13:05Z

Additional Comments (2)

tests/pytorch/distributed/run_numerics_exact.py, line 63
Missing None check before accessing role.tensor_type. According to module implementations (Linear, LayerNormLinear, etc.), role can be None for output and grad_input quantizer slots when output_quantizer_role/grad_input_quantizer_role properties are not set. This will cause AttributeError.

        if role is None:
            return None
        if role.tensor_type == "input":

tests/pytorch/nvfp4/test_nvfp4_module_exact.py, line 83
Missing None check before accessing role.tensor_type. role can be None for output and grad_input quantizer slots. Add check at start of factory:

        if role is None:
            return None
        if role.tensor_type == "input":

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

negvet · 2026-03-02T18:17:36Z

About 1d63084

Custom recipe factories can now return stateful quantizer requests (not just stateless quantizers) — TE detects these request dataclasses, allocates the required shared state (scale/amax buffers), and replaces them with real quantizer instances wired into existing infrastructure. Factories cannot create stateful quantizers directly because the shared buffers must be allocated across all slots simultaneously and registered with TE's global state manager for cross-module distributed reduction, recompute, and checkpointing — a lifecycle that only TE can orchestrate. Delayed scaling is supported via DelayedScalingRequest; the composed DelayedScalingRecipeState integrates with distributed amax reduction, activation recompute, and checkpointing. Factories can mix stateful requests and stateless quantizers per-slot within the same CustomRecipe.

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet · 2026-03-02T19:00:43Z

@cyanguwa @ptrendx, please review, enabled custom recipe for attention with current scaling, current envvar-based routing is still functional.

for more information, see https://pre-commit.ci

ptrendx · 2026-03-04T00:20:55Z

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py

+        Forward  (3 GEMMs x 3 = 9 slots):
+            GEMM1 -> QKV (at ``GEMM1_OUTPUT``),
+            GEMM2 -> O   (at ``GEMM2_INPUT``),
+            GEMM3 -> S   (at ``GEMM3_OUTPUT``).
+
+        Backward (3 GEMMs x 2 = 6 slots):
+            GEMM1 -> dQKV (at ``GRAD_OUTPUT1``),
+            GEMM2 -> dO   (at ``GRAD_INPUT2``),
+            GEMM3 -> dP   (at ``GRAD_INPUT3``).


This is not intuitive to me - O is created by the last gemm, whereas S is created by the first gemm inside DPA (so should be gemm 2?).

This is not intuitive to me either, this is a legacy of the cudnn naming convention, trying to fix in f21ce2f

ptrendx · 2026-03-04T00:32:01Z

transformer_engine/pytorch/module/base.py

        self.fp8_meta[fp8_meta_tensor_key] = recipe_state
        self.quantizers[fp8_meta_tensor_key] = recipe_state.make_quantizers()

+    def get_quantizer_roles(


I don't really understand how to properly create this function. We should have some documentation here on what is expected - let's say you have some number of gemms, you want to assign potentially different quantizers to their inputs/weights and you would like to name those roles X/Y etc. - this is what you would need to have in it.

Improved in 0c1ec9b

ptrendx · 2026-03-04T00:39:12Z

transformer_engine/pytorch/custom_recipes/quantization_factory_examples.py

+    This factory demonstrates how to use ``CustomRecipe`` with ``fp8_dpa=True``
+    to combine NVFP4 quantization for linear layers with FP8 attention.


To be honest I was hoping that something like this would enable getting rid of fp8_dpa/fp8_mha toggles or at least make them irrelevant for the custom recipe.

fp8_dpa/fp8_mha is a kernel-dispatch gate. The factory says how to quantize; fp8_dpa/fp8_mha says whether to run the FP8 attention kernel path at all. It is possible to infer the gate from the quantizer type produced by the factory. This is a refactor, but nothing dramatic. Still, I would propose to plan it as a follow up, not in this PR.

cc @cyanguwa

Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>

greptile-apps · 2026-03-04T14:42:53Z

Additional Comments (3)

transformer_engine/pytorch/custom_recipes/quantization_recipes_base.py, line 1550
Contradictory inline comment

The comment # constrain scale to powers of 2 describes the exact opposite of the parameter value — False means scales are NOT constrained to powers of 2. A reader will likely infer the wrong behaviour from this comment. It should say something like # do not constrain scale to powers of 2 or mirror the adjacent parameter.

        force_pow_2_scales=False,  # do not constrain scale to powers of 2

transformer_engine/pytorch/custom_recipes/quantization_factory_examples.py, line 1471
nvfp4_linear_fp8_dpa_factory silently produces NVFP4 quantizers for DPA boundary slots when fp8_mha=True

When MultiheadAttention is configured with fp8_mha=True, _update_output_quantizer_roles sets core_attention.output_quantizer_role to a proper QuantizerRole(module_type="linear", tensor_type="input", name=proj_name). With that role, the is_dpa_boundary guard here is never triggered (because role.module_type is "linear", not empty), so the call falls through to _make_nvfp4_quantizer(role). This returns an NVFP4Quantizer, which then fails the assert isinstance(_q, _fp8_types) assertion inside get_attention_quantizers at runtime.

The factory is designed for the standalone fp8_dpa=True / fp8_mha=False scenario where DPA emits hint-only roles (empty module_type), but this is not clearly stated in the docstring. Consider adding an explicit fp8_mha=True guard or at least a note in the docstring warning users that this factory is incompatible with fp8_mha=True:

# DPA boundary slots (O output / dQKV grad-input): the fused attention
# kernel only supports FP8 quantizers here, regardless of the linear recipe.
# NOTE: when fp8_mha=True, MultiheadAttention wires output_quantizer_role with
# module_type="linear", which bypasses this guard.  This factory is only
# designed for standalone fp8_dpa=True usage (fp8_mha=False).
is_dpa_boundary = (
    role is not None
    and (
        # standalone DPA: hint-only role emitted by DotProductAttention
        (not role.module_type and ("dpa_output" in role.name or "dpa_grad_input" in role.name))
        # MHA-wired DPA: output_quantizer_role set by _update_output_quantizer_roles
        or (role.module_type == "linear" and role.tensor_type == "input")
    )
)

transformer_engine/pytorch/module/layernorm_mlp.py, line 2018
Forward slot 2 and slot 3 carry identical roles, making them indistinguishable to factories

Slot 2 (GEMM1_OUTPUT) is the output of FC1, and slot 3 (GEMM2_INPUT) is the input quantizer for FC2. Both are assigned QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name).

Because the roles are identical, a custom factory cannot distinguish between the two quantizer slots for this internal activation. This mirrors the forward design choice in other modules (where GEMM1_OUTPUT and GEMM2_INPUT represent the same logical tensor), but it differs from the output-boundary convention: in Linear and LayerNormLinear, the analogous slot uses tensor_type="output" for the GEMM output position.

Consider using tensor_type="output" for slot 2 to allow factories to differentiate:

if fwd:
    base = [
        QuantizerRole(module_type="linear", tensor_type="input",  name=fc1_name),   # GEMM1_INPUT
        QuantizerRole(module_type="linear", tensor_type="weight", name=fc1_name),   # GEMM1_WEIGHT
        QuantizerRole(module_type="linear", tensor_type="output", name=fc1_name),   # GEMM1_OUTPUT (fc1→fc2 activation)
        QuantizerRole(module_type="linear", tensor_type="input",  name=fc2_name),   # GEMM2_INPUT
        QuantizerRole(module_type="linear", tensor_type="weight", name=fc2_name),   # GEMM2_WEIGHT
        self._output_quantizer_role,                                                 # GEMM2_OUTPUT
    ]

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-04T16:31:59Z

Additional Comments (2)

tests/pytorch/test_custom_recipe.py, line 104
("grad_output") is a parenthesized string, not a tuple. The in operator performs substring matching rather than membership testing. This happens to work for the current set of tensor_type values, but is incorrect Python and could silently fail if future tensor_type values contain "grad_output" as a substring.

Use a proper equality check instead:

        if role.tensor_type == "grad_output":

Or alternatively role.tensor_type in ("grad_output",) with a trailing comma to create a real single-element tuple. This pattern appears in multiple test functions: lines 104, 141, 203, and 260.

transformer_engine/pytorch/module/layernorm_mlp.py, line 2008
Slots 2 and 3 are assigned identical QuantizerRole objects, making them indistinguishable to factory dispatch:

slot 2 (GEMM1_OUTPUT): QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name)
slot 3 (GEMM2_INPUT):  QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name)

These represent physically different tensors (output of fc1 before GeLU vs. input to fc2 after GeLU) and should have distinct roles so factories can apply different quantization settings per slot.

Slot 2 should identify as the fc1 output:

QuantizerRole(module_type="linear", tensor_type="output", name=fc1_name),  # slot 2: GEMM1 output
QuantizerRole(module_type="linear", tensor_type="input",  name=fc2_name),  # slot 3: GEMM2 input

This allows factories to inspect role.name and role.tensor_type independently for fine-grained per-slot quantization control.

Signed-off-by: Evgeny <etsykunov@nvidia.com>

greptile-apps · 2026-03-04T16:41:27Z

Additional Comments (4)

tests/pytorch/nvfp4/test_nvfp4_module_exact.py, line 116
AttributeError when factory is called with role=None

Since the Linear/LayerNormLinear/etc. output and grad-input slots now default to None (when _output_quantizer_role / _grad_input_quantizer_role are not set), qfactory(None) will be called for those slots.

The inner factory starts with if role.tensor_type == "input": without first checking role is None, which will immediately raise AttributeError: 'NoneType' object has no attribute 'tensor_type' when run through a CustomRecipe.

The same crash pattern exists in tests/pytorch/distributed/run_numerics_exact.py at the equivalent factory.

    def factory(role):
        if role is None:
            return None
        if role.tensor_type == "input":

transformer_engine/pytorch/module/layernorm_mlp.py, line 2153
Duplicate backward role for slot 3 (fc2 grad-input / fc1 output-grad)

Slot 3 (the fc2 GEMM's grad-input, which is the internal boundary gradient flowing back through the activation into fc1's output) is assigned the same role as slot 0:

# slot 0: fc1 grad_output
QuantizerRole(module_type="linear", tensor_type="grad_output", name=fc1_name),
...
# slot 3: fc2 grad_input (internal boundary) — but labeled identically to slot 0
QuantizerRole(module_type="linear", tensor_type="grad_output", name=fc1_name),

A user factory dispatching on role.name and role.tensor_type has no way to distinguish the actual fc1 grad_output (slot 0, flowing in from upstream) from the internal fc2→fc1 boundary gradient (slot 3). In fine-grained quantization scenarios (e.g. the test_custom_recipe_quantization_targets example), this could cause the wrong quantizer type to be selected for the internal boundary.

Consider using a distinct role for slot 3, such as:

QuantizerRole(module_type="linear", tensor_type="grad_output", name=fc2_name),

or a dedicated tensor_type like "internal_grad" to make the slot semantically distinguishable.

transformer_engine/pytorch/module/layernorm_mlp.py, line 2145
Forward slots 2 and 3 share identical roles

Slot 2 is the output of fc1 (the internal boundary feeding into the activation before fc2) and slot 3 is fc2's actual input quantizer — yet both are assigned:

QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name)

While the docstring notes "Internal boundaries use fixed roles with known consumer identity" (so slot 2's consumer-perspective labelling is intentional), the practical effect is that the factory receives two indistinguishable roles. For example, a factory that applies a special quantization for fc2's input will apply the same quantizer to both the fc1 output boundary and the fc2 GEMM input, which may not match user intent when fine-tuning quantization per-slot.

Consider naming the boundary slot differently, e.g. with a tensor_type of "output" for fc1 and "input" for fc2:

QuantizerRole(module_type="linear", tensor_type="output", name=fc1_name),   # slot 2
QuantizerRole(module_type="linear", tensor_type="input",  name=fc2_name),   # slot 3

transformer_engine/pytorch/quantization.py, line 2609
amax_compute_algo equality check may fail for callable objects

When validating that all DelayedScalingRequest instances in a state share identical parameters, the code uses if v0 != vi for the amax_compute_algo field, which can be a Callable. Comparing two distinct callable objects with != in Python compares by identity, not by value — so two factories that are functionally identical but different objects would fail this check even if the user passes the same function from a different import path or via a lambda. Meanwhile scaling_factor_compute_algo has the same problem.

This could produce a confusing ValueError in valid usage (e.g. two layers created from the same factory function with a fresh closure each time). Consider documenting this limitation prominently, or restricting the equality check to non-callable fields.

Signed-off-by: Evgeny <etsykunov@nvidia.com>

greptile-apps · 2026-03-04T17:00:50Z

Additional Comments (3)

transformer_engine/pytorch/module/layernorm_mlp.py, line 2008
Duplicate forward roles for GEMM1_OUTPUT and GEMM2_INPUT

Slots 2 and 3 in the forward base list are identical: both are QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name). Slot 2 corresponds to GEMM1_OUTPUT (fc1's output tensor) and slot 3 to GEMM2_INPUT (fc2's input tensor). These are assigned the same role, making them indistinguishable to a custom factory.

For LayerNormMLP with num_gemms=2, the forward slot layout is:

Slot 0 → GEMM1_INPUT   (fc1 input)
Slot 1 → GEMM1_WEIGHT  (fc1 weight)
Slot 2 → GEMM1_OUTPUT  (fc1 output / intermediate activation)  ← currently labeled as fc2 input
Slot 3 → GEMM2_INPUT   (fc2 input)                             ← labeled as fc2 input (duplicate)
Slot 4 → GEMM2_WEIGHT  (fc2 weight)
Slot 5 → GEMM2_OUTPUT  (module output)

Slot 2 should distinguish the fc1 output from the fc2 input. If the intent is to label the internal fc1→fc2 boundary from the consumer perspective, slot 2 should at minimum carry a different tensor_type (e.g. "output") or a different name:

                QuantizerRole(module_type="linear", tensor_type="output", name=fc1_name),
                QuantizerRole(module_type="linear", tensor_type="input", name=fc2_name),

transformer_engine/pytorch/module/layernorm_mlp.py, line 2017
Duplicate backward roles for GEMM1_GRAD_OUTPUT and GEMM2_GRAD_INPUT

Slots 0 and 3 in the backward base list are identical: both are QuantizerRole(module_type="linear", tensor_type="grad_output", name=fc1_name). Slot 0 is GEMM1_GRAD_OUTPUT (the gradient flowing into fc1's backward GEMM) and slot 3 is GEMM2_GRAD_INPUT (the internal boundary — the gradient of the intermediate activation between fc1 and fc2, which conceptually is also fc1's grad_output).

While conceptually the same tensor (both represent the gradient at the fc1↔fc2 boundary), having two slots with identical roles means a factory cannot distinguish GEMM1_GRAD_OUTPUT from GEMM2_GRAD_INPUT. A factory doing fine-grained slot-level dispatch on (tensor_type, name) will receive two calls with the same arguments.

The slot at index 3 should ideally carry a distinct role. If this is intentional (both slots logically represent the same tensor), a comment to that effect would help future maintainers understand why the duplication is deliberate rather than a copy-paste error.

transformer_engine/pytorch/custom_recipes/quantization_recipes_base.py, line 74
Misleading inline comments invert the parameter semantics

Both inline comments describe the opposite effect of the value they annotate:

force_pow_2_scales=False, # constrain scale to powers of 2 — Setting this to False explicitly does not constrain scales to powers of 2. The comment implies it does.
amax_epsilon=0.0, # clamp amax from below to avoid div-by-zero — An epsilon of 0.0 provides no lower-bound clamping and does nothing to prevent division by zero. A positive epsilon (e.g. 1e-12) would serve that purpose.

The same misleading comment appears on amax_epsilon=0.0 in float8_block_scaling_quantizer_factory at line 117.

        force_pow_2_scales=False,  # allow non-power-of-2 scales (matches Float8CurrentScaling defaults)
        amax_epsilon=0.0,  # no lower-bound clamping on amax (matches Float8CurrentScaling defaults)

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

negvet · 2026-03-05T16:54:35Z

/te-ci pytorch L0

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

negvet and others added 4 commits January 23, 2026 15:14

Enable semantic roles emitted by module/op and comsumed by custom rec…

cd8b8ad

…ipe state Signed-off-by: Evgeny <etsykunov@nvidia.com>

Update quantization factories

fddeba4

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Fix tests

82b84ff

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4346231

for more information, see https://pre-commit.ci

negvet requested review from cyanguwa and timmoon10 January 23, 2026 15:32

This comment was marked as off-topic.

Sign in to view

negvet added 2 commits January 27, 2026 10:57

Swap tensor:module

a81f54a

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Better naming

700ea04

Signed-off-by: Evgeny <etsykunov@nvidia.com>

This comment was marked as outdated.

Sign in to view

Introduce QuantizerRole frozen data class instead of a string

d7ca20b

Signed-off-by: Evgeny <etsykunov@nvidia.com>

This comment was marked as resolved.

Sign in to view

Shrink module_type vocabulary

ed59556

Signed-off-by: Evgeny <etsykunov@nvidia.com>

This comment was marked as resolved.

Sign in to view

timmoon10 reviewed Feb 20, 2026

View reviewed changes

negvet and others added 2 commits February 20, 2026 14:31

Merge branch 'main' into semantic_quantizer_roles

ade46a6

Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b1a4aed

for more information, see https://pre-commit.ci

This comment was marked as resolved.

Sign in to view

negvet and others added 5 commits February 20, 2026 15:05

Fix numerics exact test

6e1ee37

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Set defaults, make custom recipe forward compatible

b9753f2

Signed-off-by: Evgeny <etsykunov@nvidia.com>

remove position from QuantizerRole

ad67247

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Set good defaults

e6be76a

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a86fdad

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

timmoon10 reviewed Feb 20, 2026

View reviewed changes

negvet mentioned this pull request Feb 23, 2026

Add NVTE_BACKWARD_MODE=default|unquant|dequant #2644

Open

13 tasks

negvet added 3 commits February 24, 2026 15:13

Resolve naming: make every module/op distinguishable via name

d323f66

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Configure output/grad_input roles, defaults to None

c9eae0f

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Remove is_gemm()

ea3c135

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b0e497

for more information, see https://pre-commit.ci

negvet and others added 3 commits March 2, 2026 15:27

Enable statful recipes in the Custom recipe - Delayed Scaling support

1d63084

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Fix save_original_input for custom delayed scaling

c0c78ea

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5250ac

for more information, see https://pre-commit.ci

Enable custom recipe for attn

b8c0702

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

488d5e6

for more information, see https://pre-commit.ci

ptrendx reviewed Mar 4, 2026

View reviewed changes

Merge branch 'main' into semantic_quantizer_roles

13d681d

Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>

negvet and others added 2 commits March 4, 2026 16:03

Make boundary role setting more explicit in MHA

89d1448

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f18dc75

for more information, see https://pre-commit.ci

Make dpa role setting more intuitive

f21ce2f

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Docstring for get_quantizer_roles() in base module

0c1ec9b

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet and others added 4 commits March 5, 2026 16:20

Merge branch 'main' into semantic_quantizer_roles

3fc9986

Fix lint

068f6e1

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Restrict None roles

854b4a0

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e60a2c

for more information, see https://pre-commit.ci

negvet requested a review from ptrendx March 5, 2026 18:10

negvet and others added 5 commits March 6, 2026 11:37

Merge branch 'main' into semantic_quantizer_roles

326df37

Linter

52b5894

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Minor fixes

c4d9424

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Test debug tools compat

bd350b8

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eaa04bc

for more information, see https://pre-commit.ci

	def is_gemm(self) -> bool:
	"""Whether this role belongs to a GEMM-based module."""
	return self.module_type in self.GEMM_MODULE_TYPES

		This factory demonstrates how to use ``CustomRecipe`` with ``fp8_dpa=True``
		to combine NVFP4 quantization for linear layers with FP8 attention.

Conversation

negvet commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (4)

Uh oh!

This comment was marked as off-topic.

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

negvet Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 2, 2026

Uh oh!

negvet commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negvet commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

negvet commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

negvet commented Jan 23, 2026 •

edited

Loading

greptile-apps bot commented Jan 23, 2026 •

edited

Loading

timmoon10 Feb 20, 2026 •

edited

Loading

negvet Feb 25, 2026 •

edited

Loading

negvet commented Mar 2, 2026 •

edited

Loading