Use Protocols to type-check linear_proj submodules of Attention by nschank · Pull Request #3434 · NVIDIA/Megatron-LM

nschank · 2026-02-15T16:42:02Z

What does this PR do ?

Defines Protocols representing linear_proj submodules, and uses them instead of ModuleSpec to enable typechecking of its construction in SelfAttention, CrossAttention, and MLA.

I also updated Backend to return linear_proj specifically, allowing type-checking of RowParallelLinear types as instances of linear_proj directly (otherwise Backend "hides" the type and makes no type-checking occur).

While I was in attention, I also updated the naming conventions of the existing interfaces to match what we've finalized on.

Associated design doc: Typed ModuleSpec.pdf

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-02-15T16:42:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-02-17T01:05:07Z

/ok to test 9db13d6

megatron/core/models/gpt/gpt_layer_specs.py

nschank · 2026-03-07T20:38:20Z

Resynced after coming back from travel, sorry for delay!

yashaswikarnati

synced offline, just had a minor comment,overall lgtm!

jaredcasper · 2026-03-17T23:53:49Z

I also updated Backend to return linear_proj specifically, allowing type-checking of RowParallelLinear types as instances of linear_proj directly (otherwise Backend "hides" the type and makes no type-checking occur).

Can you expand on this a bit? I'm guessing this is adding the row_parallel_linear_proj() function in addition to the "row_parallel_linear()" function? Don't those have the same inputs/outputs so same types? Why the need for a special one for "_proj"?

megatron/core/transformer/attention.py

nschank · 2026-03-19T12:00:22Z

@jaredcasper Sure! Fair criticism, this is sorta in a partial state so maybe I should update with a TODO for clarity or something. I'm trying to solve the following problem:

backend: BackendSpecProvider = ...
submodules = SelfAttentionSubmodules(..., linear_proj=backend.get_type(), ...)

SelfAttentionSubmodules.linear_proj has a specific interface it wants to require - it knows the exact signature that a LinearProjBuilder is supposed to satisfy, and same for the LinearProjInterface it must return. So whenever you provide something via linear_proj=, the type checker is given the opportunity to check that the interface actually matches.

It can only do so if the thing being passed to linear_proj= actually has a type which can be tested against that interface. This is true of specific classes (so if I pass something of type type[RowParallelLinear]), unions of classes, Callables, functools.partial, etc.

But the return type of BackendSpecProvider.row_parallel_linear() is just type. type is basically equivalent to 🤷 as far as the type-checker is concerned, so doing linear_proj=backend.row_parallel_linear() will not catch a type error. Individual subclasses of BackendSpecProvider can provide a narrower return type for row_parallel_linear, which helps somewhat (if callers are using a subclass directly), but any time a caller is using something which the type-checker only knows is a BackendSpecProvider (but not which kind) then it will not type-check row_parallel_linear.

I don't have a great Protocol to use here for what generically a method named row_parallel_linear() should actually return - there are at least two distinct Protocols that row_parallel_linear() needs to satisfy (LinearProjBuilder and LinearFc2Builder), and it's not entirely obvious those two things are required to have identical interfaces. The ideal world would be if I could just say the return type is LinearProjBuilder & LinearFc2Builder (i.e. it must satisfy both at once) but Python doesn't support that.

Thus, my proposed solution here is effectively to have BackendSpecProvider offer individual methods for each particular Builder protocol that we end up introducing. If we later merge LinearProjBuilder and LinearFc2Builder into a single LinearLayerBuilder then both column_parallel_linear and row_parallel_linear could use it; but in the meantime I think we should have row_parallel_linear_proj (returning LinearProjBuilder) and I will rename row_parallel_linear() to row_parallel_linear_fc2() -> LinearFc2Builder. This basically means BackendSpecProvider might return the same class from multiple separate methods, but each one is enforcing that that class satisfies a different interface.

jaredcasper · 2026-03-24T22:39:02Z

it's not entirely obvious those two things are required to have identical interfaces.

Why not? Both are row parallel linear layers. Do we have cases where the fc2 row_parallel_linear has a different interface than the linear_proj row_parallel_linear? If so that should be fixed. I don't want a backend to say "this is what I want you to use for linear proj" and "this is what I want you to use for fc2"... That's not a backend, that's just an additional layer to specs (which is already too confusing as it is). I want a backend to say "when you need a row_parallel_linear layer, wherever it is, this is the thing to use."

nschank · 2026-03-24T23:45:34Z

Why not? Both are row parallel linear layers. Do we have cases where the fc2 row_parallel_linear has a different interface than the linear_proj row_parallel_linear?

This is fair! I was trying to be less restrictive than that, and let the two APIs evolve independently. I don't really have an opinion between the two - if we want to basically merge the Linear protocols into one, and enforce that for all callers, that would be totally fine with me.

I would personally recommend letting me get this in first, and then I can merge them as an immediate followup - merging the interfaces for just row_linear will be a nontrivial task, and similarly it will be a bit of work to do the same for column_linear. So if you're fine with the somewhat gross intermediate step, I can follow up. But if not, I'm happy to spend some time getting together the full thing. Which makes more sense to you? @jaredcasper

jaredcasper · 2026-03-30T18:34:56Z

I think it makes sense to have the backend define types for general layer types, not specific layers (i.e. define "row parallel linear" instead of "type specifically for the linear proj", otherwise, as I said, it's doing the same thing as the spec in general, just hidden behind yet another layer of abstraction. Putting this in the meantime adds this API to the backend that would then need to be changed again. Let's just go straight to row_parallel_linear().

nschank · 2026-03-31T01:01:35Z

SG, will get back in a day or so with the update!

nschank · 2026-04-01T02:00:45Z

@jaredcasper I realized the relevant work is actually somewhat independent so am opening a separate PR for it here: #4087 - I simply reverted the Backend changes here so now we're just focusing on linear_proj, and the issue I noted with type-checking will be fixed by that PR.

Phlip79 · 2026-04-03T16:43:01Z

/ok to test 6585350

nschank requested review from a team as code owners February 15, 2026 16:42

ko3n1g requested a review from a team February 15, 2026 16:42

github-actions bot added the community-request label Feb 15, 2026

Phlip79 added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. complexity: medium labels Feb 17, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 01:05 Inactive

ko3n1g added this to the Core 0.16 milestone Feb 17, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 01:05 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 01:05 Failure

copy-pr-bot bot temporarily deployed to test February 17, 2026 01:06 Inactive

yashaswikarnati reviewed Feb 20, 2026

View reviewed changes

megatron/core/models/gpt/gpt_layer_specs.py Outdated Show resolved Hide resolved

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 2, 2026

nschank force-pushed the linearproj branch from 9db13d6 to 479b369 Compare March 7, 2026 20:35

chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Mar 7, 2026

yashaswikarnati approved these changes Mar 13, 2026

View reviewed changes

nschank force-pushed the linearproj branch from 479b369 to 3f8feac Compare March 14, 2026 17:51

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 14, 2026

santhnm2 reviewed Mar 18, 2026

View reviewed changes

megatron/core/transformer/attention.py Outdated Show resolved Hide resolved

nschank force-pushed the linearproj branch from 3f8feac to 208d564 Compare March 19, 2026 12:01

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 19, 2026

shanmugamr1992 approved these changes Mar 19, 2026

View reviewed changes

chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 19, 2026

santhnm2 approved these changes Mar 19, 2026

View reviewed changes

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 22, 2026

Use Protocols to type-check linear_proj submodules of Attention

3248471

nschank force-pushed the linearproj branch from 09fcf45 to 3248471 Compare April 1, 2026 01:40

nschank mentioned this pull request Apr 1, 2026

Refactor BackendSpecProvider to use Protocols to define the types it returns #4087

Open

5 tasks

Remove accidental extra default arg

6585350

Phlip79 removed the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Apr 3, 2026

copy-pr-bot bot requested a deployment to test April 3, 2026 16:43 Waiting

Conversation

nschank commented Feb 15, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 15, 2026

Uh oh!

Phlip79 commented Feb 17, 2026

Uh oh!

Uh oh!

nschank commented Mar 7, 2026

Uh oh!

yashaswikarnati left a comment

Choose a reason for hiding this comment

Uh oh!

jaredcasper commented Mar 17, 2026

Uh oh!

Uh oh!

nschank commented Mar 19, 2026

Uh oh!

jaredcasper commented Mar 24, 2026

Uh oh!

nschank commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaredcasper commented Mar 30, 2026

Uh oh!

nschank commented Mar 31, 2026

Uh oh!

nschank commented Apr 1, 2026

Uh oh!

Phlip79 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

(Step 1): Add PR label `Expert Review`

nschank commented Mar 24, 2026 •

edited

Loading