Skip to content

Replace custom SDPA with F.scaled_dot_product_attention#538

Open
stashuk-olek wants to merge 4 commits intofacebookresearch:mainfrom
stashuk-olek:export-D92927084
Open

Replace custom SDPA with F.scaled_dot_product_attention#538
stashuk-olek wants to merge 4 commits intofacebookresearch:mainfrom
stashuk-olek:export-D92927084

Conversation

@stashuk-olek
Copy link
Copy Markdown

Summary:
Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's F.scaled_dot_product_attention using the MATH backend.

For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy

Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with torch.allclose.

Reviewed By: OmarPavel

Differential Revision: D92927084

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Feb 13, 2026

@stashuk-olek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92927084.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2026
stashuk-olek added a commit to stashuk-olek/multimodal that referenced this pull request Feb 13, 2026
…arch#538)

Summary:

Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's `F.scaled_dot_product_attention` using the MATH backend.

For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy 

Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with `torch.allclose`.

Reviewed By: OmarPavel

Differential Revision: D92927084
stashuk-olek added a commit to stashuk-olek/multimodal that referenced this pull request Feb 13, 2026
…arch#538)

Summary:

Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's `F.scaled_dot_product_attention` using the MATH backend.

For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy 

Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with `torch.allclose`.

Reviewed By: OmarPavel

Differential Revision: D92927084
… weights in FLAVA (facebookresearch#535)

Summary:

The `attentions` field on `TransformerOutput` and `return_attn_weights`/`head_mask` parameters in the FLAVA encoder stack were never used by any consumer. 

This diffs cleans it up. Later the intent is to simplify attention usage / use common API for them.

Reviewed By: OmarPavel

Differential Revision: D92927086
…h#536)

Summary:

Remove dead `head_mask`, `return_attn_weights`, and `attention_weights` from the VideoGPT stack. These features were never used by any consumer — `head_mask` was always `None` or all-ones, and `return_attn_weights` was always `False` except in tests that verified the feature itself.

Reviewed By: OmarPavel

Differential Revision: D92927089
Summary:

After removing all consumers of `head_mask`, `return_attn_weights`, and `attn_probs` in the previous commits, the core attention module can be simplified. This commit:
- Removes `head_mask` param from `scaled_dot_product_attention` and `SelfAttention.forward`
- Changes return types from `Tuple[Tensor, Tensor]` to `Tensor` (no longer returning attention probabilities)
- Removes `return_attn_weights` param and tuple unpacking logic from `MultiHeadAttention.forward`
- Cleans up unused imports (`Tuple`, `Union`)

No behavioral change — the attention computation itself is unchanged.

Reviewed By: OmarPavel

Differential Revision: D92927085
…arch#538)

Summary:

Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's `F.scaled_dot_product_attention` using the MATH backend.

For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy 

Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with `torch.allclose`.

Reviewed By: OmarPavel

Differential Revision: D92927084
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant