Replace custom SDPA with F.scaled_dot_product_attention by stashuk-olek · Pull Request #538 · facebookresearch/multimodal

stashuk-olek · 2026-02-13T17:42:39Z

Summary:
Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's F.scaled_dot_product_attention using the MATH backend.

For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy

Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with torch.allclose.

Reviewed By: OmarPavel

Differential Revision: D92927084

meta-codesync · 2026-02-13T17:42:46Z

@stashuk-olek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92927084.

…arch#538) Summary: Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's `F.scaled_dot_product_attention` using the MATH backend. For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with `torch.allclose`. Reviewed By: OmarPavel Differential Revision: D92927084

… weights in FLAVA (facebookresearch#535) Summary: The `attentions` field on `TransformerOutput` and `return_attn_weights`/`head_mask` parameters in the FLAVA encoder stack were never used by any consumer. This diffs cleans it up. Later the intent is to simplify attention usage / use common API for them. Reviewed By: OmarPavel Differential Revision: D92927086

…h#536) Summary: Remove dead `head_mask`, `return_attn_weights`, and `attention_weights` from the VideoGPT stack. These features were never used by any consumer — `head_mask` was always `None` or all-ones, and `return_attn_weights` was always `False` except in tests that verified the feature itself. Reviewed By: OmarPavel Differential Revision: D92927089

Summary: After removing all consumers of `head_mask`, `return_attn_weights`, and `attn_probs` in the previous commits, the core attention module can be simplified. This commit: - Removes `head_mask` param from `scaled_dot_product_attention` and `SelfAttention.forward` - Changes return types from `Tuple[Tensor, Tensor]` to `Tensor` (no longer returning attention probabilities) - Removes `return_attn_weights` param and tuple unpacking logic from `MultiHeadAttention.forward` - Cleans up unused imports (`Tuple`, `Union`) No behavioral change — the attention computation itself is unchanged. Reviewed By: OmarPavel Differential Revision: D92927085

…arch#538) Summary: Replace the manual scaled dot-product attention implementation (matmul -> scale -> mask -> softmax -> dropout -> matmul) with PyTorch's `F.scaled_dot_product_attention` using the MATH backend. For my context on unified attention API, read https://docs.google.com/document/d/1XCZkhLtBNXGhxoYZ2-47XVvH7bQDOzPXBNZyL7Q7Fqo/edit?tab=t.0#heading=h.gzjhznhk1ejy Two numerical equivalence tests are added to verify the new implementation matches the manual computation path with `torch.allclose`. Reviewed By: OmarPavel Differential Revision: D92927084

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 13, 2026

stashuk-olek force-pushed the export-D92927084 branch from 59e7160 to 00ef2fe Compare February 13, 2026 21:24

stashuk-olek force-pushed the export-D92927084 branch from 00ef2fe to 3f5f7f5 Compare February 13, 2026 21:42

stashuk-olek added 4 commits February 25, 2026 15:40

stashuk-olek force-pushed the export-D92927084 branch from 3f5f7f5 to 6c2cfdf Compare February 25, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace custom SDPA with F.scaled_dot_product_attention#538

Replace custom SDPA with F.scaled_dot_product_attention#538
stashuk-olek wants to merge 4 commits intofacebookresearch:mainfrom
stashuk-olek:export-D92927084

stashuk-olek commented Feb 13, 2026

Uh oh!

meta-codesync bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stashuk-olek commented Feb 13, 2026

Uh oh!

meta-codesync bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant