-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[feat] LongSANA: a minute-length real-time video generation model #12723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1. add `SanaVideoCausalTransformerBlock` and `SanaVideoCausalTransformer3DModel`; 2. add `LongSanaVideoPipeline` for Linear Attention KV-Cache; 3. support LongSANA converting from pth to diffusers safetensor;
Co-authored-by: Yuyang Zhao <43061147+HeliosZhao@users.noreply.github.com>
We can actually leverage our attention backends: |
Is KV cache is supported in any backends? |
|
Gentle ping @dg845 |
|
Hi @lawrence-cj, is the |
| return hidden_states | ||
|
|
||
|
|
||
| class CachedGLUMBConvTemp(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the additions of the LongSANA modeling blocks (CachedGLUMBConvTemp, SanaCausalLinearAttnProcessor1_0, SanaVideoCausalTransformerBlock, and SanaVideoCausalTransformer3DModel) to transformer_sana_video.py intended? It looks like all of these blocks are also defined in transformer_sana_video_causal.py and it doesn't look like any of the previous Sana Video models are being modified to use these blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I must have forgotten to delete them.
There is a |
|
Hi @lawrence-cj, I don't think I can access it unless I'm specifically given permission (for example, via a read access token). |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up & refactored the KV cache implementation in this commit: 8b177ff
Please take a look and feel free to cherry-pick if it works for you. Main changes are
- Created
SanaBlockKvCacheto replace[None, None, None]- It is just more readable and we can know exactly what we are caching - Created
LongSanaKvCachclass for pipeline, to abstract away the logic related to initialize and accumulate kv_caches across different chunks inside the pipeline kv_cacheis always passed/returned (can beNonewhen not used) -> this simplify the code a bit so that our input/output format is consistent- we have a
enable_saveflag on the cache class — this way we don't need to passsave_kv_cacheargument through every layer, we just have to enable/disable inside pipeline
Additionally, do you think we should create a custom scheduler for SANA long video? There's quite a bit of logic in the pipeline that I think belongs in a scheduler instead.
This PR supports LongSANA: a minute-length real-time video generation model
Related links:
project: https://nvlabs.github.io/Sana/Video
code: https://github.com/NVlabs/Sana
paper: https://arxiv.org/pdf/2509.24695
PR feature:
LongSANA uses Causal Linear Attention KV Cache during inference, which is crucial for long video generation(FlashAttention may need other PR). This PR adds Causal computation logi for both Linear Attention and Mix-FFN (Conv in MLP)
Added classes and functions
SanaVideoCausalTransformerBlockandSanaVideoCausalTransformer3DModel;LongSanaVideoPipelinefor Linear Attention KV-Cache;Cc: @sayakpaul @dg845
Co-author: @HeliosZhao
Code snap: