Skip to content

Latest commit

 

History

History
134 lines (91 loc) · 5.71 KB

File metadata and controls

134 lines (91 loc) · 5.71 KB

🚀 LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

CVPR 2025


This repository contains an unofficial implementation of the MATE block from the paper:

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
CVPR 2025


🎯 About This Repository

This is an unofficial implementation of the MATE (MA-branch + TE-branch) block described in the LinGen paper, built on top of the PixArt codebase. The implementation enables linear computational complexity for text-to-video generation by replacing the quadratic-complexity self-attention with the proposed MATE block.

Key Features

  • 🔧 MATE Block Implementation: Custom implementation of the MA-branch and the TE-branch
  • 📹 Video Support: Extended PixArt architecture to handle video data
  • ⚡ Linear Complexity: Replaces the quadratic-complexity self-attention with the linear-complexity MATE block
  • 🎨 Based on PixArt: Built upon the Diffusion Transformer architecture of PixArt

🏗️ Architecture Overview

The MATE block consists of two main components:

MA-Branch

  • Bidirectional Mamba2 block for short-to-long-range token correlations
  • Rotary Major Scan (RMS) for token rearrangement at almost no extra cost
  • Review tokens for enhanced long video generation
  • Implementation located in: mamba_blocks/ directory

TE-Branch

  • Temporal Swin Attention block for spatially adjacent and temporally medium-range correlations
  • Completely addresses adjacency preservation issues of Mamba
  • Implementation located in: temporal_swin_attn.py

📂 Repository Structure

├── PixArt/
│   └── PixArtMS.py          # Modified PixArt with MATE block integration & video support
├── mamba_blocks/            # MA-branch implementations  
├── temporal_swin_attn.py    # TE-branch implementation
└── README.md               # This file

Core Modifications

  1. PixArt/PixArtMS.py:

    • Added option to replace standard self-attention with MATE block
    • Extended to support video data
    • Maintains compatibility with the original PixArt architecture
  2. mamba_blocks/:

    • Contains implementations of the MA-branch components
    • Includes bidirectional Mamba2, Rotary Major Scan, and review tokens
  3. temporal_swin_attn.py:

    • Implements the TE-branch Temporal Swin Attention mechanism
    • Handles temporal correlations and spatial adjacency

🚀 Getting Started

Dependencies

Usage

The MATE block can be enabled in the PixArt architecture by modifying the configuration in PixArt/PixArtMS.py. The implementation supports both image and video generation tasks with linear computational complexity.


📊 Key Benefits

  • 15x FLOPs Reduction and 11.5x Latency Reduction: Significant speedup compared to standard Diffusion Transformers
  • Linear Scaling: Computational cost scales linearly with number of pixels in the generated videos
  • Minute-Length Videos: Enables generation of long videos without compromising quality
  • Single GPU Inference: High-resolution minute-length video generation on a single GPU

⚠️ Important Notes

  • This is an unofficial implementation based on the paper description
  • Built on the PixArt codebase which uses the standard Diffusion Transformer architecture

📝 Citation

If you use this implementation in your research, please cite the original LinGen paper:

@inproceedings{wang2025lingen,
  title={Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity},
  author={Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and Vajda, Peter and Jha, Niraj K. and Dai, Xiaoliang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={2578--2588},
  year={2025}
}

🙏 Acknowledgments


📧 Contact

For questions about this implementation, please open an issue in this repository.
For questions about the original LinGen research, please refer to the official project page.