LinGen/README.md at main · jha-lab/LinGen

🚀 LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

CVPR 2025

This repository contains an unofficial implementation of the MATE block from the paper:

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
CVPR 2025

🎯 About This Repository

This is an unofficial implementation of the MATE (MA-branch + TE-branch) block described in the LinGen paper, built on top of the PixArt codebase. The implementation enables linear computational complexity for text-to-video generation by replacing the quadratic-complexity self-attention with the proposed MATE block.

Key Features

🔧 MATE Block Implementation: Custom implementation of the MA-branch and the TE-branch
📹 Video Support: Extended PixArt architecture to handle video data
⚡ Linear Complexity: Replaces the quadratic-complexity self-attention with the linear-complexity MATE block
🎨 Based on PixArt: Built upon the Diffusion Transformer architecture of PixArt

🏗️ Architecture Overview

The MATE block consists of two main components:

MA-Branch

Bidirectional Mamba2 block for short-to-long-range token correlations
Rotary Major Scan (RMS) for token rearrangement at almost no extra cost
Review tokens for enhanced long video generation
Implementation located in: mamba_blocks/ directory

TE-Branch

Temporal Swin Attention block for spatially adjacent and temporally medium-range correlations
Completely addresses adjacency preservation issues of Mamba
Implementation located in: temporal_swin_attn.py

📂 Repository Structure

├── PixArt/
│   └── PixArtMS.py          # Modified PixArt with MATE block integration & video support
├── mamba_blocks/            # MA-branch implementations  
├── temporal_swin_attn.py    # TE-branch implementation
└── README.md               # This file

Core Modifications

PixArt/PixArtMS.py:
- Added option to replace standard self-attention with MATE block
- Extended to support video data
- Maintains compatibility with the original PixArt architecture
mamba_blocks/:
- Contains implementations of the MA-branch components
- Includes bidirectional Mamba2, Rotary Major Scan, and review tokens
temporal_swin_attn.py:
- Implements the TE-branch Temporal Swin Attention mechanism
- Handles temporal correlations and spatial adjacency

🚀 Getting Started

Dependencies

Python >= 3.9 (Recommend Anaconda or Miniconda)
PyTorch >= 1.13.0+cu11.7

Usage

The MATE block can be enabled in the PixArt architecture by modifying the configuration in PixArt/PixArtMS.py. The implementation supports both image and video generation tasks with linear computational complexity.

📊 Key Benefits

15x FLOPs Reduction and 11.5x Latency Reduction: Significant speedup compared to standard Diffusion Transformers
Linear Scaling: Computational cost scales linearly with number of pixels in the generated videos
Minute-Length Videos: Enables generation of long videos without compromising quality
Single GPU Inference: High-resolution minute-length video generation on a single GPU

⚠️ Important Notes

This is an unofficial implementation based on the paper description
Built on the PixArt codebase which uses the standard Diffusion Transformer architecture

📝 Citation

If you use this implementation in your research, please cite the original LinGen paper:

@inproceedings{wang2025lingen,
  title={Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity},
  author={Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and Vajda, Peter and Jha, Niraj K. and Dai, Xiaoliang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={2578--2588},
  year={2025}
}

🙏 Acknowledgments

LinGen Team: For the innovative MATE block and LinGen architecture design
PixArt Team: For the excellent Diffusion Transformer codebase
Original Paper: LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

📧 Contact

For questions about this implementation, please open an issue in this repository.
For questions about the original LinGen research, please refer to the official project page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

CVPR 2025

🎯 About This Repository

Key Features

🏗️ Architecture Overview

MA-Branch

TE-Branch

📂 Repository Structure

Core Modifications

🚀 Getting Started

Dependencies

Usage

📊 Key Benefits

⚠️ Important Notes

📝 Citation

🙏 Acknowledgments

📧 Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🚀 LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

CVPR 2025

🎯 About This Repository

Key Features

🏗️ Architecture Overview

MA-Branch

TE-Branch

📂 Repository Structure

Core Modifications

🚀 Getting Started

Dependencies

Usage

📊 Key Benefits

⚠️ Important Notes

📝 Citation

🙏 Acknowledgments

📧 Contact