End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
-
Updated
Jan 8, 2026 - Python
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
Profine automatically profiles and optimizes PyTorch training jobs on real GPUs, delivering measurable speedups and lower GPU costs before teams waste days tuning configs by hand.
PyTorch extension for Rebellions NPU
Wide-model collective ensemble system with fractal, geometric, and heavy compilation optimizations.
Leveraging torch.compile to accelerate cross-encoder inference
🚀 2-4x faster PyTorch training with one line of code. Beats torch.compile by 79%. Zero config, automatic hardware optimization for T4/V100/A100/H100 GPUs.
Optimized CSM-1B TTS pipeline for RTX 5090 (Blackwell sm_120). CUDA graph replay via patched HF Transformers. ~0.46x RTF. Topics (tags): csm text-to-speech rtx-5090 blackwell cuda-graphs torch-compile sesame streaming pytorch
Add a description, image, and links to the torch-compile topic page so that developers can more easily learn about it.
To associate your repository with the torch-compile topic, visit your repo's landing page and select "manage topics."