refactor(quantization): modularize linear.py and loader.py, add FP8 K… by luozixin2 · Pull Request #31 · SJTU-DENG-Lab/Diffulex

luozixin2 · 2026-03-11T08:00:06Z

…V cache support

Major refactoring and feature additions:

Code Modularization:

Split linear.py (872 lines) into focused modules:
- linear.py: Main mixin class with forward dispatch
- offline_prep.py: GPTQ/AWQ/Marlin weight preparation
- online_quant.py: Runtime weight quantization helpers
Split loader.py (887 lines) into loader/ package:
- core.py: Base weight loading utilities
- lora.py: LoRA weight loading
- offline_quant.py: GPTQ/AWQ weight loading
- main.py: Entry point

FP8 KV Cache Support:

Compatibility Fixes:

Cleanup:

Net reduction: ~1200 lines of cleaner, more maintainable code

Summary by CodeRabbit

New Features
- Added FP8 KV-cache support for improved attention computation efficiency.
- Enhanced quantization capabilities with offline and online support for GPTQ, AWQ, and Marlin formats.
Bug Fixes
- Improved FP8 dtype detection with better PyTorch version compatibility.
- Fixed device handling during distributed training initialization.
Refactor
- Reorganized model loader utilities into a modular package structure for improved maintainability.