Skip to content

refactor(quantization): modularize linear.py and loader.py, add FP8 K…#31

Closed
luozixin2 wants to merge 1 commit intoSJTU-DENG-Lab:mainfrom
luozixin2:main
Closed

refactor(quantization): modularize linear.py and loader.py, add FP8 K…#31
luozixin2 wants to merge 1 commit intoSJTU-DENG-Lab:mainfrom
luozixin2:main

Conversation

@luozixin2
Copy link
Collaborator

@luozixin2 luozixin2 commented Mar 11, 2026

…V cache support

Major refactoring and feature additions:

Code Modularization:

  • Split linear.py (872 lines) into focused modules:
    • linear.py: Main mixin class with forward dispatch
    • offline_prep.py: GPTQ/AWQ/Marlin weight preparation
    • online_quant.py: Runtime weight quantization helpers
  • Split loader.py (887 lines) into loader/ package:
    • core.py: Base weight loading utilities
    • lora.py: LoRA weight loading
    • offline_quant.py: GPTQ/AWQ weight loading
    • main.py: Entry point

FP8 KV Cache Support:

  • Add chunked_prefill_fp8_triton.py with Triton kernel
  • Implement per-head running-max quantization strategy
  • Add FP8 dispatch in dllm_flash_attn_kernels.py
  • Support both BF16 compute (A100) and native FP8 (H100)

Compatibility Fixes:

  • Fix torch.distributed device_id type (int -> torch.device)
  • Fix vLLM platform API incompatibility for FP8 dtype
  • Disable Triton autotune for 3.1.0 compatibility
  • Reduce BLOCK size 128->64 for shared memory limits

Cleanup:

  • Remove obsolete commented code in attn_impl.py
  • Add attn_type field to D2F metadata for attention routing

Net reduction: ~1200 lines of cleaner, more maintainable code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added FP8 KV-cache support for improved attention computation efficiency.
    • Enhanced quantization capabilities with offline and online support for GPTQ, AWQ, and Marlin formats.
  • Bug Fixes

    • Improved FP8 dtype detection with better PyTorch version compatibility.
    • Fixed device handling during distributed training initialization.
  • Refactor

    • Reorganized model loader utilities into a modular package structure for improved maintainability.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant