Skip to content

feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1

Open
Defilan wants to merge 4 commits intomainfrom
feat/flux2-klein-4b-fp8
Open

feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1
Defilan wants to merge 4 commits intomainfrom
feat/flux2-klein-4b-fp8

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Mar 23, 2026

Summary

  • Replaces FLUX.1-schnell (12B params) with FLUX.2-klein-4B (4B params, Apache 2.0) for improved image quality
  • Adds FP8 quantization via optimum-quanto, reducing VRAM from ~13GB to ~8GB on 16GB GPUs
  • Updates inference parameters: guidance_scale=1.0, Flux2KleinPipeline, enable_model_cpu_offload
  • Adds FLUX_QUANTIZE_FP8 env var (default true) to toggle FP8 at deploy time
  • Reduces K8s memory request from 16Gi to 12Gi to reflect smaller model footprint

Notes

  • Requires diffusers installed from source (Klein pipeline not in a PyPI release yet)
  • First deploy will download ~8GB of model weights to the PVC; subsequent startups load from cache
  • FP8 can be disabled by setting FLUX_QUANTIZE_FP8=false if quality issues arise

Test plan

  • Build Docker image with updated Dockerfile
  • Deploy to Shadowstack and verify model loads with FP8
  • Generate sample images and compare quality vs FLUX.1-schnell
  • Verify /health endpoint reports FLUX.2-klein-4B
  • Test with FLUX_QUANTIZE_FP8=false to confirm BF16 fallback works

Defilan added 4 commits March 23, 2026 11:43
Replace FLUX.1-schnell (12B) with FLUX.2-klein-4B (4B params,
Apache 2.0) for significantly improved image quality at lower
VRAM usage.

Key changes:
- Switch from FluxPipeline to Flux2KleinPipeline (diffusers main)
- Add FP8 quantization via optimum-quanto (~8GB VRAM vs ~13GB BF16)
- Update guidance_scale from 0.0 to 1.0 (Klein uses light guidance)
- Use enable_model_cpu_offload instead of sequential offloading
- Add FLUX_QUANTIZE_FP8 env var (default true, set false to disable)
- Reduce K8s memory request from 16Gi to 12Gi

Signed-off-by: Christopher Maher <chris@mahercode.io>
The runtime image lacks nvcc/build tools needed by optimum-quanto
to JIT-compile Marlin FP8 CUDA kernels. Also adds git for pip
install of diffusers from source.

Signed-off-by: Christopher Maher <chris@mahercode.io>
CUDA 12.4 nvcc doesn't support compute_120 (Blackwell/RTX 50 series).
CUDA 12.8 adds Blackwell support. Also adds python3-dev for the
Python.h headers needed by optimum-quanto's JIT kernel compilation.

Signed-off-by: Christopher Maher <chris@mahercode.io>
optimum-quanto's Marlin FP8 kernels require JIT compilation and hit
runtime contiguity bugs. Switch to diffusers' built-in
enable_layerwise_casting (stores weights in FP8, computes in BF16)
which needs no external dependencies or CUDA compilation.

This also allows switching back to the smaller runtime base image.

Signed-off-by: Christopher Maher <chris@mahercode.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant