feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1
Open
feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1
Conversation
Replace FLUX.1-schnell (12B) with FLUX.2-klein-4B (4B params, Apache 2.0) for significantly improved image quality at lower VRAM usage. Key changes: - Switch from FluxPipeline to Flux2KleinPipeline (diffusers main) - Add FP8 quantization via optimum-quanto (~8GB VRAM vs ~13GB BF16) - Update guidance_scale from 0.0 to 1.0 (Klein uses light guidance) - Use enable_model_cpu_offload instead of sequential offloading - Add FLUX_QUANTIZE_FP8 env var (default true, set false to disable) - Reduce K8s memory request from 16Gi to 12Gi Signed-off-by: Christopher Maher <chris@mahercode.io>
The runtime image lacks nvcc/build tools needed by optimum-quanto to JIT-compile Marlin FP8 CUDA kernels. Also adds git for pip install of diffusers from source. Signed-off-by: Christopher Maher <chris@mahercode.io>
CUDA 12.4 nvcc doesn't support compute_120 (Blackwell/RTX 50 series). CUDA 12.8 adds Blackwell support. Also adds python3-dev for the Python.h headers needed by optimum-quanto's JIT kernel compilation. Signed-off-by: Christopher Maher <chris@mahercode.io>
optimum-quanto's Marlin FP8 kernels require JIT compilation and hit runtime contiguity bugs. Switch to diffusers' built-in enable_layerwise_casting (stores weights in FP8, computes in BF16) which needs no external dependencies or CUDA compilation. This also allows switching back to the smaller runtime base image. Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
guidance_scale=1.0,Flux2KleinPipeline,enable_model_cpu_offloadFLUX_QUANTIZE_FP8env var (defaulttrue) to toggle FP8 at deploy timeNotes
diffusersinstalled from source (Klein pipeline not in a PyPI release yet)FLUX_QUANTIZE_FP8=falseif quality issues ariseTest plan
/healthendpoint reportsFLUX.2-klein-4BFLUX_QUANTIZE_FP8=falseto confirm BF16 fallback works