feat: upgrade to FLUX.2-klein-4B with FP8 quantization by Defilan · Pull Request #1 · defilantech/vecsmith

Defilan · 2026-03-23T18:43:45Z

Summary

Replaces FLUX.1-schnell (12B params) with FLUX.2-klein-4B (4B params, Apache 2.0) for improved image quality
Adds FP8 quantization via optimum-quanto, reducing VRAM from ~13GB to ~8GB on 16GB GPUs
Updates inference parameters: guidance_scale=1.0, Flux2KleinPipeline, enable_model_cpu_offload
Adds FLUX_QUANTIZE_FP8 env var (default true) to toggle FP8 at deploy time
Reduces K8s memory request from 16Gi to 12Gi to reflect smaller model footprint

Notes

Requires diffusers installed from source (Klein pipeline not in a PyPI release yet)
First deploy will download ~8GB of model weights to the PVC; subsequent startups load from cache
FP8 can be disabled by setting FLUX_QUANTIZE_FP8=false if quality issues arise

Test plan

Build Docker image with updated Dockerfile
Deploy to Shadowstack and verify model loads with FP8
Generate sample images and compare quality vs FLUX.1-schnell
Verify /health endpoint reports FLUX.2-klein-4B
Test with FLUX_QUANTIZE_FP8=false to confirm BF16 fallback works

Replace FLUX.1-schnell (12B) with FLUX.2-klein-4B (4B params, Apache 2.0) for significantly improved image quality at lower VRAM usage. Key changes: - Switch from FluxPipeline to Flux2KleinPipeline (diffusers main) - Add FP8 quantization via optimum-quanto (~8GB VRAM vs ~13GB BF16) - Update guidance_scale from 0.0 to 1.0 (Klein uses light guidance) - Use enable_model_cpu_offload instead of sequential offloading - Add FLUX_QUANTIZE_FP8 env var (default true, set false to disable) - Reduce K8s memory request from 16Gi to 12Gi Signed-off-by: Christopher Maher <chris@mahercode.io>

The runtime image lacks nvcc/build tools needed by optimum-quanto to JIT-compile Marlin FP8 CUDA kernels. Also adds git for pip install of diffusers from source. Signed-off-by: Christopher Maher <chris@mahercode.io>

CUDA 12.4 nvcc doesn't support compute_120 (Blackwell/RTX 50 series). CUDA 12.8 adds Blackwell support. Also adds python3-dev for the Python.h headers needed by optimum-quanto's JIT kernel compilation. Signed-off-by: Christopher Maher <chris@mahercode.io>

optimum-quanto's Marlin FP8 kernels require JIT compilation and hit runtime contiguity bugs. Switch to diffusers' built-in enable_layerwise_casting (stores weights in FP8, computes in BF16) which needs no external dependencies or CUDA compilation. This also allows switching back to the smaller runtime base image. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan added 4 commits March 23, 2026 11:43

fix: use CUDA devel image for FP8 quantization support

5f41068

The runtime image lacks nvcc/build tools needed by optimum-quanto to JIT-compile Marlin FP8 CUDA kernels. Also adds git for pip install of diffusers from source. Signed-off-by: Christopher Maher <chris@mahercode.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1

feat: upgrade to FLUX.2-klein-4B with FP8 quantization#1
Defilan wants to merge 4 commits intomainfrom
feat/flux2-klein-4b-fp8

Defilan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Mar 23, 2026

Summary

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant