Skip to content

Conversation

@Pfannkuchensack
Copy link
Contributor

Add comprehensive support for Z-Image-Turbo (S3-DiT) models including:

Backend:

  • New BaseModelType.ZImage in taxonomy
  • Z-Image model config classes (ZImageTransformerConfig, Qwen3TextEncoderConfig)
  • Model loader for Z-Image transformer and Qwen3 text encoder
  • Z-Image conditioning data structures
  • Step callback support for Z-Image with FLUX latent RGB factors

Invocations:

  • z_image_model_loader: Load Z-Image transformer and Qwen3 encoder
  • z_image_text_encoder: Encode prompts using Qwen3 with chat template
  • z_image_denoise: Flow matching denoising with time-shifted sigmas
  • z_image_image_to_latents: Encode images to 16-channel latents
  • z_image_latents_to_image: Decode latents using FLUX VAE

Frontend:

  • Z-Image graph builder for text-to-image generation
  • Model picker and validation updates for z-image base type
  • CFG scale now allows 0 (required for Z-Image-Turbo)
  • Clip skip disabled for Z-Image (uses Qwen3, not CLIP)
  • Optimal dimension settings for Z-Image (1024x1024)

Technical details:

  • Uses Qwen3 text encoder (not CLIP/T5)
  • 16 latent channels with FLUX-compatible VAE
  • Flow matching scheduler with dynamic time shift
  • 8 inference steps recommended for Turbo variant
  • bfloat16 inference dtype

Summary

Related Issues / Discussions

QA Instructions

  • Install a Z-Image-Turbo model (e.g., from HuggingFace)
  • Select the model in the Model Picker
  • Generate a text-to-image with:
  • CFG Scale: 0
  • Steps: 8
  • Resolution: 1024x1024
  • Verify the generated image is coherent (not noise)

Merge Plan

Standard merge, no special considerations needed.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Add comprehensive support for Z-Image-Turbo (S3-DiT) models including:

Backend:
- New BaseModelType.ZImage in taxonomy
- Z-Image model config classes (ZImageTransformerConfig, Qwen3TextEncoderConfig)
- Model loader for Z-Image transformer and Qwen3 text encoder
- Z-Image conditioning data structures
- Step callback support for Z-Image with FLUX latent RGB factors

Invocations:
- z_image_model_loader: Load Z-Image transformer and Qwen3 encoder
- z_image_text_encoder: Encode prompts using Qwen3 with chat template
- z_image_denoise: Flow matching denoising with time-shifted sigmas
- z_image_image_to_latents: Encode images to 16-channel latents
- z_image_latents_to_image: Decode latents using FLUX VAE

Frontend:
- Z-Image graph builder for text-to-image generation
- Model picker and validation updates for z-image base type
- CFG scale now allows 0 (required for Z-Image-Turbo)
- Clip skip disabled for Z-Image (uses Qwen3, not CLIP)
- Optimal dimension settings for Z-Image (1024x1024)

Technical details:
- Uses Qwen3 text encoder (not CLIP/T5)
- 16 latent channels with FLUX-compatible VAE
- Flow matching scheduler with dynamic time shift
- 8 inference steps recommended for Turbo variant
- bfloat16 inference dtype
@github-actions github-actions bot added api python PRs that change python files Root invocations PRs that change invocations backend PRs that change backend files frontend PRs that change frontend files python-deps PRs that change python dependencies labels Nov 30, 2025
Add comprehensive LoRA support for Z-Image models including:

Backend:
- New Z-Image LoRA config classes (LoRA_LyCORIS_ZImage_Config, LoRA_Diffusers_ZImage_Config)
- Z-Image LoRA conversion utilities with key mapping for transformer and Qwen3 encoder
- LoRA prefix constants (Z_IMAGE_LORA_TRANSFORMER_PREFIX, Z_IMAGE_LORA_QWEN3_PREFIX)
- LoRA detection logic to distinguish Z-Image from Flux models
- Layer patcher improvements for proper dtype conversion and parameter
@lstein
Copy link
Collaborator

lstein commented Dec 2, 2025

Very impressive. The model is working with acceptable performance even on my 12 GB RAM card.

I notice the following message in the error log:

[2025-12-01 20:50:58,822]::[ModelManagerService]::WARNING --> [MODEL CACHE] Failed to calculate model size for unexpected model type: <class 'transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer'>. The model will be treated as having size 0.

Would it be possible to add support for the quantized models, e.g. T5B/Z-Image-Turbo-FP8 or jayn7/Z-Image-Turbo-GGUF ?

@Pfannkuchensack
Copy link
Contributor Author

I'll take a look at it and report back.

@lstein
Copy link
Collaborator

lstein commented Dec 2, 2025

I tried two huggingface LoRAs that claim to be based on z-image, but they were detected as Flux lycoris models:

reverentelusarca/elusarca-anime-style-lora-z-image-turbo
tarn59/pixel_art_style_lora_z_image_turbo

…ntification

Move Flux layer structure check before metadata check to prevent misidentifying Z-Image LoRAs (which use `diffusion_model.layers.X`) as Flux AI Toolkit format. Flux models use `double_blocks` and `single_blocks` patterns which are now checked first regardless of metadata presence.
…ibility

Add comprehensive support for GGUF quantized Z-Image models and improve component flexibility:

Backend:
- New Main_GGUF_ZImage_Config for GGUF quantized Z-Image transformers
- Z-Image key detection (_has_z_image_keys) to identify S3-DiT models
- GGUF quantization detection and sidecar LoRA patching for quantized models
- Qwen3Encoder_Qwen3Encoder_Config for standalone Qwen3 encoder models

Model Loader:
- Split Z-Image model
@Pfannkuchensack
Copy link
Contributor Author

image I did tried both of the Lora and both of them get imported as z-images lora.

@Pfannkuchensack Pfannkuchensack marked this pull request as ready for review December 4, 2025 23:46
@lstein
Copy link
Collaborator

lstein commented Dec 5, 2025

When running upscaling, diffusers 0.36.0.dev0 dies because the diffusers.models.controlnet module has been renamed to diffusers.models.controlnets.controlnet. I suggest applying this patch to fix the issue:

diff --git a/invokeai/backend/util/hotfixes.py b/invokeai/backend/util/hotfixes.py
index 7e258b8779..1609fe12c4 100644
--- a/invokeai/backend/util/hotfixes.py
+++ b/invokeai/backend/util/hotfixes.py
@@ -5,7 +5,6 @@ import torch
 from diffusers.configuration_utils import ConfigMixin, register_to_config
 from diffusers.loaders.single_file_model import FromOriginalModelMixin
 from diffusers.models.attention_processor import AttentionProcessor, AttnProcessor
-from diffusers.models.controlnet import ControlNetConditioningEmbedding, ControlNetOutput, zero_module
 from diffusers.models.embeddings import (
     TextImageProjection,
     TextImageTimeEmbedding,
@@ -13,6 +12,7 @@ from diffusers.models.embeddings import (
     TimestepEmbedding,
     Timesteps,
 )
+from diffusers.models.controlnets.controlnet import ControlNetConditioningEmbedding, ControlNetOutput, zero_module
 from diffusers.models.modeling_utils import ModelMixin
 from diffusers.models.unets.unet_2d_blocks import (
     CrossAttnDownBlock2D,
@@ -777,7 +777,7 @@ class ControlNetModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
 
 
 diffusers.ControlNetModel = ControlNetModel
-diffusers.models.controlnet.ControlNetModel = ControlNetModel
+diffusers.models.controlnets.controlnet.ControlNetModel = ControlNetModel

@blessedcoolant
Copy link
Collaborator

Think this needs support for loading in the repackaged safetensors versions of the models that people use with Comfy - the default fp16 version and the fp8 model. People will likely try to load those model files as the transformer and also as the text encoder and share between the two programs.

@lstein
Copy link
Collaborator

lstein commented Dec 8, 2025

I've tested multiple LoRAs and they import and work correctly.

@lstein lstein requested a review from Copilot December 8, 2025 22:59
…inModelConfig

The FLUX Dev license warning in model pickers used isCheckpointMainModelConfig
incorrectly:
```
isCheckpointMainModelConfig(config) && config.variant === 'dev'
```

This caused a TypeScript error because CheckpointModelConfig type doesn't
include the 'variant' property (it's extracted as `{ type: 'main'; format:
'checkpoint' }` which doesn't narrow to include variant).

Changes:
- Add isFluxDevMainModelConfig type guard that properly checks
  base='flux' AND variant='dev', returning MainModelConfig
- Update MainModelPicker and InitialStateMainModelPicker to use new guard
- Remove isCheckpointMainModelConfig as it had no other usages

The function was removed because:
1. It was only used for detecting FLUX Dev models (incorrect use case)
2. No other code needs a generic "is checkpoint format" check
3. The pattern in this codebase is specific type guards per model variant
   (isFluxFillMainModelModelConfig, isRefinerMainModelModelConfig, etc.)
@blessedcoolant
Copy link
Collaborator

blessedcoolant commented Dec 9, 2025

I used the models from https://comfyanonymous.github.io/ComfyUI_examples/z_image/ for testing.

  1. ✔️ Nice job. The models are being detected correctly at via the model manager.

  2. ❌ The inference seems to be fine at FP16 but on the FP8 models the following error occurs. Reference FP8 model to check: https://huggingface.co/Kijai/Z-Image_comfy_fp8_scaled/tree/main

File "~\InvokeAI\invokeai\backend\model_manager\load\model_loaders\z_image.py", line 75, in _convert_z_image_gguf_to_diffusers
    raise ValueError(
ValueError: Cannot split QKV tensor 'context_refiner.0.attention.qkv.scale_weight': first dimension (1) is not divisible by 3. The model file may be corrupted or incompatible.

This is also the same issue with a model that I manually converted on my end too.

  1. ⁉️Secondary thing would be to set the default params for the Z Image model when loaded -- the recommended steps of 9, cfg to 1 and etc.

  2. ❌ Also LoRA's for z-image that I randomly pulled off Civit are being loaded as checkpoint models rather than LoRAs and manually trying to update the field is failing. So effectively cannot use them at all.

  3. ✔️ Tested the GGUF models. The base model quants are working as expected.

  4. ❌ The GGUF quants for the text encoder Qwen 4B are failing to load. https://huggingface.co/Qwen/Qwen3-4B-GGUF/tree/main

  5. ❌ There are some weird artifacts at 9 steps and 1 CFG which I believe are the recommended settings for Z Image Turbo. These are not so visible in styled images but when it comes to realism they are quite prominent.

opera_EBU1aieHo3

@Pfannkuchensack
Copy link
Contributor Author

  1. this are the settings from the github page of the model. so i think cfg 1 is the problem there. but i will check.
    num_inference_steps=9, # This actually results in 8 DiT forwards
    guidance_scale=0.0, # Guidance should be 0 for the Turbo models
    I'll take a look at the rest as well.

@blessedcoolant
Copy link
Collaborator

  1. this are the settings from the github page of the model. so i think cfg 1 is the problem there. but i will check.
    num_inference_steps=9, # This actually results in 8 DiT forwards
    guidance_scale=0.0, # Guidance should be 0 for the Turbo models
    I'll take a look at the rest as well.

Same issue with CFG set to 0 too. Another issue I found is that now that 0 CFG is possible, we cannot set it as the model default in the model manager. It bugs out. Needs fixing.

…ters

- Add Qwen3EncoderGGUFLoader for llama.cpp GGUF quantized text encoders
- Convert llama.cpp key format (blk.X., token_embd) to PyTorch format
- Handle tied embeddings (lm_head.weight ↔ embed_tokens.weight)
- Dequantize embed_tokens for embedding lookups (GGMLTensor limitation)
- Add QK normalization key mappings (q_norm, k_norm) for Qwen3
- Set Z-Image defaults: steps=9, cfg_scale=0.0, width/height=1024
- Allow cfg_scale >= 0 (was >= 1) for Z-Image Turbo compatibility
- Add GGUF format detection for Qwen3 model probing
…rNorm

- Add CustomDiffusersRMSNorm for diffusers.models.normalization.RMSNorm
- Add CustomLayerNorm for torch.nn.LayerNorm
- Register both in AUTOCAST_MODULE_TYPE_MAPPING

Enables partial loading (enable_partial_loading: true) for Z-Image models
by wrapping their normalization layers with device autocast support
Fixed the DEFAULT_TOKENIZER_SOURCE to Qwen/Qwen3-4B
@Pfannkuchensack
Copy link
Contributor Author

  1. ❌ The inference seems to be fine at FP16 but on the FP8 models the following error occurs. Reference FP8 model to check: https://huggingface.co/Kijai/Z-Image_comfy_fp8_scaled/tree/main

I got the FP8 Version without the scaled running.

File "~\InvokeAI\invokeai\backend\model_manager\load\model_loaders\z_image.py", line 75, in _convert_z_image_gguf_to_diffusers
    raise ValueError(
ValueError: Cannot split QKV tensor 'context_refiner.0.attention.qkv.scale_weight': first dimension (1) is not divisible by 3. The model file may be corrupted or incompatible.

This is also the same issue with a model that I manually converted on my end too.

  1. ⁉️Secondary thing would be to set the default params for the Z Image model when loaded -- the recommended steps of 9, cfg to 1 and etc.

added.

  1. ❌ Also LoRA's for z-image that I randomly pulled off Civit are being loaded as checkpoint models rather than LoRAs and manually trying to update the field is failing. So effectively cannot use them at all.

fixed for the one you linked.

  1. ❌ The GGUF quants for the text encoder Qwen 4B are failing to load. https://huggingface.co/Qwen/Qwen3-4B-GGUF/tree/main

fixed. tested with the Qwen3-4B-Q5_0.gguf and Qwen3-4B-Q6_K.gguf

  1. ❌ There are some weird artifacts at 9 steps and 1 CFG which I believe are the recommended settings for Z Image Turbo. These are not so visible in styled images but when it comes to realism they are quite prominent.
opera_EBU1aieHo3

found the problem here. i used the wrong qwen config.

@blessedcoolant
Copy link
Collaborator

blessedcoolant commented Dec 14, 2025

Gave it a quick test. All the above stuff has been fixed. Tested text to image, image to image, inpainting and outpainting. All of them work fine except outpainting -- model related maybe?

But beyond that, I think this PR is good to go once we clear up the tests and formatting. @lstein could probably give it a quick look over too maybe.

There's a new controlnet model for Z Image. But I guess that can be another PR.

Great job overall.

@Pfannkuchensack
Copy link
Contributor Author

Yeah the outpainting is a limitation of model. i found (https://github.com/scraed/LanPaint?tab=readme-ov-file#example-z-image-inpaintlanpaint-k-sampler-5-steps-of-thinking) (https://arxiv.org/abs/2502.03491) but that is a lot for now.

@Pfannkuchensack
Copy link
Contributor Author

https://github.com/Pfannkuchensack/InvokeAI/tree/feature/z-image-control the control branch. not ready.

@blessedcoolant
Copy link
Collaborator

I've fixed up the ruff checks. I didn't update the uv lockfile yet. I'll let you do the pinning of the diffusers version on that. Currently it is locking to the dev version from their git. Version 0.36 released last week suffices with this update, we can pin that and update the lock file directly.

I'll also go through the code later today. There's a lot but I'll skim through.

Once the uv lockfile is resolved, we are a go I think. Pinging @lstein once more for a second set of eyes before we merge.

@lstein lstein self-requested a review December 16, 2025 02:49
Copy link
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've functionally tested extensively over the past week and did not encounter any hiccups. I'm happy to approve.

…noise node

The Z-Image denoise node outputs latents, not images, so these mixins
were unnecessary. Metadata and board handling is correctly done in the
L2I (latents-to-image) node. This aligns with how FLUX denoise works.
The previous mixed-precision optimization for FP32 mode only converted
some VAE decoder layers (post_quant_conv, conv_in, mid_block) to the
latents dtype while leaving others (up_blocks, conv_norm_out) in float32.
This caused "expected scalar type Half but found Float" errors after
recent diffusers updates.

Simplify FP32 mode to consistently use float32 for both VAE and latents,
removing the incomplete mixed-precision logic. This trades some VRAM
usage for stability and correctness.

Also removes now-unused attention processor imports.
@Pfannkuchensack
Copy link
Contributor Author

I fixed a Problem with SDXL FP32 that maybe comes from the update of diffusers.

@blessedcoolant
Copy link
Collaborator

Sounds good (if anything else breaks, we'll fix it up in another PR).

Should I update the uv lockfile to 0.36 and merge the PR? If anyone has an issue, speak now or forever hold your peace until the next PR.

@blessedcoolant
Copy link
Collaborator

blessedcoolant commented Dec 21, 2025

I've upgraded Diffusers to 0.36.0 and the lock file. If the checks pass, I think this PR is good for merge. Once this is merged, we'll move on to the ControlNet PR.

@blessedcoolant blessedcoolant merged commit ab6b672 into invoke-ai:main Dec 21, 2025
13 checks passed
@Pfannkuchensack Pfannkuchensack deleted the feat/z-image-turbo-support branch December 21, 2025 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api backend PRs that change backend files frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-deps PRs that change python dependencies Root

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants