Fast-Diffusion

Mobile-first CLML diffusion. Fastest on supported devices in publicly comparable settings.
SDXL 1024 on phones with a practical end-to-end pipeline.

Highlights

CLML backend optimized for fast on-device inference
SD1.5 512 (single-process) and SDXL 1024 (multi-process, recommended)
Early-decoded x0 flow for cleaner SDXL results
Includes app source, APK, and bench binaries

Gallery (SDXL 1024, CLML)

Same prompt/seed/CFG. Left is early decode (k=2, x0), right is the default final-step decode.

20 steps

Early decode (k=2, x0)	Final-step decode (x0)
_{SM8750 • steps=20 • k=2 • s/it=3.21684 (CFG UNet)}	_{SM8750 • steps=20 • final • s/it=3.21684 (CFG UNet)}

25 steps

Early decode (k=2, x0)	Final-step decode (x0)
_{SM8750 • steps=25 • k=2 • s/it=3.21684 (CFG UNet)}	_{SM8750 • steps=25 • final • s/it=3.21684 (CFG UNet)}

30 steps

Early decode (k=2, x0)	Final-step decode (x0)
_{SM8750 • steps=30 • k=2 • s/it=3.21684 (CFG UNet)}	_{SM8750 • steps=30 • final • s/it=3.21684 (CFG UNet)}

Step snapshots (20 steps, x0)

Step 17 (k=2)	Step 18 (k=1)	Step 19 (final)
_{SM8750 • steps=20 • step=17 • s/it=3.21684 (CFG UNet)}	_{SM8750 • steps=20 • step=18 • s/it=3.21684 (CFG UNet)}	_{SM8750 • steps=20 • step=19 • s/it=3.21684 (CFG UNet)}

Gallery (SD1.5 512, CLML)

Portrait 2	Non-portrait (cityscape)
_{SM8750 • steps=20 • s/step=0.459639 (CFG)}	_{SM8750 • steps=20 • s/step=0.459639 (CFG)}

Performance (CFG-aligned, with explicit records)

Our measured results (full records are included in this repo):

SD1.5 512 (CFG): steps=20, total_s=9.19277, s/step=0.459639
- Record: release/sd_pipelines_zh.md
SDXL 1024 UNet-only (CFG): init_s=65.6627, loop_s=64.3368, s/step=3.21684 (20 steps, precomputed embeddings)
- Record: release/bench/logs/sdxl_unet_pyclip.log
SDXL 1024 UNet single pass (no CFG): iters=1, s/it=1.61393
- Record: release/bench/logs/sdxl_unet_single_step.log

Public comparison baselines (CFG enabled or equivalent):

CVPR 2023 (Google LLC), Adreno 740, SD1.4, 20 steps: 11.5s
- Paper: https://arxiv.org/abs/2304.11267
Local Diffusion app (author report), Snapdragon 8 Gen 3, SD1.5: 8 s/it, GPU slower than CPU
- App: https://github.com/rmatif/Local-Diffusion
- Backend: https://github.com/leejet/stable-diffusion.cpp
Local Dream (MNN GPU backend), Snapdragon 8 Elite, SD1.5, 20 steps: 52s
T4 baseline for SDXL 1024: 1.2 s/it (CFG enabled)

Note on T4: 1.2 s/it includes CFG (two UNet passes). Normalized to a single pass: 0.6 s/it.
For comparison, our SDXL UNet single-pass record is 1.61393 s/it, and CFG step is 3.21684 s/step.

Conclusion highlights:

Fastest in publicly comparable settings on CLML-supported phones
SDXL 1024 on mobile is practical here for the first time
Per-step speed is remarkable for a phone-class device

Roadmap (Confirmed)

SDXL Base/Turbo 512/768 backend is already done; front-end integration is pending
SDXL Turbo 768 on SM8750 will generate a high-quality image within 10 seconds; this is the best balance point because Turbo is trained at lower resolution and quality is better than Base at 768
On 16GB RAM devices, enable optional UNet pre-init for SDXL

Requirements

Qualcomm Adreno GPU
OpenCL device extension: cl_qcom_ml_ops
CLML SDK (for building)
MNN with Attention HostOp enabled
- Example build flags: MNN_SUPPORT_TRANSFORMER_FUSE=ON

Runtime notes:

Always set CLML_NO_REUSE_TNN=1 (TNN reuse causes numerical instability)
SDXL VAE requires CLML VAE + MNN Attention HostOp
SDXL 1024 needs 16GB RAM even without pre-init; pre-init with CLIP + UNet co-resident OOMs
SD1.5 does not have this issue and supports pre-init for smoother UX

SDK Versions

CLML SDK: v4.1 (cl_qcom_ml_ops)
QNN/SNPE SDK: 2.39 (used for SoC table source)
MNN: 3.3.0 custom build with Attention HostOp enabled (Transformer Fuse)

Supported SoCs (from Qualcomm QNN/SNPE SDK table)

Source: QNN_SDK_2.39/qairt/2.39.0.250926/docs/SNPE/html/general/overview.html

SD 8 Elite Gen 5 (SM8850)
SD 8 Gen 4 (SM8750)
SD 8 Gen 3 (SM8650)
SD 8 Gen 2 (SM8550)
SD 8s Gen 3 (SM8635)
SD 8+ Gen 1 (SM8475)
SD 8 Gen 1 (SM8450)
888+ (SM8350P)
888 (SM8350)
7+ Gen 3 (SM7675)
7 Gen 1 (SM7450)
778G (SM7325)
865 (SM8250)
765 (SM7250)
750G (SM7225)
690 (SM6350)
695 (SM6375)
680 (SM6225)
480 (SM4350/6325)
460 (SM4250)
662 (SM6115)

Repository Layout (Release)

app/sdxl-clml/ - Android app source
release/ - release artifacts
- release/app/sdxl-clml-debug.apk
- release/bench/ (binaries + source)
- release/sd_pipelines_zh.md (full pipeline notes, Chinese)

Quick Start (SD1.5 512)

adb push release/bench/sd15_pipeline_run /data/local/tmp/sd15_pipeline_run
adb push -r <sd15_clml_weights_dir> /data/local/tmp/sd15_clml/

adb shell "CLML_NO_REUSE_TNN=1 /data/local/tmp/sd15_pipeline_run /data/local/tmp/sd15_clml/sd15_clml_weights 20"

Output: /data/local/tmp/output/clml_stable_diffusion_output.qfp32

Quick Start (SDXL 1024, recommended)

Memory note:

SDXL 1024 needs 16GB RAM even without pre-init
Pre-init with CLIP + UNet co-resident OOMs; app does not pre-init SDXL

1) Generate CLIP token ids (host)

conda run -n comfyui --no-capture-output python - <<'PY'
import sys
import numpy as np
from pathlib import Path
COMFY_ROOT = "/home/happyyzy/ComfyUI"
if COMFY_ROOT not in sys.path:
    sys.path.append(COMFY_ROOT)
import comfy.sd

ckpt_path = "<SDXL_CKPT_PATH>/sd_xl_base_1.0.safetensors"
prompt = "a close-up portrait of a young woman, soft lighting, shallow depth of field"

_, clip, _, _ = comfy.sd.load_checkpoint_guess_config(
    ckpt_path,
    output_vae=False,
    output_clip=True,
    output_model=False,
)

def token_ids(token_list):
    return [int(t[0]) for t in token_list]

cond = clip.tokenize(prompt)
uncond = clip.tokenize("")

ids_l = np.array(token_ids(uncond["l"][0]) + token_ids(cond["l"][0]), dtype=np.int32)
ids_g = np.array(token_ids(uncond["g"][0]) + token_ids(cond["g"][0]), dtype=np.int32)

Path("clip_l_ids.i32").write_bytes(ids_l.tobytes())
Path("clip_g_ids.i32").write_bytes(ids_g.tobytes())
print("ok")
PY

2) UNet process (CPU CLIP + CLML UNet)

adb push release/bench/sdxl_pipeline_run /data/local/tmp/sdxl_pipeline_run

adb shell "LD_LIBRARY_PATH=/data/local/tmp/MNN_fuse:/system/lib64:/vendor/lib64 \
MNN_CL_LIB=/data/local/tmp/MNN_fuse/libMNN_CL.so \
CLML_MNN_ATTN_BACKEND=opencl CLML_MNN_ATTN_FP32=1 CLML_NO_REUSE_TNN=1 \
SDXL_EARLY_DECODE_K=2 SDXL_EARLY_DECODE_X0=1 SDXL_UNET_ONLY=1 \
SDXL_LATENT_OUT=/data/local/tmp/sdxl_latent_clipcpu_early2_x0.qfp32 \
/data/local/tmp/sdxl_pipeline_run \
/data/local/tmp/sdxl_clml/sdxl_clml_weights \
/data/local/tmp/MNN_clip \
/data/local/tmp/clip_l_ids.i32 /data/local/tmp/clip_g_ids.i32 \
20 7.5 0 1024 1024 /data/local/tmp/unused_output.qfp32"

3) VAE process (CLML VAE + MNN Attention)

adb push release/bench/sdxl_vae_decoder_run /data/local/tmp/sdxl_vae_decoder_run

adb shell "cd /data/local/tmp && \
LD_LIBRARY_PATH=/data/local/tmp/MNN_fuse:/system/lib64:/vendor/lib64 \
MNN_CL_LIB=/data/local/tmp/MNN_fuse/libMNN_CL.so \
MNN_BACKEND=opencl MNN_GPU_MODE=1 MNN_MEM=0 MNN_POWER=0 MNN_PREC=0 \
CLML_MNN_ATTN_BACKEND=opencl CLML_MNN_ATTN_FP32=1 CLML_NO_REUSE_TNN=1 \
./sdxl_vae_decoder_run /data/local/tmp/sdxl_clml/sdxl_clml_weights \
1 0 1 0.1 128 128 1 /data/local/tmp/sdxl_latent_clipcpu_early2_x0.qfp32"

4) Convert qfp32 to PNG (host)

python3 - <<'PY'
import numpy as np
from PIL import Image
path_in = './sdxl_vae_out_clipcpu_early2_x0.qfp32'
path_out = './sdxl_vae_out_clipcpu_early2_x0.png'
arr = np.fromfile(path_in, dtype=np.float32).reshape(1, 3, 1024, 1024)
img = (arr[0] / 2.0 + 0.5)
img = np.clip(img, 0.0, 1.0)
img = (img.transpose(1, 2, 0) * 255.0).round().astype(np.uint8)
Image.fromarray(img).save(path_out)
print(path_out)
PY

App

Source: app/sdxl-clml/
APK: release/app/sdxl-clml-debug.apk
Features: SDXL 1024 + SD1.5 512, steps, CFG, scheduler, early decode, decode x0, seed, prompt/negative prompt

Weights

Weights are hosted on HuggingFace (public):
https://huggingface.co/zhiyuanasad/fast-diffusion-weights

SD1.5 weights: https://huggingface.co/zhiyuanasad/fast-diffusion-weights/tree/main/sd15_clml_weights
SDXL weights: https://huggingface.co/zhiyuanasad/fast-diffusion-weights/tree/main/sdxl_clml_weights

Acknowledgements

Qualcomm CLML SDK (cl_qcom_ml_ops)
MNN SDK and runtime

Notes

Full pipeline records are in release/sd_pipelines_zh.md (Chinese)
Chinese release doc: release/README_zh.md
For reproducibility, ensure CLML and MNN runtime libraries match the expected build options

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
app/sdxl-clml		app/sdxl-clml
assets		assets
release		release
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast-Diffusion

Highlights

Gallery (SDXL 1024, CLML)

Gallery (SD1.5 512, CLML)

Performance (CFG-aligned, with explicit records)

Roadmap (Confirmed)

Requirements

SDK Versions

Supported SoCs (from Qualcomm QNN/SNPE SDK table)

Repository Layout (Release)

Quick Start (SD1.5 512)

Quick Start (SDXL 1024, recommended)

1) Generate CLIP token ids (host)

2) UNet process (CPU CLIP + CLML UNet)

3) VAE process (CLML VAE + MNN Attention)

4) Convert qfp32 to PNG (host)

App

Weights

Acknowledgements

Notes

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast-Diffusion

Highlights

Gallery (SDXL 1024, CLML)

Gallery (SD1.5 512, CLML)

Performance (CFG-aligned, with explicit records)

Roadmap (Confirmed)

Requirements

SDK Versions

Supported SoCs (from Qualcomm QNN/SNPE SDK table)

Repository Layout (Release)

Quick Start (SD1.5 512)

Quick Start (SDXL 1024, recommended)

1) Generate CLIP token ids (host)

2) UNet process (CPU CLIP + CLML UNet)

3) VAE process (CLML VAE + MNN Attention)

4) Convert qfp32 to PNG (host)

App

Weights

Acknowledgements

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages