feat: add support for init audio allowing Audio to Video (A2V) for LTX2.3 by LostRuins · Pull Request #1676 · leejet/stable-diffusion.cpp

LostRuins · 2026-06-17T15:54:49Z

Summary

This PR allows for init audio allowing Audio to Video (A2V) for LTX2.3.

Related Issue / Discussion

Implements and closes #1674

Additional Information

!! AI Disclosure: This PR was created with the assistance of OpenAI Codex !!

Besides I2V creating a video from a start frame, LTX2.3 also supports generating a video from an audio clip, audio to video (A2V) as seen in https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/a2vid_two_stage.py

Previously in stable-diffusion.cpp, the audio vae encoder is ignored. This PR allows the audio latent to be created from a provided wav audio file with --init-audio, and used instead of denoising from the default audio latent.

I tested with the distilled LTX2.3 model.

Run with CLI like this:

sd-cli.exe -M vid_gen --diffusion-model ltx-2.3-22b-distilled-1.1-Q4_K_S.gguf --vae ltx-2.3-22b-distilled_video_vae.safetensors --audio-vae ltx-2.3-22b-distilled_audio_vae.safetensors --llm gemma3-12b-it-Q4_K_M.gguf --embeddings-connectors ltx-2.3-22b-distilled_embeddings_connectors.safetensors -p "man speaking" --cfg-scale 1.0 --sampling-method euler -v -W 256 -H 256 --video-frames 65 --init-audio tonystark.wav -o output.avi

Example outputs and audio inputs for reference:

rap.mp4

tony.mp4

epic_rap.wav
tony_stark_cave.wav

Known limitations:

As this project does not currently ingest audio, a simple PCM .wav to sd_audio_t file loader was added. This can probably be improved or replaced (e.g. Miniaudio, Dr Wav).
CLI only, not currently added to the sd.cpp server
The .wav audio and the generated video length should match up. So for example if you use a 4s audio with 16fps, then you would use 16x4+1 = 65 frames generated.
You may or may not have better results by including the text of spoken words in your prompt. I have seen mixed results, but often something as simple as "man speaking" already causes the lips and mouth to line up with the audio, though there is some element of randomness involved. YMMV.

Checklist

I have read and confirmed this PR follows the contribution guidelines.

pinging those who might be interested: @stduhpf @wbruna
@leejet Marking PR as draft, but feel free to modify (you have branch access) or merge anytime at your discretion. If you prefer to replace this implementation, that's fine too.

LostRuins added 3 commits June 17, 2026 21:45

added support for Audio to Video (A2V) for LTX2.3

fb7453b

minor linting format-code

8ab6cf8

corrected the help text for init audio to be more precise

ae7f9ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for init audio allowing Audio to Video (A2V) for LTX2.3#1676

feat: add support for init audio allowing Audio to Video (A2V) for LTX2.3#1676
LostRuins wants to merge 3 commits into
leejet:masterfrom
LostRuins:add_ltx_a2v_support

LostRuins commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LostRuins commented Jun 17, 2026

Summary

Related Issue / Discussion

Additional Information

!! AI Disclosure: This PR was created with the assistance of OpenAI Codex !!

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant