Skip to content

feat: add support for init audio allowing Audio to Video (A2V) for LTX2.3#1676

Draft
LostRuins wants to merge 3 commits into
leejet:masterfrom
LostRuins:add_ltx_a2v_support
Draft

feat: add support for init audio allowing Audio to Video (A2V) for LTX2.3#1676
LostRuins wants to merge 3 commits into
leejet:masterfrom
LostRuins:add_ltx_a2v_support

Conversation

@LostRuins

Copy link
Copy Markdown
Contributor

Summary

This PR allows for init audio allowing Audio to Video (A2V) for LTX2.3.

Related Issue / Discussion

Implements and closes #1674

Additional Information

!! AI Disclosure: This PR was created with the assistance of OpenAI Codex !!

Besides I2V creating a video from a start frame, LTX2.3 also supports generating a video from an audio clip, audio to video (A2V) as seen in https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/a2vid_two_stage.py

Previously in stable-diffusion.cpp, the audio vae encoder is ignored. This PR allows the audio latent to be created from a provided wav audio file with --init-audio, and used instead of denoising from the default audio latent.

I tested with the distilled LTX2.3 model.

Run with CLI like this:

sd-cli.exe -M vid_gen --diffusion-model ltx-2.3-22b-distilled-1.1-Q4_K_S.gguf --vae ltx-2.3-22b-distilled_video_vae.safetensors --audio-vae ltx-2.3-22b-distilled_audio_vae.safetensors --llm gemma3-12b-it-Q4_K_M.gguf --embeddings-connectors ltx-2.3-22b-distilled_embeddings_connectors.safetensors -p "man speaking" --cfg-scale 1.0 --sampling-method euler -v -W 256 -H 256 --video-frames 65 --init-audio tonystark.wav -o output.avi

Example outputs and audio inputs for reference:

rap.mp4
tony.mp4

epic_rap.wav
tony_stark_cave.wav

Known limitations:

  • As this project does not currently ingest audio, a simple PCM .wav to sd_audio_t file loader was added. This can probably be improved or replaced (e.g. Miniaudio, Dr Wav).
  • CLI only, not currently added to the sd.cpp server
  • The .wav audio and the generated video length should match up. So for example if you use a 4s audio with 16fps, then you would use 16x4+1 = 65 frames generated.
  • You may or may not have better results by including the text of spoken words in your prompt. I have seen mixed results, but often something as simple as "man speaking" already causes the lips and mouth to line up with the audio, though there is some element of randomness involved. YMMV.

Checklist

pinging those who might be interested: @stduhpf @wbruna
@leejet Marking PR as draft, but feel free to modify (you have branch access) or merge anytime at your discretion. If you prefer to replace this implementation, that's fine too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Is LTX2.3 audio-to-video (A2V) supported?

1 participant