Add unified MLX model format and audio-video generation improvements by james-see · Pull Request #18 · Blaizzy/mlx-video

james-see · 2026-02-01T02:52:53Z

Add unified MLX model conversion that creates single model.safetensors
Support loading from unified MLX format in generate_av.py
Fix conversion bug that included quantized weights (fp4/fp8 scale keys)
Add local path support in get_model_path (tilde expansion)
Update README with MLX model conversion and usage instructions
Add soundfile dependency for audio processing
Add CLI entry points for generate_av and convert commands

The unified format provides:

Faster loading (single file vs multiple)
Pre-sanitized weights (no on-the-fly transformation)
Easy sharing via HuggingFace

- Add unified MLX model conversion that creates single model.safetensors - Support loading from unified MLX format in generate_av.py - Fix conversion bug that included quantized weights (fp4/fp8 scale keys) - Add local path support in get_model_path (tilde expansion) - Update README with MLX model conversion and usage instructions - Add soundfile dependency for audio processing - Add CLI entry points for generate_av and convert commands The unified format provides: - Faster loading (single file vs multiple) - Pre-sanitized weights (no on-the-fly transformation) - Easy sharing via HuggingFace

- Add GitHub Actions workflow to publish to PyPI on tagged releases - Export generate_video and generate_video_with_audio functions - Bump version to 0.1.0 for first PyPI release

- Rename package from mlx-video to mlx-video-with-audio - Update project URLs to correct repository - Bump version to 0.1.1

- Update description to reflect video+audio generation capabilities - Add keywords for better PyPI discoverability - Update README with PyPI installation instructions - Update repository URLs and image paths - Bump version to 0.1.2

- Add stage parameter to denoise() and denoise_av() functions - Emit STAGE:N:STEP:X:Y:Denoising to stderr for progress tracking - Enables external apps to parse generation progress - Bump version to 0.1.3

- New enhance_prompt.py: standalone enhancement via mlx_lm with TheCluster/amoral-gemma-3-12B-v2-mlx-4bit - --use-uncensored-enhancer CLI flag for generate_av - Avoids content filters on words like urine, blood, etc. - First run downloads ~7GB; logs progress to stderr

- --save-audio-separately flag (default off) - When off: use temp wav for muxing, delete after - When on or --output-audio: keep .wav file

…Gemma

…(~7.5GB)

…ified model

Detect and correct unified checkpoint conv3d tensors saved as (O,D,H,I,W) by swapping trailing dims to MLX layout (O,D,H,W,I), preventing decoder channel mismatch crashes.

Fallback to the original prompt when enhancement fails and emit structured stderr tokens so host apps can surface clear status without failing generation.

Blaizzy · 2026-03-14T19:30:47Z

Hey @james-see,

Thank you for this contribution, really appreciate the effort you put into it!

I'm going to close this PR though, as I've been reworking the Unified API on a dev branch over the past month and have taken it in a different direction. The results have been really solid, significantly lower memory usage (20-35GB) and better speed.

I'll have a PR up soon, just working through a few remaining bugs around LTX-2.3.

I don't want you spending time on something that'll get overwritten, but there will be plenty of work to pick up once that PR lands. I'll put up some issues you can grab, would love to keep you involved.

Fail fast with actionable errors when a unified AV config is used as a text encoder config, and emit structured resolution diagnostics to make issue Blaizzy#20 triage deterministic.

…epos) is_unified_mlx_model() now recognizes split-weight layouts (transformer.safetensors + vae_decoder.safetensors + config.json) in addition to single-file model.safetensors. Also detects via split_model.json manifest from mlx-forge conversions. Updated weight loading across all components (transformer, VAE decoder/encoder, upsampler, connector, text encoder) to load from individual split files when model.safetensors is absent. Fixes dgrauet/ltx-2.3-mlx-distilled-q4 "text encoder config mismatch".

Support gated-attention transformer blocks, prompt scale/shift modulation, and BigVGAN/SnakeBeta vocoder loading for distilled split-weight checkpoints. Also add unified VAE layout fallback to Lightricks/LTX-2 when the embedded decoder topology is not yet supported.

Build the unified VAE decoder from embedded decoder_blocks so distilled checkpoints use their native topology instead of a mismatched static layout. This resolves garbled/distorted output caused by incorrect upsampling path selection.

Detect LTX-2.3 checkpoints and resolve the matching 2.3 spatial upsampler during stage-2 refinement. For 2.3 models, remove fallback to LTX-2 upsampler weights so incompatible latent refinement paths fail fast instead of producing garbled output.

Wire distilled vocoder BWE execution with checkpoint mel_stft tensors, fix 2.3 upsampler key/layout mapping, and tighten unified VAE Conv3d layout handling so unified split weights are not re-transposed incorrectly.

James Campbell and others added 17 commits January 29, 2026 03:51

Add PyPI auto-publish workflow and library API exports

5f421d2

- Add GitHub Actions workflow to publish to PyPI on tagged releases - Export generate_video and generate_video_with_audio functions - Bump version to 0.1.0 for first PyPI release

Fix package name to mlx-video-with-audio

bf09c7c

- Rename package from mlx-video to mlx-video-with-audio - Update project URLs to correct repository - Bump version to 0.1.1

Add structured progress logging for external parsers

d2aeef5

- Add stage parameter to denoise() and denoise_av() functions - Emit STAGE:N:STEP:X:Y:Denoising to stderr for progress tracking - Enables external apps to parse generation progress - Bump version to 0.1.3

v0.1.5: Don't save audio track separately by default

0eb52a5

- --save-audio-separately flag (default off) - When off: use temp wav for muxing, delete after - When on or --output-audio: keep .wav file

Fix mlx-lm 0.25+ API: use sampler instead of temp for generate()

a774c31

Fix unified model: avoid Lightricks download, use model_path and MLX …

c3a81ff

…Gemma

v0.1.8: Auto-use cached text encoder, default to smaller 4-bit Gemma …

ddce860

…(~7.5GB)

v0.1.9: Fix VAE decoder conv_in - skip transpose when loading from un…

4a6212b

…ified model

v0.1.10: Restore bf16 Gemma text encoder default for generation quality

69d0df7

v0.1.11: Restore get_model_path in utils to fix import/validation

9a692c5

v0.1.12: Fix unified connector key mapping to restore generation quality

3daf53b

v0.1.13: Fix malformed unified Conv3d decoder weight layout

7161489

Detect and correct unified checkpoint conv3d tensors saved as (O,D,H,I,W) by swapping trailing dims to MLX layout (O,D,H,W,I), preventing decoder channel mismatch crashes.

v0.1.14: add native prompt-enhancer fallback stability

60f1ec4

Fallback to the original prompt when enhancement fails and emit structured stderr tokens so host apps can surface clear status without failing generation.

Merge branch 'main' into main

1601b24

James Campbell added 6 commits March 14, 2026 15:34

v0.1.15: guard text-encoder config schema and improve diagnostics

360d63d

Fail fast with actionable errors when a unified AV config is used as a text encoder config, and emit structured resolution diagnostics to make issue Blaizzy#20 triage deterministic.

v0.1.18: fix distilled VAE decode topology

f1be9c5

Build the unified VAE decoder from embedded decoder_blocks so distilled checkpoints use their native topology instead of a mismatched static layout. This resolves garbled/distorted output caused by incorrect upsampling path selection.

v0.1.20: align distilled AV decode math paths

02d5adc

Wire distilled vocoder BWE execution with checkpoint mel_stft tensors, fix 2.3 upsampler key/layout mapping, and tighten unified VAE Conv3d layout handling so unified split weights are not re-transposed incorrectly.

Blaizzy closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unified MLX model format and audio-video generation improvements#18

Add unified MLX model format and audio-video generation improvements#18
james-see wants to merge 23 commits intoBlaizzy:mainfrom
james-see:main

james-see commented Feb 1, 2026

Uh oh!

Blaizzy commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

james-see commented Feb 1, 2026

Uh oh!

Blaizzy commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blaizzy commented Mar 14, 2026 •

edited

Loading