Skip to content

Add unified MLX model format and audio-video generation improvements#18

Closed
james-see wants to merge 23 commits intoBlaizzy:mainfrom
james-see:main
Closed

Add unified MLX model format and audio-video generation improvements#18
james-see wants to merge 23 commits intoBlaizzy:mainfrom
james-see:main

Conversation

@james-see
Copy link
Copy Markdown

  • Add unified MLX model conversion that creates single model.safetensors
  • Support loading from unified MLX format in generate_av.py
  • Fix conversion bug that included quantized weights (fp4/fp8 scale keys)
  • Add local path support in get_model_path (tilde expansion)
  • Update README with MLX model conversion and usage instructions
  • Add soundfile dependency for audio processing
  • Add CLI entry points for generate_av and convert commands

The unified format provides:

  • Faster loading (single file vs multiple)
  • Pre-sanitized weights (no on-the-fly transformation)
  • Easy sharing via HuggingFace

James Campbell and others added 17 commits January 29, 2026 03:51
- Add unified MLX model conversion that creates single model.safetensors
- Support loading from unified MLX format in generate_av.py
- Fix conversion bug that included quantized weights (fp4/fp8 scale keys)
- Add local path support in get_model_path (tilde expansion)
- Update README with MLX model conversion and usage instructions
- Add soundfile dependency for audio processing
- Add CLI entry points for generate_av and convert commands

The unified format provides:
- Faster loading (single file vs multiple)
- Pre-sanitized weights (no on-the-fly transformation)
- Easy sharing via HuggingFace
- Add GitHub Actions workflow to publish to PyPI on tagged releases
- Export generate_video and generate_video_with_audio functions
- Bump version to 0.1.0 for first PyPI release
- Rename package from mlx-video to mlx-video-with-audio
- Update project URLs to correct repository
- Bump version to 0.1.1
- Update description to reflect video+audio generation capabilities
- Add keywords for better PyPI discoverability
- Update README with PyPI installation instructions
- Update repository URLs and image paths
- Bump version to 0.1.2
- Add stage parameter to denoise() and denoise_av() functions
- Emit STAGE:N:STEP:X:Y:Denoising to stderr for progress tracking
- Enables external apps to parse generation progress
- Bump version to 0.1.3
- New enhance_prompt.py: standalone enhancement via mlx_lm with TheCluster/amoral-gemma-3-12B-v2-mlx-4bit
- --use-uncensored-enhancer CLI flag for generate_av
- Avoids content filters on words like urine, blood, etc.
- First run downloads ~7GB; logs progress to stderr
- --save-audio-separately flag (default off)
- When off: use temp wav for muxing, delete after
- When on or --output-audio: keep .wav file
Detect and correct unified checkpoint conv3d tensors saved as (O,D,H,I,W) by swapping trailing dims to MLX layout (O,D,H,W,I), preventing decoder channel mismatch crashes.
Fallback to the original prompt when enhancement fails and emit structured stderr tokens so host apps can surface clear status without failing generation.
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 14, 2026

Hey @james-see,

Thank you for this contribution, really appreciate the effort you put into it!

I'm going to close this PR though, as I've been reworking the Unified API on a dev branch over the past month and have taken it in a different direction. The results have been really solid, significantly lower memory usage (20-35GB) and better speed.

I'll have a PR up soon, just working through a few remaining bugs around LTX-2.3.

I don't want you spending time on something that'll get overwritten, but there will be plenty of work to pick up once that PR lands. I'll put up some issues you can grab, would love to keep you involved.

James Campbell added 6 commits March 14, 2026 15:34
Fail fast with actionable errors when a unified AV config is used as a text encoder config, and emit structured resolution diagnostics to make issue Blaizzy#20 triage deterministic.
…epos)

is_unified_mlx_model() now recognizes split-weight layouts
(transformer.safetensors + vae_decoder.safetensors + config.json)
in addition to single-file model.safetensors. Also detects via
split_model.json manifest from mlx-forge conversions.

Updated weight loading across all components (transformer, VAE
decoder/encoder, upsampler, connector, text encoder) to load from
individual split files when model.safetensors is absent.

Fixes dgrauet/ltx-2.3-mlx-distilled-q4 "text encoder config mismatch".
Support gated-attention transformer blocks, prompt scale/shift modulation, and BigVGAN/SnakeBeta vocoder loading for distilled split-weight checkpoints. Also add unified VAE layout fallback to Lightricks/LTX-2 when the embedded decoder topology is not yet supported.
Build the unified VAE decoder from embedded decoder_blocks so distilled checkpoints use their native topology instead of a mismatched static layout. This resolves garbled/distorted output caused by incorrect upsampling path selection.
Detect LTX-2.3 checkpoints and resolve the matching 2.3 spatial upsampler during stage-2 refinement. For 2.3 models, remove fallback to LTX-2 upsampler weights so incompatible latent refinement paths fail fast instead of producing garbled output.
Wire distilled vocoder BWE execution with checkpoint mel_stft tensors, fix 2.3 upsampler key/layout mapping, and tighten unified VAE Conv3d layout handling so unified split weights are not re-transposed incorrectly.
@Blaizzy Blaizzy closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants