VDLM is a model inference framework for running language MDMs (masked diffusion models) w/ an OpenAI style API.
python api_server.py
python test_request.py
Video sped up for demonstration purposes
Written in pytest, run using pytest
- tests runs server w/ mock engine loop by default rather than loading a real model
- add more architectures, current code only uses LLaDA
- implement CUDA graph capture for model serving
- cancellable engine requests
- dynamic request batching
- faster IPC using
ZMG+msgpackovermultiprocessing.Queue
Model generation + load config code is from fast-dLLM.
- slight modification added to the original RoPE implementation for torch compilability
- some numerical precision issues observed, see link for more info
