Skip to content

NickCheng0921/VDLM

Repository files navigation

VDLM Overview

VDLM is a model inference framework for running language MDMs (masked diffusion models) w/ an OpenAI style API.

Running the server

python api_server.py
python test_request.py

Demo

Video sped up for demonstration purposes

Demo Gif

Tests

Written in pytest, run using pytest

  • tests runs server w/ mock engine loop by default rather than loading a real model

Work in Progress

  • add more architectures, current code only uses LLaDA
  • implement CUDA graph capture for model serving
  • cancellable engine requests
  • dynamic request batching
  • faster IPC using ZMG + msgpack over multiprocessing.Queue

Acknowledgements

Model generation + load config code is from fast-dLLM.

  • slight modification added to the original RoPE implementation for torch compilability
    • some numerical precision issues observed, see link for more info

About

Online inference framework for diffusion LMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages