GitHub - davefojtik/RunPod-vLLM: RunPod serverless worker for the vLLM AI text-gen inference. Simple, optimized and customisable.

Caution

My RunPod repos have been archived and won't receive any more work!

Long story short: There's been an issue with queue delay times that I tried to bring to attention and fix it with the RunPod support team. There are big delays before each container starts, especially before each cold-start (up to 10+ seconds), and they are billed. Couple of users and I insisted on them fixing it for more than 2 months. After countless conversations and endless documentation, it got swept away, and I got banned on their Discord server with no explanation.

I don't know if the delay is an intentional way to silently charge more money per request, but this whole completely unprofessional approach from their side convinced me to end everything regarding their platform, and I can't simply recommend using their serverless services anymore. Here are text copies of the whole Discord thread and Email conversations if you want the whole story.

I'm sorry it had to end like this. Thank you for your understanding and interest in these projects. Shall we see at others again.

RunPod serverless handler for vLLM, optimized for efficient tuning and production deployment. Written from scratch, easily customisable, OpenAI compatible.

Advantages and Support:

Optimized, minimal container image and settings for the best performance/cold-start time
Automatic VLLM + RunPod continuous batching (AsyncEngine + Dynamic Concurrent Handler modifier)
Per-coldstart vLLM configuration via request payload (change engine arguments and env variables for each cold-start)
Static memory profiling patched into vLLM source code (reduces cold-start time by an additional ~30%)
Automatic network volume shared torch.compile and graph capture (allows to perform caching only when network path is detected)
Prewarming method for RunPod Flashboot
SSE Concurrent Streaming (Generator Handler)
Baked or network volume models
FP8, GPTQ, AWQ, Bitsandbites quantizations

Todo / Needs to be tested:

Network volume setup interface
Embedding and Multimodal models
Q/LoRAs
Tensorizer for faster model loading over the network

Tip

For quick, optimized deployments you can use pre-made images used in our projects:

Qwen3-14B-AWQ: 3wad/runpod-vllm:0.8.5-qwen3-14B-awq

Documentation

⚙️ Configuration
📦 Deployment
🚀 Usage

Why use this template when there are others, including the official one?

This repo tries to implement and fix things not addressed by the others, which was the main reason it was made.
It's used and tested with the best-performing models in our own projects. That way, you can enjoy optimized pre-made images with baked models and settings dialled in for the best performance or coldstart time.
It aims to be easy to fix and debug with well-organized code in a single handler file. You can now fork this repo, customize it and build it directly into the RunPod with their new GitHub build feature!
It aims for active communication and cooperation, to help with all issues and to be open to community feature requests.

Contributors and support

All sorts of help with this repo is very welcomed, no matter whether you choose to answer the questions, try to help with issues, contribute to the code or performance tuning. We try our best to communicate the priorities of bug fixes and further development through tags but never hesitate to reach out and contact us with any interest in cooperation.

Please keep in mind this is a hobby project. While often their contributors, we're not directly associated with the companies or the things we're implementing. We're trying to help the open-source community in our free time and for free. If you use the projects commercially, consider supporting the vLLM.

Note

This repo is in no way affiliated with RunPod Inc. All logos and names are owned by the authors. This is an unofficial community implementation

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
builder		builder
docs		docs
presets		presets
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advantages and Support:

Todo / Needs to be tested:

Documentation

Why use this template when there are others, including the official one?

Contributors and support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

davefojtik/RunPod-vLLM

Folders and files

Latest commit

History

Repository files navigation

Advantages and Support:

Todo / Needs to be tested:

Documentation

Why use this template when there are others, including the official one?

Contributors and support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages