nanoLLMServe

nanoLLMServe is a tiny, readable LLM serving engine.

The goal is to build a small vLLM/SGLang-style system that can actually run, while keeping each production serving idea easy to inspect:

OpenAI-compatible API
KV cache decode
batching and continuous batching
block KV cache management
prefix cache
chunked prefill
metrics, structured output, speculative decoding, LoRA, quantization, and distributed serving

It is not trying to be faster than vLLM. It is trying to make the serving stack understandable.

Roadmap

Milestones live in docs/exec-plans/active/milestones.

Start with v0.0-naive-single-request, expose it through the v0.1 OpenAI-compatible server, then switch generation to v0.2 KV cache decode.

Quick Start

Install the package in a virtual environment:

uv venv --python python3
source .venv/bin/activate
uv pip install -e ".[dev]"

Generate text for one prompt:

nanollm-generate \
  --model Qwen/Qwen3-1.7B \
  --prompt "Explain KV cache in one sentence." \
  --max-new-tokens 32 \
  --temperature 0 \
  --show-stats

On the A100 harness machine, use the cached local weights:

nanollm-generate \
  --model /data2/nanoLLMServe/models/Qwen3-1.7B \
  --local-files-only \
  --prompt "Explain KV cache in one sentence." \
  --max-new-tokens 32 \
  --temperature 0 \
  --show-stats

v0.2 KV Cache Decode

The default generation path now runs one prefill pass over the prompt and then uses Hugging Face past_key_values so each later decode step only receives the last generated token.

Run the benchmark with a v0.0-style naive baseline comparison:

python benchmarks/benchmark_generate.py \
  --model /data2/nanoLLMServe/models/Qwen3-1.7B \
  --local-files-only \
  --runs 3 \
  --warmup 1

The JSON output includes kv_cache_decode.mean_ttft_seconds, kv_cache_decode.mean_tpot_seconds, and a comparison section with elapsed and TPOT speedup against the naive full-sequence loop.

v0.3 Static Batching (Teaching-Scale)

You can benchmark fixed-size static batching with --batch-size:

python benchmarks/benchmark_generate.py \
  --model /data2/nanoLLMServe/models/Qwen3-1.7B \
  --local-files-only \
  --batch-size 4 \
  --runs 5 \
  --warmup 2

The output will include static_batch, including batch elapsed time and mean row-level token throughput for the fixed-size group.

v0.4 Continuous Batching (Teaching-Scale)

When --batch-size is greater than 1, the benchmark now also runs a teaching-scale continuous batching path. It admits requests over scheduler steps, rebuilds the active batch each step, and reports active batch sizes:

python benchmarks/benchmark_generate.py \
  --model /data2/nanoLLMServe/models/Qwen3-1.7B \
  --local-files-only \
  --batch-size 4 \
  --runs 5 \
  --warmup 2 \
  --skip-naive-baseline

The output includes continuous_batch.active_batch_sizes and continuous_batch.mean_active_batch_size. This milestone intentionally recomputes full active rows; paged KV cache for dynamic rows belongs to v0.5.

v0.1 OpenAI-Compatible Server

Serve one local or Hugging Face causal LM:

nanollm-serve \
  --model /data2/nanoLLMServe/models/Qwen3-1.7B \
  --served-model-name Qwen3-1.7B \
  --local-files-only \
  --host 127.0.0.1 \
  --port 8000

List models:

curl http://127.0.0.1:8000/v1/models

Call the recommended Responses endpoint:

curl http://127.0.0.1:8000/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3-1.7B",
    "instructions": "Answer in one sentence.",
    "input": "Explain KV cache.",
    "max_output_tokens": 32,
    "temperature": 0
  }'

Call the chat completions endpoint:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3-1.7B",
    "messages": [{"role": "user", "content": "Explain KV cache in one sentence."}],
    "max_tokens": 32,
    "temperature": 0
  }'

Streaming uses OpenAI-style server-sent events:

curl -N http://127.0.0.1:8000/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3-1.7B",
    "input": "KV cache is",
    "max_output_tokens": 16,
    "temperature": 0,
    "stream": true
  }'

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
benchmarks		benchmarks
docs		docs
scripts		scripts
src/nanollmserve		src/nanollmserve
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoLLMServe

Roadmap

Quick Start

v0.2 KV Cache Decode

v0.3 Static Batching (Teaching-Scale)

v0.4 Continuous Batching (Teaching-Scale)

v0.1 OpenAI-Compatible Server

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoLLMServe

Roadmap

Quick Start

v0.2 KV Cache Decode

v0.3 Static Batching (Teaching-Scale)

v0.4 Continuous Batching (Teaching-Scale)

v0.1 OpenAI-Compatible Server

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages