GitHub - fvliang/DART: Official Implementation of DART (DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference).

Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Overview

DART is a new speculative decoding approach for Large Language Models (LLMs) inference, which is inspired by diffusion-based Large Language Models (dLLMs). DART surpasses EAGLE3 by 30% on average, achieving up to 65% improvement on certain code-centric workloads.

DART's drafting requires only a single forward pass of a single transformer layer and a fast cpp-based tree search to build the draft token tree, resulting in extremely low drafting cost while preserving relatively high $\tau$ (Average Acceptance Length).

Speedup

Speedup Comparison

Key Features

Fast Drafting Forward: Produces multiple logits simultaneously with 1 forward of 1 layer.
Fast Tree Search: Uses n-gram–based tree search to build the final draft tree (C++ based).
Low Drafting Cost: Results in extremely low drafting cost for efficient inference.
Relatively High $\tau$: Average Acceptance Length is competitive with EAGLE3.

Quick Start

Installation

git clone https://github.com/fvliang/DART.git
cd DART
curl -LsSf https://astral.sh/uv/install.sh | sh (optimal, if you don't have a uv)
uv sync
uv pip install -e .

Model Weights (HuggingFace)

Base Model	DART Weights	N-gram Model
Qwen3-1.7B	fvliang/qwen1.7b-dart	fvliang/dart-qwen3-ngram
Qwen3-4B	fvliang/qwen4b-dart	fvliang/dart-qwen3-ngram
Qwen3-8B	fvliang/qwen8b-dart	fvliang/dart-qwen3-ngram
Qwen3-14B	fvliang/qwen14b-dart	fvliang/dart-qwen3-ngram
Qwen3-32B	fvliang/qwen32b-dart	fvliang/dart-qwen3-ngram

Model Weights (ModelScope)

Base Model	DART Weights	N-gram Model
Qwen3-1.7B	fvliang/Qwen3-1.7B-dart	fvliang/dart-qwen3-ngram
Qwen3-4B	fvliang/Qwen3-4B-dart	fvliang/dart-qwen3-ngram
Qwen3-8B	fvliang/Qwen3-8B-dart	fvliang/dart-qwen3-ngram
Qwen3-14B	fvliang/Qwen3-14B-dart	fvliang/dart-qwen3-ngram
Qwen3-32B	fvliang/Qwen3-32B-dart	fvliang/dart-qwen3-ngram

Inference

With UI (Gradio App)

We provide a Gradio web interface in dart/app/app.py. The easiest way is to run one of the prepared scripts:

bash dart/app/qwen3_1d7b_app.sh

You can also launch the app directly:

uv python dart/app/app.py \
  --base-model-name-or-path Qwen/Qwen3-4B \
  --dart-model-name-or-path fvliang/qwen4b-dart \
  --ngram-model-name-or-path fvliang/dart-qwen3-ngram \
  --template-name qwen \
  --device cuda \
  --max-new-tokens 2048 \
  --max-length 4096 \
  --use-small-ngram \
  --listen \
  --server-port 30000

You could compare DART with EAGLE3 in a single UI (target model is shared by DART and EAGLE3):

uv python dart/app/app.py \
  --base-model-name-or-path Qwen/Qwen3-4B \
  --dart-model-name-or-path fvliang/qwen4b-dart \
  --ngram-model-name-or-path fvliang/dart-qwen3-ngram \
  --template-name qwen \
  --device cuda \
  --max-new-tokens 2048 \
  --max-length 4096 \
  --use-small-ngram \
  --compare-eagle3 \
  --eagle3-model-name-or-path AngelSlim/Qwen3-4B_eagle3 \
  --listen \
  --server-port 30000

After the model is fully loaded, Gradio will print a local URL in the terminal that you can open in your browser.

Tip: --use-small-ngram is great for fast testing. For best accuracy, omit it and load the full n-gram trie (this uses more memory and takes longer to load).

With Code (`main.py` / Python API)

You can use DART programmatically via DartModel.from_pretrained(...) and dart_generate(...), similar to Hugging Face generate:

import torch
from dart.model.dart_model import DartModel
from dart.model.template import TEMPLATE_REGISTRY

base_model_path = "Qwen/Qwen3-1.7B"
dart_model_path = "fvliang/qwen1.7b-dart"
ngram_model_path = "fvliang/dart-qwen3-ngram"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DartModel.from_pretrained(
    base_model_name_or_path=base_model_path,
    dart_model_name_or_path=dart_model_path,
    ngram_model_name_or_path=ngram_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    is_small_ngram=True,
).to(device)
model.eval()

template = TEMPLATE_REGISTRY.get("qwen")
messages = [
    {"role": "system", "content": template.system_prompt},
    {"role": "user", "content": "Hello! Please introduce DART briefly."},
]
prompt = model.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

input_ids = model.tokenizer(
    prompt, return_tensors="pt", add_special_tokens=False
).input_ids.to(device)

output_ids = model.dart_generate(
    input_ids,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_new_token_num=512,
    max_length=2048,
)

output = model.tokenizer.decode(
    output_ids[0],
    skip_special_tokens=True,
    spaces_between_special_tokens=False,
    clean_up_tokenization_spaces=True,
)
print(output)

Citation

If you find DART useful in your research, please cite:

@misc{liu2026dart,
      title={DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference}, 
      author={Fuliang Liu and Xue Li and Ketai Zhao and Yinxi Gao and Ziyan Zhou and Zhonghui Zhang and Zhibin Wang and Wanchun Dou and Sheng Zhong and Chen Tian},
      year={2026},
      eprint={2601.19278},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.19278}, 
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

If you find this project helpful, please give it a ⭐ Star!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
dart.egg-info		dart.egg-info
dart		dart
figs		figs
third_party		third_party
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Overview

Speedup

Speedup Comparison

Key Features

Quick Start

Installation

Model Weights (HuggingFace)

Model Weights (ModelScope)

Inference

With UI (Gradio App)

With Code (`main.py` / Python API)

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

fvliang/DART

Folders and files

Latest commit

History

Repository files navigation

Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Overview

Speedup

Speedup Comparison

Key Features

Quick Start

Installation

Model Weights (HuggingFace)

Model Weights (ModelScope)

Inference

With UI (Gradio App)

With Code (main.py / Python API)

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

With Code (`main.py` / Python API)

Packages