DART is a new speculative decoding approach for Large Language Models (LLMs) inference, which is inspired by diffusion-based Large Language Models (dLLMs). DART surpasses EAGLE3 by 30% on average, achieving up to 65% improvement on certain code-centric workloads.
DART's drafting requires only a single forward pass of a single transformer layer and a fast cpp-based tree search to build the draft token tree, resulting in extremely low drafting cost while preserving relatively high
- Fast Drafting Forward: Produces multiple logits simultaneously with 1 forward of 1 layer.
- Fast Tree Search: Uses n-gram–based tree search to build the final draft tree (C++ based).
- Low Drafting Cost: Results in extremely low drafting cost for efficient inference.
-
Relatively High
$\tau$ : Average Acceptance Length is competitive with EAGLE3.
git clone https://github.com/fvliang/DART.git
cd DART
curl -LsSf https://astral.sh/uv/install.sh | sh (optimal, if you don't have a uv)
uv sync
uv pip install -e .We provide a Gradio web interface in dart/app/app.py. The easiest way is to run one of the prepared scripts:
bash dart/app/qwen3_1d7b_app.shYou can also launch the app directly:
uv python dart/app/app.py \
--base-model-name-or-path Qwen/Qwen3-4B \
--dart-model-name-or-path fvliang/qwen4b-dart \
--ngram-model-name-or-path fvliang/dart-qwen3-ngram \
--template-name qwen \
--device cuda \
--max-new-tokens 2048 \
--max-length 4096 \
--use-small-ngram \
--listen \
--server-port 30000You could compare DART with EAGLE3 in a single UI (target model is shared by DART and EAGLE3):
uv python dart/app/app.py \
--base-model-name-or-path Qwen/Qwen3-4B \
--dart-model-name-or-path fvliang/qwen4b-dart \
--ngram-model-name-or-path fvliang/dart-qwen3-ngram \
--template-name qwen \
--device cuda \
--max-new-tokens 2048 \
--max-length 4096 \
--use-small-ngram \
--compare-eagle3 \
--eagle3-model-name-or-path AngelSlim/Qwen3-4B_eagle3 \
--listen \
--server-port 30000After the model is fully loaded, Gradio will print a local URL in the terminal that you can open in your browser.
Tip: --use-small-ngram is great for fast testing. For best accuracy, omit it and load the full n-gram trie (this uses more memory and takes longer to load).
You can use DART programmatically via DartModel.from_pretrained(...) and dart_generate(...), similar to Hugging Face generate:
import torch
from dart.model.dart_model import DartModel
from dart.model.template import TEMPLATE_REGISTRY
base_model_path = "Qwen/Qwen3-1.7B"
dart_model_path = "fvliang/qwen1.7b-dart"
ngram_model_path = "fvliang/dart-qwen3-ngram"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DartModel.from_pretrained(
base_model_name_or_path=base_model_path,
dart_model_name_or_path=dart_model_path,
ngram_model_name_or_path=ngram_model_path,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
is_small_ngram=True,
).to(device)
model.eval()
template = TEMPLATE_REGISTRY.get("qwen")
messages = [
{"role": "system", "content": template.system_prompt},
{"role": "user", "content": "Hello! Please introduce DART briefly."},
]
prompt = model.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
input_ids = model.tokenizer(
prompt, return_tensors="pt", add_special_tokens=False
).input_ids.to(device)
output_ids = model.dart_generate(
input_ids,
temperature=0.7,
top_p=0.9,
top_k=50,
max_new_token_num=512,
max_length=2048,
)
output = model.tokenizer.decode(
output_ids[0],
skip_special_tokens=True,
spaces_between_special_tokens=False,
clean_up_tokenization_spaces=True,
)
print(output)If you find DART useful in your research, please cite:
@misc{liu2026dart,
title={DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference},
author={Fuliang Liu and Xue Li and Ketai Zhao and Yinxi Gao and Ziyan Zhou and Zhonghui Zhang and Zhibin Wang and Wanchun Dou and Sheng Zhong and Chen Tian},
year={2026},
eprint={2601.19278},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.19278},
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you find this project helpful, please give it a ⭐ Star!

