Skip to content
/ DART Public

Official Implementation of DART (DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference).

License

Notifications You must be signed in to change notification settings

fvliang/DART

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DART

License Version Maintained Contributions

Diffusion-Inspired Speculative Decoding for Fast LLM Inference

[Paper of DART]


Overview

DART is a new speculative decoding approach for Large Language Models (LLMs) inference, which is inspired by diffusion-based Large Language Models (dLLMs). DART surpasses EAGLE3 by 30% on average, achieving up to 65% improvement on certain code-centric workloads.

DART's drafting requires only a single forward pass of a single transformer layer and a fast cpp-based tree search to build the draft token tree, resulting in extremely low drafting cost while preserving relatively high $\tau$ (Average Acceptance Length).

Speedup

DART speedup results

Speedup Comparison

DART comparison animation

Key Features

  • Fast Drafting Forward: Produces multiple logits simultaneously with 1 forward of 1 layer.
  • Fast Tree Search: Uses n-gram–based tree search to build the final draft tree (C++ based).
  • Low Drafting Cost: Results in extremely low drafting cost for efficient inference.
  • Relatively High $\tau$: Average Acceptance Length is competitive with EAGLE3.

Quick Start

Installation

git clone https://github.com/fvliang/DART.git
cd DART
curl -LsSf https://astral.sh/uv/install.sh | sh (optimal, if you don't have a uv)
uv sync
uv pip install -e .

Model Weights (HuggingFace)

Model Weights (ModelScope)

Inference

With UI (Gradio App)

We provide a Gradio web interface in dart/app/app.py. The easiest way is to run one of the prepared scripts:

bash dart/app/qwen3_1d7b_app.sh

You can also launch the app directly:

uv python dart/app/app.py \
  --base-model-name-or-path Qwen/Qwen3-4B \
  --dart-model-name-or-path fvliang/qwen4b-dart \
  --ngram-model-name-or-path fvliang/dart-qwen3-ngram \
  --template-name qwen \
  --device cuda \
  --max-new-tokens 2048 \
  --max-length 4096 \
  --use-small-ngram \
  --listen \
  --server-port 30000

You could compare DART with EAGLE3 in a single UI (target model is shared by DART and EAGLE3):

uv python dart/app/app.py \
  --base-model-name-or-path Qwen/Qwen3-4B \
  --dart-model-name-or-path fvliang/qwen4b-dart \
  --ngram-model-name-or-path fvliang/dart-qwen3-ngram \
  --template-name qwen \
  --device cuda \
  --max-new-tokens 2048 \
  --max-length 4096 \
  --use-small-ngram \
  --compare-eagle3 \
  --eagle3-model-name-or-path AngelSlim/Qwen3-4B_eagle3 \
  --listen \
  --server-port 30000

After the model is fully loaded, Gradio will print a local URL in the terminal that you can open in your browser.

Tip: --use-small-ngram is great for fast testing. For best accuracy, omit it and load the full n-gram trie (this uses more memory and takes longer to load).

With Code (main.py / Python API)

You can use DART programmatically via DartModel.from_pretrained(...) and dart_generate(...), similar to Hugging Face generate:

import torch
from dart.model.dart_model import DartModel
from dart.model.template import TEMPLATE_REGISTRY

base_model_path = "Qwen/Qwen3-1.7B"
dart_model_path = "fvliang/qwen1.7b-dart"
ngram_model_path = "fvliang/dart-qwen3-ngram"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DartModel.from_pretrained(
    base_model_name_or_path=base_model_path,
    dart_model_name_or_path=dart_model_path,
    ngram_model_name_or_path=ngram_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    is_small_ngram=True,
).to(device)
model.eval()

template = TEMPLATE_REGISTRY.get("qwen")
messages = [
    {"role": "system", "content": template.system_prompt},
    {"role": "user", "content": "Hello! Please introduce DART briefly."},
]
prompt = model.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

input_ids = model.tokenizer(
    prompt, return_tensors="pt", add_special_tokens=False
).input_ids.to(device)

output_ids = model.dart_generate(
    input_ids,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_new_token_num=512,
    max_length=2048,
)

output = model.tokenizer.decode(
    output_ids[0],
    skip_special_tokens=True,
    spaces_between_special_tokens=False,
    clean_up_tokenization_spaces=True,
)
print(output)

Citation

If you find DART useful in your research, please cite:

@misc{liu2026dart,
      title={DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference}, 
      author={Fuliang Liu and Xue Li and Ketai Zhao and Yinxi Gao and Ziyan Zhou and Zhonghui Zhang and Zhibin Wang and Wanchun Dou and Sheng Zhong and Chen Tian},
      year={2026},
      eprint={2601.19278},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.19278}, 
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


If you find this project helpful, please give it a ⭐ Star!

About

Official Implementation of DART (DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published