More Efficient Text Diffusion via Length Prediction

Davide Beltrame*, Giacomo Cirò*, Luca Gandolfi*, Vittorio Rossi*

Bocconi University
Milan, Italy

*Equal contribution, the ordering is alphabetical.

Abstract

Diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARMs) for text generation, but their fixed-length decoding process leads to significant computational inefficiencies. In this work, we address this limitation by predicting an upper bound to the output sequence length before generation begins and reduce the context window to be processed accordingly. Our approach relies solely on the internal representations of the model and explores both zero-shot and embedding-based techniques. When considering a state-of-the-art DLM, LLaDa-8B, a token-level classifier built on top of the encoded token embeddings successfully predicts an upper bound to the sequence length 80% of the times and better avoids underestimation compared to a DistilBERT-based baseline. Our results show that output length prediction is an effective and lightweight strategy to improve DLM efficiency, enabling significant computational savings with minimal overhead.

Overview

Language generation with Large Language Models (LLMs) has traditionally been dominated by autoregressive models (ARMs), which generate text one token at a time. While effective, their sequential nature limits inference speed. Diffusion Language Models (DLMs) have emerged as a promising alternative, offering potential for faster generation through a denoising process.

DLMs operate by iteratively unmasking a sequence of tokens, where initial tokens represent the prompt and remaining ones are placeholder mask tokens, revealed progressively. At each denoising step, the model predicts logits for the full masked sequence and only unmasks tokens when confident. The context length must be fixed at the start, and DLMs handle variable length output by appending special end-of-sentence (EoS) tokens after the end of the sentence. This approach is effective, but computationally inefficient: the entire context window must be processed during each forward call of the denoising process, regardless of output length.

In this work, we focus on LLaDa, Large Language Diffusion with mAsking, an 8-B-parameter DLM, and propose methods to predict an upper bound for the generated sequence length at the initial stage of the denoising process, restricting the effective context to this predicted window and reducing unnecessary computations.

Core Contributions

Our key contributions involve developing techniques to predict output length efficiently:

Zero-shot Length Prediction: We analyze the model's internal signals by examining the logits corresponding to the EoS token in a zero-shot fashion.
Embedding-based Methods: We develop:
- A regression model using average prompt embeddings
- A token-wise classification approach to identify EoS tokens
Comparative Analysis: We compare our methods against DistilBERT-based baselines, evaluating them on their ability to provide accurate upper bounds that avoid underestimating sequence lengths.

Implementation Approaches

We explore the following methods for output length prediction:

Logit Quantile Heuristic: A zero-shot approach analyzing token-wise logit distributions for the EoS token
Embeddings-based Regression: A neural network trained on prompt embeddings to predict output length
Token-wise Classification: A classifier trained to identify which tokens are likely to be EoS tokens

Evaluation Framework

Our evaluation framework focuses on the quality of upper bounds provided by each method:

Bound Correctness: Percentage of test samples for which a correct upper bound is estimated
Bound Tightness: Average number of tokens from true end of sequence to estimated end
Saved Tokens: Average number of tokens saved from estimated end to context window end
Root MSE: Square root of mean squared error between predicted and true sequence length

We prioritize methods that tend to overestimate rather than underestimate sequence lengths, as overestimation is safer for generation quality.

Experimental Results

Our experiments with LLaDa-8B show:

Token-level Classification: The classifier based on LLaDa embeddings correctly predicts upper bounds for over 80% of test cases, with an average bound looseness of ~127 tokens.
Zero-shot Quantile Heuristics: Show a clear trade-off - higher quantiles yield higher correctness (Q75 achieves 99.88% valid bounds) but looser estimates (~510 tokens).
Regression Methods: Achieve the lowest RMSE but produce valid bounds for fewer samples (46-57%), as they're trained to minimize error rather than ensure upper bounds.
Comparison with DistilBERT: While DistilBERT exhibited lower average error in some cases, it frequently underestimated length, which is problematic for generation. LLaDa methods tended to safely overestimate.
Efficiency Gains: Our best methods save approximately 760-870 tokens on a 1024-token context window, representing substantial computational savings with minimal overhead.

Limitations & Future Work

Our work has several limitations that suggest directions for future research:

Computational Constraints: Limited computational resources restricted our ability to explore more complex models or larger datasets.
Model Requirements: Our approach works best with DLMs that have embeddings pre-trained for variable-length output generation.
Pretraining Challenges: We began exploring ways to adapt existing diffusion language models to support variable-length generation (e.g., with DiffuGPT), but this remains an open challenge.
Zero-shot Prediction Depth: We did not test zero-shot EoS prediction at different stages of the denoising process, which might improve accuracy.
Output Quality Analysis: Our assumption that models maintain performance under varying upper bounds needs further verification.
Dataset Limitations: We used clean, multilingual, conversational prompts. A more diverse and length-balanced dataset could yield more generalizable insights.
Multilingual Capabilities: We only superficially explored the model's multilingual capabilities and its sensitivity to prompt phrasing.

Conclusion

Our experiments addressed a significant limitation of DLMs by investigating how to upper bound the output sequence length. The core challenge is balancing the tightness of the predicted bound with the risk of underestimation, which can lead to premature truncation of generated text.

We demonstrated that upper bound prediction can be successfully approached as a classification or regression problem using a DLM's internal representations. A classifier relying on the DLM's internal representation strikes the best balance between bound tightness and accuracy, even compared to methods using sentence-level DistilBERT embeddings.

This work shows that predicting output sequence length is a viable strategy for enhancing the efficiency of DLMs like LLaDa, with potential for zero-shot and specialized solutions to address computational challenges in large-scale generative models.

Name		Name	Last commit message	Last commit date
Latest commit History 851 Commits
.agent/walkthroughs		.agent/walkthroughs
.github		.github
diffusion_llms		diffusion_llms
docs		docs
eval/LLaDa		eval/LLaDa
eval_results		eval_results
jobs/slurm		jobs/slurm
results		results
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark_generation.py		benchmark_generation.py
create_dashboard.py		create_dashboard.py
generate_results_table.py		generate_results_table.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

More Efficient Text Diffusion via Length Prediction

Abstract

Overview

Core Contributions

Implementation Approaches

Evaluation Framework

Experimental Results

Limitations & Future Work

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

More Efficient Text Diffusion via Length Prediction

Abstract

Overview

Core Contributions

Implementation Approaches

Evaluation Framework

Experimental Results

Limitations & Future Work

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages