From e4234468192b0c156175e31e5b5325b0b205dc9f Mon Sep 17 00:00:00 2001 From: JunghwanNA <70629228+shaun0927@users.noreply.github.com> Date: Fri, 17 Apr 2026 17:00:59 +0900 Subject: [PATCH] Pass temperature to draft sampler in dflash_generate The draft sampler at dflash/model.py:121 was called without the user-supplied temperature, so it always used the default `temperature=0.0` (greedy argmax). The target sampler at line 134 does receive `temperature`. For any `temperature > 0` the two paths therefore sample from different distributions: the draft is deterministic while the target is stochastic. Acceptance is decided by token equality (block_output_ids[:, 1:] == posterior[:, :-1]) so the mismatch artificially depresses acceptance and the accepted tokens do not follow the target distribution. Minimal repro without a model: torch.manual_seed(0) logits = torch.tensor([[[2.0, 1.5, 1.0, 0.5]]]) draft = sum(int(sample(logits).item() == 0) for _ in range(4000)) target = sum(int(sample(logits, 1.0).item() == 0) for _ in range(4000)) # draft = 4000/4000 (100%), target ~1900/4000 (~47%) Pass `temperature` through so both paths use the same scheme. Refs: #74 --- dflash/model.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dflash/model.py b/dflash/model.py index ef8d497..d64e838 100644 --- a/dflash/model.py +++ b/dflash/model.py @@ -118,7 +118,7 @@ def dflash_generate( is_causal=False, )[:, 1 - block_size :, :]) past_key_values_draft.crop(start) - block_output_ids[:, 1:] = sample(draft_logits) + block_output_ids[:, 1:] = sample(draft_logits, temperature) if draft_prefill and return_stats: draft_prefill = False decode_start = _cuda_time()