From e4234468192b0c156175e31e5b5325b0b205dc9f Mon Sep 17 00:00:00 2001
From: JunghwanNA <70629228+shaun0927@users.noreply.github.com>
Date: Fri, 17 Apr 2026 17:00:59 +0900
Subject: [PATCH] Pass temperature to draft sampler in dflash_generate

The draft sampler at dflash/model.py:121 was called without the
user-supplied temperature, so it always used the default
`temperature=0.0` (greedy argmax).  The target sampler at line 134
does receive `temperature`.  For any `temperature > 0` the two paths
therefore sample from different distributions: the draft is
deterministic while the target is stochastic.

Acceptance is decided by token equality
    (block_output_ids[:, 1:] == posterior[:, :-1])
so the mismatch artificially depresses acceptance and the accepted
tokens do not follow the target distribution.

Minimal repro without a model:

    torch.manual_seed(0)
    logits = torch.tensor([[[2.0, 1.5, 1.0, 0.5]]])
    draft  = sum(int(sample(logits).item() == 0) for _ in range(4000))
    target = sum(int(sample(logits, 1.0).item() == 0) for _ in range(4000))
    # draft = 4000/4000 (100%),  target ~1900/4000 (~47%)

Pass `temperature` through so both paths use the same scheme.

Refs: #74
---
 dflash/model.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dflash/model.py b/dflash/model.py
index ef8d497..d64e838 100644
--- a/dflash/model.py
+++ b/dflash/model.py
@@ -118,7 +118,7 @@ def dflash_generate(
                 is_causal=False,
             )[:, 1 - block_size :, :])
             past_key_values_draft.crop(start)
-            block_output_ids[:, 1:] = sample(draft_logits)
+            block_output_ids[:, 1:] = sample(draft_logits, temperature)
             if draft_prefill and return_stats:
                 draft_prefill = False
                 decode_start = _cuda_time()