-
Notifications
You must be signed in to change notification settings - Fork 16.3k
disabling reasoning does not work anymore on certain models #20196
Copy link
Copy link
Open
Labels
Description
Name and Version
llama-cli --version
version: 8233 (c5a778891)
built with GNU 10.5.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Nvidia Tesla V100 32GB
Models
MiniMax-M2.5-IQ4_XS
Problem description & steps to reproduce
Hello,
on a fresh build, I noticed that in llama-server --reasoning-budget 0 or --chat-template-kwargs {"enable_thinking": false} do not have any effect anymore with MiniMax M2.5 (unsloth GGUF)
Model will spit thinking tokens as output.
AesSedai/Qwen3.5-35B-A3B-GGUF does not highlight this behavior for instance.
First Bad Commit
No response
Relevant log output
Logs
Now I see that thinking is enabled from llama-server output:
llama-server -m MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf -ngl 99 -b 2048 -ub 2048 --jinja --reasoning-budget 0
[...]
init: chat template, example_format: ']~b]system
You are a helpful assistant[e~[
]~b]user
Hello[e~[
]~b]ai
Hi there[e~[
]~b]user
How are you?[e~[
]~b]ai
<think>
'
srv init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8081
main: starting the main loop...
srv update_slots: all slots are idle
srv log_server_r: done request: GET / 127.0.0.1 200
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 180736, n_keep = 0, task.n_tokens = 39
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot init_sampler: id 3 | task 0 | init sampler, took 0.03 ms, tokens: text = 39, total = 39
slot update_slots: id 3 | task 0 | prompt processing done, n_tokens = 39, batch.n_tokens = 39
slot print_timing: id 3 | task 0 |
prompt eval time = 882.74 ms / 39 tokens ( 22.63 ms per token, 44.18 tokens per second)
eval time = 664.85 ms / 38 tokens ( 17.50 ms per token, 57.16 tokens per second)
total time = 1547.58 ms / 77 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 76, truncated = 0
srv update_slots: all slots are idle
srv init: init: chat template, thinking = 0
Reactions are currently unavailable