Skip to content

disabling reasoning does not work anymore on certain models #20196

@gelim

Description

@gelim

Name and Version

llama-cli  --version
version: 8233 (c5a778891)
built with GNU 10.5.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

Nvidia Tesla V100 32GB

Models

MiniMax-M2.5-IQ4_XS

Problem description & steps to reproduce

Hello,

on a fresh build, I noticed that in llama-server --reasoning-budget 0 or --chat-template-kwargs {"enable_thinking": false} do not have any effect anymore with MiniMax M2.5 (unsloth GGUF)
Model will spit thinking tokens as output.

AesSedai/Qwen3.5-35B-A3B-GGUF does not highlight this behavior for instance.

First Bad Commit

No response

Relevant log output

Logs

Now I see that thinking is enabled from llama-server output:

llama-server  -m MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf -ngl 99 -b 2048 -ub 2048 --jinja --reasoning-budget 0
[...]
init: chat template, example_format: ']~b]system
You are a helpful assistant[e~[
]~b]user
Hello[e~[
]~b]ai
Hi there[e~[
]~b]user
How are you?[e~[
]~b]ai
<think>
'
srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8081
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 180736, n_keep = 0, task.n_tokens = 39
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot init_sampler: id  3 | task 0 | init sampler, took 0.03 ms, tokens: text = 39, total = 39
slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 39, batch.n_tokens = 39
slot print_timing: id  3 | task 0 |
prompt eval time =     882.74 ms /    39 tokens (   22.63 ms per token,    44.18 tokens per second)
       eval time =     664.85 ms /    38 tokens (   17.50 ms per token,    57.16 tokens per second)
      total time =    1547.58 ms /    77 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 76, truncated = 0
srv  update_slots: all slots are idle

srv          init: init: chat template, thinking = 0
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions