Problem
The FORCED_OPEN workaround in chat-diff-analyzer.cpp (lines 28-42) catches templates with content.split('</think>') that lack reasoning_content, and sets reasoning_mode::FORCED_OPEN. This workaround was designed for old Qwen/DeepSeek thinking templates, but it also matches NVIDIA-Nemotron-Nano-9B-v2, which supports per-message thinking toggling via /no_think.
For Nemotron-Nano-v2, this causes 100% of streaming SSE chunks to have reasoning_content instead of content, because:
- The FORCED_OPEN PEG parser (
optional(literal(start)) + reasoning(until(end)) + end) makes the reasoning block mandatory
- In lenient (streaming) mode,
until("</think>") returns NEED_MORE_INPUT when </think> hasn't appeared yet
NEED_MORE_INPUT propagates through the AST, tagging all accumulated output as reasoning
- When
</think> never appears (e.g., thinking exceeds max_tokens), every token is classified as reasoning_content
This breaks OpenAI-compatible clients that don't handle reasoning_content in streaming deltas.
Affected models
Only NVIDIA-Nemotron-Nano-9B-v2. Other templates that trigger the workaround (DeepSeek-R1 variants, QwQ, rwkv-world) are unaffected because they don't have /no_think toggling.
Proposed fix
Two changes (tested and verified):
1. common/chat-diff-analyzer.cpp — Exclude /no_think templates from the FORCED_OPEN workaround. The autoparser can't reliably handle templates where thinking is toggled per-message via template logic.
if (tmpl.src.find("content.split('</think>')") != std::string::npos &&
tmpl.src.find("reasoning_content") == std::string::npos &&
tmpl.src.find("no_think") == std::string::npos && // NEW
analysis.reasoning.mode == reasoning_mode::NONE) {
2. common/chat-auto-parser-generator.cpp — Make the FORCED_OPEN reasoning block fully optional (defensive, matches TAG_BASED behavior). Currently only the start tag is optional; the reasoning+end is mandatory. This is a no-op in the current always-lenient architecture but makes FORCED_OPEN consistent with TAG_BASED.
// Before:
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
// After:
return p.optional(p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end);
Testing
Verified with NVIDIA-Nemotron-Nano-9B-v2 (bf16 and q4_k_m) and DeepSeek-R1-Distill-Llama-8B (q4_k_m) on GB200:
- Nemotron: previously 100% failure → now passes
- DeepSeek-R1: no regression
Related
Problem
The FORCED_OPEN workaround in
chat-diff-analyzer.cpp(lines 28-42) catches templates withcontent.split('</think>')that lackreasoning_content, and setsreasoning_mode::FORCED_OPEN. This workaround was designed for old Qwen/DeepSeek thinking templates, but it also matches NVIDIA-Nemotron-Nano-9B-v2, which supports per-message thinking toggling via/no_think.For Nemotron-Nano-v2, this causes 100% of streaming SSE chunks to have
reasoning_contentinstead ofcontent, because:optional(literal(start)) + reasoning(until(end)) + end) makes the reasoning block mandatoryuntil("</think>")returnsNEED_MORE_INPUTwhen</think>hasn't appeared yetNEED_MORE_INPUTpropagates through the AST, tagging all accumulated output as reasoning</think>never appears (e.g., thinking exceedsmax_tokens), every token is classified asreasoning_contentThis breaks OpenAI-compatible clients that don't handle
reasoning_contentin streaming deltas.Affected models
Only NVIDIA-Nemotron-Nano-9B-v2. Other templates that trigger the workaround (DeepSeek-R1 variants, QwQ, rwkv-world) are unaffected because they don't have
/no_thinktoggling.Proposed fix
Two changes (tested and verified):
1.
common/chat-diff-analyzer.cpp— Exclude/no_thinktemplates from the FORCED_OPEN workaround. The autoparser can't reliably handle templates where thinking is toggled per-message via template logic.2.
common/chat-auto-parser-generator.cpp— Make the FORCED_OPEN reasoning block fully optional (defensive, matches TAG_BASED behavior). Currently only the start tag is optional; the reasoning+end is mandatory. This is a no-op in the current always-lenient architecture but makes FORCED_OPEN consistent with TAG_BASED.Testing
Verified with NVIDIA-Nemotron-Nano-9B-v2 (bf16 and q4_k_m) and DeepSeek-R1-Distill-Llama-8B (q4_k_m) on GB200:
Related