Skip to content

Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2) #20754

@janbernloehr

Description

@janbernloehr

Problem

The FORCED_OPEN workaround in chat-diff-analyzer.cpp (lines 28-42) catches templates with content.split('</think>') that lack reasoning_content, and sets reasoning_mode::FORCED_OPEN. This workaround was designed for old Qwen/DeepSeek thinking templates, but it also matches NVIDIA-Nemotron-Nano-9B-v2, which supports per-message thinking toggling via /no_think.

For Nemotron-Nano-v2, this causes 100% of streaming SSE chunks to have reasoning_content instead of content, because:

  1. The FORCED_OPEN PEG parser (optional(literal(start)) + reasoning(until(end)) + end) makes the reasoning block mandatory
  2. In lenient (streaming) mode, until("</think>") returns NEED_MORE_INPUT when </think> hasn't appeared yet
  3. NEED_MORE_INPUT propagates through the AST, tagging all accumulated output as reasoning
  4. When </think> never appears (e.g., thinking exceeds max_tokens), every token is classified as reasoning_content

This breaks OpenAI-compatible clients that don't handle reasoning_content in streaming deltas.

Affected models

Only NVIDIA-Nemotron-Nano-9B-v2. Other templates that trigger the workaround (DeepSeek-R1 variants, QwQ, rwkv-world) are unaffected because they don't have /no_think toggling.

Proposed fix

Two changes (tested and verified):

1. common/chat-diff-analyzer.cpp — Exclude /no_think templates from the FORCED_OPEN workaround. The autoparser can't reliably handle templates where thinking is toggled per-message via template logic.

if (tmpl.src.find("content.split('</think>')") != std::string::npos &&
    tmpl.src.find("reasoning_content") == std::string::npos &&
    tmpl.src.find("no_think") == std::string::npos &&  // NEW
    analysis.reasoning.mode == reasoning_mode::NONE) {

2. common/chat-auto-parser-generator.cpp — Make the FORCED_OPEN reasoning block fully optional (defensive, matches TAG_BASED behavior). Currently only the start tag is optional; the reasoning+end is mandatory. This is a no-op in the current always-lenient architecture but makes FORCED_OPEN consistent with TAG_BASED.

// Before:
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
// After:
return p.optional(p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end);

Testing

Verified with NVIDIA-Nemotron-Nano-9B-v2 (bf16 and q4_k_m) and DeepSeek-R1-Distill-Llama-8B (q4_k_m) on GB200:

  • Nemotron: previously 100% failure → now passes
  • DeepSeek-R1: no regression

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions