Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2)

### Problem

The FORCED_OPEN workaround in `chat-diff-analyzer.cpp` (lines 28-42) catches templates with `content.split('</think>')` that lack `reasoning_content`, and sets `reasoning_mode::FORCED_OPEN`. This workaround was designed for old Qwen/DeepSeek thinking templates, but it also matches NVIDIA-Nemotron-Nano-9B-v2, which supports per-message thinking toggling via `/no_think`.

For Nemotron-Nano-v2, this causes 100% of streaming SSE chunks to have `reasoning_content` instead of `content`, because:

1. The FORCED_OPEN PEG parser (`optional(literal(start)) + reasoning(until(end)) + end`) makes the reasoning block **mandatory**
2. In lenient (streaming) mode, `until("</think>")` returns `NEED_MORE_INPUT` when `</think>` hasn't appeared yet
3. `NEED_MORE_INPUT` propagates through the AST, tagging all accumulated output as reasoning
4. When `</think>` never appears (e.g., thinking exceeds `max_tokens`), **every token** is classified as `reasoning_content`

This breaks OpenAI-compatible clients that don't handle `reasoning_content` in streaming deltas.

### Affected models

Only NVIDIA-Nemotron-Nano-9B-v2. Other templates that trigger the workaround (DeepSeek-R1 variants, QwQ, rwkv-world) are unaffected because they don't have `/no_think` toggling.

### Proposed fix

Two changes (tested and verified):

**1. `common/chat-diff-analyzer.cpp`** — Exclude `/no_think` templates from the FORCED_OPEN workaround. The autoparser can't reliably handle templates where thinking is toggled per-message via template logic.

```cpp
if (tmpl.src.find("content.split('</think>')") != std::string::npos &&
    tmpl.src.find("reasoning_content") == std::string::npos &&
    tmpl.src.find("no_think") == std::string::npos &&  // NEW
    analysis.reasoning.mode == reasoning_mode::NONE) {
```

**2. `common/chat-auto-parser-generator.cpp`** — Make the FORCED_OPEN reasoning block fully optional (defensive, matches TAG_BASED behavior). Currently only the start tag is optional; the reasoning+end is mandatory. This is a no-op in the current always-lenient architecture but makes FORCED_OPEN consistent with TAG_BASED.

```cpp
// Before:
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
// After:
return p.optional(p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end);
```

### Testing

Verified with NVIDIA-Nemotron-Nano-9B-v2 (bf16 and q4_k_m) and DeepSeek-R1-Distill-Llama-8B (q4_k_m) on GB200:
- Nemotron: previously 100% failure → now passes
- DeepSeek-R1: no regression

### Related

- Fixes #20325
- Root cause: PR #18675 (Autoparser refactoring)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2) #20754

Problem

Affected models

Proposed fix

Testing

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2) #20754

Description

Problem

Affected models

Proposed fix

Testing

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions