Skip to content

Conversation

@dranger003
Copy link
Contributor

Fixes a regression introduced by #17786 where models with complex tokenizer patterns (e.g., gpt-4o) fail to load on Windows/MSVC with error_stack.

MSVC's std::regex has severe stack limitations with nested lookaheads. The nosubs | optimize flags exacerbate this, causing failure during pattern compilation.

This PR uses #ifdef _MSC_VER to apply the flags only on platforms where they're beneficial (GCC/Clang on Linux/macOS) while preserving the original behaviour on Windows.

Closes #17830

@dranger003
Copy link
Contributor Author

dranger003 commented Dec 6, 2025

The problem: The original fix works on Linux/GCC but breaks Windows/MSVC because MSVC's std::regex implementation is fundamentally flawed for complex patterns.

Why MSVC Fails

The gpt-oss tokenizer regex is:

[^\r\n\p{L}\p{N}]?((?=[\p{L}])([^a-z]))*((?=[\p{L}])([^A-Z]))+(?:'[sS]|...|'[dD])?|...

This pattern contains:

  1. Multiple lookaheads(?=[\p{L}])
  2. Nested capturing groups((?=...)([^a-z]))*
  3. Complex Unicode properties\p{L}, \p{N}

MSVC's std::regex implementation uses recursive backtracking that can easily exhaust the call stack. The nested method calls in MSVC's std::_Matcher are insane, and this is a known design issue that causes stack overflows for complicated expressions.

The error_stack error code specifically means "There was insufficient memory to determine whether the regular expression could match the specified character sequence."

Platform Without PR #17786 With PR #17786
Linux/GCC Crashes on repetitive input at runtime ✅ Works
Windows/MSVC ✅ Works for normal input Crashes at model load

The nosubs and optimize flags change how MSVC's regex engine processes the pattern. Paradoxically, these "optimization" flags seem to trigger a different code path in MSVC that hits the stack limit earlier during pattern preparation/compilation rather than during matching.

Solutions

  1. Add a preprocessor conditional to only apply the nosubs|optimize flags on non-MSVC platforms.

  2. Replace std::regex with a more robust regex library (like RE2, PCRE2, etc.) - but this introduces potentially unwanted external dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: PR #17786 breaks model loading on Windows/MSVC for models with complex tokenizer regex

1 participant