feat/deepseek-v4-tokenizer#981
Merged
Merged
Conversation
This was referenced May 15, 2026
Closed
Owner
|
Thanks! #978 is in — go ahead and rebase whenever you're ready, the conflict in |
- Update tokenizer data to use DeepSeek V4 vocabulary - Update token IDs for `<think>` and `</think>` special tokens - Update test suites to reflect V4 tokenization behavior and CJK compression characteristics - Tokenizer From: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/tokenizer.json
- Reference: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py - Implement `formatDeepSeekPrompt` to include BOS, EOS, and role-specific separators (`<|User|>`, `<|Assistant|>`) - Add support for DeepSeek Machine Learning Language (DSML) tool-calling syntax in prompt estimation - Update `estimateRequestTokens` to account for chat template framing and tool schema overhead - Refactor `ContextManager` to use more accurate token counting for message folding logic
- Update `V4Message` interface to include `_textParts` for better content tracking - Improve `mergeToolMessages` to handle both text parts and tool blocks during message folding - Simplify JSDoc comments for better readability - Fix type casting in message merging to ensure consistency with new internal properties
330390f to
0fe0bed
Compare
Contributor
Author
|
已执行变基,冲突部分完全遵照main分支 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Upgrade the tokenizer from DeepSeek V3 to V4 chat template format, update pricing constants, and adjust context-manager fold ordering to use content-only token counts.
Why
DeepSeek V4 uses a different chat template (DSML tool-call framing,
\u1f60generation suffix, merged tool-result blocks). The V3-only tokenizer produces inaccurateprompt_tokensestimates for V4 models (deepseek-v4-flash,deepseek-v4-pro), causing wrong context-window calculations and premature or missed auto-folding.How to verify
npm run verify— lint, typecheck, tests, comment-policy all passnpx tsx src/tokenizer.ts(if there's a smoke entry) or runreasonix chatwith a V4 model and observe correct token estimatesChecklist
npm run verifypasses locally (lint + typecheck + tests + comment-policy gate)Co-Authored-By: Claudetrailer in commitsCHANGELOG.md— release notes are maintainer-written at release timeReference:
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/tokenizer.json
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py
Closes #982