Skip to content

feat/deepseek-v4-tokenizer#981

Merged
esengine merged 3 commits into
esengine:mainfrom
ADX15xs:feat/deepseek-v4-tokenizer
May 16, 2026
Merged

feat/deepseek-v4-tokenizer#981
esengine merged 3 commits into
esengine:mainfrom
ADX15xs:feat/deepseek-v4-tokenizer

Conversation

@ADX15xs
Copy link
Copy Markdown
Contributor

@ADX15xs ADX15xs commented May 15, 2026

What

Upgrade the tokenizer from DeepSeek V3 to V4 chat template format, update pricing constants, and adjust context-manager fold ordering to use content-only token counts.

Why

DeepSeek V4 uses a different chat template (DSML tool-call framing, \u1f60 generation suffix, merged tool-result blocks). The V3-only tokenizer produces inaccurate prompt_tokens estimates for V4 models (deepseek-v4-flash, deepseek-v4-pro), causing wrong context-window calculations and premature or missed auto-folding.

How to verify

  1. npm run verify — lint, typecheck, tests, comment-policy all pass
  2. npx tsx src/tokenizer.ts (if there's a smoke entry) or run reasonix chat with a V4 model and observe correct token estimates

Checklist

  • npm run verify passes locally (lint + typecheck + tests + comment-policy gate)
  • No Co-Authored-By: Claude trailer in commits
  • Comments follow CONTRIBUTING.md (no module-essay headers, no incident history)
  • No edits to CHANGELOG.md — release notes are maintainer-written at release time

Reference:
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/tokenizer.json
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py

Closes #982

This PR conflicts with #978 at the official price value in src/telemetry/stats.ts, but it is easy to resolve.
It is recommended to merge his PR first, and I will take charge of rebasing and resolving the conflicts.
Alternatively, if @esengine has other ideas, we can realign our plans.

@ADX15xs ADX15xs changed the title Feat/deepseek v4 tokenizer feat/deepseek v4 tokenizer May 15, 2026
@ADX15xs ADX15xs changed the title feat/deepseek v4 tokenizer feat/deepseek-v4-tokenizer May 15, 2026
@esengine
Copy link
Copy Markdown
Owner

Thanks! #978 is in — go ahead and rebase whenever you're ready, the conflict in src/telemetry/stats.ts should be trivial (keep the new constants from main). The tokenizer work itself looks great at a glance; I'll do a proper pass once the rebase lands.

ADX15xs added 3 commits May 16, 2026 11:15
- Update tokenizer data to use DeepSeek V4 vocabulary
- Update token IDs for `<think>` and `</think>` special tokens
- Update test suites to reflect V4 tokenization behavior and CJK compression characteristics
- Tokenizer From: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/tokenizer.json
- Reference: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/encoding/encoding_dsv4.py
- Implement `formatDeepSeekPrompt` to include BOS, EOS, and role-specific separators (`<|User|>`, `<|Assistant|>`)
- Add support for DeepSeek Machine Learning Language (DSML) tool-calling syntax in prompt estimation
- Update `estimateRequestTokens` to account for chat template framing and tool schema overhead
- Refactor `ContextManager` to use more accurate token counting for message folding logic
- Update `V4Message` interface to include `_textParts` for better content tracking
- Improve `mergeToolMessages` to handle both text parts and tool blocks during message folding
- Simplify JSDoc comments for better readability
- Fix type casting in message merging to ensure consistency with new internal properties
@ADX15xs ADX15xs force-pushed the feat/deepseek-v4-tokenizer branch from 330390f to 0fe0bed Compare May 16, 2026 03:19
@ADX15xs
Copy link
Copy Markdown
Contributor Author

ADX15xs commented May 16, 2026

已执行变基,冲突部分完全遵照main分支

@esengine esengine merged commit 30bdc51 into esengine:main May 16, 2026
5 checks passed
@ADX15xs ADX15xs deleted the feat/deepseek-v4-tokenizer branch May 16, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deepseek-v4-tokenizer

2 participants