Skip to content

perf: sort vocab before merge parsing + rebuild WASM with ASYNCIFY#22

Merged
unamedkr merged 1 commit into
mainfrom
fix/merge-perf-wasm-rebuild
Apr 10, 2026
Merged

perf: sort vocab before merge parsing + rebuild WASM with ASYNCIFY#22
unamedkr merged 1 commit into
mainfrom
fix/merge-perf-wasm-rebuild

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Two fixes that together complete the WASM demo improvements from PR #20 and #21:

1. Merge parsing performance: ~10 s → ~100 ms

str_lookup() during GGUF BPE merge parsing was using O(n) linear scan because sorted_indices was built after the merge loop. For Qwen3 (248K vocab × 50K merges × 3 lookups), this meant ~22 billion string comparisons on every model load.

Fix: Move sorted_indices build (qsort) above the merge parsing loop. Now str_lookup uses binary search during merge parsing. Applied to both tq_tokenizer.c and quant.h.

2. WASM binary rebuild with ASYNCIFY

PR #20 added -sASYNCIFY to build.sh but never recompiled the binaries. The deployed quant.js/quant.wasm were from a pre-ASYNCIFY build, so wasm_generate_async() didn't exist and the JS silently fell back to the synchronous path (blocking the event loop, tokens appearing all at once).

Fix: Recompiled with emcc 5.0.5 + ASYNCIFY. Verified: strings quant.wasm | grep asyncify returns 5 hits, emscripten_sleep returns 1 hit. Binary grew from 197K → 244K (ASYNCIFY stack overhead).

Impact on models (from analysis)

Model family Tokenizer Merge parsing Impact
Gemma 3/4 SentencePiece Uses JSON path, not GGUF path No change
Qwen 2.5/3 BPE (248K vocab) Now fast + correct Fixed
Llama 3.x tiktoken BPE (128K) Now fast + correct Fixed
SmolLM2 BPE (SentencePiece detect) Merges now parsed but SPM path used No regression
Phi-3 / Mistral BPE Now fast + correct Fixed

Test plan

  • Native build passes
  • WASM rebuild succeeds (quant.js 72K, quant.wasm 256K)
  • strings quant.wasm | grep asyncify confirms ASYNCIFY present
  • WASM demo: Qwen3 0.6B loads without long init delay
  • WASM demo: tokens stream in real-time (not all at once)

🤖 Generated with Claude Code

Two changes:

1. Move sorted_indices build before GGUF BPE merge parsing in both
   tq_tokenizer.c and quant.h.  str_lookup() during merge parsing was
   falling back to O(n) linear scan because sorted_indices wasn't built
   yet. For Qwen3 (248K vocab × 50K merges × 3 lookups) this was ~10 s
   of init time. Now uses binary search: ~100 ms.

2. Rebuild quant.js (72K) and quant.wasm (256K) with -sASYNCIFY.
   The previous binaries were compiled before the ASYNCIFY flags were
   added to build.sh, so wasm_generate_async() didn't exist and the
   JS fallback ran the synchronous path (blocking the browser event
   loop, all tokens appearing at once). The new binary contains
   asyncify runtime + emscripten_sleep, enabling real-time per-token
   streaming in the browser demo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit c717832 into main Apr 10, 2026
3 checks passed
@unamedkr unamedkr deleted the fix/merge-perf-wasm-rebuild branch April 10, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant