perf(wasm): pthreads multi-threading + Service Worker COOP/COEP by unamedkr · Pull Request #27 · quantumaikr/quant.cpp

unamedkr · 2026-04-10T07:53:41Z

Summary

Enable multi-threaded inference in the WASM demo for 3-4x speedup.

Problem

WASM demo runs at ~0.9 tok/s (single thread) while native gets 25 tok/s. WASM SIMD (PR #25) helps but single-threaded is the primary bottleneck.

Solution

Service Worker for COOP/COEP (coi-serviceworker.js):

GitHub Pages doesn't support custom HTTP headers
Service Worker intercepts all responses and injects the required headers
Enables SharedArrayBuffer which pthreads needs
Auto-reloads on first visit to activate the Service Worker
Well-established pattern used by FFmpeg.wasm, SQL.js, etc.

Multi-threaded build:

-pthread + PTHREAD_POOL_SIZE=4
Auto-detects navigator.hardwareConcurrency (capped at 4)
Shows thread count in status message

Expected performance

Config	tok/s (est.)
Before (1 thread, no SIMD)	~0.9
PR #25 (1 thread + SIMD)	~2-3
This PR (4 threads + SIMD)	~5-10

Files

wasm/coi-serviceworker.js — new, Service Worker for header injection
wasm/build.sh — pthread flags
wasm/quant_wasm.c — thread detection + config
wasm/index.html — Service Worker registration
wasm/quant.{js,wasm} — rebuilt binaries

Test plan

WASM builds with pthreads (quant.js has pthread_create, SharedArrayBuffer)
GitHub Pages: Service Worker activates, crossOriginIsolated === true
Multi-threaded inference: status shows "(4 threads)"
Measurable speedup vs single-threaded build

🤖 Generated with Claude Code

Enable WASM pthreads so inference uses multiple CPU cores in the browser. Three changes: 1. coi-serviceworker.js: injects Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers into all responses via Service Worker. This enables SharedArrayBuffer on GitHub Pages and other static hosts that don't support custom HTTP headers. Well-established pattern (used by FFmpeg.wasm, SQL.js, etc.). 2. build.sh: add -pthread, PTHREAD_POOL_SIZE=4, ENVIRONMENT=web,worker. WASM binary now includes multi-threaded libc and pthread support. 3. quant_wasm.c: detect navigator.hardwareConcurrency (capped at 4) and pass to quant_config.n_threads. Model load message shows thread count ("Model loaded! Ready to chat. (4 threads)"). Expected speedup: 3-4x on multi-core devices (most modern laptops). Combined with SIMD128 from PR #25: total 6-12x vs original build. Binary: 320K → 384K (pthread runtime overhead). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: quantcpp CLI command + default to Llama-3.2-1B (user feedback) User feedback: "quantcpp command not found" + "garbage text from 135M" 1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts]) - `quantcpp "question"` — one-shot - `quantcpp` — interactive chat - `quantcpp --model path.gguf` — custom model 2. Default model changed from SmolLM2-135M to Llama-3.2-1B - 135M produces garbage text — terrible first impression - 1B is 750MB (bigger download) but actually useful output - SmolLM2-135M still available for bandwidth-constrained users 3. README Quick Start now shows `quantcpp` CLI first, Python second Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf(wasm): Web Worker architecture — eliminate ASYNCIFY for max speed Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 454f664 into main Apr 10, 2026
3 checks passed

unamedkr deleted the perf/wasm-pthreads branch April 10, 2026 07:53

unamedkr mentioned this pull request Apr 10, 2026

perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed #28

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(wasm): pthreads multi-threading + Service Worker COOP/COEP#27

perf(wasm): pthreads multi-threading + Service Worker COOP/COEP#27
unamedkr merged 1 commit into
mainfrom
perf/wasm-pthreads

unamedkr commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 10, 2026

Summary

Problem

Solution

Expected performance

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant