Skip to content

perf(wasm): pthreads multi-threading + Service Worker COOP/COEP#27

Merged
unamedkr merged 1 commit into
mainfrom
perf/wasm-pthreads
Apr 10, 2026
Merged

perf(wasm): pthreads multi-threading + Service Worker COOP/COEP#27
unamedkr merged 1 commit into
mainfrom
perf/wasm-pthreads

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Enable multi-threaded inference in the WASM demo for 3-4x speedup.

Problem

WASM demo runs at ~0.9 tok/s (single thread) while native gets 25 tok/s. WASM SIMD (PR #25) helps but single-threaded is the primary bottleneck.

Solution

Service Worker for COOP/COEP (coi-serviceworker.js):

  • GitHub Pages doesn't support custom HTTP headers
  • Service Worker intercepts all responses and injects the required headers
  • Enables SharedArrayBuffer which pthreads needs
  • Auto-reloads on first visit to activate the Service Worker
  • Well-established pattern used by FFmpeg.wasm, SQL.js, etc.

Multi-threaded build:

  • -pthread + PTHREAD_POOL_SIZE=4
  • Auto-detects navigator.hardwareConcurrency (capped at 4)
  • Shows thread count in status message

Expected performance

Config tok/s (est.)
Before (1 thread, no SIMD) ~0.9
PR #25 (1 thread + SIMD) ~2-3
This PR (4 threads + SIMD) ~5-10

Files

  • wasm/coi-serviceworker.js — new, Service Worker for header injection
  • wasm/build.sh — pthread flags
  • wasm/quant_wasm.c — thread detection + config
  • wasm/index.html — Service Worker registration
  • wasm/quant.{js,wasm} — rebuilt binaries

Test plan

  • WASM builds with pthreads (quant.js has pthread_create, SharedArrayBuffer)
  • GitHub Pages: Service Worker activates, crossOriginIsolated === true
  • Multi-threaded inference: status shows "(4 threads)"
  • Measurable speedup vs single-threaded build

🤖 Generated with Claude Code

Enable WASM pthreads so inference uses multiple CPU cores in the
browser. Three changes:

1. coi-serviceworker.js: injects Cross-Origin-Opener-Policy and
   Cross-Origin-Embedder-Policy headers into all responses via
   Service Worker. This enables SharedArrayBuffer on GitHub Pages
   and other static hosts that don't support custom HTTP headers.
   Well-established pattern (used by FFmpeg.wasm, SQL.js, etc.).

2. build.sh: add -pthread, PTHREAD_POOL_SIZE=4, ENVIRONMENT=web,worker.
   WASM binary now includes multi-threaded libc and pthread support.

3. quant_wasm.c: detect navigator.hardwareConcurrency (capped at 4)
   and pass to quant_config.n_threads. Model load message shows
   thread count ("Model loaded! Ready to chat. (4 threads)").

Expected speedup: 3-4x on multi-core devices (most modern laptops).
Combined with SIMD128 from PR #25: total 6-12x vs original build.

Binary: 320K → 384K (pthread runtime overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 454f664 into main Apr 10, 2026
3 checks passed
@unamedkr unamedkr deleted the perf/wasm-pthreads branch April 10, 2026 07:53
unamedkr added a commit that referenced this pull request Apr 10, 2026
Replace ASYNCIFY-based streaming with a dedicated Web Worker.
Inference runs entirely in the worker thread; tokens stream to
the main thread via postMessage(). The main thread never blocks.

Changes:
- inference-worker.js: new Web Worker that loads WASM + runs
  quant_generate() in a blocking loop, posting each token
- quant_wasm.c: simplified — removed ASYNCIFY, sleep, async
  variants. Single sync callback posts tokens via EM_JS
- build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added
  -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty
  with pthreads). ALLOW_MEMORY_GROWTH=0
- index.html: generate() sends to worker, receives tokens via
  onmessage handler. Model loading via transferable ArrayBuffer

Performance impact:
- ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind)
- Fixed memory: eliminates pthreads+growth penalty
- Relaxed SIMD: FMA instructions where available
- Binary: 384K → 256K (-33%)

Combined with pthreads (PR #27) and SIMD128 (PR #25):
expected total speedup 8-15x vs original single-thread build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 10, 2026
Replace ASYNCIFY-based streaming with a dedicated Web Worker.
Inference runs entirely in the worker thread; tokens stream to
the main thread via postMessage(). The main thread never blocks.

Changes:
- inference-worker.js: new Web Worker that loads WASM + runs
  quant_generate() in a blocking loop, posting each token
- quant_wasm.c: simplified — removed ASYNCIFY, sleep, async
  variants. Single sync callback posts tokens via EM_JS
- build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added
  -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty
  with pthreads). ALLOW_MEMORY_GROWTH=0
- index.html: generate() sends to worker, receives tokens via
  onmessage handler. Model loading via transferable ArrayBuffer

Performance impact:
- ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind)
- Fixed memory: eliminates pthreads+growth penalty
- Relaxed SIMD: FMA instructions where available
- Binary: 384K → 256K (-33%)

Combined with pthreads (PR #27) and SIMD128 (PR #25):
expected total speedup 8-15x vs original single-thread build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 10, 2026
* fix: quantcpp CLI command + default to Llama-3.2-1B (user feedback)

User feedback: "quantcpp command not found" + "garbage text from 135M"

1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts])
   - `quantcpp "question"` — one-shot
   - `quantcpp` — interactive chat
   - `quantcpp --model path.gguf` — custom model

2. Default model changed from SmolLM2-135M to Llama-3.2-1B
   - 135M produces garbage text — terrible first impression
   - 1B is 750MB (bigger download) but actually useful output
   - SmolLM2-135M still available for bandwidth-constrained users

3. README Quick Start now shows `quantcpp` CLI first, Python second

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(wasm): Web Worker architecture — eliminate ASYNCIFY for max speed

Replace ASYNCIFY-based streaming with a dedicated Web Worker.
Inference runs entirely in the worker thread; tokens stream to
the main thread via postMessage(). The main thread never blocks.

Changes:
- inference-worker.js: new Web Worker that loads WASM + runs
  quant_generate() in a blocking loop, posting each token
- quant_wasm.c: simplified — removed ASYNCIFY, sleep, async
  variants. Single sync callback posts tokens via EM_JS
- build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added
  -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty
  with pthreads). ALLOW_MEMORY_GROWTH=0
- index.html: generate() sends to worker, receives tokens via
  onmessage handler. Model loading via transferable ArrayBuffer

Performance impact:
- ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind)
- Fixed memory: eliminates pthreads+growth penalty
- Relaxed SIMD: FMA instructions where available
- Binary: 384K → 256K (-33%)

Combined with pthreads (PR #27) and SIMD128 (PR #25):
expected total speedup 8-15x vs original single-thread build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant