Bounded-latency browser edge inference pipeline for real-time voice interview summarization using ONNX + WASM.
Built for a 3-hour edge AI hackathon. Every design decision is driven by runtime constraints, not model scale.
- O(1) bounded inference via semantic ring buffer — latency never scales with session length
- Latest-only concurrency lock — intentional backpressure, freshness over completeness
- WASM pre-warm on init — absorbs JIT cold-start spike before user interaction
- Web Worker isolated ONNX inference — UI thread never touched by inference compute
- Real-time observability dashboard — load time, inference latency, queue drops, worker uptime, SLA telemetry
- Audio feedback loop prevention —
utterance.onstartSTT guard stops recursive transcript contamination - Modular STT architecture —
whisper.cpp WASMis a drop-in swap with zero pipeline changes
🎥 Full recorded demo — model boot → READY state → live summarization → stress test:
| READY State | LISTENING State |
|---|---|
![]() |
![]() |
EdgePulse AI is a bounded-latency browser inference pipeline — not an AI API wrapper.
Every engineering decision is driven by hard constraints:
| Constraint | Solution |
|---|---|
| Model ≤35MB | Xenova/t5-small quantized q4 ONNX |
| Inference <50ms | Semantic ring buffer bounds input · pre-warmed WASM |
| No main-thread lag | Isolated Web Worker · latest-only concurrency lock |
| Runs offline | allowRemoteModels=false + local ONNX weights |
| No perceivable pauses | Heuristic TTS fillers (0ms latency) |
┌─ MAIN THREAD (UI) ──────────┐
│ [🎤 Mic] │
│ ↓ │
│ [Web Speech API] │ ← Modular STT node
│ *(modular STT node)* │ Production: whisper.cpp WASM
│ ↓ │
│ [Semantic Ring Buffer] │ ← Max 3 utterances
│ *max 3 utterances* │ O(1) memory · O(1) latency forever
│ ↓ ↘ │
│ [Pause→TTS] │ ← 0ms filler phrases
└──────────────┬──────────────┘
│ postMessage
┌──────────────▼──────────────┐
│ WEB WORKER (WASM) │
│ [Concurrency Lock] │ ← Latest-only policy
│ *latest-only policy* │ Drops stale inference requests
│ ↓ │
│ [ONNX Runtime + t5-small] │ ← Xenova/t5-small q4 · ~35MB
│ *q4 · ~35MB · local* │ Pre-warmed on init
│ ↓ │
│ postMessage({summary}) │
└─────────────────────────────┘
1. Semantic Ring Buffer (O(1) guarantee)
- Stores only the last 3 finalized utterances
shift()on overflow — constant memory, constant inference time- Eliminates O(n) latency scaling that breaks every unbounded system
2. Latest-Only Concurrency Lock
- If inference is running when new speech arrives → DROP the new request
- Tracked live in Observability Dashboard as "Latest-Only Queue Drops"
- Rationale: freshness > completeness in conversational systems
3. WASM Pre-Warm on Init
- Runs a dummy inference during model loading
- Absorbs the 200–500ms WASM JIT compilation spike before any user interaction
- Ensures the first real inference has stable, predictable latency
4. Audio Feedback Loop Prevention
recognition.stop()called viautterance.onstartbefore TTS plays- 150ms debounce guard on STT restart prevents Chrome
InvalidStateError - Without this: mic picks up TTS output → infinite recursive transcript contamination
5. Heuristic Pause Detection
- 2-second VAD timeout triggers a filler phrase
- Fillers play in 0ms (no inference) — masks WASM summarization delay
- Creates natural conversational rhythm during background processing
Prerequisites: Python 3.x · Chrome · Microphone
git clone https://github.com/rajveer100704/EdgepulseAI.git
cd EdgepulseAI
python serve.py
# Open Chrome: http://localhost:8000On first launch the quantized t5-small model (~35MB) downloads via CDN and is cached by the browser. All subsequent loads are instant.
# Download model weights locally (~35MB, one-time)
pip install huggingface_hub
python setup_model.py
# Enable strict offline in worker.js:
# const STRICT_OFFLINE = true;
python serve.pyAfter this, zero network requests are made. Inference runs entirely in-browser.
| Metric | What It Measures |
|---|---|
| Model Load Time | WASM init + pre-warm (cold-start absorbed) |
| Inference Latency | Per-request time, color-coded against SLA |
| Buffer Depth | Current / max utterances in ring buffer |
| Latest-Only Queue Drops | Requests dropped by concurrency lock |
| Worker Thread | IDLE / INFERRING live status |
| Queue Policy | Confirms LATEST-ONLY is active |
| Worker Uptime | Elapsed time since READY state |
Latency SLA: 🟢 <50ms · 🟡 <150ms · 🔴 >150ms
The STT node is intentionally modular — designed for zero-friction replacement:
Prototype: Web Speech API (browser-native, Chrome)
↓
Production: whisper.cpp WASM + AudioWorklet VAD
→ Core ring buffer + inference locks: UNCHANGED
EdgepulseAI/
├── assets/
│ ├── demo.webp — Full demo recording
│ ├── screenshot_ready.png — READY state
│ ├── screenshot_listening.png — LISTENING state
│ └── screenshot_upgraded.png — Full UI with all panels
├── models/
│ └── README.md — Instructions (weights not committed)
├── index.html — UI: state machine, observability, arch diagram
├── app.js — Ring buffer, STT, TTS, worker communication
├── worker.js — Web Worker: ONNX engine, concurrency lock, pre-warm
├── serve.py — CORS-safe HTTP server (required for WASM)
├── setup_model.py — Model downloader for strict offline mode
├── .gitignore
├── LICENSE
└── README.md
| Requirement | Status | Evidence |
|---|---|---|
| Model ≤100MB | ✅ ~35MB | t5-small q4 ONNX · displayed in UI |
| Client-side ONNX/WASM | ✅ | ONNX Runtime Web · Web Worker · no backend |
| Inference <50ms | ✅ | Ring buffer + pre-warm + max_new_tokens=30 |
| Runs offline | ✅ | allowRemoteModels=false + setup_model.py |
| Real-time summarization | ✅ | Per-utterance trigger · live summary panel |
| No main-thread lag | ✅ | Isolated Web Worker · concurrency lock |
| Pause detection + fillers | ✅ | 2s heuristic VAD · TTS engine · audio loop guard |
| Load time metric | ✅ | Observability Dashboard |
| Response time metric | ✅ | Per-inference latency, SLA color-coded |
The model is the least interesting part. Here's what was actually solved:
| Failure Mode | Solution |
|---|---|
| Queue explosion under rapid speech | Latest-only concurrency lock |
| O(n) latency scaling over long sessions | Semantic ring buffer with FIFO eviction |
| Cold-start WASM stutter | Pre-warm dummy inference at init |
| Recursive audio (mic captures TTS) | utterance.onstart STT stop guard |
Chrome InvalidStateError crashes |
150ms debounce on STT restart |
| Stale summaries from backed-up queue | Latest-only drop policy |
| Unprovable performance claims | Live observability dashboard |
JavaScript (ES Modules) · WebAssembly · ONNX Runtime Web · Transformers.js · Web Workers · Web Speech API · SpeechSynthesis API · Python (local server)
Rajveer Singh Saggu
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.



