⚡ EdgePulse AI

Bounded-latency browser edge inference pipeline for real-time voice interview summarization using ONNX + WASM.

Engineering Highlights

Built for a 3-hour edge AI hackathon. Every design decision is driven by runtime constraints, not model scale.

O(1) bounded inference via semantic ring buffer — latency never scales with session length
Latest-only concurrency lock — intentional backpressure, freshness over completeness
WASM pre-warm on init — absorbs JIT cold-start spike before user interaction
Web Worker isolated ONNX inference — UI thread never touched by inference compute
Real-time observability dashboard — load time, inference latency, queue drops, worker uptime, SLA telemetry
Audio feedback loop prevention — utterance.onstart STT guard stops recursive transcript contamination
Modular STT architecture — whisper.cpp WASM is a drop-in swap with zero pipeline changes

Live Demo

🎥 Full recorded demo — model boot → READY state → live summarization → stress test:

READY State	LISTENING State

What Is This?

EdgePulse AI is a bounded-latency browser inference pipeline — not an AI API wrapper.

Every engineering decision is driven by hard constraints:

Constraint	Solution
Model ≤35MB	Xenova/t5-small quantized q4 ONNX
Inference <50ms	Semantic ring buffer bounds input · pre-warmed WASM
No main-thread lag	Isolated Web Worker · latest-only concurrency lock
Runs offline	`allowRemoteModels=false` + local ONNX weights
No perceivable pauses	Heuristic TTS fillers (0ms latency)

Architecture

┌─ MAIN THREAD (UI) ──────────┐
│ [🎤 Mic]                     │
│    ↓                         │
│ [Web Speech API]             │   ← Modular STT node
│  *(modular STT node)*        │     Production: whisper.cpp WASM
│    ↓                         │
│ [Semantic Ring Buffer]       │   ← Max 3 utterances
│  *max 3 utterances*          │     O(1) memory · O(1) latency forever
│    ↓        ↘                │
│             [Pause→TTS]      │   ← 0ms filler phrases
└──────────────┬──────────────┘
               │ postMessage
┌──────────────▼──────────────┐
│ WEB WORKER (WASM)           │
│ [Concurrency Lock]          │   ← Latest-only policy
│  *latest-only policy*       │     Drops stale inference requests
│    ↓                        │
│ [ONNX Runtime + t5-small]   │   ← Xenova/t5-small q4 · ~35MB
│  *q4 · ~35MB · local*       │     Pre-warmed on init
│    ↓                        │
│  postMessage({summary})     │
└─────────────────────────────┘

Key Engineering Decisions

1. Semantic Ring Buffer (O(1) guarantee)

Stores only the last 3 finalized utterances
shift() on overflow — constant memory, constant inference time
Eliminates O(n) latency scaling that breaks every unbounded system

2. Latest-Only Concurrency Lock

If inference is running when new speech arrives → DROP the new request
Tracked live in Observability Dashboard as "Latest-Only Queue Drops"
Rationale: freshness > completeness in conversational systems

3. WASM Pre-Warm on Init

Runs a dummy inference during model loading
Absorbs the 200–500ms WASM JIT compilation spike before any user interaction
Ensures the first real inference has stable, predictable latency

4. Audio Feedback Loop Prevention

recognition.stop() called via utterance.onstart before TTS plays
150ms debounce guard on STT restart prevents Chrome InvalidStateError
Without this: mic picks up TTS output → infinite recursive transcript contamination

5. Heuristic Pause Detection

2-second VAD timeout triggers a filler phrase
Fillers play in 0ms (no inference) — masks WASM summarization delay
Creates natural conversational rhythm during background processing

Quick Start

Prerequisites: Python 3.x · Chrome · Microphone

Hybrid Mode (CDN — instant start)

git clone https://github.com/rajveer100704/EdgepulseAI.git
cd EdgepulseAI

python serve.py
# Open Chrome: http://localhost:8000

On first launch the quantized t5-small model (~35MB) downloads via CDN and is cached by the browser. All subsequent loads are instant.

Strict Offline Mode

# Download model weights locally (~35MB, one-time)
pip install huggingface_hub
python setup_model.py

# Enable strict offline in worker.js:
# const STRICT_OFFLINE = true;

python serve.py

After this, zero network requests are made. Inference runs entirely in-browser.

Observability Dashboard

Metric	What It Measures
Model Load Time	WASM init + pre-warm (cold-start absorbed)
Inference Latency	Per-request time, color-coded against SLA
Buffer Depth	Current / max utterances in ring buffer
Latest-Only Queue Drops	Requests dropped by concurrency lock
Worker Thread	IDLE / INFERRING live status
Queue Policy	Confirms LATEST-ONLY is active
Worker Uptime	Elapsed time since READY state

Latency SLA: 🟢 <50ms · 🟡 <150ms · 🔴 >150ms

Production Upgrade Path

The STT node is intentionally modular — designed for zero-friction replacement:

Prototype:   Web Speech API  (browser-native, Chrome)
                 ↓
Production:  whisper.cpp WASM + AudioWorklet VAD
             → Core ring buffer + inference locks: UNCHANGED

Project Structure

EdgepulseAI/
├── assets/
│   ├── demo.webp                  — Full demo recording
│   ├── screenshot_ready.png       — READY state
│   ├── screenshot_listening.png   — LISTENING state
│   └── screenshot_upgraded.png   — Full UI with all panels
├── models/
│   └── README.md                  — Instructions (weights not committed)
├── index.html                     — UI: state machine, observability, arch diagram
├── app.js                         — Ring buffer, STT, TTS, worker communication
├── worker.js                      — Web Worker: ONNX engine, concurrency lock, pre-warm
├── serve.py                       — CORS-safe HTTP server (required for WASM)
├── setup_model.py                 — Model downloader for strict offline mode
├── .gitignore
├── LICENSE
└── README.md

Constraint Compliance

Requirement	Status	Evidence
Model ≤100MB	✅ ~35MB	t5-small q4 ONNX · displayed in UI
Client-side ONNX/WASM	✅	ONNX Runtime Web · Web Worker · no backend
Inference <50ms	✅	Ring buffer + pre-warm + `max_new_tokens=30`
Runs offline	✅	`allowRemoteModels=false` + `setup_model.py`
Real-time summarization	✅	Per-utterance trigger · live summary panel
No main-thread lag	✅	Isolated Web Worker · concurrency lock
Pause detection + fillers	✅	2s heuristic VAD · TTS engine · audio loop guard
Load time metric	✅	Observability Dashboard
Response time metric	✅	Per-inference latency, SLA color-coded

What Was Engineered Around

The model is the least interesting part. Here's what was actually solved:

Failure Mode	Solution
Queue explosion under rapid speech	Latest-only concurrency lock
O(n) latency scaling over long sessions	Semantic ring buffer with FIFO eviction
Cold-start WASM stutter	Pre-warm dummy inference at init
Recursive audio (mic captures TTS)	`utterance.onstart` STT stop guard
Chrome `InvalidStateError` crashes	150ms debounce on STT restart
Stale summaries from backed-up queue	Latest-only drop policy
Unprovable performance claims	Live observability dashboard

Tech Stack

JavaScript (ES Modules) · WebAssembly · ONNX Runtime Web · Transformers.js · Web Workers · Web Speech API · SpeechSynthesis API · Python (local server)

Author & Contributions

👤 Author

Rajveer Singh Saggu

High-Performance Systems & Adaptive ML Infrastructure
GitHub | LinkedIn

🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ EdgePulse AI

Engineering Highlights

Live Demo

What Is This?

Architecture

Key Engineering Decisions

Quick Start

Hybrid Mode (CDN — instant start)

Strict Offline Mode

Observability Dashboard

Production Upgrade Path

Project Structure

Constraint Compliance

What Was Engineered Around

Tech Stack

Author & Contributions

👤 Author

🤝 Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.js		app.js
index.html		index.html
serve.py		serve.py
setup_model.py		setup_model.py
walkthrough.md		walkthrough.md
worker.js		worker.js

Folders and files

Latest commit

History

Repository files navigation

⚡ EdgePulse AI

Engineering Highlights

Live Demo

What Is This?

Architecture

Key Engineering Decisions

Quick Start

Hybrid Mode (CDN — instant start)

Strict Offline Mode

Observability Dashboard

Production Upgrade Path

Project Structure

Constraint Compliance

What Was Engineered Around

Tech Stack

Author & Contributions

👤 Author

🤝 Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages