Skip to content

rajveer100704/EdgepulseAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ EdgePulse AI

Bounded-latency browser edge inference pipeline for real-time voice interview summarization using ONNX + WASM.

EdgePulse AI

License: MIT Model Size Latency Runtime Offline


Engineering Highlights

Built for a 3-hour edge AI hackathon. Every design decision is driven by runtime constraints, not model scale.

  • O(1) bounded inference via semantic ring buffer — latency never scales with session length
  • Latest-only concurrency lock — intentional backpressure, freshness over completeness
  • WASM pre-warm on init — absorbs JIT cold-start spike before user interaction
  • Web Worker isolated ONNX inference — UI thread never touched by inference compute
  • Real-time observability dashboard — load time, inference latency, queue drops, worker uptime, SLA telemetry
  • Audio feedback loop preventionutterance.onstart STT guard stops recursive transcript contamination
  • Modular STT architecturewhisper.cpp WASM is a drop-in swap with zero pipeline changes

Live Demo

🎥 Full recorded demo — model boot → READY state → live summarization → stress test:

EdgePulse AI Full Demo

READY State LISTENING State
Ready Listening

What Is This?

EdgePulse AI is a bounded-latency browser inference pipeline — not an AI API wrapper.

Every engineering decision is driven by hard constraints:

Constraint Solution
Model ≤35MB Xenova/t5-small quantized q4 ONNX
Inference <50ms Semantic ring buffer bounds input · pre-warmed WASM
No main-thread lag Isolated Web Worker · latest-only concurrency lock
Runs offline allowRemoteModels=false + local ONNX weights
No perceivable pauses Heuristic TTS fillers (0ms latency)

Architecture

┌─ MAIN THREAD (UI) ──────────┐
│ [🎤 Mic]                     │
│    ↓                         │
│ [Web Speech API]             │   ← Modular STT node
│  *(modular STT node)*        │     Production: whisper.cpp WASM
│    ↓                         │
│ [Semantic Ring Buffer]       │   ← Max 3 utterances
│  *max 3 utterances*          │     O(1) memory · O(1) latency forever
│    ↓        ↘                │
│             [Pause→TTS]      │   ← 0ms filler phrases
└──────────────┬──────────────┘
               │ postMessage
┌──────────────▼──────────────┐
│ WEB WORKER (WASM)           │
│ [Concurrency Lock]          │   ← Latest-only policy
│  *latest-only policy*       │     Drops stale inference requests
│    ↓                        │
│ [ONNX Runtime + t5-small]   │   ← Xenova/t5-small q4 · ~35MB
│  *q4 · ~35MB · local*       │     Pre-warmed on init
│    ↓                        │
│  postMessage({summary})     │
└─────────────────────────────┘

Key Engineering Decisions

1. Semantic Ring Buffer (O(1) guarantee)

  • Stores only the last 3 finalized utterances
  • shift() on overflow — constant memory, constant inference time
  • Eliminates O(n) latency scaling that breaks every unbounded system

2. Latest-Only Concurrency Lock

  • If inference is running when new speech arrives → DROP the new request
  • Tracked live in Observability Dashboard as "Latest-Only Queue Drops"
  • Rationale: freshness > completeness in conversational systems

3. WASM Pre-Warm on Init

  • Runs a dummy inference during model loading
  • Absorbs the 200–500ms WASM JIT compilation spike before any user interaction
  • Ensures the first real inference has stable, predictable latency

4. Audio Feedback Loop Prevention

  • recognition.stop() called via utterance.onstart before TTS plays
  • 150ms debounce guard on STT restart prevents Chrome InvalidStateError
  • Without this: mic picks up TTS output → infinite recursive transcript contamination

5. Heuristic Pause Detection

  • 2-second VAD timeout triggers a filler phrase
  • Fillers play in 0ms (no inference) — masks WASM summarization delay
  • Creates natural conversational rhythm during background processing

Quick Start

Prerequisites: Python 3.x · Chrome · Microphone

Hybrid Mode (CDN — instant start)

git clone https://github.com/rajveer100704/EdgepulseAI.git
cd EdgepulseAI

python serve.py
# Open Chrome: http://localhost:8000

On first launch the quantized t5-small model (~35MB) downloads via CDN and is cached by the browser. All subsequent loads are instant.

Strict Offline Mode

# Download model weights locally (~35MB, one-time)
pip install huggingface_hub
python setup_model.py

# Enable strict offline in worker.js:
# const STRICT_OFFLINE = true;

python serve.py

After this, zero network requests are made. Inference runs entirely in-browser.


Observability Dashboard

Metric What It Measures
Model Load Time WASM init + pre-warm (cold-start absorbed)
Inference Latency Per-request time, color-coded against SLA
Buffer Depth Current / max utterances in ring buffer
Latest-Only Queue Drops Requests dropped by concurrency lock
Worker Thread IDLE / INFERRING live status
Queue Policy Confirms LATEST-ONLY is active
Worker Uptime Elapsed time since READY state

Latency SLA: 🟢 <50ms · 🟡 <150ms · 🔴 >150ms


Production Upgrade Path

The STT node is intentionally modular — designed for zero-friction replacement:

Prototype:   Web Speech API  (browser-native, Chrome)
                 ↓
Production:  whisper.cpp WASM + AudioWorklet VAD
             → Core ring buffer + inference locks: UNCHANGED

Project Structure

EdgepulseAI/
├── assets/
│   ├── demo.webp                  — Full demo recording
│   ├── screenshot_ready.png       — READY state
│   ├── screenshot_listening.png   — LISTENING state
│   └── screenshot_upgraded.png   — Full UI with all panels
├── models/
│   └── README.md                  — Instructions (weights not committed)
├── index.html                     — UI: state machine, observability, arch diagram
├── app.js                         — Ring buffer, STT, TTS, worker communication
├── worker.js                      — Web Worker: ONNX engine, concurrency lock, pre-warm
├── serve.py                       — CORS-safe HTTP server (required for WASM)
├── setup_model.py                 — Model downloader for strict offline mode
├── .gitignore
├── LICENSE
└── README.md

Constraint Compliance

Requirement Status Evidence
Model ≤100MB ~35MB t5-small q4 ONNX · displayed in UI
Client-side ONNX/WASM ONNX Runtime Web · Web Worker · no backend
Inference <50ms Ring buffer + pre-warm + max_new_tokens=30
Runs offline allowRemoteModels=false + setup_model.py
Real-time summarization Per-utterance trigger · live summary panel
No main-thread lag Isolated Web Worker · concurrency lock
Pause detection + fillers 2s heuristic VAD · TTS engine · audio loop guard
Load time metric Observability Dashboard
Response time metric Per-inference latency, SLA color-coded

What Was Engineered Around

The model is the least interesting part. Here's what was actually solved:

Failure Mode Solution
Queue explosion under rapid speech Latest-only concurrency lock
O(n) latency scaling over long sessions Semantic ring buffer with FIFO eviction
Cold-start WASM stutter Pre-warm dummy inference at init
Recursive audio (mic captures TTS) utterance.onstart STT stop guard
Chrome InvalidStateError crashes 150ms debounce on STT restart
Stale summaries from backed-up queue Latest-only drop policy
Unprovable performance claims Live observability dashboard

Tech Stack

JavaScript (ES Modules) · WebAssembly · ONNX Runtime Web · Transformers.js · Web Workers · Web Speech API · SpeechSynthesis API · Python (local server)


Author & Contributions

👤 Author

Rajveer Singh Saggu

  • High-Performance Systems & Adaptive ML Infrastructure
  • GitHub | LinkedIn

🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

About

Bounded-latency browser edge inference pipeline for real-time voice interview summarization using ONNX Runtime Web + WASM. Features Web Worker isolation, semantic ring buffers, latest-only concurrency control, observability dashboard, offline-first architecture and production-ready whisper.cpp upgrade path.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors