Real-time social-emotional intelligence for video calls
An AI-powered desktop overlay that helps neurodivergent individuals understand social cues during video conversations by analyzing facial expressions, voice prosody, and speech patterns.
Spectra is a desktop application designed to bridge the social communication gap for neurodivergent individuals during video calls. By combining computer vision, speech recognition, and AI-powered analysis, Spectra provides real-time insights into the emotional state and social cues of conversation partners.
- 🎭 Facial Expression Analysis - Real-time detection of smiles, brow movements, and eye contact using MediaPipe
- 🎤 Voice Prosody Analysis - Emotion detection from vocal tone and pitch using Hume AI
- 💬 Live Transcription - Real-time speech-to-text for both conversation participants via Deepgram
- 🤖 AI-Powered Insights - Contextual social cue interpretation using GPT-5-nano or Llama 3.1
- ⚡ Dual-Path Synthesis - Fast rule-based alerts (500ms) + deep LLM analysis (2s)
- 🪟 Floating Overlay - Non-intrusive HUD that stays on top of all applications
- ⌨️ Keyboard Controls - Quick positioning and visibility toggling
- Electron: Cross-platform desktop framework with native macOS integration
- React 18: Component-based UI with hooks for state management
- Zustand: Lightweight state management for real-time data flow
- Framer Motion: Smooth animations and transitions for the overlay UI
- Tailwind CSS v4: Modern utility-first styling with glassmorphism effects
- MediaPipe FaceLandmarker: 468-point facial landmark detection with expression blendshapes
- Hume AI Batch API: Voice prosody analysis for emotional tone detection
- Deepgram Nova-2: Real-time speech-to-text with speaker diarization
- OpenAI GPT-5-nano / Groq Llama 3.1: Social cue interpretation and contextual advice generation
- Input Capture: Desktop screen + system audio capture via Electron's desktopCapturer
- Parallel Processing: Simultaneous face analysis (160ms), audio transcription (real-time), and prosody detection (3s chunks)
- Context Synthesis: Fusion of visual, auditory, and textual signals into unified emotional state
- Dual-Path Analysis:
- Fast Path (500ms): Rule-based detection for sarcasm, disengagement, escalation
- Slow Path (2s): LLM-powered nuanced interpretation with actionable advice
- Real-Time Display: Emotion visualization, live advice, and alerts on floating overlay
For a comprehensive view of the complete signal processing pipeline with all intermediate steps:
- Node.js 18.x or higher
- macOS 12.0+ (Monterey or later) with Screen Recording permissions
- API Keys for:
# Clone the repository
git clone https://github.com/Dharshan2004/spectra.git
cd spectra
# Install dependencies
npm install
# Create environment file
cp .env.example .env
# Add your API keys to .env
VITE_HUME_API_KEY=your_hume_key
VITE_DEEPGRAM_API_KEY=your_deepgram_key
VITE_OPENAI_API_KEY=your_openai_key
# OR
VITE_GROQ_API_KEY=your_groq_keynpm run devThe app will launch with DevTools attached. On first run, you'll need to grant Screen Recording permissions:
- Go to System Settings → Privacy & Security → Screen Recording
- Enable Electron in the list
- Restart the app (Cmd+Q, then
npm run dev)
# Build the app
npm run build
# Package for macOS
npm run package:mac| Action | Shortcut |
|---|---|
| Move overlay | Cmd + Arrow Keys |
| Hide/Show | Cmd + H |
| Select source | Click monitor icon |
| Start microphone | Click mic icon |
| Collapse/Expand | Click chevron icon |
- Launch Spectra - The overlay appears in the top-right corner
- Select Source - Click the monitor icon and choose "Entire Screen" (required for audio)
- Start Call - Join your video call (Zoom, Meet, Teams, etc.)
- Monitor Insights - Watch real-time emotion detection and social cue advice
- Optional: Enable Mic - Add your own voice for better conversation context
- Emotion Orb: Pulses with detected emotion color (Joy=Gold, Anger=Red, etc.)
- Emotion Bar: Visual intensity indicator below the toolbar
- Live Advice: Main text showing contextual social cue interpretation
- Cue Tags: Detected patterns with confidence scores
- Alerts: Pop-up notifications for critical situations (escalation, sarcasm)
Facial Expression Mapping:
- Smile:
(mouthSmileLeft + mouthSmileRight) / 2from MediaPipe blendshapes - Brow Tension:
(browDownLeft + browDownRight) / 2indicates concern/frustration - Eye Contact: Calculated from horizontal gaze direction (looking straight vs. away)
Voice Prosody:
- Hume AI analyzes 48 emotional dimensions, mapped to 6 core emotions: Joy, Anger, Sadness, Fear, Anxiety, Neutral
- Audio encoded as 16-bit PCM WAV at 16kHz sample rate
- Processed in 3-second chunks via Hume's batch API
Context Synthesis:
- Rule-based fast path detects sarcasm (tone-expression mismatch), disengagement (low eye contact + flat affect), escalation (high anger)
- LLM slow path combines transcript + facial data + prosody for nuanced interpretation
- Advice generated with emphasis on supportive, actionable guidance
Spectra uses desktopCapturer with special handling for system audio:
// Screen capture with audio
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
mandatory: {
chromeMediaSource: 'desktop',
chromeMediaSourceId: sourceId
}
},
video: {
mandatory: {
chromeMediaSource: 'desktop',
chromeMediaSourceId: sourceId
}
}
})Important: System audio is only available when capturing "Entire Screen", not individual windows. This is a macOS security restriction.
- Frame Processing: 6 FPS (every 160ms) for MediaPipe to balance accuracy and CPU usage
- AudioContext: ScriptProcessor with 4096 buffer size for low-latency processing
- State Management: Zustand with
useShallowto prevent unnecessary re-renders - Debouncing: LLM calls rate-limited to 1.5s minimum interval
- Lazy Loading: MediaPipe WASM loaded from CDN on-demand
spectra/
├── electron/
│ ├── main.ts # Main process (window management, IPC)
│ └── preload.ts # Preload script (IPC bridge)
├── src/
│ ├── components/
│ │ └── HUD/
│ │ └── SocialHUD.tsx # Main overlay UI
│ ├── services/
│ │ ├── audioCapture.ts # System audio + mic capture
│ │ ├── screenCapture.ts # Desktop video capture
│ │ ├── vision.ts # MediaPipe integration
│ │ ├── hume.ts # Hume AI prosody API
│ │ ├── transcription.ts # Deepgram streaming
│ │ └── socialCueDecoder.ts # LLM-powered analysis
│ ├── core/
│ │ └── contextSynthesis.ts # Signal fusion logic
│ ├── hooks/
│ │ └── useSocialSynthesis.ts # Dual-path orchestration
│ ├── store/
│ │ └── useSocialStore.ts # Zustand state management
│ ├── types/
│ │ └── index.ts # TypeScript definitions
│ ├── App.tsx # Root component
│ ├── main.tsx # React entry point
│ └── index.css # Global styles (Tailwind)
├── assets/
│ ├── logo.png # Application logo
│ ├── architecture.png # System architecture diagram
│ └── dataflow.png # Data flow visualization
├── electron.vite.config.ts # Vite + Electron build config
├── tailwind.config.js # Tailwind CSS configuration
├── tsconfig.json # TypeScript configuration
├── package.json # Dependencies and scripts
└── .env.example # Environment variables template
- Local Processing: All AI inference happens via API calls; no data stored locally
- No Recording: Spectra analyzes live streams but does not record or save audio/video
- Encrypted Transit: All API communications use HTTPS/WSS
- Permissions: Only requests Screen Recording access; no camera or mic access by default
- Click-Through: Overlay doesn't interfere with underlying applications
- API Keys: Stored in local
.envfile, never committed to version control
Problem: "Unable to capture screen" or "Permission denied"
Solution:
- Open System Settings → Privacy & Security → Screen Recording
- Find Electron (dev mode) or Spectra (production) in the list
- Toggle it OFF and back ON
- Fully quit the app (
Cmd+Q) and restart
Problem: Transcription and prosody not detecting audio
Solution:
- Ensure you selected "Entire Screen" or "Screen 1", NOT a window
- macOS only provides audio for full screen captures, not individual windows
- Check that audio is playing from the video call (test with YouTube)
Problem: HUD disappears when video call goes fullscreen
Solution:
- This should be handled automatically via
setVisibleOnAllWorkspaces - If it persists, try moving the overlay with
Cmd+Arrow Keyswhile in fullscreen - Alternative: Use "windowed fullscreen" instead of native fullscreen in your video call app
Problem: Fans spinning up, system lag
Solution:
- MediaPipe face detection is GPU-accelerated but can be intensive
- Reduce quality: Lower screen capture resolution in
screenCapture.ts(line 44-47) - Close DevTools in production (
npm run buildinstead ofnpm run dev) - Disable Hume prosody if not needed (comment out in
App.tsxline 110-114)
# Run linter
npm run lint
# Auto-fix issues
npm run lint:fix
# Format code
npm run format# Run tests (when implemented)
npm test
# Test in production mode without packaging
npm run build
npm run previewElectron Main Process:
# Run with inspector
npm run dev -- --inspect=5858React DevTools:
- DevTools open automatically in development mode
- Use Console tab for service logs (
[MediaPipe],[Hume],[Deepgram], etc.)
- Multi-language Support - Extend beyond English for global accessibility
- Sentiment History - Track emotional patterns over time with visualizations
- Custom Alerts - User-defined rules for specific social cue combinations
- Recording Mode - Optional session recording for post-call review
- Mobile Companion - iOS/Android app for in-person conversations
- Accessibility Features - Screen reader support, high-contrast themes
- Integration APIs - Export insights to therapy apps or journaling tools
We welcome contributions from the community! Whether it's bug reports, feature requests, or code contributions, please feel free to get involved.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read our CONTRIBUTING.md for detailed guidelines.
K Priyadharshan Lead Developer |
Negha M Teammate |
Aafia Teammate |
This project is licensed under the MIT License - see the LICENSE file for details.
- MediaPipe by Google - Facial landmark detection framework
- Hume AI - Voice prosody and emotion recognition API
- Deepgram - High-accuracy speech-to-text service
- OpenAI / Groq - LLM inference for social cue interpretation
- Electron - Cross-platform desktop framework
- React - UI component library
For questions, feedback, or collaboration opportunities:
- Email: kpd2204@gmail.com
- GitHub: @Dharshan2004
- LinkedIn: K Priyadharshan
