Skip to content

Dharshan2004/spectra

Repository files navigation

Spectra Logo

Spectra

Real-time social-emotional intelligence for video calls

An AI-powered desktop overlay that helps neurodivergent individuals understand social cues during video conversations by analyzing facial expressions, voice prosody, and speech patterns.

License: MIT Electron React TypeScript


🎯 Overview

Spectra is a desktop application designed to bridge the social communication gap for neurodivergent individuals during video calls. By combining computer vision, speech recognition, and AI-powered analysis, Spectra provides real-time insights into the emotional state and social cues of conversation partners.

Key Features

  • 🎭 Facial Expression Analysis - Real-time detection of smiles, brow movements, and eye contact using MediaPipe
  • 🎤 Voice Prosody Analysis - Emotion detection from vocal tone and pitch using Hume AI
  • 💬 Live Transcription - Real-time speech-to-text for both conversation participants via Deepgram
  • 🤖 AI-Powered Insights - Contextual social cue interpretation using GPT-5-nano or Llama 3.1
  • Dual-Path Synthesis - Fast rule-based alerts (500ms) + deep LLM analysis (2s)
  • 🪟 Floating Overlay - Non-intrusive HUD that stays on top of all applications
  • ⌨️ Keyboard Controls - Quick positioning and visibility toggling

🏗️ Architecture

Spectra Architecture

System Components

Frontend Layer

  • Electron: Cross-platform desktop framework with native macOS integration
  • React 18: Component-based UI with hooks for state management
  • Zustand: Lightweight state management for real-time data flow
  • Framer Motion: Smooth animations and transitions for the overlay UI
  • Tailwind CSS v4: Modern utility-first styling with glassmorphism effects

AI Services Layer

  • MediaPipe FaceLandmarker: 468-point facial landmark detection with expression blendshapes
  • Hume AI Batch API: Voice prosody analysis for emotional tone detection
  • Deepgram Nova-2: Real-time speech-to-text with speaker diarization
  • OpenAI GPT-5-nano / Groq Llama 3.1: Social cue interpretation and contextual advice generation

Processing Pipeline

  1. Input Capture: Desktop screen + system audio capture via Electron's desktopCapturer
  2. Parallel Processing: Simultaneous face analysis (160ms), audio transcription (real-time), and prosody detection (3s chunks)
  3. Context Synthesis: Fusion of visual, auditory, and textual signals into unified emotional state
  4. Dual-Path Analysis:
    • Fast Path (500ms): Rule-based detection for sarcasm, disengagement, escalation
    • Slow Path (2s): LLM-powered nuanced interpretation with actionable advice
  5. Real-Time Display: Emotion visualization, live advice, and alerts on floating overlay

📊 Data Flow

Spectra Data Flow

Detailed Signal Flow

For a comprehensive view of the complete signal processing pipeline with all intermediate steps:

Detailed Signal Flow

🚀 Getting Started

Prerequisites

  • Node.js 18.x or higher
  • macOS 12.0+ (Monterey or later) with Screen Recording permissions
  • API Keys for:

Installation

# Clone the repository
git clone https://github.com/Dharshan2004/spectra.git
cd spectra

# Install dependencies
npm install

# Create environment file
cp .env.example .env

# Add your API keys to .env
VITE_HUME_API_KEY=your_hume_key
VITE_DEEPGRAM_API_KEY=your_deepgram_key
VITE_OPENAI_API_KEY=your_openai_key
# OR
VITE_GROQ_API_KEY=your_groq_key

Running in Development

npm run dev

The app will launch with DevTools attached. On first run, you'll need to grant Screen Recording permissions:

  1. Go to System Settings → Privacy & Security → Screen Recording
  2. Enable Electron in the list
  3. Restart the app (Cmd+Q, then npm run dev)

Building for Production

# Build the app
npm run build

# Package for macOS
npm run package:mac

🎮 Usage

Basic Controls

Action Shortcut
Move overlay Cmd + Arrow Keys
Hide/Show Cmd + H
Select source Click monitor icon
Start microphone Click mic icon
Collapse/Expand Click chevron icon

Workflow

  1. Launch Spectra - The overlay appears in the top-right corner
  2. Select Source - Click the monitor icon and choose "Entire Screen" (required for audio)
  3. Start Call - Join your video call (Zoom, Meet, Teams, etc.)
  4. Monitor Insights - Watch real-time emotion detection and social cue advice
  5. Optional: Enable Mic - Add your own voice for better conversation context

Understanding the HUD

  • Emotion Orb: Pulses with detected emotion color (Joy=Gold, Anger=Red, etc.)
  • Emotion Bar: Visual intensity indicator below the toolbar
  • Live Advice: Main text showing contextual social cue interpretation
  • Cue Tags: Detected patterns with confidence scores
  • Alerts: Pop-up notifications for critical situations (escalation, sarcasm)

🔧 Technical Details

Emotion Detection

Facial Expression Mapping:

  • Smile: (mouthSmileLeft + mouthSmileRight) / 2 from MediaPipe blendshapes
  • Brow Tension: (browDownLeft + browDownRight) / 2 indicates concern/frustration
  • Eye Contact: Calculated from horizontal gaze direction (looking straight vs. away)

Voice Prosody:

  • Hume AI analyzes 48 emotional dimensions, mapped to 6 core emotions: Joy, Anger, Sadness, Fear, Anxiety, Neutral
  • Audio encoded as 16-bit PCM WAV at 16kHz sample rate
  • Processed in 3-second chunks via Hume's batch API

Context Synthesis:

  • Rule-based fast path detects sarcasm (tone-expression mismatch), disengagement (low eye contact + flat affect), escalation (high anger)
  • LLM slow path combines transcript + facial data + prosody for nuanced interpretation
  • Advice generated with emphasis on supportive, actionable guidance

macOS Screen Capture

Spectra uses desktopCapturer with special handling for system audio:

// Screen capture with audio
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    mandatory: {
      chromeMediaSource: 'desktop',
      chromeMediaSourceId: sourceId
    }
  },
  video: {
    mandatory: {
      chromeMediaSource: 'desktop',
      chromeMediaSourceId: sourceId
    }
  }
})

Important: System audio is only available when capturing "Entire Screen", not individual windows. This is a macOS security restriction.

Performance Optimization

  • Frame Processing: 6 FPS (every 160ms) for MediaPipe to balance accuracy and CPU usage
  • AudioContext: ScriptProcessor with 4096 buffer size for low-latency processing
  • State Management: Zustand with useShallow to prevent unnecessary re-renders
  • Debouncing: LLM calls rate-limited to 1.5s minimum interval
  • Lazy Loading: MediaPipe WASM loaded from CDN on-demand

🧩 Project Structure

spectra/
├── electron/
│   ├── main.ts              # Main process (window management, IPC)
│   └── preload.ts           # Preload script (IPC bridge)
├── src/
│   ├── components/
│   │   └── HUD/
│   │       └── SocialHUD.tsx    # Main overlay UI
│   ├── services/
│   │   ├── audioCapture.ts      # System audio + mic capture
│   │   ├── screenCapture.ts     # Desktop video capture
│   │   ├── vision.ts            # MediaPipe integration
│   │   ├── hume.ts              # Hume AI prosody API
│   │   ├── transcription.ts     # Deepgram streaming
│   │   └── socialCueDecoder.ts  # LLM-powered analysis
│   ├── core/
│   │   └── contextSynthesis.ts  # Signal fusion logic
│   ├── hooks/
│   │   └── useSocialSynthesis.ts # Dual-path orchestration
│   ├── store/
│   │   └── useSocialStore.ts    # Zustand state management
│   ├── types/
│   │   └── index.ts             # TypeScript definitions
│   ├── App.tsx                  # Root component
│   ├── main.tsx                 # React entry point
│   └── index.css                # Global styles (Tailwind)
├── assets/
│   ├── logo.png                 # Application logo
│   ├── architecture.png         # System architecture diagram
│   └── dataflow.png             # Data flow visualization
├── electron.vite.config.ts      # Vite + Electron build config
├── tailwind.config.js           # Tailwind CSS configuration
├── tsconfig.json                # TypeScript configuration
├── package.json                 # Dependencies and scripts
└── .env.example                 # Environment variables template

🔐 Privacy & Security

  • Local Processing: All AI inference happens via API calls; no data stored locally
  • No Recording: Spectra analyzes live streams but does not record or save audio/video
  • Encrypted Transit: All API communications use HTTPS/WSS
  • Permissions: Only requests Screen Recording access; no camera or mic access by default
  • Click-Through: Overlay doesn't interfere with underlying applications
  • API Keys: Stored in local .env file, never committed to version control

🐛 Troubleshooting

Screen Recording Permission Issues

Problem: "Unable to capture screen" or "Permission denied"

Solution:

  1. Open System Settings → Privacy & Security → Screen Recording
  2. Find Electron (dev mode) or Spectra (production) in the list
  3. Toggle it OFF and back ON
  4. Fully quit the app (Cmd+Q) and restart

System Audio Not Working

Problem: Transcription and prosody not detecting audio

Solution:

  • Ensure you selected "Entire Screen" or "Screen 1", NOT a window
  • macOS only provides audio for full screen captures, not individual windows
  • Check that audio is playing from the video call (test with YouTube)

Overlay Not Appearing on Fullscreen Apps

Problem: HUD disappears when video call goes fullscreen

Solution:

  • This should be handled automatically via setVisibleOnAllWorkspaces
  • If it persists, try moving the overlay with Cmd+Arrow Keys while in fullscreen
  • Alternative: Use "windowed fullscreen" instead of native fullscreen in your video call app

High CPU Usage

Problem: Fans spinning up, system lag

Solution:

  • MediaPipe face detection is GPU-accelerated but can be intensive
  • Reduce quality: Lower screen capture resolution in screenCapture.ts (line 44-47)
  • Close DevTools in production (npm run build instead of npm run dev)
  • Disable Hume prosody if not needed (comment out in App.tsx line 110-114)

🛠️ Development

Code Style

# Run linter
npm run lint

# Auto-fix issues
npm run lint:fix

# Format code
npm run format

Testing

# Run tests (when implemented)
npm test

# Test in production mode without packaging
npm run build
npm run preview

Debugging

Electron Main Process:

# Run with inspector
npm run dev -- --inspect=5858

React DevTools:

  • DevTools open automatically in development mode
  • Use Console tab for service logs ([MediaPipe], [Hume], [Deepgram], etc.)

📈 Future Roadmap

  • Multi-language Support - Extend beyond English for global accessibility
  • Sentiment History - Track emotional patterns over time with visualizations
  • Custom Alerts - User-defined rules for specific social cue combinations
  • Recording Mode - Optional session recording for post-call review
  • Mobile Companion - iOS/Android app for in-person conversations
  • Accessibility Features - Screen reader support, high-contrast themes
  • Integration APIs - Export insights to therapy apps or journaling tools

🤝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, or code contributions, please feel free to get involved.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please read our CONTRIBUTING.md for detailed guidelines.


👥 Team

NTU WIT Beyond Binary 2026 Hackathon Team

K Priyadharshan
K Priyadharshan

Lead Developer
Negha M
Negha M

Teammate
Aafia
Aafia

Teammate

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • MediaPipe by Google - Facial landmark detection framework
  • Hume AI - Voice prosody and emotion recognition API
  • Deepgram - High-accuracy speech-to-text service
  • OpenAI / Groq - LLM inference for social cue interpretation
  • Electron - Cross-platform desktop framework
  • React - UI component library

📧 Contact

For questions, feedback, or collaboration opportunities:


Built for NTU WIT Beyond Binary 2026 Hackathon

Empowering neurodivergent individuals with AI-driven social intelligence

About

A real-time Social-Emotional Head-Up Display (HUD) for neurodivergent users. Fuses Computer Vision (MediaPipe) & Audio Prosody (Hume AI) to decode invisible social cues like sarcasm and tension in video calls. Built for NTU WIT 2026.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages