Bridging the gap between Sign Language and Spoken English with Real-Time, Edge-Computed AI.
Sign2Sound Euphoria is a bi-directional Sign Language Translation system designed to run entirely on consumer-grade hardware (Offline-First). It eliminates the need for expensive cloud APIs or heavy server-grade GPUs.
By utilizing a novel Dual-Expert Graph Neural Network (ST-GCN) architecture, the system distinguishes between Dynamic Words (WLASL) and Static Finger-Spelling (ASL) in real-time. It integrates a Small Language Model (SLM) to correct raw glosses into grammatically natural English sentences.
- Dual-Expert Routing: Separate specialized models for Spelling vs. Signing to eliminate the "Hold vs. Letter" confusion.
- Edge-Optimized: Runs at 22+ FPS on a laptop RTX 3050 (4GB VRAM).
- Hybrid Pipeline: Combines Vision (ST-GCN) + Language (SLM) for context-aware translation.
- Privacy First: Zero data leaves the device; fully offline execution.
The pipeline processes video input in four distinct stages:
-
Skeletal Extraction:
- Tool: Google MediaPipe Holistic.
- Data: Extracts 109 Keypoints (Body, Hands, Face) per frame.
- Normalization: Relative Nose-Centric Alignment (invariant to user position).
-
Dual-Expert Inference (ST-GCN):
- Expert A (WLASL): Tracks temporal motion for dynamic words (e.g., "Mother", "Eat").
- Expert B (ASL): Recognizes static spatial features for finger-spelling (e.g., "A-D-A-M").
-
Grammar Correction (SLM):
- Input: Raw Glosses (e.g., "Who Eat Now").
- Model: Quantized Microsoft Phi-2 / DistilGPT-2.
- Output: Natural English (e.g., "Who is eating now?").
-
Vocalization (Coming Soon):
- Engine: KokoroTTS (High-fidelity, <80ms latency).
We evaluated the system on a held-out test set (20% split) using an ASUS TUF A15 (RTX 3050).
| Dataset / Task | Accuracy | F1-Score | Latency |
|---|---|---|---|
| ASL Letters (Static) | 99.04% | 0.99 | 45ms |
| WLASL-100 (Dynamic) | 92.05% | 0.91 | 45ms |
| End-to-End Pipeline | N/A | N/A | ~22 FPS |
Note: Training graphs and confusion matrices are available in the
results/directory.
- Python 3.10+
- NVIDIA GPU (Recommended) or CPU
- Webcam
-
Clone the Repository
git clone [https://github.com/yourusername/Sign2Sound-Euphoria.git](https://github.com/yourusername/Sign2Sound-Euphoria.git) cd Sign2Sound-Euphoria -
Install Dependencies
pip install -r requirements.txt
-
Download Models
- Place
stgcn_wlasl100_final.pthinmodels/. - Place
stgcn_letters_scratch.pthinmodels/. (Pre-trained weights link to come)
- Place
Runs the full stack: Video -> Gloss -> SLM Correction.
python inference/final_pipeline.py- Input: Sequence of videos (e.g.,
who.mp4,eat.mp4,now.mp4). - Output:
[SLM]: Who is eating now?
We utilized a Split-Dataset Strategy to solve class imbalance and confusion:
- IEEE DataPort ASL Dataset: Used for training the static Spelling Expert (Filtered to ~200 samples/class).
- WLASL (World Level ASL): Used top 100 classes for the Dynamic Word Expert.
Access: Dataset composition details available here.
- KokoroTTS Integration: Replace text output with natural voice synthesis.
- Streaming Decoder: Optimize SLM to decode tokens asynchronously for lower latency.
- Mobile Port: Quantize models for deployment on Android/iOS via TFLite.
- Roshan Robin - AI Engineer & Architecture
- Jayalakshmy Jayakrishnan - Data Processing & Evaluation
- Nima Fathima - Data Processing & Evaluation
- Sakhil N Maju - Frontend & Integration
Distributed under the MIT License. See LICENSE for more information.