🎙️ SpeechSeek

SpeechSeek is an advanced, end-to-end audio processing pipeline designed for comprehensive conversational analysis. It combines state-of-the-art acoustic modeling, multilingual Automatic Speech Recognition (ASR), and Large Language Models (LLMs) to perform robust speaker diarization, transcription, language identification, and emotion detection—specifically optimized for code-switched Indian languages.

✨ Key Features

Intelligent Speaker Diarization: Utilizes Naturalspeech3's FACodec for high-dimensional acoustic feature extraction, paired with an adaptive angular distance clustering algorithm and chunk-based reassignment to accurately track distinct voices.
Multilingual LID & ASR: Features a custom 6-Language Identification model (English, Hindi, Tamil, Telugu, Marathi, Kannada) that dynamically prompts openai/whisper-medium. This ensures highly accurate transcription, even during complex code-switching scenarios.
Segment-Level Emotion Detection: Employs a custom QFormerBridge mapped to the CLAP embedding space to classify the speaker's emotional state across 6 categories (Happy, Sad, Anger, Fear, Disgust, Surprise) for every single utterance.
LLM-Powered Contextual Refinement: Passes raw segment metadata and complete audio context to Google Gemini to intelligently verify speaker identities based on turn-taking, correct minor ASR hallucinations, and seamlessly format the final output.
Interactive Web UI: Includes a deployed Gradio interface for seamless microphone recording or audio file uploading.

🧠 Architecture & Tech Stack

Acoustic & Semantic Extraction: Amphion FACodec, SpeechTokenizer
Transcription: Hugging Face transformers, openai/whisper-medium
Custom Models: PyTorch-based Q-Former, CNN-based LID Classifier
LLM Integration: Google GenAI SDK (Gemini 2.0 Flash / 2.5 Pro)
UI & Deployment: Gradio

⚙️ How It Works

Audio Ingestion: The user uploads or records audio via the Gradio UI. The audio is normalized and resampled to 16kHz mono.
Feature Extraction: The audio is chunked into sliding windows. FACodec extracts high-dimensional speaker embeddings.
Clustering: Embeddings are clustered using a custom adaptive angular distance algorithm to map out raw speaker segments.
Diarized Processing: * The LID model detects the language of each segment.
- Whisper transcribes the audio using the detected language as a prompt.
- The Q-Former model assigns an emotion label.
LLM Formatting: Gemini acts as the final adjudicator, listening to the full audio context to merge broken segments, fix transcription anomalies, and output a clean, readable transcript in the format: (start-end) speaker: text (emotion).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Copy_of_speechseek_project_RC.ipynb		Copy_of_speechseek_project_RC.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ SpeechSeek

✨ Key Features

🧠 Architecture & Tech Stack

⚙️ How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ SpeechSeek

✨ Key Features

🧠 Architecture & Tech Stack

⚙️ How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages