Comprehensive Speech & Audio Processing, Disfluency Detection, Spell Checking, and Model Evaluation
- Overview
- Module 1: Whisper Fine-Tuning
- Module 2: Disfluency Detection
- Module 3: NLP Spell Checking
- Module 4: Consensus Architecture & Evaluation
This repository contains a full suite of solutions, methodologies, and architectures specifically tailored towards conversational Hindi (Hinglish). The pipeline spans end-to-end model fine-tuning, automated NLP data cleaning, and dynamic algorithmic scoring.
Fine-tuning of openai/whisper-small using the Hindi subset of the FLEURS dataset (hi_in).
- Bypassed default HuggingFace audio casting on Apple Silicon by decoding audio chunks manually with
soundfile. - Alleviated
mpsbackend Out-Of-Memory issues locally by settingPYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0. - Utilized a batch size of 1 with 16 gradient accumulation steps to ensure stable training.
| Model | WER | Improvement |
|---|---|---|
| Whisper-small (Baseline) | 84.16% | - |
| Whisper-small (Fine-tuned) | 47.50% | **~ 43%** |
Identified and isolated speech disfluencies (fillers and stutters) across a 10-hour conversational dataset.
- Detection: Leveraged NLP techniques over Google Cloud STT
.jsontranscriptions. Extracted common Hindi fillers ("हम्म", "आह", "उम", "अह", "ओह") alongside custom backtracking logic to catch non-lexical word repetition behaviors typical of stuttering. (Total: 907 segments detected). - Precision Clipping: Loaded
.wavrecordings directly into memory withsoundfileto prevent slow I/O operations. Translatedstartandendtimestamps strictly into sample indices for exact clipping. - Audio Fidelity: Preserved original audio quality with zero destructive dynamic-range normalization.
Evaluated ~177,000 uniquely transcribed Hinglish words to identify correctly and incorrectly spelled terms, accurately handling English loanwords written in Devanagari script (e.g., "कंप्यूटर").
Designed a robust two-tiered NLP validation engine:
- Academic Morphological Validation: Used
spyllsdynamically mapped against the official LibreOfficehi_INdictionary (hi_IN.dic/hi_IN.aff). Allowed strict evaluation of native Hindi suffix/prefix affixation rules. - Frequency Corpus Check: Cross-referenced failures from Step 1 against a massive conversational Hindi Subtitle Frequency Corpus. If a word was widely active organically in real-world contexts, it was structurally accepted as a correctly spelled conversational loanword.
- Data Cleaning: Hard-filtered English Latin alphabetic characters (A-Z), empty strings, numbers, and boundary punctuations.
Constructed a ROVER-style transcription consensus logic across 6 distinct ASR models to prevent unfair model penalization caused by human transcription typos.
- Alignment: Stripped punctuation to unify standard alignments, executing evaluations purely at the word level.
- Lattice Construction: Aligned the 6 candidate models (
Model H, i, k, l, m, n) against the human reference baseline using Levenshtein distance (difflib). - Majority Voting: Evaluated every word slot in the lattice alignment. Replaced the baseline human reference word only if at least 3 out of 6 models mathematically agreed on a different output.
| Model | Original WER | Consensus WER | Status |
|---|---|---|---|
| Model H | 2.81 % | 3.18 % | ↑ Matched human typos closely |
| Model i | 0.37 % | 1.47 % | ↑ Matched human typos closely |
| Model k | 8.07 % | 7.33 % | ↓ Improved |
| Model l | 8.68 % | 8.07 % | ↓ Improved |
| Model m | 15.77 % | 14.67 % | ↓ Improved |
| Model n | 9.90 % | 9.05 % | ↓ Improved |
Note: Models with high initial error rates correctly saw score improvements since they were no longer unfairly penalized for disagreeing with human typos.