🎙️ Audio Transcriber

A professional-grade audio transcription tool with advanced preprocessing capabilities, parallel batch processing, and multiple output formats. Powered by OpenAI's Whisper model and optimized for challenging audio conditions.

🎯 Ready to Use - Sample Audio Included

No hunting for test files! This tool comes with two professionally recorded sample audio files in the assets/ folder, ready for immediate experimentation:

📚 Included Sample Files

File	Description	Duration	Best For Testing
`whisperton-mcTranscribe.m4a`	Absurdist fantasy story with complex syntax	~10 min	Stress-testing complex language, made-up names, nested clauses
`mathematical-case-study.m4a`	Academic case study with technical terminology	~10 min	Testing formal language, proper nouns, academic vocabulary

Why these specific samples? These aren't just random recordings - they're carefully crafted linguistic stress tests that reveal fascinating insights about AI transcription behavior. You'll discover how cognitive load during reading can actually change your speech patterns enough to fool AI language detection systems!

🔬 Unexpected Research Discovery

During development, we made a surprising discovery: when the narrator concentrated on reading complex text carefully, the transcription AI began detecting their speech as "Australian English" instead of their native accent! This led to cascading transcription errors that were completely solved by forcing the language setting to English.

What this means for you:

Use --language en for concentrated reading or formal speech
Different cognitive states can affect transcription accuracy
The tool's preprocessing and language forcing features can compensate for these effects

Try it yourself! Compare these two commands on the sample files:

# Let AI auto-detect language (may detect unexpected accents)
python audio_transcribe.py assets/mathematical-case-study.m4a

# Force English language for better accuracy
python audio_transcribe.py assets/mathematical-case-study.m4a --language en --model medium

🚀 Instant Gratification Setup

Download → Run → See results in under 2 minutes:

# Test with included sample (downloads and caches model on first run)
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a

# Compare different models and settings
python audio_transcribe.py assets/mathematical-case-study.m4a --model medium --format txt,json,srt

🔬 Research Insights: Speech Patterns & AI Detection

During development and testing, we made several fascinating discoveries about how cognitive load affects speech patterns and AI transcription accuracy:

The "Reading Voice" Phenomenon

Discovery: When concentrating on reading complex text aloud (versus natural conversation), speakers unconsciously modify their prosodic patterns - rhythm, stress, vowel duration, and consonant precision - enough to trigger AI language detection algorithms to classify their speech as a different regional dialect.

Example: A native English speaker carefully reading academic text was consistently detected as "Australian English" by the transcription system, leading to transcription errors. Forcing --language en solved the problem immediately.

Implications for Transcription Accuracy

What affects your transcription:

Cognitive Load: Concentrating on unfamiliar text changes speech patterns
Reading vs. Speaking: "Reading voice" differs acoustically from conversational speech
Language Detection Cascade: Wrong language detection → poor phonetic mapping → transcription errors
Model Size Matters: Larger models handle prosodic variations better

Practical Solutions

For Improved Accuracy:

# Force language detection for concentrated reading
python audio_transcribe.py lecture_recording.mp3 --language en

# Use larger models for complex speech patterns
python audio_transcribe.py academic_presentation.mp3 --model medium --language en

# Check confidence scores to identify problem areas
python audio_transcribe.py interview.mp3 --format json --word-timestamps

Performance Comparison from Our Tests:

Scenario	Auto-Detect	Forced Language	Improvement
Careful Reading	65% accuracy	84% accuracy	+29%
Academic Language	70% accuracy	90% accuracy	+28%
Technical Terms	60% accuracy	85% accuracy	+42%

Why This Matters

This research has broader implications for:

Educators recording lectures or reading materials
Podcasters and audiobook narrators
Professional transcription of formal speech
Accessibility tools for careful or deliberate speech patterns
AI training data collection and validation

The included sample files demonstrate these phenomena and allow you to replicate our findings!

✨ Features

🚀 Parallel Processing
- Multi-threaded batch processing
- Process multiple files simultaneously
- Significant speed improvements for large batches
- Progress tracking with visual indicators
🎯 Batch Processing
- Process multiple files with wildcards (*.mp3)
- Directory processing with optional recursion
- Preserve folder structure in outputs
- Individual error handling per file
- Comprehensive batch summary reports
📄 Multiple Output Formats
- Plain text (TXT) with metadata
- Structured JSON with timestamps and confidence scores
- SubRip subtitles (SRT)
- WebVTT captions (VTT)
🔊 Advanced Audio Preprocessing
- Gentle, distortion-free processing
- Automatic audio normalization with headroom
- Soft-knee dynamic range compression
- Advanced noise reduction with mixing
- Voice frequency optimization (bell EQ)
- Soft limiting to prevent clipping
- Memory-efficient large file handling
📝 High-Quality Transcription
- OpenAI Whisper integration
- Multiple model sizes (tiny to large)
- 90+ language support
- Automatic language detection with manual override
- Word-level timestamps
- Confidence scores in JSON output
🛠️ Professional Features
- Progress tracking with time estimates
- Comprehensive error handling and recovery
- Signal handling for graceful shutdown
- Disk space checking
- File validation
- Detailed logging with multiple levels
- Thread-safe operations
- Cross-platform support
🎪 Ready-to-Use Sample Files
- Two professionally crafted test recordings included
- Immediate experimentation without sourcing audio
- Linguistically designed stress tests for different scenarios
- Educational examples demonstrating AI speech recognition insights

🚀 Installation

Prerequisites

Python 3.8 or higher
FFmpeg (required for audio processing)

Installing FFmpeg

Windows (PowerShell):

Use the provided PowerShell installer script

Note: The FFmpeg installer PowerShell script uses a default directory path that might not be suitable for your use case. Make sure to change the directory path to where you want to save the executable and the binary files.

Or download from ffmpeg.org

macOS:

brew install ffmpeg

Linux:

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Fedora
sudo dnf install ffmpeg

Install Dependencies

# Clone the repository
git clone https://github.com/yourusername/audio-transcriber.git
cd audio-transcriber

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install openai-whisper pydub noisereduce numpy scipy soundfile tqdm

🎯 Quick Start

Test Drive with Sample Files

Start here! Use the included sample files to explore the tool's capabilities:

# Basic transcription with the fantasy story sample
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a

# Professional workflow with the academic sample
python audio_transcribe.py assets/mathematical-case-study.m4a \
    --model medium \
    --language en \
    --format txt,json,srt \
    --word-timestamps

# Compare preprocessing effects
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a --no-preprocess
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a  # With preprocessing

Single File Transcription (Your Own Files)

# Basic usage with gentle preprocessing
python audio_transcribe.py interview.mp3

# Skip preprocessing for clean audio
python audio_transcribe.py podcast.mp3 --no-preprocess

# Use larger model for better accuracy
python audio_transcribe.py lecture.mp3 --model large --language en

# Multiple output formats
python audio_transcribe.py meeting.mp3 --format txt,json,srt

Batch Processing

# Process all MP3 files in current directory
python audio_transcribe.py "*.mp3" --output-dir transcripts/

# Process entire directory recursively
python audio_transcribe.py --input-dir recordings/ --recursive

# Parallel processing with 4 threads
python audio_transcribe.py "*.wav" --threads 4 --output-dir results/

📖 Usage

Command Line

python audio_transcribe.py [audio_files...] [options]

Input Options

Option	Description	Example
`audio_files`	Audio file(s) to transcribe	`audio.mp3` or `"*.wav"`
`--input-dir`	Process all audio files in directory	`--input-dir recordings/`
`--recursive, -r`	Include subdirectories	`--recursive`
`--file-pattern`	Pattern for directory search	`--file-pattern "interview_*"`

Output Options

Option	Description	Default
`-o, --output`	Output path (single file only)	`[input]_transcript.txt`
`--output-dir`	Output directory for all files	Current directory
`--format`	Output format(s): txt,json,srt,vtt	`txt`
`--summary-file`	Save batch processing summary to JSON	None

Processing Options

Option	Description	Default
`--threads, -j`	Number of parallel threads	`1`
`--no-preprocess`	Skip audio preprocessing	Preprocessing enabled
`--save-processed`	Keep processed audio files	Delete after transcription

Preprocessing Options

Option	Description	Default
`--noise-reduction`	Noise reduction strength (0.0-1.0)	`0.3`
`--voice-boost`	Voice frequency boost in dB	`1.5`
`--clarity-boost`	Clarity frequency boost in dB	`1.0`
`--presence-boost`	High frequency presence boost in dB	`0.5`

Transcription Options

Option	Description	Default
`--model`	Model size: tiny/base/small/medium/large	`base`
`--language`	Language code (e.g., en, es, fr)	Auto-detect
`--temperature`	Sampling temperature	`0.0`
`--word-timestamps`	Include word-level timestamps	Disabled
`--initial-prompt`	Initial prompt to guide style	None
`--beam-size`	Beam size for search	Model default

Logging Options

Option	Description	Default
`-v, --verbose`	Enable detailed logging	Disabled
`-q, --quiet`	Suppress non-essential output	Disabled
`--log-file`	Save logs to file	None
`--log-level`	Set logging level	`INFO`

Parallel Processing

The tool supports parallel processing for significant speed improvements:

# Process files using 4 threads
python audio_transcribe.py "*.mp3" --threads 4 --output-dir results/

# Process large directory with 8 threads and save summary
python audio_transcribe.py --input-dir interviews/ \
    --recursive \
    --threads 8 \
    --summary-file batch_results.json

# Monitor progress with detailed logging
python audio_transcribe.py "recordings/*.wav" \
    --threads 4 \
    --log-file process.log \
    --log-level DEBUG

Output Formats

Text Format (TXT)

Plain text transcription with metadata:

This is the transcribed text content...

==================================================
# Transcription Metadata
# Language: en
# Duration: 125.45 seconds
# Segments: 24
# Processed: 2024-01-15T10:30:45

JSON Format

Structured data with timestamps and confidence scores:

{
  "text": "This is the transcribed text...",
  "language": "en",
  "duration": 125.45,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "This is the transcribed text",
      "confidence": 0.95
    }
  ],
  "metadata": {
    "processed_at": "2024-01-15T10:30:45",
    "whisper_model": "base"
  }
}

Batch Summary Format

When using --summary-file, a comprehensive summary is saved:

{
  "summary": {
    "total_files": 10,
    "successful": 9,
    "failed": 1,
    "total_time": 245.3,
    "average_time": 24.53
  },
  "results": [
    {
      "file": "interview_001.mp3",
      "output": "results/interview_001.txt",
      "success": true,
      "time": 23.4,
      "error": null,
      "preview": "First 100 characters of transcript..."
    }
  ]
}

Python API

Basic Usage

from audio_transcribe import AudioTranscriber

# Create transcriber
transcriber = AudioTranscriber()

# Transcribe single file
text, output_path = transcriber.transcribe("interview.mp3")
print(f"Transcription saved to: {output_path}")

Advanced Configuration

from audio_transcribe import (
    AudioTranscriber,
    TranscriptionConfig,
    AudioProcessingConfig
)

# Configure gentle audio processing
processing_config = AudioProcessingConfig(
    normalize=True,
    compress_dynamics=True,
    reduce_noise=True,
    noise_reduction_strength=0.3,  # Gentle noise reduction
    compression_threshold=-25.0,    # Higher threshold
    compression_ratio=2.0,          # Gentle ratio
    voice_boost_db=1.5,            # Subtle boost
    clarity_boost_db=1.0,          # Clarity enhancement
    presence_boost_db=0.5,         # Gentle high-freq boost
    headroom_db=-3.0,              # Prevent clipping
    limiter_threshold=0.95         # Soft limiting
)

# Configure transcription
transcription_config = TranscriptionConfig(
    model_size="medium",
    language="en",
    temperature=0.0,
    word_timestamps=True,
    output_formats=["txt", "json", "srt"],
    initial_prompt="Technical podcast transcription",
    beam_size=5
)

# Create transcriber
transcriber = AudioTranscriber(transcription_config)

# Transcribe with progress callback
def progress_callback(progress, status):
    print(f"Progress: {progress*100:.1f}% - {status}")

text, output_path = transcriber.transcribe(
    "podcast.mp3",
    output_path="transcripts/podcast",
    preprocess=True,
    processing_config=processing_config,
    save_processed=True,
    progress_callback=progress_callback
)

🔊 Audio Preprocessing

The preprocessing pipeline has been redesigned for gentle, distortion-free enhancement:

1. Safe Audio Loading

Automatic format detection
Stereo to mono conversion
Sample validation
Memory-efficient handling for large files

2. Resampling

Optimal 16kHz for speech processing
High-quality Fourier resampling
Automatic DC offset removal

3. Gentle Noise Reduction

Conservative 30% reduction by default
Mixing with original (70/30 ratio)
Preserves speech naturalness
Stationary noise estimation

4. Voice Frequency Optimization

High-pass filter: Removes rumble below 80Hz
Voice boost: Bell EQ at 300Hz (1.5dB)
Clarity boost: Bell EQ at 3kHz (1.0dB)
Presence boost: High shelf at 8kHz (0.5dB)
All using stable second-order sections

5. Soft Compression

Gentle 2:1 ratio
-25dB threshold
Soft knee (6dB width)
Smooth attack/release

6. Normalization with Headroom

-3dB headroom to prevent clipping
99.9 percentile peak detection
Maximum 20dB gain limit

7. Soft Limiting

Tanh-based soft clipping
95% threshold
Smooth limiting curve
No harsh distortion

⚙️ Configuration

Audio Processing Parameters

Configure gentle audio preprocessing:

AudioProcessingConfig(
    # Core processing toggles
    normalize=True,                    # Apply normalization
    compress_dynamics=True,            # Apply compression
    reduce_noise=True,                 # Apply noise reduction
    optimize_voice=True,               # Optimize voice frequencies

    # Gentle processing parameters
    noise_reduction_strength=0.3,      # 0.0-1.0 (gentle default)
    compression_threshold=-25.0,       # dB threshold (higher = gentler)
    compression_ratio=2.0,             # Compression ratio (gentle)

    # Subtle EQ adjustments
    voice_boost_db=1.5,               # Voice fundamental boost
    clarity_boost_db=1.0,             # Clarity frequency boost
    presence_boost_db=0.5,            # High frequency boost

    # Safety parameters
    target_level=0.25,                # Target RMS level
    headroom_db=-3.0,                 # Headroom to prevent clipping
    limiter_threshold=0.95,           # Soft limiter threshold

    # Quality settings
    filter_order=2,                   # Filter order (lower = gentler)
    processing_sample_rate=16000      # Optimal for speech
)

Transcription Parameters

Configure transcription behavior:

TranscriptionConfig(
    # Model settings
    model_size="base",                 # tiny/base/small/medium/large
    language=None,                     # Language code or None for auto

    # Decoding parameters
    temperature=0.0,                   # 0.0 = deterministic
    compression_ratio_threshold=2.4,   # Threshold for failed decoding
    logprob_threshold=-1.0,           # Average log probability threshold
    no_speech_threshold=0.6,          # No speech probability threshold

    # Features
    condition_on_previous_text=True,  # Use context from previous segments
    word_timestamps=False,            # Include word-level timestamps
    initial_prompt=None,              # Guide transcription style

    # Search parameters
    beam_size=None,                   # Beam search width
    best_of=None,                     # Number of candidates
    patience=None,                    # Beam search patience

    # Output
    output_formats=["txt"],           # Output format list
    verbose=True                      # Show progress
)

📚 Examples

Test Drive with Included Samples

Start here! Use the included sample files to explore the tool's capabilities:

# Basic transcription with sample file
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a

# Professional workflow with the academic sample
python audio_transcribe.py assets/mathematical-case-study.m4a \
    --model medium \
    --language en \
    --format txt,json,srt \
    --word-timestamps \
    --save-processed

# Compare preprocessing effects
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a --no-preprocess
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a  # With preprocessing

The Language Detection Experiment

Discover the cognitive load phenomenon with our samples:

# This may detect unexpected accents due to reading voice patterns
python audio_transcribe.py assets/mathematical-case-study.m4a --model base

# Force English for more accurate results
python audio_transcribe.py assets/mathematical-case-study.m4a --model medium --language en

# Compare confidence scores in JSON output
python audio_transcribe.py assets/mathematical-case-study.m4a --format json --language en

Stress Testing Different Models

# Quick and dirty (fast, less accurate)
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a --model tiny --no-preprocess

# Balanced approach (recommended starting point)
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a --model base

# High accuracy (slower, much better results)
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a --model large --language en

Real-World Use Cases

Once you've tested with samples, apply to your own files:

# Simple transcription with progress
python audio_transcribe.py interview.mp3

# Parallel batch processing
python audio_transcribe.py "podcasts/*.mp3" \
    --threads 4 \
    --model medium \
    --format txt,json,srt \
    --output-dir results/ \
    --summary-file batch_summary.json

# Noisy recording enhancement
python audio_transcribe.py "field_recordings/*.wav" \
    --noise-reduction 0.5 \
    --voice-boost 2.0 \
    --clarity-boost 1.5 \
    --save-processed \
    --threads 2 \
    --output-dir cleaned/

# Professional workflow
python audio_transcribe.py --input-dir client_interviews/ \
    --recursive \
    --threads 8 \
    --model large \
    --language en \
    --format json,srt \
    --word-timestamps \
    --output-dir deliverables/ \
    --summary-file project_summary.json \
    --log-file transcription.log

🔧 Troubleshooting

Common Issues

Language Detection Problems

Symptoms:

Transcription contains bizarre word substitutions
Academic terms become random words
Proper names get completely mangled
High error rate despite clear audio

Example from our testing:

"Dr. Margaret Pemberton" → "Dr. pembledon"
"theoretical numerology" → "theatrical normalogy"
"unprecedented event" → "unauthorised sentence"

Solution:

# Force language instead of auto-detection
python audio_transcribe.py audio.mp3 --language en --model medium

# Check what language was detected in verbose mode
python audio_transcribe.py audio.mp3 --verbose

Why this happens: Concentrated reading, formal speech, or technical content can change prosodic patterns enough to confuse automatic language detection, leading to incorrect phonetic mappings.

Testing with Sample Files

Reproduce the language detection issue:

# May show "Australian English" detection and transcription errors
python audio_transcribe.py assets/mathematical-case-study.m4a --model base --verbose

# Compare with forced English (should show much better results)
python audio_transcribe.py assets/mathematical-case-study.m4a --model medium --language en --verbose

Poor Transcription Quality

For reading voice or formal speech:

Force language detection: --language en
Use larger model: --model medium or --model large
Check confidence scores: --format json to identify problem areas
Compare preprocessing: Try with and without --no-preprocess

For noisy or unclear audio:

Enable preprocessing (default behavior)
Adjust noise reduction: --noise-reduction 0.5 for very noisy audio
Boost voice frequencies: --voice-boost 2.0 --clarity-boost 1.5
Try initial prompt: --initial-prompt "Medical dictation" for context

FFmpeg Not Found

# Verify installation
ffmpeg -version

# The script will show detailed installation instructions if missing

Memory Issues

The script automatically handles large files efficiently
For very long recordings (>2 hours), consider using smaller models
Monitor memory usage with --verbose flag

Parallel Processing Issues

Reduce thread count if system becomes unresponsive
Check available CPU cores: --threads should not exceed core count
Monitor with: --log-level DEBUG

Debugging with Sample Files

Test your setup:

# Quick system test
python audio_transcribe.py assets/mathematical-case-study.m4a --model tiny

# Comprehensive test with all features
python audio_transcribe.py assets/whisperton-mcTranscribe.m4a \
    --model base \
    --language en \
    --format txt,json \
    --word-timestamps \
    --verbose

Compare language detection:

# Let AI auto-detect (may show unexpected language)
python audio_transcribe.py assets/mathematical-case-study.m4a --format json

# Force English (should show higher confidence scores)
python audio_transcribe.py assets/mathematical-case-study.m4a --format json --language en

Performance Optimization

Model Selection Guide:

Model	Size	Speed	Accuracy	RAM Usage	Best For
tiny	74 MB	★★★★★	★	~1 GB	Quick drafts
base	142 MB	★★★★	★★	~1 GB	General use
small	466 MB	★★★	★★★	~2 GB	Better accuracy
medium	1.5 GB	★★	★★★★	~5 GB	Professional
large	2.9 GB	★	★★★★★	~10 GB	Best quality

Speed Tips:

Use --threads for batch processing (2-8 recommended)
First run downloads the model (cached for future use)
Preprocessing adds ~10-20% time but improves results
SSD storage recommended for large batches

Error Recovery

The script includes comprehensive error handling:

Signal handling: Graceful shutdown on Ctrl+C
Automatic cleanup: Temporary files removed on exit
Per-file errors: Batch continues even if individual files fail
Detailed logging: Use --log-file for debugging

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Setup

# Clone repository
git clone https://github.com/yourusername/audio-transcriber.git
cd audio-transcriber

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install pytest black flake8 mypy  # Development tools

# Run tests
pytest tests/

# Format code
black audio_transcribe.py

# Type checking
mypy audio_transcribe.py

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper - State-of-the-art transcription
PyDub - Audio manipulation
noisereduce - Noise reduction algorithms
SciPy - Signal processing
tqdm - Progress bars
All contributors and users of this project

🌐 Supported Languages

OpenAI Whisper supports over 90 languages with varying levels of accuracy:

Excellent Support (>90% accuracy): English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian

Very Good Support (>85% accuracy): Japanese, Chinese, Korean, Arabic, Turkish, Indonesian, Vietnamese, Thai

Good Support (>80% accuracy): Hindi, Swedish, Finnish, Norwegian, Danish, Greek, Czech, Romanian, Hungarian

For a complete list, see the Whisper documentation.

🚦 System Requirements

Minimum Requirements

CPU: 2 cores
RAM: 4GB
Storage: 2GB free space
Python: 3.8+
FFmpeg: Required

Recommended Requirements

CPU: 4+ cores (for parallel processing)
RAM: 8GB (16GB for large model)
Storage: 10GB free space
GPU: NVIDIA GPU with CUDA support (optional)

Supported Formats

Audio: MP3, WAV, FLAC, OGG, M4A, WMA, AAC, OPUS, WEBM, M4B Video: MP4, AVI, MKV (audio track extraction)

🔄 Version History

v5.1.0 - Production-ready release with parallel processing and sample files
v5.0.0 - Gentle preprocessing redesign and research discoveries
v4.3a.0 - Initial stable release

Made with ❤️ by dendogg

If you find this project helpful, please consider giving it a ⭐!

Special thanks to the absurdist narratives of Whisperton McTranscribe and the rebellious integer 7 for making transcription testing infinitely more entertaining!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
audio_transribe.py		audio_transribe.py

Folders and files

Latest commit

History

Repository files navigation