Skip to content

Latest commit

 

History

History
496 lines (384 loc) · 18 KB

File metadata and controls

496 lines (384 loc) · 18 KB

CLI Reference

All commands follow the pattern mazinger <command> [options]. Run mazinger <command> --help for the built-in help text.

Global Options

These options are available on every command:

Flag Default Description
--base-dir ./mazinger_output Root directory for project output
--verbose off Enable debug logging

dub

Run the full pipeline: download, transcribe, translate, synthesize, assemble.

mazinger dub <source> [options]

Positional arguments:

Argument Description
source Video URL, local video file, or local audio file (required)

Options:

Flag Default Description
--slug auto-generated Project slug (directory name)
--quality best available Video quality: low, medium, high, or numeric height (e.g., 1080)
--cookies-from-browser Browser name for yt-dlp cookie extraction
--cookies Path to a Netscape cookies.txt file
--clone-profile Voice profile name from HuggingFace or local directory path
--voice-theme Pre-defined voice theme (e.g. narrator-m, warm-f). See mazinger profile list
--voice-sample Path to reference voice audio file
--voice-script Path to transcript of the voice sample (or inline text)
--transcribe-method faster-whisper openai, faster-whisper, whisperx, mlx-whisper, or deepgram
--whisper-model varies by method Whisper/Deepgram model name
--mlx-whisper-model mlx-community/whisper-large-v3-turbo MLX Whisper model name
--beam-size Beam size for decoding (faster-whisper/whisperx)
--deepgram-api-key $DEEPGRAM_API_KEY Deepgram API key (required for --transcribe-method deepgram)
--device auto auto, cuda, or cpu
--source-language auto Source language for translation (or auto to detect)
--target-language English Target language for translation
--words-per-second 2.0 Speech rate used for duration-aware word budgets
--duration-budget 0.80 Fraction of available time for dubbed speech
--translate-technical-terms off Translate technical terms instead of keeping them in English
--asr-review off Review ASR transcript with LLM to fix typos and punctuation
--keep-technical-english off Convert technical terms to English in the source transcript (requires --asr-review)
--youtube-subs off Download YouTube subtitles and compare with ASR to pick the best source
--tts-engine qwen qwen, chatterbox, or mlx
--tts-model Qwen/Qwen3-TTS-12Hz-1.7B-Base Qwen model ID
--mlx-tts-model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 MLX TTS model name
--chatterbox-model ResembleAI/chatterbox Chatterbox model ID
--tts-language same as --target-language Language hint for TTS
--chatterbox-exaggeration 0.5 Emotion intensity (0.0–1.0)
--chatterbox-cfg 0.5 Pacing control (0.0–1.0)
--use-resegmented off Use resegmented SRT instead of raw transcription
--output-type audio audio (WAV only) or video (muxed MP4)
--embed-subtitles off Burn subtitles into output video (implies --output-type video)
--subtitle-source translated translated, original, or path to a custom SRT file
--dynamic-tempo off Per-segment speed matching
--fixed-tempo Constant speed multiplier (e.g., 1.1)
--max-tempo 1.3 Maximum speed-up for dynamic/auto tempo
--no-loudness-match off Skip loudness normalisation against the original audio
--no-mix-background off Skip mixing background audio from the original
--background-volume 0.15 Background audio mix level (0.0–1.0)
--start Start timestamp for slicing (e.g. 00:01:30 or 90)
--end End timestamp for slicing (e.g. 00:05:00 or 300)
--force-reset off Discard all cached outputs and re-run from scratch
--openai-api-key $OPENAI_API_KEY OpenAI API key
--openai-base-url $OPENAI_BASE_URL Custom API base URL
--llm-model gpt-4.1 LLM model for translation/analysis
--llm-think / --no-llm-think Enable/disable LLM thinking mode (use --no-llm-think for Ollama Qwen3)

All --subtitle-* styling flags are also accepted. See Subtitle Styling.

Examples:

# Auto-clone the speaker's voice (simplest — no voice flags needed)
mazinger dub "https://youtube.com/watch?v=abc123" \
    --target-language Spanish

# Dub with a voice theme (easiest)
mazinger dub "https://youtube.com/watch?v=abc123" \
    --voice-theme narrator-m --target-language Spanish

# Basic dub with a profile
mazinger dub "https://youtube.com/watch?v=abc123" \
    --clone-profile abubakr --target-language Arabic

# Dub with Chatterbox, video output, subtitles
mazinger dub ./lecture.mp4 \
    --voice-sample speaker.m4a \
    --tts-engine chatterbox \
    --output-type video \
    --embed-subtitles \
    --target-language Spanish

# Dub only a portion of the video
mazinger dub "https://youtube.com/watch?v=abc123" \
    --clone-profile abubakr --target-language Arabic \
    --start 00:01:30 --end 00:05:00

# Local transcription, dynamic tempo
mazinger dub "https://youtube.com/watch?v=abc123" \
    --clone-profile abubakr \
    --transcribe-method faster-whisper \
    --dynamic-tempo --max-tempo 1.3

download

Download a video and extract its audio track.

mazinger download <source> [options]
Flag Default Description
--slug auto-generated Project slug
--quality best Video quality
--cookies-from-browser Browser name for yt-dlp cookies
--cookies Path to cookies.txt
--start Start timestamp for slicing (e.g. 00:01:30 or 90)
--end End timestamp for slicing (e.g. 00:05:00 or 300)

Example:

mazinger download "https://youtube.com/watch?v=abc123" --base-dir ./output --quality 720

# Download and extract only a segment
mazinger download "https://youtube.com/watch?v=abc123" --start 00:02:00 --end 00:04:00

slice

Extract a time range from a video or audio file.

mazinger slice <source> [options]
Flag Default Description
--slug auto-generated Project slug
--quality best Video quality
--cookies-from-browser Browser name for yt-dlp cookies
--cookies Path to cookies.txt
--start Start timestamp (e.g. 00:01:30 or 90)
--end End timestamp (e.g. 00:05:00 or 300)

Examples:

# Extract a 3-minute clip from a YouTube video
mazinger slice "https://youtube.com/watch?v=abc123" --start 00:01:00 --end 00:04:00

# Extract from a local file
mazinger slice ./lecture.mp4 --start 90 --end 300

transcribe

Convert audio to SRT subtitles.

mazinger transcribe [source] [options]

If source is provided, the video is downloaded first and its audio is transcribed. Otherwise, use --audio to point to an existing audio file.

Flag Default Description
--audio Path to audio file (overrides source)
-o, --output Output SRT path
--method faster-whisper openai, faster-whisper, whisperx, mlx-whisper, or deepgram
--model varies Model name (whisper-1 for OpenAI, large-v3 for local, nova-3 for Deepgram)
--device auto auto, cuda, cpu
--batch-size 16 Batch size for local transcription
--compute-type float16 Weight precision: float16, int8, int8_float16
--beam-size 5 Beam size for decoding (default: 5)
--language auto-detect Force a language code (e.g., en, ar, fr)
--initial-prompt Initial text to condition Whisper (e.g., domain terms, video title)
--no-condition-on-previous-text off Disable conditioning on previous segment text
--max-chars 84 Max characters per subtitle entry
--max-duration 5.0 Max seconds per subtitle entry
--no-resegment off Skip the post-transcription resegmentation step
--refine off Use LLM to add punctuation and fix misheard words
--asr-review off Review transcript with LLM: fix typos, punctuation, and optionally normalise technical terms
--keep-technical-english off Convert technical terms to English (requires --asr-review)
--llm-model gpt-4.1 LLM model for refinement
--llm-think / --no-llm-think Enable/disable LLM thinking mode
--openai-api-key $OPENAI_API_KEY OpenAI API key (for cloud method)
--deepgram-api-key $DEEPGRAM_API_KEY Deepgram API key (for --method deepgram)

Examples:

# Cloud transcription (OpenAI)
mazinger transcribe --audio recording.mp3 -o subs.srt --method openai

# Cloud transcription (Deepgram Nova 3 — free $200 credit, no card)
mazinger transcribe --audio recording.mp3 -o subs.srt \
    --method deepgram --language ar

# Local with faster-whisper on GPU
mazinger transcribe --audio recording.mp3 -o subs.srt \
    --method faster-whisper --device cuda

# From a URL, auto-download first
mazinger transcribe "https://youtube.com/watch?v=abc123" --base-dir ./output

thumbnails

Extract LLM-selected key frames from a video.

mazinger thumbnails [source] [options]
Flag Default Description
--video Path to video file
--srt Path to SRT file
--output-dir Output directory for thumbnails
--meta Path to save metadata JSON
--openai-api-key $OPENAI_API_KEY OpenAI API key
--llm-model gpt-4.1 LLM model
--llm-think / --no-llm-think Enable/disable LLM thinking mode
--transcribe-method openai Transcription method (if SRT not provided)
--whisper-model varies Whisper model

Example:

mazinger thumbnails --video video.mp4 --srt subs.srt --output-dir ./thumbs

describe

Generate a structured content analysis (title, summary, key points, keywords).

mazinger describe [source] [options]
Flag Default Description
--srt Path to SRT file
--thumbnails-meta Path to thumbnails meta.json
-o, --output Output JSON path
--openai-api-key $OPENAI_API_KEY OpenAI API key
--llm-model gpt-4.1 LLM model
--llm-think / --no-llm-think Enable/disable LLM thinking mode

Example:

mazinger describe --srt subs.srt --thumbnails-meta ./thumbs/meta.json -o description.json

translate

Translate SRT subtitles into another language.

mazinger translate [source] [options]

If source is provided, the video is downloaded, transcribed, and translated automatically.

Flag Default Description
--srt Path to source SRT (overrides auto-transcription)
--description Path to description JSON
--thumbnails-meta Path to thumbnails meta.json
-o, --output Output SRT path
--source-language auto Source language
--target-language English Target language
--words-per-second 2.0 Speech rate for word budget calculation
--duration-budget 0.80 Fraction of time allocated for dubbed speech
--translate-technical-terms off Translate technical terms
--video Source video for subtitle embedding
--video-output Output video path (when embedding subtitles)
--embed-subtitles off Burn translated subtitles into video
--openai-api-key $OPENAI_API_KEY OpenAI API key
--llm-model gpt-4.1 LLM model
--llm-think / --no-llm-think Enable/disable LLM thinking mode
--transcribe-method openai Transcription method (if SRT not provided)

All --subtitle-* styling flags are accepted when --embed-subtitles is set.

Examples:

# Translate an existing SRT
mazinger translate --srt subs.srt --target-language French -o translated.srt

# Full auto: download, transcribe, translate, burn subtitles
mazinger translate "https://youtube.com/watch?v=abc123" \
    --target-language Arabic \
    --embed-subtitles \
    --subtitle-google-font "Noto Sans Arabic"

resegment

Re-segment subtitles for readability by merging fragments and splitting long entries.

mazinger resegment [options]
Flag Default Description
--srt Path to input SRT (required)
-o, --output Output SRT path (required)
--max-chars 84 Max characters per subtitle entry
--max-dur 4.0 Max seconds per subtitle entry
--openai-api-key $OPENAI_API_KEY OpenAI API key (optional — falls back to rules)
--llm-model gpt-4.1 LLM model
--llm-think / --no-llm-think Enable/disable LLM thinking mode

Example:

mazinger resegment --srt translated.srt -o final.srt --max-chars 80 --max-dur 5.0

speak

Synthesize dubbed audio from an SRT file using voice-cloned TTS.

mazinger speak [source] [options]
Flag Default Description
--srt Path to translated SRT
--original-audio Original audio file (for duration matching)
--clone-profile Voice profile name from HuggingFace or local directory path
--voice-theme Pre-defined voice theme (e.g. narrator-m, warm-f)
--voice-sample Path to reference voice audio
--voice-script Path to transcript of voice sample
-o, --output Output WAV path
--segments-dir Directory for individual segment WAVs
--tts-engine qwen qwen, chatterbox, or mlx
--tts-model Qwen/Qwen3-TTS-12Hz-1.7B-Base Qwen model ID
--mlx-tts-model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 MLX TTS model name
--chatterbox-model ResembleAI/chatterbox Chatterbox model ID
--tts-language Language hint for TTS
--chatterbox-exaggeration 0.5 Emotion intensity (0.0–1.0)
--chatterbox-cfg 0.5 Pacing control (0.0–1.0)
--device auto auto, cuda, cpu
--dtype bfloat16 Weight dtype for Qwen: bfloat16, float16, float32
--dynamic-tempo off Per-segment tempo matching
--fixed-tempo Constant speed multiplier
--max-tempo 1.3 Maximum speed-up ratio
--force-reset off Re-synthesize all segments from scratch

Examples:

# With a voice theme
mazinger speak --srt translated.srt --original-audio audio.mp3 \
    --voice-theme warm-f -o dubbed.wav

# Qwen with a profile
mazinger speak --srt translated.srt --original-audio audio.mp3 \
    --clone-profile abubakr -o dubbed.wav

# Chatterbox with emotion
mazinger speak --srt translated.srt --original-audio audio.mp3 \
    --voice-sample speaker.m4a \
    --tts-engine chatterbox \
    --chatterbox-exaggeration 0.7 --chatterbox-cfg 0.3 \
    -o dubbed.wav

# Fixed tempo speed-up
mazinger speak --srt translated.srt --original-audio audio.mp3 \
    --clone-profile abubakr --fixed-tempo 1.1 -o dubbed.wav

profile

List available voice themes or generate a reusable voice profile from a theme.

profile list

List all available voice themes.

mazinger profile list

Displays all 16 pre-defined themes with name, gender, and supported languages.

profile generate

Generate a voice profile directory from a theme.

mazinger profile generate <theme> <language> -o <output-dir> [options]
Argument / Flag Default Description
theme (required) Theme name (from profile list)
language (required) Target language (e.g. English, Spanish)
-o, --output (required) Output directory for the profile
--device auto auto, cuda, or cpu
--dtype bfloat16 Weight dtype for VoiceDesign model

The output directory will contain voice.wav and script.txt, suitable for use with --clone-profile <path>.

Examples:

# Generate a narrator profile for Spanish
mazinger profile generate narrator-m Spanish -o ./my-narrator

# Use the generated profile
mazinger dub "https://youtube.com/watch?v=abc123" \
    --clone-profile ./my-narrator --target-language Spanish

subtitle

Burn subtitles into a video file, optionally replacing the audio track.

mazinger subtitle [source] [options]
Flag Default Description
--video Source video path
--srt SRT file path
--audio Replacement audio track
-o, --output Output video path
--openai-api-key $OPENAI_API_KEY OpenAI API key (if auto-translating)
--llm-model gpt-4.1 LLM model
--llm-think / --no-llm-think Enable/disable LLM thinking mode
--transcribe-method openai Transcription method
--source-language auto Source language
--target-language English Target language

All --subtitle-* styling flags are accepted. See Subtitle Styling.

Examples:

# Burn subtitles, keep original audio
mazinger subtitle video.mp4 --srt translated.srt -o output.mp4

# Burn subtitles and replace audio
mazinger subtitle video.mp4 --srt translated.srt --audio dubbed.wav -o output.mp4

# With custom styling
mazinger subtitle video.mp4 --srt translated.srt -o output.mp4 \
    --subtitle-font-size 28 \
    --subtitle-font-color yellow \
    --subtitle-bg-color black \
    --subtitle-bg-alpha 0.8 \
    --subtitle-position bottom \
    --subtitle-bold