Create a tool that takes two singing videos and recreates the pitch sequence of video 1 by recutting clips from video 2 based on pitch matching.
Inputs:
- Video 1 (Guide): Person A singing song X - defines target pitch sequence
- Video 2 (Source): Person B singing song Y - provides clips to recut
Output:
- New video of Person B appearing to sing the melody from song X (by reordering their clips from song Y)
Video 1 (Guide) → [Onset Detection + Pitch Analysis] → Pitch Sequence (e.g., C4, D4, E4...)
↓
Video 2 (Source) → [Onset Detection + Pitch Analysis] → Pitch Database (all pitches + timestamps)
↓
[Pitch Matcher: Find matching pitches in source]
↓
[Video Assembler: Recut source to match guide]
↓
Output: Source video recut to guide's melody
Goal: Extract the pitch sequence from video 1 that we want to recreate
- Extract audio from video 1 using ffmpeg
- Sample rate: 22050 Hz (standard for pitch detection)
- Reuse existing code:
onset_strength_analysis.py - Detect onsets (musical note beginnings) at frame-accurate precision
- Output: Array of onset times and strengths
- For each onset-to-onset segment, extract pitch information:
- Library options:
- CREPE (recommended): Deep learning, very accurate for singing
- librosa.pyin(): Built-in, decent accuracy
- Extract frame-by-frame pitch values for the segment
- Calculate median pitch for the segment (Hz)
- Convert to MIDI note number (0-127)
- Calculate confidence score (how stable is the pitch?)
- Store segment duration
- Library options:
{
"video_path": "guide_video.mp4",
"fps": 24,
"audio_sample_rate": 22050,
"guide_sequence": [
{
"index": 0,
"start_time": 0.5,
"end_time": 0.85,
"duration": 0.35,
"pitch_hz": 261.6,
"pitch_midi": 60,
"pitch_note": "C4",
"pitch_confidence": 0.95,
"onset_strength": 0.78
},
{
"index": 1,
"start_time": 0.85,
"end_time": 1.2,
"duration": 0.35,
"pitch_hz": 293.7,
"pitch_midi": 62,
"pitch_note": "D4",
"pitch_confidence": 0.92,
"onset_strength": 0.65
}
// ... more segments
]
}src/pitch_guide_analyzer.py
Goal: Build a searchable database of ALL pitches available in video 2
- Extract audio from video 2 (same settings as guide)
- Detect ALL onsets in source video (not just matching guide)
- More granular detection to capture every possible note/syllable
- Lower threshold to catch more potential clips
- For each detected onset segment:
- Extract median pitch (Hz + MIDI)
- Store start/end times in both audio and video
- Store frame numbers for precise video cutting
- Calculate duration and confidence
- Index by MIDI note for fast lookup
{
"video_path": "source_video.mp4",
"fps": 24,
"audio_sample_rate": 22050,
"pitch_database": [
{
"clip_id": 0,
"start_time": 1.25,
"end_time": 1.58,
"duration": 0.33,
"pitch_hz": 261.8,
"pitch_midi": 60,
"pitch_note": "C4",
"pitch_confidence": 0.89,
"video_start_frame": 30,
"video_end_frame": 38,
"onset_strength": 0.71
},
{
"clip_id": 1,
"start_time": 1.58,
"end_time": 1.95,
"duration": 0.37,
"pitch_hz": 329.6,
"pitch_midi": 64,
"pitch_note": "E4",
"pitch_confidence": 0.94,
"video_start_frame": 38,
"video_end_frame": 47,
"onset_strength": 0.82
}
// ... all detected segments
],
"pitch_index": {
"60": [0, 45, 89, 123], // Clip IDs with MIDI note 60 (C4)
"62": [12, 56, 78], // D4
"64": [1, 34, 67, 99] // E4
// ... all MIDI notes found
}
}src/pitch_source_analyzer.py
Goal: For each pitch in the guide sequence, find the best matching clip from source database
Strategy 1: Exact MIDI Match (Priority)
- For guide segment with MIDI note N:
- Look up all source clips with MIDI note N
- Rank by:
- Pitch confidence (higher is better)
- Duration similarity to guide segment
- Reuse count (prefer unused clips)
Strategy 2: Nearest Pitch Match (Fallback)
- If no exact MIDI match found:
- Search MIDI notes N±1 (adjacent semitones)
- Search MIDI notes N±2 (if still no match)
- Calculate pitch distance in cents (100 cents = 1 semitone)
- Prefer matches within ±50 cents
Strategy 3: Duration Weighting
- Combined score:
pitch_match * duration_similarity - Duration similarity:
1 - abs(source_duration - guide_duration) / guide_duration - Prefer clips that are close in length to avoid stretching
Control how often source clips can be reused:
none: Each source clip used max once (requires very long source video)allow: Unlimited reuse (best pitch matches, may be repetitive)min_gap: Minimum 5 seconds between reuses of same cliplimited: Each clip reused max 3 times totalpercentage: Max 30% of final video can be reused clips
{
"guide_video": "guide_video.mp4",
"source_video": "source_video.mp4",
"reuse_policy": "min_gap",
"matches": [
{
"guide_index": 0,
"guide_pitch_midi": 60,
"guide_duration": 0.35,
"source_clip_id": 45,
"source_pitch_midi": 60,
"pitch_distance_cents": 8.2,
"duration_ratio": 0.94,
"match_score": 0.91,
"source_start_time": 12.5,
"source_end_time": 12.83,
"source_start_frame": 300,
"source_end_frame": 308,
"reuse_count": 0
}
// ... one match per guide segment
],
"stats": {
"total_segments": 145,
"exact_matches": 132,
"semitone_matches": 11,
"no_match_found": 2,
"avg_pitch_distance": 12.3,
"clips_reused": 23
}
}src/pitch_matcher.py
Goal: Cut and reassemble source video based on matched pitch sequence
For each match:
- Extract video segment from source using ffmpeg
- Handle duration mismatches:
- Option A (Simple): Trim to exact guide duration
- Option B (Advanced): Time-stretch audio/video to match (±20% max)
- Save temporary clip files
Use ffmpeg concat filter to join all clips:
ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4Option 1: Source Audio (Reconstructed Melody)
- Use the audio from matched source clips
- Result: Source person singing guide melody (pitch-matched reconstruction)
Option 2: Guide Audio (Original)
- Replace entire audio track with guide video's audio
- Result: Source person's video synced to guide's original singing
Option 3: Both Versions
- Generate both outputs for comparison
pitch_matched_source_audio.mp4- Source clips with their own audio (reconstructed melody)pitch_matched_guide_audio.mp4- Source clips with guide audio overlaidpitch_matched_source_audio_prores.mov- ProRes 422 (editing-friendly)
Quality Settings:
- H.264: CRF 18, slow preset
- Audio: 320kbps AAC or uncompressed PCM
- ProRes: 422 10-bit for professional editing
src/pitch_video_assembler.py
Interactive HTML tool showing:
- Guide pitch sequence (line graph)
- Matched source pitches overlaid
- Pitch distance errors highlighted
- Audio playback synchronized with visualization
- Click to jump to problematic segments
Generate statistics:
- Percentage of exact pitch matches
- Average pitch error (cents)
- Duration match accuracy
- Reuse statistics
- Segments with low confidence
- Suggested threshold adjustments
src/pitch_visualizer.py
-
onset_strength_analysis.py- Core onset detection (lines 1-150)- Reuse
OnsetStrengthAnalyzerclass - Already handles frame-accurate onset detection
- Reuse
-
audio_segmenter.py- Audio segmentation logic- Reuse segment cutting at onset times
- Modify to include pitch extraction per segment
-
video_assembler.py- Video concatenation logic- Reuse ffmpeg concat approach
- Reuse quality settings (CRF 18, ProRes options)
pitch_guide_analyzer.py- Guide video pitch analysispitch_source_analyzer.py- Source database builderpitch_matcher.py- Pitch matching algorithmpitch_video_assembler.py- Modified assembler for pitch-based cutspitch_visualizer.py- Pitch contour visualization
Create src/pitch_utils.py for:
- Hz to MIDI conversion
- MIDI to note name conversion
- Pitch distance calculation (cents)
- Pitch confidence scoring
# Pitch detection
crepe>=0.0.12 # ML-based pitch tracking (recommended)
# OR use librosa's built-in pyin
# (librosa>=0.10.0 already installed)
# Optional: Time stretching
pyrubberband>=0.3.0 # For duration adjustment- librosa (onset detection, pitch detection)
- numpy (array processing)
- soundfile (audio I/O)
- ffmpeg (video/audio processing)
- opencv-python (if needed for video frames)
video-hacking/
├── src/
│ ├── onset_strength_analysis.py # [REUSE] Onset detection
│ ├── audio_segmenter.py # [REUSE] Audio cutting
│ ├── video_assembler.py # [REUSE] Video concatenation
│ ├── pitch_guide_analyzer.py # [NEW] Phase 1
│ ├── pitch_source_analyzer.py # [NEW] Phase 2
│ ├── pitch_matcher.py # [NEW] Phase 3
│ ├── pitch_video_assembler.py # [NEW] Phase 4
│ ├── pitch_visualizer.py # [NEW] Phase 5
│ └── pitch_utils.py # [NEW] Shared utilities
├── scripts/
│ ├── test_pitch_pipeline.sh # Complete pipeline
│ ├── test_pitch_analysis.sh # Test pitch detection
│ └── test_pitch_matching.sh # Test matching only
└── PITCH_TOOL_README.md # User documentation
- Install CREPE library
- Create
pitch_utils.pywith conversion functions - Implement
pitch_guide_analyzer.py- Integrate onset detection from existing code
- Add CREPE pitch detection per segment
- Export guide sequence JSON
- Test with simple singing video
- Verify pitch accuracy (compare to ground truth)
- Implement
pitch_source_analyzer.py- Reuse onset detection logic
- Build comprehensive pitch database
- Create MIDI note index for fast lookup
- Test with longer source video (5-10 minutes)
- Verify database completeness (all pitches detected)
- Implement
pitch_matcher.py- Exact MIDI matching
- Nearest pitch fallback
- Duration weighting
- Reuse policy enforcement
- Test different reuse policies
- Generate match statistics
- Tune matching thresholds
- Implement
pitch_video_assembler.py- Modify
video_assembler.pyfor pitch-based cuts - Clip extraction and trimming
- Concatenation with both audio options
- Modify
- Generate first complete pitch-matched video
- Compare source audio vs guide audio versions
- Test with different video durations
- Implement
pitch_visualizer.py- Pitch contour graph (guide vs matched)
- Interactive playback
- Error highlighting
- Create shell scripts for pipeline automation
- Write comprehensive README
- Add example videos and test cases
- Performance optimization
Decision: CREPE
- Rationale: Most accurate for singing (deep learning-based)
- Fallback: librosa.pyin() if CREPE installation fails
- Performance: ~0.3s per second of audio (acceptable)
Decision: Reuse existing onset_strength_analysis.py settings
- FPS: 24 (standard video)
- Power: 0.6 (balanced sensitivity)
- Threshold: 0.2 (adjustable per video)
Decision: Exact MIDI > Nearest Pitch > Duration Weighted
- Try exact MIDI match first (preferred)
- Fallback to ±1 semitone if needed
- Weight by duration for final ranking
Decision: Phase 1 = Simple trim, Phase 2 = Time-stretch
- v1: Trim clips to exact duration (simpler, no artifacts)
- v2: Add time-stretching for ±20% duration mismatches
- Avoid pitch-shifting (would alter the detected pitch)
Decision: min_gap (5 seconds)
- Balances pitch accuracy with visual variety
- Prevents jarring back-to-back repetitions
- User can override for
allowornone
- Guide video: 30-60 seconds, simple melody (e.g., "Twinkle Twinkle")
- Source video: 5-10 minutes, different song, same singer if possible
- Format: MP4, 24fps, clear audio (no background noise)
- Pitch accuracy: >85% exact MIDI matches, >95% within ±1 semitone
- Temporal alignment: Onset timing within ±100ms
- Visual quality: No visible encoding artifacts
- Audio quality: No clicks, pops, or volume jumps
- Visualize guide pitch sequence (are onsets detected correctly?)
- Check source database coverage (does it have all needed pitches?)
- Review match quality report (where are the errors?)
- Watch output video (does it look/sound natural?)
- Iterate on threshold/parameters
Options:
- A) Use nearest available pitch (±1-2 semitones)
- B) Insert silence/black frame
- C) Pitch-shift nearest clip (adds artifacts)
- Recommendation: A, with warning logged
Options:
- A) Extend to minimum duration (0.15s)
- B) Merge with adjacent note
- C) Skip entirely
- Recommendation: A, with crossfade
Options:
- A) Use median pitch only (simpler)
- B) Match full pitch contour (complex)
- Recommendation: A for v1, B for v2
Options:
- A) Source audio (reconstructed melody)
- B) Guide audio (original performance)
- C) Both versions
- Recommendation: C (generate both)
Options:
- A) Hard cuts (current approach)
- B) Short crossfades (50-100ms)
- Recommendation: A for v1, add B as option
- Mitigation: Use CREPE (most accurate), filter low-confidence segments
- Fallback: Manual pitch correction JSON editing
- Mitigation: Require source video 10-20x longer than guide
- Fallback: Allow semitone mismatches, log warnings
- Mitigation: Frame-accurate onset detection, manual threshold tuning
- Fallback: Visualizer tool for debugging
- Mitigation: Prefer clips with similar duration, add crossfade option
- Fallback: User can manually exclude jarring clips
-
Facial expression matching (ImageBind embeddings)
- Combine pitch match + visual coherence score
- Prefer clips where mouth shape roughly matches
-
Pitch correction/auto-tune
- Fine-tune source clips to exact guide pitch
- Use rubberband or librosa pitch_shift
-
Vibrato preservation
- Match full pitch contour, not just median
- Use DTW (Dynamic Time Warping) for alignment
-
Real-time preview
- Web interface for adjusting thresholds
- Instant re-matching and preview
-
Batch processing
- Process multiple guide/source pairs
- Generate comparison videos automatically
- PITCH_TOOL_README.md - User guide with examples
- Code comments - Docstrings for all functions
- Example scripts - Shell scripts for common workflows
- Test cases - Sample videos and expected outputs
- Troubleshooting guide - Common errors and fixes
Minimum Viable Product (MVP):
- ✅ Accurately detect pitches in guide video
- ✅ Build searchable database from source video
- ✅ Match >85% of guide pitches exactly
- ✅ Generate playable output video with both audio options
- ✅ Complete pipeline script (one command to run all phases)
Stretch Goals:
- ✅ Interactive visualizer for pitch contours
- ✅ Time-stretching for duration matching
- ✅ Multiple reuse policies
- ✅ Comprehensive match quality report
- Review this plan - Confirm approach and priorities
- Prepare test videos - Find/create guide and source singing videos
- Install dependencies - Add CREPE to requirements.txt
- Start Sprint 1 - Implement pitch_guide_analyzer.py
- Iterate and test - Validate pitch detection before moving to matching
- Core implementation (Sprints 1-4): 8-10 days
- Visualization & polish (Sprint 5): 2-3 days
- Testing & refinement: 2-3 days
- Documentation: 1-2 days
Total: ~2 weeks for complete, polished tool
End of Plan