Gap Extraction Guide

Problem

We need to extract gap data from the correction process to manually review and annotate how each gap should be handled.

Quick Solution

Run your normal correction command, but add this Python code at the start of corrector.py's gap processing loop:

Step 1: Add Gap Dumping Code

In /Users/andrew/Projects/karaoke-gen/lyrics_transcriber_local/lyrics_transcriber/correction/corrector.py, find the line:

        for i, gap in enumerate(gap_sequences, 1):
            self.logger.info(f"Processing gap {i}/{len(gap_sequences)} at position {gap.transcription_position}")

And before the loop, add this code:

        # === GAP EXTRACTION CODE (TEMPORARY) ===
        import yaml
        if os.getenv("DUMP_GAPS") == "1":
            gaps_data = []
            for i, gap in enumerate(gap_sequences, 1):
                gap_words = []
                for word_id in gap.transcribed_word_ids:
                    if word_id in word_map:
                        word = word_map[word_id]
                        gap_words.append({
                            "id": word_id,
                            "text": word.text,
                            "start_time": round(getattr(word, 'start_time', 0), 3),
                            "end_time": round(getattr(word, 'end_time', 0), 3)
                        })
                
                ref_context = ""
                for source, lyrics_data in self.reference_lyrics.items():
                    if lyrics_data and lyrics_data.segments:
                        ref_words = []
                        for seg in lyrics_data.segments[:20]:
                            ref_words.extend([w.text for w in seg.words])
                        ref_context = " ".join(ref_words[:150])
                        break
                
                gap_text = " ".join([w["text"] for w in gap_words])
                
                gaps_data.append({
                    "gap_id": i,
                    "position": gap.transcription_position,
                    "gap_text": gap_text,
                    "transcribed_words": gap_words,
                    "reference_context": ref_context[:300],
                    "word_count": len(gap_words),
                    "annotations": {
                        "your_decision": "",
                        "action_type": "# NO_ACTION | REPLACE | DELETE | INSERT | MERGE | SPLIT",
                        "target_word_ids": [],
                        "replacement_text": "",
                        "notes": ""
                    }
                })
            
            with open("gaps_review.yaml", 'w') as f:
                f.write("# Gap Review Data\n")
                f.write(f"# Total gaps: {len(gaps_data)}\n\n")
                yaml.dump({"gaps": gaps_data}, f, default_flow_style=False, allow_unicode=True, width=120, sort_keys=False)
            
            self.logger.info(f"📝 Dumped {len(gaps_data)} gaps to gaps_review.yaml")
            import sys
            sys.exit(0)
        # === END GAP EXTRACTION CODE ===

Step 2: Run with the Flag

DUMP_GAPS=1 USE_AGENTIC_AI=0 poetry run lyrics-transcriber Time-Bomb.flac

This will:

Find all gaps
Write them to gaps_review.yaml
Exit before processing

Step 3: Review the Output

Open gaps_review.yaml and for each gap, fill in:

your_decision: Brief description of what should happen
action_type: NO_ACTION | REPLACE | DELETE | INSERT | MERGE | SPLIT
target_word_ids: Which word IDs to operate on (from transcribed_words)
replacement_text: The corrected text
notes: Any additional context

Example Annotation

- gap_id: 1
  position: 7
  gap_text: "out, I'm starting over I'm"
  transcribed_words:
    - {id: w7, text: "out,", start_time: 10.5, end_time: 10.8}
    - {id: w8, text: "I'm", start_time: 10.9, end_time: 11.1}
    # ... more words
  reference_context: "Starting now I'm starting over I'm gonna sleep..."
  annotations:
    your_decision: "Replace 'out,' with 'now' - transcription error"
    action_type: "REPLACE"
    target_word_ids: ["w7"]
    replacement_text: "now"
    notes: "Reference lyrics clearly say 'now' not 'out'"

Step 4: Remove the Temporary Code

After extracting gaps, remove the === GAP EXTRACTION CODE === block from corrector.py.

Why This Approach?

The monkey-patching scripts were failing because:

Method names weren't what we expected
The correction logic is inline, not in a separate method
Easier to just add temporary code directly

This direct approach is simpler and more reliable for a one-time extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gap Extraction Guide

Problem

Quick Solution

Step 1: Add Gap Dumping Code

Step 2: Run with the Flag

Step 3: Review the Output

Example Annotation

Step 4: Remove the Temporary Code

Why This Approach?

FilesExpand file tree

GAPS_EXTRACTION_GUIDE.md

Latest commit

History

GAPS_EXTRACTION_GUIDE.md

File metadata and controls

Gap Extraction Guide

Problem

Quick Solution

Step 1: Add Gap Dumping Code

Step 2: Run with the Flag

Step 3: Review the Output

Example Annotation

Step 4: Remove the Temporary Code

Why This Approach?