Skip to content

[FEAT]: Offline Speaker Diarization for Multi-Speaker Environments #479

@Cubix33

Description

@Cubix33

📝 Description

Implement a 100% offline speaker diarization pipeline using pyannote.audio to distinguish between multiple voices in a single audio file. This will map transcriptions to specific speakers (e.g., SPEAKER_01, SPEAKER_02), allowing the LLM to differentiate between official First Responders and bystanders/patients.

💡 Rationale

Emergency scenes are chaotic. When a responder interviews a patient, both voices end up in the same .wav file. Currently, the LLM processes this as a single block of text, which creates a high risk of "hallucinating" data by confusing a panicked bystander's guess with a responder's official medical assessment. We need this to ensure factual accuracy while maintaining our strict zero-cloud privacy mandate.

🛠️ Proposed Solution

Implement Voice Activity Detection (VAD) to trim silence, then run pyannote.audio locally to cluster voice frequencies. We will align the Whisper transcript with these clusters to create a "movie script" format, then update the LLM prompt to explicitly filter facts based on the speaker.

  • Logic change in src/ (New module: src/diarization.py for pipeline execution and memory management)
  • Update to requirements.txt (Add pyannote.audio and torchaudio)
  • New prompt for Mistral/Ollama (Instruct LLM to identify the responder and only extract their confirmed facts)

✅ Acceptance Criteria

How will we know this is finished?

  • Feature works in Docker container.
  • Documentation updated in docs/ (Specifically regarding local model caching for pyannote weights).
  • JSON output validates against the schema.
  • Pipeline successfully runs on a local machine without Out-Of-Memory (OOM) crashes by explicitly unloading the diarization model before Ollama inference.

📌 Additional Context

  • Hardware Constraints: Running Pyannote, Whisper, and Ollama simultaneously will crash most standard edge devices. The implementation must include sequential loading/unloading of models (e.g., del pipeline, torch.cuda.empty_cache()) to manage VRAM effectively.
  • Fallback: If pyannote proves too heavy for the target hardware during testing, we may need to pivot to an app-level "Push-to-Talk" segregation as a lighter alternative.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions