This document describes how to evaluate reconstruction performance for SAC.
To evaluate speaker similarity, you need to extract speaker embeddings from the reconstructed speech and the reference speech. We use the WavLM-based model for speaker verification, please set up the environment as described in that repository (e.g., the "s3prl" package).
For reference, we provide an embedding extraction script at tools/speaker/extract_spk_emb.py.
In the scripts/eval.sh script, set the variable rec_path to the reconstructed audio directory, which should follow this structure:
/path/to/reconstructed_wav
├── speaker_embedding/
└── wav/
To evaluate word error rate (WER), set ref_texts_path to the reference transcription file. The input .jsonl file should include:
wav_path: path to the reference audiotext: corresponding ground-truth transcript