Dixtral extends the DiCoW target-speaker ASR system with an LLM decoder, enabling per-speaker QA and summarization directly from meeting audio.
Live demo available β see
demo/README.mdfor instructions on running the Gradio interface locally.
The architecture combines:
- DiCoW encoder β a Whisper encoder augmented with Frame-level Diarization-Dependent Transformations (FDDT) that inject SilenceβTargetβNon-TargetβOverlap (STNO) diarization masks into every encoder layer.
- Voxtral decoder β
mistralai/Voxtral-Mini-3B-2507as the LLM backbone, fine-tuned via LoRA to answer questions about what a specific speaker said.
| Model | Description | Link |
|---|---|---|
| DiCoW v3.3 large | Diarization-conditioned Whisper (encoder) | https://huggingface.co/BUT-FIT/DiCoW_v3_3_large |
| Dixtral | Diarization-conditioned Voxtral | https://huggingface.co/BUT-FIT/Dixtral |
| Dixtral QA | Q/A finetunned variant | https://huggingface.co/BUT-FIT/Dixtral_QA |
git clone https://github.com/BUTSpeechFIT/Dixtral.git
cd DixtralConda
conda create -n dixtral python=3.11
conda activate dixtralvenv
python -m venv dixtral
source dixtral/bin/activatepip install -r requirements.txtEdit configs/local_paths.sh to set:
| Variable | Description |
|---|---|
SRC_ROOT |
Root of this repository |
MANIFEST_DIR |
Directory containing Lhotse manifest files |
EXPERIMENT_PATH |
Output directory for checkpoints and logs |
MUSAN_ROOT |
Path to MUSAN noise corpus (optional) |
HF_HOME |
Hugging Face cache directory |
conda install -c conda-forge ffmpeg sox
# or
sudo apt install ffmpeg soxFor standard target-speaker ASR, prepare Lhotse manifests using the dedicated repository: π mt-asr-data-prep
Follow its instructions, then set MANIFEST_DIR in configs/local_paths.sh.
The codebase uses Hydra for configuration. All configs are in ./configs.
| Config group | Description |
|---|---|
+train=base |
Train Dixtral encoder + decoder for target-speaker ASR |
+train=qa_ft |
Fine-tune Dixtral with LoRA for QA / summarization |
+train=dec_gt |
Decode with ground-truth diarization (ASR) |
+train=dec_gt_qa |
Decode with a QA-fine-tuned checkpoint |
Scripts are written for SLURM. Drop sbatch to run locally.
sbatch ./scripts/submit_slurm.sh +train=baseKey config knobs (configs/train/base.yaml):
model.dixtral_base_modelβ Voxtral model ID (default:mistralai/Voxtral-Mini-3B-2507)model.dixtral_load_fddt_fromβ path to a pretrained DiCoW checkpoint (provides FDDT weights)training.use_loraβ enable LoRA on the LLM decoderdata.train_cutsetsβ list of Lhotse manifest paths
Dixtral QA/summarization fine-tuning requires speaker-level question-answer pairs and summaries annotated on top of an existing Lhotse cutset. We provide annotations for NOTSOFAR1 via the NSF-QA dataset on Hugging Face.
python utils/download_nsf_qa.py --local-dir data/nsf_qaThis downloads only the QA (*_flat.json), and summary (*_summaries.json) files from popcornell/NSF-QA.
The downloaded directory will contain:
data/nsf_qa/
qa_annotations/
train_qa_flat.json # flat list of {session_id, speaker, question, answer, category, type}
dev_qa_flat.json
eval_qa_flat.json
summaries/
train/<session_id>_summaries.json # per-speaker GT summaries
dev/<session_id>_summaries.json
eval/<session_id>_summaries.json
Each *_qa_flat.json is a list of records:
[
{"session_id": "MTG_30860", "speaker": "Peter", "question": "...", "answer": "...", "category": "content", "type": "entity"},
...
]Each *_summaries.json file has:
{
"speaker_summaries": {
"<speaker_id>": ["summary text 1", "summary text 2"]
}
}utils/populate_cutset.py merges QA/summary annotations into an existing Lhotse cutset,
storing all prompts and ground-truth answers in cut.custom["speakers"].
for SPLIT in train dev eval; do
python utils/populate_cutset.py \
--cutset_path ${MANIFEST_DIR}/notsofar1/notsofar1_sdm_${SPLIT}_set_*_cutset.jsonl.gz \
--split ${SPLIT} \
--qa_dir data/nsf_qa/qa_annotations \
--summary_dir data/nsf_qa/qa_annotations/summaries \
doneThe populated cutset contains one entry per original cut, each with a custom.speakers dict:
cut.custom["speakers"] = {
"SPK1": [
{"prompt": "Summarize what this speaker said.", "gt_answer": "...", "qa_type": "summary", ...},
{"prompt": "What did the speaker propose?", "gt_answer": "...", "qa_type": "content", ...},
],
...
}This format is consumed directly by TS_QA_Dataset during training and evaluation.
sbatch ./scripts/submit_slurm.sh +train=qa_ftKey differences from ASR training (configs/train/qa_ft.yaml):
training.train_for_qa: Trueβ switches dataset, collator, and checkpoint selectiontraining.predict_with_generate: Falseβ uses loss for checkpoint selection during trainingtraining.metric_for_best_model: eval_<split>_lossdata.train_cutsets/dev_cutsets/eval_cutsetsβ point to*_qa.jsonl.gzcutsets
# ASR decode with GT diarization
sbatch ./scripts/submit_slurm.sh +decode=w_lora
# QA decode with GT diarization
sbatch ./scripts/submit_slurm.sh +decode=qa_ftFor optimal performance, recordings longer than ~5 minutes should be chunked before decoding. utils/chunk_longform_cutset.py splits a GT cutset (and optionally an aligned diarization-predicted cutset) at silence boundaries:
python utils/chunk_longform_cutset.py \
${MANIFEST_DIR}/<corpus>/<corpus>_cutset_test.jsonl.gz \
[path/to/diar_predicted_cutset.jsonl.gz] \
--target-duration 300 \
--output-dir ${MANIFEST_DIR}/<corpus>/Output filenames are derived from the input basenames with _<duration>s appended (e.g. <corpus>_cutset_test_300s.jsonl.gz).
For optimal performance, multi-channel recordings should be reduced to a single channel before decoding. Two options:
Select a specific channel β utils/select_channel.py extracts one channel from a recordings + supervisions manifest pair into a cutset:
python utils/select_channel.py \
--input-recset ${MANIFEST_DIR}/<corpus>/recordings.jsonl.gz \
--input-supset ${MANIFEST_DIR}/<corpus>/supervisions.jsonl.gz \
--channel 4 \
--output ${MANIFEST_DIR}/<corpus>/<corpus>_cuts_ch4.jsonl.gzSum all channels β set data.load_signal_sum: true in your config to average all channels to mono at data-loading time (no manifest preprocessing needed).
To decode with predicted (rather than ground-truth) diarization, first obtain diarized cutsets using the diarization pipeline from TS-ASR-Whisper:
π scripts/diarize.sh
Then wire the resulting cutsets into your decode config via data.dev_diar_cutsets / data.eval_diar_cutsets, following the pattern in:
π configs/decode/dicow_v3_beam_joint_diar.yaml
Hydra configs are modular. Each file starts with:
# @package _global_and overrides configs/base.yaml. A config can inherit from another using defaults:
# @package _global_
defaults:
- /train/baseAll available training/data/model parameters are documented in src/utils/training_args.py.
| Variable | Used for |
|---|---|
SRC_ROOT |
Python path root |
MANIFEST_DIR |
Lhotse manifest directory |
EXPERIMENT_PATH |
Checkpoint and log output |
MUSAN_ROOT |
MUSAN noise augmentation |
Export a trained checkpoint to Hugging Face Hub:
- Create a model card at
export_sources/readmes/<HUB_MODEL_NAME>.md - Optionally update
export_sources/generation_config.json - Run:
python ./export_dixtral.py \
--model_path <MODEL_DIR> \
--model_name <HUB_MODEL_NAME> \
--org <HUB_ORG>Source code is licensed under the Apache License 2.0.
If you use this code or models, please cite:
Issues and pull requests are welcome.