AVSD-GER

Identity-conditioned, closed-loop generative error correction for multi-speaker audio-visual speech recognition.

Three contributions on top of the dual-hypothesis GER paradigm: a cross-modal identity pool (C1), ID-conditioned token-level alignment + speaker-aware GER (C2), and a closed-loop confidence-gated feedback (C3) that refines the pool online. See docs/ARCHITECTURE.md for the design and docs/RELATED_WORK.md for the comparison with DualHyp / AVSD / DiarizationLM. Raw meeting-video diarization frontends are tracked separately in docs/AVSD_FRONTENDS.md: oracle turns, common pyannote+ASD, strong Sortformer/Precision-2 style frontends, and degraded robustness profiles.

Install (conda)

The project is tested with Python 3.10. Keep pip==24.0 because fairseq==0.12.2 depends on older metadata that newer pip versions reject.

Fast path: `conda env create`

git clone https://github.com/TroyHow0413/avsd_ger.git
cd avsd_ger

conda env create -f environment.yml
conda activate avsdger

Then force-install the correct PyTorch CUDA wheel for your GPU. This step is intentional: several pip packages in the environment depend on torch, so a plain conda env create may let pip choose a generic torch wheel before you pin the right CUDA build.

# RTX 50-series / Blackwell, e.g. RTX 5080 or 5090, sm_120:
pip install --upgrade --force-reinstall \
    "torch>=2.7" "torchaudio>=2.7" "torchvision>=0.22" \
    --index-url https://download.pytorch.org/whl/cu128

# Older GPUs, e.g. A100/H100/RTX 30/40-series:
# pip install --upgrade --force-reinstall \
#     "torch==2.6.*" "torchaudio==2.6.*" "torchvision==0.21.*" \
#     --index-url https://download.pytorch.org/whl/cu124

Do not add --no-deps to the torch command. The torch wheel must pull its exact CUDA runtime packages, including cublas, cudnn, cufft, cusparse, nccl, nvtx, triton, and sympy.

Clean deterministic path

This avoids the temporary "wrong torch first, correct torch second" behavior of the fast path:

conda create -n avsdger python=3.10 pip=24.0 -y
conda activate avsdger

conda install -c conda-forge -y \
    "numpy>=1.24,<2.0" "scipy>=1.11" "pyyaml>=6.0" "tqdm>=4.66" \
    libsndfile ffmpeg openh264

# Choose ONE torch line:
pip install \
    "torch>=2.7" "torchaudio>=2.7" "torchvision>=0.22" \
    --index-url https://download.pytorch.org/whl/cu128

# pip install \
#     "torch==2.6.*" "torchaudio==2.6.*" "torchvision==0.21.*" \
#     --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Dependency notes:

Dependency group	Packages	Why
Interpreter	`python=3.10`, `pip=24.0`	Python 3.11+ is fragile with AV-HuBERT/fairseq; pip 24.1+ rejects fairseq's old metadata.
Numeric / IO	`numpy>=1.24,<2.0`, `scipy`, `pyyaml`, `tqdm`	`numpy<2` keeps InsightFace/ONNX/OpenCV ABI compatibility. The pip block repeats the NumPy pin because transitive pip deps can otherwise upgrade it after conda solves the env.
System media libs	`libsndfile`, `ffmpeg`, `openh264`	Required by `soundfile`, `librosa`, and video/audio decode paths.
PyTorch	`torch`, `torchaudio`, `torchvision` from the PyTorch CUDA wheel index	RTX 50-series needs cu128 wheels with `sm_120`; older GPUs can use cu124.
ASR / text backbone	`faster-whisper`, `transformers`, `tokenizers`, `huggingface_hub>=0.34,<1.0`, `accelerate`	Whisper and transformer model loading. `huggingface_hub>=0.34` provides the `hf` CLI used below.
GER head	`peft`, `sentencepiece`, `bitsandbytes`	Llama-3 LoRA and optional quantized loading.
Identity encoders	`speechbrain`, `insightface`, `onnxruntime`	ECAPA-TDNN voice embeddings and ArcFace face embeddings.
Audio / video Python IO	`librosa`, `opencv-python`, `opencv-python-headless`, `soundfile`, `python_speech_features`	Feature extraction and AV-HuBERT logfbank support. Both OpenCV wheels are pinned `<4.10` so `albumentations`/`insightface` cannot pull an OpenCV build that forces NumPy 2.x.
Monitoring / logging	`nvidia-ml-py`, `psutil`, `wandb`	Power logging and experiment tracking.

Verify the environment

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_arch_list())"
python -c "import numpy, scipy, cv2, soundfile, librosa; print('core io ok')"
python -c "import transformers, faster_whisper, peft, speechbrain, insightface, onnxruntime; print('model deps ok')"

For RTX 5080/5090, torch.cuda.get_arch_list() must contain sm_120.

Conda troubleshooting

On some Windows Anaconda installs, conda env create can fail before dependency solving because the conda-anaconda-tos plugin cannot read its cache. Use:

conda --no-plugins env create -f environment.yml --solver classic

If this hangs during solving, use the clean deterministic path above.

Windows 11: keep openh264 installed. Without it, ffmpeg may fail at runtime with libopenh264.so.5: cannot open shared object file.

If an already-created server env prints an error like A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.6, pip upgraded NumPy after conda created the env. This usually comes from a transitive opencv-python-headless install. Repair it in-place with:

conda activate avsdger
python -m pip uninstall -y numpy opencv-python opencv-python-headless
python -m pip install "numpy>=1.24,<2.0" "opencv-python>=4.9,<4.10" "opencv-python-headless>=4.9,<4.10"
python -c "import numpy, cv2; print(numpy.__version__, cv2.__version__)"

Expected: NumPy 1.26.x and OpenCV 4.9.x.

Manual steps after `conda activate avsdger`

# fairseq from source — required by AV-HuBERT, PyPI fairseq is too stale
pip install "git+https://github.com/facebookresearch/fairseq.git@v0.12.2"

# AV-HuBERT repo on PYTHONPATH, plus a checkpoint-config compatibility patch.
git clone https://github.com/facebookresearch/av_hubert.git
source scripts/setup_avhubert_env.sh

# Equivalent one-off command if you only want to patch an existing server clone:
# python - <<'PY'
# from pathlib import Path
# p = Path("av_hubert/avhubert/hubert_pretraining.py")
# s = p.read_text()
# n = '    fine_tuning: bool = field(default=False, metadata={"help": "set to true if fine-tuning AV-Hubert"})\n'
# if "input_modality:" not in s:
#     p.write_text(s.replace(n, n + '    input_modality: Optional[str] = field(default="audiovisual", metadata={"help": "input modality: audio | video | audiovisual"})\n', 1))
# PY

# ⚠️  Windows 11 (native, not WSL): bash activate hooks don't run.
# Use one of these alternatives instead:
#
# Option A — conda env var (recommended, persists across activations):
#     conda env config vars set PYTHONPATH="D:\GitHub\avsd_ger_claude\av_hubert;D:\GitHub\avsd_ger_claude\av_hubert\avhubert"
#     conda deactivate && conda activate avsdger   # apply immediately
#
# Option B — batch activate hook (PowerShell users: create avhubert_path.bat):
#     New-Item -Force "$env:CONDA_PREFIX\etc\conda\activate.d\avhubert_path.bat"
#     Add-Content "$env:CONDA_PREFIX\etc\conda\activate.d\avhubert_path.bat" `
#         "@set PYTHONPATH=D:\GitHub\avsd_ger_claude\av_hubert;D:\GitHub\avsd_ger_claude\av_hubert\avhubert;%PYTHONPATH%"

# Gated Llama-3-8B-Instruct access (only when stub_backbones=false).
# The `hf` CLI requires huggingface_hub>=0.34. If `hf` is missing, upgrade
# the package inside the active env first:
python -m pip install --upgrade "huggingface_hub>=0.34,<1.0"
hf auth login                         # paste a Read-scope token from https://huggingface.co/settings/tokens
hf auth whoami                        # sanity check
# On older environments that you cannot upgrade yet, use: huggingface-cli login
# then request access at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Windows 11 — known quirks after `pip install -r requirements.txt`

Package	Issue	Fix
`bitsandbytes`	The PyPI wheel is Linux-only; the Windows build is at a different index.	`pip install bitsandbytes --index-url https://jllllll.github.io/bitsandbytes-windows-webui` — or skip it entirely if you're not doing 4-/8-bit Llama loading.
DataLoader `num_workers`	Windows uses `spawn` (not `fork`) for multiprocessing — any script that sets `num_workers > 0` without a `__main__` guard will deadlock.	Wrap the entry point of any custom script in `if __name__ == '__main__':`, or pass `--num-workers 0` / set `num_workers=0` in the DataLoader calls. All scripts in `scripts/` already have the guard.

Backbone weights (only when running real models)

Whisper-large-v3 — auto-downloaded into checkpoints/whisper/ on first use (faster-whisper CT2 weights plus the Hugging Face encoder/rescore weights). This directory is portable; upload it with checkpoints/ to avoid slow server downloads.
AV-HuBERT Large — drop the .pt at checkpoints/avhubert_large_lrs3_iter5.pt (path in configs/default.yaml).
ECAPA-TDNN — auto from SpeechBrain.
InsightFace buffalo_l — auto on first use.
Llama-3-8B-Instruct — pulled by the GER head once hf auth login is done. The hf CLI ships with huggingface_hub>=0.34; older envs may only have the legacy huggingface-cli login.

The repo defaults to stub_backbones: true in configs/default.yaml, so you can verify wiring without any of the above.

Current Workflow

The old Phase 0 / A / B / C / D / E / F / G names were rollout labels from the first implementation pass. They are not runtime modes, and the training scripts do not read a --phase argument. The old notes were moved to docs/LEGACY_PHASE_ROLLOUT.md.

For current day-to-day work, use the concrete scripts:

Task	Script	Notes
Enroll speakers	`scripts/enroll_identity.py`	Builds or updates an identity pool.
Single utterance smoke test	`scripts/run_sample.py`	Runs C1 -> C2 -> C3 on one utterance.
Stage-1 training	`scripts/train_identity.py`	Trains the C1 identity fuser with bidirectional InfoNCE.
Stage-2 training	`scripts/train_stage2.py`	Trains alignment, CTC, GER LoRA/QFormer pieces depending on `--warmup`.
Ablation eval	`scripts/eval_ablations.py`	Runs the 5 ablation rows and writes metrics.
Convenience launcher	`one_go/train.py`	Optional wrapper around Stage-1 and Stage-2; useful for quick local runs, not required.

Minimal Stub Check

configs/default.yaml defaults to stub_backbones: true, so this checks wiring without downloading real model weights.

python scripts/enroll_identity.py --manifest data/sample_manifest.json
python scripts/run_sample.py --manifest data/sample_manifest.json --utt utt_0001

Optional stub ablation check:

python scripts/eval_ablations.py \
    --config configs/default.yaml \
    --manifest data/sample_session_manifest.json \
    --pool checkpoints/identity_pool.pt \
    --out out/ablation_report_stub.json \
    --no-power

Real Training

Prepare the real backbones first: set stub_backbones: false, put AV-HuBERT at checkpoints/avhubert_large_lrs3_iter5.pt, log in to Hugging Face for Llama-3, and let Whisper/ECAPA/InsightFace auto-cache on first use.

Stage-1:

python scripts/train_identity.py \
    --config configs/default.yaml \
    --manifest data/your_real_train_manifest.jsonl \
    --out checkpoints/stage1/ \
    --epochs 5 \
    --wandb-project avsd-ger \
    --wandb-run-name stage1-real-v1

Stage-2:

python scripts/train_stage2.py \
    --config configs/default.yaml \
    --manifest data/your_real_train_manifest.jsonl \
    --stage1-pool checkpoints/stage1/identity_pool_stage1.pt \
    --out checkpoints/stage2/ \
    --warmup joint \
    --wandb-project avsd-ger \
    --wandb-run-name stage2-real-v1

Useful Stage-2 --warmup modes:

Mode	Trains	When to use
`joint`	aligner + CTC + identity fuser + GER projectors/LoRA	Full Stage-2 run.
`align_ctc`	aligner + CTC only	Debug alignment/CTC before loading the LLM path.
`ger_lora`	GER LoRA path only	Lower-memory GER text n-best training. Can pair with `--no-encoder-context`.
`ger_qformer`	GER LoRA + QFormer/id projector	Train GER soft-prefix pieces without the full joint objective.

Low-memory GER example:

python scripts/train_stage2.py \
    --config configs/default.yaml \
    --manifest data/your_real_train_manifest.jsonl \
    --stage1-pool checkpoints/stage1/identity_pool_stage1.pt \
    --out checkpoints/stage2_ger_lora/ \
    --warmup ger_lora \
    --no-encoder-context \
    --llm-quant 4bit

Final Eval

train_stage2.py saves trained fuser weights, but it does not enroll test speakers into identity_pool_stage2.pt. Use one of these two valid eval patterns.

Recommended for per-meeting AMI eval: let eval_ablations.py load the trained fuser and fresh-enroll speakers from each session manifest:

python scripts/eval_ablations.py \
    --config configs/default.yaml \
    --manifest data/your_real_test_session_manifest.json \
    --pool checkpoints/stage2/identity_pool_stage2.pt \
    --fresh-pool \
    --out out/ablation_report_real.json \
    --idle-calibrate-s 2.0 \
    --wandb-project avsd-ger \
    --wandb-run-name ablation-final-v1

Alternative for debugging or deployment-style fixed enrollment: pre-enroll once, then evaluate without --fresh-pool:

python scripts/enroll_identity.py \
    --manifest data/your_real_test_speakers.json \
    --in-pool checkpoints/stage2/identity_pool_stage2.pt \
    --out-pool checkpoints/stage2/identity_pool_stage2_enrolled.pt

python scripts/eval_ablations.py \
    --config configs/default.yaml \
    --manifest data/your_real_test_session_manifest.json \
    --pool checkpoints/stage2/identity_pool_stage2_enrolled.pt \
    --out out/ablation_report_real.json \
    --idle-calibrate-s 2.0 \
    --wandb-project avsd-ger \
    --wandb-run-name ablation-final-v1

When eval_ablations.py starts, check that the pool/enrollment log shows a non-zero speaker count.

Optional One-Go Launcher

one_go/train.py is only a convenience wrapper. It writes a runtime config under one_go/runs/ and then calls the same Stage-1/Stage-2 scripts above.

python one_go/train.py --stage all --manifest data/your_real_train_manifest.jsonl --real --device cuda

W&B flags

All three scripts (train_identity.py, train_stage2.py, eval_ablations.py) share the same set of W&B CLI flags via avsd_ger.wandb_logger.add_wandb_args:

Flag	What it sets	Default
`--wandb-project NAME`	`wandb.init(project=...)`	`avsd-ger`
`--wandb-run-name NAME`	`wandb.init(name=...)`	auto-derived from script + manifest stem
`--wandb-entity TEAM`	`wandb.init(entity=...)`	`$WANDB_ENTITY` env, else your default
`--wandb-tags T1 T2 ...`	`wandb.init(tags=[...])`	none
`--no-wandb`	Disables logging entirely	(logging on)

If wandb is not installed or --no-wandb is set, the logger silently no-ops — the rest of the script runs unaffected. To enable live logging:

pip install wandb           # already in environment.yml / requirements.txt
wandb login                 # paste your W&B API key from https://wandb.ai/authorize

The metric namespaces written by each script:

Script	W&B keys
`train_identity.py`	`stage1/loss/{total,A->V,V->A}`, `stage1/acc/{A->V,V->A}`, `stage1/lr`, `stage1/cold_start/{K,n_unknown}`
`train_stage2.py`	`stage2/loss/{total,ctc,ger,info}`, `stage2/lr`, `stage2/epoch_end/{ctc,ger,info}`
`eval_ablations.py`	`ablation/<row>/{sa_wer,wer,scr,av_sid_acc,der,jer,energy_wh,avg_power_w}`, `summary/<row>/<metric>`, `summary/spec_check_c3_gate_pass`

Scripts

Script	What it does	Detailed docs
`scripts/enroll_identity.py`	Enrol speakers into the cross-modal identity pool. Supports `--in-pool` to load a trained-fuser pool before evaluation enrollment.	`docs/ARCHITECTURE.md#c1`
`scripts/run_sample.py`	Run one utterance end-to-end (single-speaker path).	`docs/ARCHITECTURE.md`
`scripts/train_identity.py`	Stage-1: identity fuser training with bidirectional InfoNCE.	`docs/TRAINING.md`
`scripts/train_stage2.py`	Stage-2 multi-task training with selectable `--warmup` modes. *Enforces `lr_stage2 == lr_stage1 ratio`** at runtime.	`docs/TRAINING.md`
`scripts/eval_ablations.py`	Run the five spec ablation rows on a session manifest, write metrics + energy.	`docs/EVALUATION.md#ablation-runner`
`one_go/train.py`	Optional convenience wrapper that calls Stage-1 and/or Stage-2 training. Not required for normal training.	See Current Workflow.

Layout

avsd_ger/
├── backbones/        # Whisper + AV-HuBERT wrappers (frozen)
├── c1_identity/      # ECAPA + ArcFace + IdentityPool + dual-gate + cold-start
├── c2_alignment/     # ID-conditioned aligner + GER head (Llama-3-8B + LoRA)
├── c3_feedback/      # Composite confidence + closed-loop controller
├── training/         # CTC head, GER cross-entropy, identity (InfoNCE) loss
├── eval/             # SessionRunner, metrics (SA-WER/SCR/AV-SID/DER/JER), PowerMonitor
├── wandb_logger.py   # uniform W&B wrapper (no-op if wandb missing)
└── pipeline.py       # C1 -> C2 -> C3 orchestrator
configs/default.yaml  # all hyperparameters; ablation flags live under `ablation:`
data/                 # sample manifests for stub rehearsal
docs/                 # design + rollout docs (see table below)
scripts/              # CLI entry points

Documentation index

File	Purpose
`docs/ARCHITECTURE.md`	C1/C2/C3 module-level design, data shapes, key implementation choices.
`docs/RELATED_WORK.md`	Side-by-side comparison vs. DualHyp, AVSD, DiarizationLM.
`docs/TRAINING.md`	Stage-1 / Stage-2 recipes, loss weights, the spec §7 LR invariant.
`docs/REAL_MODEL_WORKFLOW.md`	Current real-model setup, manifest expectations, Stage-1/Stage-2 commands, re-enrollment, and eval workflow.
`docs/EVALUATION.md`	Manifest format, the five primary metrics, power monitor, ablation runner.
`docs/LEGACY_PHASE_ROLLOUT.md`	Archived Phase 0/A-G rollout notes from the old README. Not the current training workflow.

Status

Skeleton + spec-aligned wiring complete. AST + cross-import verified. Stub-mode enrollment, sample run, Stage-1, Stage-2, and ablation eval have working script paths. Current training uses scripts/train_identity.py, scripts/train_stage2.py, or the optional one_go/train.py wrapper.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
av_hubert		av_hubert
avsd_ger		avsd_ger
configs		configs
docs		docs
one_go		one_go
out		out
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
amiBuild-173918-Sat-May-9-2026.wget.sh		amiBuild-173918-Sat-May-9-2026.wget.sh
check_fa2.py		check_fa2.py
environment.yml		environment.yml
requirements.txt		requirements.txt
技术栈地图.md		技术栈地图.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AVSD-GER

Install (conda)

Fast path: `conda env create`

Clean deterministic path

Verify the environment

Conda troubleshooting

Manual steps after `conda activate avsdger`

Windows 11 — known quirks after `pip install -r requirements.txt`

Backbone weights (only when running real models)

Current Workflow

Minimal Stub Check

Real Training

Final Eval

Optional One-Go Launcher

W&B flags

Scripts

Layout

Documentation index

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AVSD-GER

Install (conda)

Fast path: conda env create

Clean deterministic path

Verify the environment

Conda troubleshooting

Manual steps after conda activate avsdger

Windows 11 — known quirks after pip install -r requirements.txt

Backbone weights (only when running real models)

Current Workflow

Minimal Stub Check

Real Training

Final Eval

Optional One-Go Launcher

W&B flags

Scripts

Layout

Documentation index

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Fast path: `conda env create`

Manual steps after `conda activate avsdger`

Windows 11 — known quirks after `pip install -r requirements.txt`

Packages