Industrial audio OPD training stack for ASR and TTS, distilling compact audio models from stronger teacher models.
中文文档: README_zh.md
Announcements
-
2026.05.28: ARK-ASR-0.6B online demo is available. Try the ASR model directly in the Hugging Face Space.
-
2026.05.27: The ARK-ASR OPD paper is available on arXiv. Read Data-Efficient On-Policy Distillation for Automatic Speech Recognition.
-
2026.05.25: open-audio-opd is available on GitHub. The repository contains the industrial ASR online policy distillation training stack with FSDP2 distributed training.
-
2026.05.25: ARK-ASR-0.6B model weights are available. Download the compact ASR student checkpoint from Hugging Face.
-
TTS OPD is on the roadmap. The planned TTS recipe will reuse online student rollout and teacher scoring, adapted for speech generation quality, alignment, and acoustic-token supervision.
open-audio-opd contains the production audio online policy distillation
(OPD) stack used to distill compact audio models from stronger teacher models.
The current release focuses on ASR: a student autoregressive ASR model rolls out
transcripts on-policy, a stronger teacher scores the same audio and transcript,
and the student is updated with token-level KL on the union top-k support.
The repository is based on THUNLP/OPD and
verl. A trimmed vendored copy of verl/
is included so the training script can use FSDP2 wrapping, gradient clipping,
and checkpoint management without depending on another local checkout.
No audio files, JSONL datasets, or private machine paths are included. All model, data, and output paths are explicit command-line arguments. ASR model weights are released separately as AutoArk-AI/ARK-ASR-0.6B.
Figure 1. Audio OPD trains a compact student from online rollouts and teacher scoring over union top-k token support.
Roadmap · Model Release · Experimental Results · Training Method · Install · Inference · Evaluation · Training · Citation
| Category | Item | Status |
|---|---|---|
| ASR OPD | FSDP2 online policy distillation trainer | Done |
| Qwen3-ASR-style teacher scoring backend | Done | |
| Resumeable FSDP2 checkpointing | Done | |
| Multi-node hostfile launcher | Done | |
| ASR inference and J/WER evaluation scripts | Done | |
| Model Releases | ARK-ASR-0.6B | Done |
| TTS OPD | Online rollout and teacher-scoring recipe | Planned |
| Speech generation quality and alignment objectives | Planned | |
| Acoustic-token supervision support | Planned |
| ARK-ASR-0.6B | |
|---|---|
| Checkpoint | AutoArk-AI/ARK-ASR-0.6B |
| Task | Autoregressive ASR |
| Languages | Chinese, English, German, Japanese, French, Korean |
| Training recipe | SFT baseline plus teacher-data adaptation and OPD |
| Repository use | Inference, evaluation, and OPD continued training workflows |
scripts/train/train_ark_asr_opd_fsdp2_resume.py # main FSDP2 ASR OPD trainer
scripts/run/run_ark_asr_opd_fsdp2_resume_hostfile.sh # multi-node launcher
scripts/infer/ark_asr_transformers.py # ASR inference
scripts/eval/eval_jwer_ark_asr_transformers.py # J/WER evaluation
scripts/eval/run_arkasr_eval.sh # multi-GPU evaluation launcher
configs/hostfile.example # hostfile format example
https://arxiv.org/abs/2605.28139 # arXiv paper
assets/opd_overview.png # OPD overview figure
verl/ # vendored verl runtime code
README.md / README_zh.md # usage docs
Ark-ASR is a 0.6B-parameter ASR student model. These OPD experiments use only 100k hours of ASR audio. Public Qwen3-ASR technical-report material reports that Qwen3-ASR uses a multi-stage training pipeline whose AuT encoder pretraining stage alone uses about 40M hours of pseudo-labeled ASR audio, followed by Omni training, ASR SFT, and ASR RL. Under this comparison, Ark-ASR uses roughly 1/400 of the disclosed ASR pretraining audio scale while reaching a comparable level to the Qwen3-ASR 0.6B baseline.
Ark-Base denotes the 0.6B checkpoint obtained by SFT on the 100k-hour ASR
dataset. TD denotes teacher-data adaptation using 2,000 hours of
teacher-generated ASR data. OPD denotes on-policy distillation with the
Qwen-ASR teacher.
| Model | aishell-1 (CER) | Wenet-meeting (CER) | Wenet-net (CER) | Libri-clean (WER) | Libri-other (WER) |
|---|---|---|---|---|---|
| Ark-Base (0.6B) | 3.48% | 10.22% | 7.74% | 3.75% | 7.17% |
| Ark-Base+OPD (0.6B) | 3.00% | 7.18% | 6.13% | 2.88% | 5.50% |
| Ark-Base+TD+OPD (0.6B) | 1.95% | 5.92% | 5.39% | 2.45% | 4.56% |
| Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 2.20% | 4.05% |
| Qwen3-ASR-0.6B | 2.07% | 5.57% | 5.45% | 2.81% | 5.05% |
Lower CER/WER is better.
Key takeaways:
- With only 100k hours of audio, Ark-ASR reaches a competitive level against Qwen3-ASR models trained with a much larger reported ASR data scale.
- Applying OPD on the same 0.6B student substantially improves every benchmark over Ark-Base, showing that OPD transfers additional ASR capability beyond standard supervised fine-tuning.
- Ark-Base+TD+OPD is the stronger recipe. It improves Ark-ASR from 3.00% to 1.95% CER on aishell-1, from 7.18% to 5.92% CER on Wenet-meeting, from 6.13% to 5.39% CER on Wenet-net, from 2.88% to 2.45% WER on Libri-clean, and from 5.50% to 4.56% WER on Libri-other.
- At the same 0.6B scale, Ark-Base+TD+OPD is stronger overall than Qwen3-ASR-0.6B, with better aishell-1, Wenet-net, Libri-clean, and Libri-other results.
ASR OPD trains a student ASR model using online rollouts and teacher scores:
audio batch
-> student generates transcript tokens with no grad
-> teacher scores the same audio plus the student transcript
-> student scores its own transcript with gradients
-> build teacher/student union top-k support
-> optimize KL(teacher || student) on aligned transcript positions
-> save FSDP2 checkpoints that can be resumed
The key point is that the teacher is not used to provide a static transcript label. It scores what the student actually generated online, so the student is trained on its own current behavior.
--student_model is the trainable audio-capable ASR model. It must be loadable
with AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) and its
processor/tokenizer must support the audio prompt format used by the script.
The student is wrapped with FSDP2 and receives gradients.
--teacher_model is the stronger ASR model used for scoring. It is loaded in
eval mode and does not receive gradients. Supported teacher backends are:
qwen3_asr_teacher_forcing: default production path for Qwen3-ASR-style teachers.qwen3_asr_transformers: Transformers backend for Qwen3-ASR.qwen3_asr_vllm: vLLM backend when the matching vLLM stack is installed.hf_causal_lm: generic Hugging Face causal LM teacher path.
For qwen3_asr_* backends, pass --qwen3_asr_code_path to the local Qwen3-ASR
Transformers backend code. That backend code is not vendored here.
Use a CUDA/PyTorch environment that matches your cluster. Then install this repository and its Python dependencies:
pip install -e .If your workflow expects verl to be installed as its own editable package:
pip install -e ./verlFor qwen3_asr_vllm, install a compatible vLLM stack separately:
pip install -e ".[vllm]"Training data is JSONL. Each line is one ASR sample:
{"audio":"/path/to/audio.wav","text":"reference transcript","task":"asr","begin_time":-1,"end_time":-1}Fields:
audio: required audio path.text: required reference transcript used for ASR supervision and metadata.task: optional; if present, it must beasr.begin_time: optional segment start in seconds. Use-1for full audio.end_time: optional segment end in seconds. Use-1for full audio.
The script fails on missing audio paths. It does not silently replace bad samples with fallback audio.
Run ASR inference with Hugging Face Transformers:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_path = "AutoArk-AI/ARK-ASR-0.6B"
audio_path = "assets/libai.wav"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch_dtype,
attn_implementation="sdpa",
).to(device)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": audio_path},
{"type": "text", "text": "Please transcribe this audio."},
],
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
)
inputs = inputs.to(device)
if "audios" in inputs:
inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
outputs = model.generate(
**inputs,
do_sample=False,
max_new_tokens=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bad_words_ids=bad_words_ids,
)
decoded_outputs = tokenizer.batch_decode(
outputs[:, inputs.input_ids.shape[1] :],
skip_special_tokens=True,
)
print(decoded_outputs)For batch JSONL inference, use:
python scripts/infer/ark_asr_transformers.py \
--input /path/to/input.jsonl \
--output runs/infer/predictions.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpaArk-ASR can also be served with the vLLM adapter in scripts/vllm/ark_asr_vllm.
The verified runtime on arki-dev-h20 is:
/root/miniforge3/envs/asr_vlm
Python 3.10
PyTorch 2.9.0+cu128
Transformers 4.57.3
vLLM 0.12.0
Start the online service:
cd /data/yumu/open-audio-opd
GPU=2 PORT=8025 scripts/vllm/deploy_ark_asr_vllm_service.sh startCheck service status and token masking:
scripts/vllm/deploy_ark_asr_vllm_service.sh status
curl -sS http://127.0.0.1:8025/health
curl -sS http://127.0.0.1:8025/token-maskRun one ASR request:
curl -sS -X POST http://127.0.0.1:8025/asr \
-F file=@assets/libai.wav \
-F max_new_tokens=64The service also exposes an OpenAI-style endpoint:
curl -sS -X POST http://127.0.0.1:8025/v1/audio/transcriptions \
-F file=@assets/libai.wav \
-F model=ark-asrStop the service:
scripts/vllm/deploy_ark_asr_vllm_service.sh stopLogs and PID files are written to runs/vllm/. The vLLM service applies
generation-time token masking with allowed_token_ids, so non-ASR control
tokens such as <|user|>, <|assistant|>, <|audio|>,
<|begin_of_audio|>, and <|end_of_audio|> are blocked during decoding.
<|im_end|> is kept as the stop token. Additional adaptation notes are in
docs/ark_asr_vllm_adaptation.md.
A local browser test page is available at tools/ark_asr_vllm_test.html. It
defaults to http://172.31.0.3:8025 and supports file upload, microphone
recording, health checks, and token-mask checks.
Run J/WER evaluation for one JSONL file:
python scripts/eval/eval_jwer_ark_asr_transformers.py \
--input /path/to/test_aishell.jsonl \
--output runs/eval/test_aishell_result.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpaThe eval output is sorted by cer_errors descending to make bad cases easy to
inspect. Each row includes ref_text, pred_text, cleaned text fields,
wer_errors, cer_errors, ref_words, and ref_chars.
If your environment has the text_process normalizer in a separate Python env,
pass:
--text_normalize_python /path/to/wetext/bin/pythonFor the five-preset multi-GPU evaluation pattern used by the internal
run_arkasr_step30000_eval.sh, use the open-source launcher without hard-coded
data paths:
MODEL_PATH=AutoArk-AI/ARK-ASR-0.6B \
EVAL_DATA_DIR=/path/to/eval_jsonl_dir \
OUTPUT_DIR=runs/eval/arkasr_step30000 \
SUFFIX=step30000 \
GPUS="0 1 2 3 4" \
PRESETS="aishell clean meeting net other" \
scripts/eval/run_arkasr_eval.shThe launcher expects files named test_${preset}.jsonl under EVAL_DATA_DIR.
It writes logs, pid files, and result JSONL files to OUTPUT_DIR. No eval data
is included in this repository.
torchrun --nproc_per_node 8 scripts/train/train_ark_asr_opd_fsdp2_resume.py \
--student_model AutoArk-AI/ARK-ASR-0.6B \
--teacher_model /path/to/qwen3_asr_model \
--qwen3_asr_code_path /path/to/qwen3-asr/backend \
--train_data /path/to/train.jsonl \
--output_dir runs/ark_asr_opd_fsdp2 \
--teacher_backend qwen3_asr_teacher_forcing \
--calibrate_only False \
--per_device_train_batch_size 1 \
--learning_rate 1e-6 \
--opd_top_k 32 \
--asr_opd_max_new_tokens 256 \
--save_freq 1000Start with a small batch and small --asr_opd_max_new_tokens, then scale after
checking generation length, non-empty generation ratio, teacher alignment, and
opd_valid_topk_mean.
Create a hostfile:
node0 slots=8
node1 slots=8
node2 slots=8
Launch:
HOSTFILE=/path/to/hostfile \
STUDENT_MODEL=AutoArk-AI/ARK-ASR-0.6B \
TEACHER_MODEL=/path/to/qwen3_asr_model \
QWEN3_ASR_CODE_PATH=/path/to/qwen3-asr/backend \
TRAIN_DATA=/path/to/train.jsonl \
OUTPUT_DIR=runs/ark_asr_opd_fsdp2 \
NCCL_SOCKET_IFNAME=hpn0 \
GLOO_SOCKET_IFNAME=hpn0 \
scripts/run/run_ark_asr_opd_fsdp2_resume_hostfile.shThe launcher requires HOSTFILE, STUDENT_MODEL, TEACHER_MODEL, and
TRAIN_DATA. It also requires QWEN3_ASR_CODE_PATH when TEACHER_BACKEND
starts with qwen3_asr_.
Resume a specific checkpoint:
torchrun --nproc_per_node 8 scripts/train/train_ark_asr_opd_fsdp2_resume.py \
--student_model AutoArk-AI/ARK-ASR-0.6B \
--teacher_model /path/to/qwen3_asr_model \
--qwen3_asr_code_path /path/to/qwen3-asr/backend \
--train_data /path/to/train.jsonl \
--output_dir runs/ark_asr_opd_fsdp2 \
--resume_from_checkpoint runs/ark_asr_opd_fsdp2/checkpoints/global_step_1000 \
--calibrate_only FalseResume the latest checkpoint under output_dir/checkpoints:
--resume_from_checkpoint latestauto is accepted as an alias for latest.
By default, --calibrate_only True. Calibration runs forward passes and prints
initial loss/generation metrics without optimizer steps. Use it before a real
run to verify model loading, data loading, rollout, teacher scoring, and OPD
alignment.
For actual training, pass:
--calibrate_only False--student_model: trainable student ASR model path or HF repo id.--teacher_model: teacher ASR model path or HF repo id.--teacher_backend: teacher scoring implementation.--qwen3_asr_code_path: required for Qwen3-ASR teacher backends.--train_data: JSONL training data.--output_dir: logs and FSDP2 checkpoints.--hf_cache_dir: Hugging Face datasets cache directory.--opd_top_k: teacher/student top-k support size.--opd_temperature: temperature for OPD distribution.--asr_block_token_id_from: masks non-ASR token ids during student generation.--asr_opd_max_new_tokens: rollout length cap.--save_freq: checkpoint save interval. Set-1to disable saving.--resume_from_checkpoint: checkpoint dir,latest, orauto.
These checks do not require model weights:
python3 -m py_compile scripts/train/train_ark_asr_opd_fsdp2_resume.py
python3 -m py_compile scripts/infer/ark_asr_transformers.py
python3 -m py_compile scripts/eval/eval_jwer_ark_asr_transformers.py
bash -n scripts/run/run_ark_asr_opd_fsdp2_resume_hostfile.sh
bash -n scripts/eval/run_arkasr_eval.sh
python scripts/train/train_ark_asr_opd_fsdp2_resume.py --helpThe final --help command must be run in an environment with the training
dependencies installed, including numpy, torch, datasets,
transformers, omegaconf, and the verl dependencies listed in
pyproject.toml.
If you use this repository or ARK-ASR in your work, please cite:
@misc{lin2026dataefficientopd,
title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition},
author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong},
year={2026},
eprint={2605.28139},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.28139}
}