This repository contains the official implementation for end-to-end streaming sign language translation (SLT) using a dense temporal grounding framework. The system jointly localizes and translates sign language events from continuous pose streams without requiring gloss annotations.
The model processes continuous pose sequences through a sliding window mechanism and performs:
- Temporal event localization — detecting when individual sentences occur within a stream
- Dense captioning — translating each localized event into the target spoken language
The architecture combines:
- A pose backbone (CoSign ST-GCN or MSKA) for spatiotemporal feature extraction from skeleton keypoints
- A Deformable DETR encoder-decoder for temporal event detection
- A trimmed mBART caption decoder for multilingual translation
Training follows a two-stage procedure:
- Stage 1: Visual-language contrastive pre-training (InfoNCE with masked pose views)
- Stage 2: Joint localization and captioning with Hungarian matching
A cascaded baseline (GFSLT-VLP) is also provided for comparison.
| Dataset | Language | Source | Pose Format |
|---|---|---|---|
| BOBSL | English (BSL) | Auto-aligned broadcast subtitles | COCO-WholeBody-133 via DWPose |
| PHOENIX-2014T | German (DGS) | Synthesized streams from per-clip pickles | COCO-WholeBody-133 |
| CSL-Daily | Chinese (CSL) | Synthesized streams from per-clip pickles | COCO-WholeBody-133 |
| How2Sign | English (ASL) | Real signer-aligned timestamps | COCO-WholeBody-133 (from OpenPose) |
Switch datasets via environment variable: DATASET={BOBSL,PHOENIX,CSL,H2S}. All paths, target language codes, and preprocessing parameters are resolved automatically in config.py.
pip install -r requirements.txtKey dependencies: PyTorch 2.6+, Transformers 4.57+, accelerate, sacrebleu, pycocoevalcap, BLEURT.
Each dataset follows a unified directory structure:
data/<dataset>/
├── poses/<video_id>.npy # (T, 133, 3) float32 at target FPS
├── vtt/<video_id>.vtt # WebVTT subtitles (one sentence per cue)
└── subset2episode.json # {"train": [...], "val": [...], "test": [...]}
For datasets other than BOBSL, synthesize streaming benchmarks from pre-segmented data:
DATASET=PHOENIX python -m data_synth.synthesize_streams --out_root data/synth/phoenix
DATASET=CSL python -m data_synth.synthesize_streams --out_root data/synth/csl
DATASET=H2S python -m data_synth.synthesize_h2s --out_root data/synth/h2sSee data_synth/README.md for details on the co-articulation synthesis pipeline.
Before training, trim the mBART vocabulary to the target dataset's subtitle tokens:
DATASET=BOBSL python captioners/trim_mbart.py
DATASET=PHOENIX python captioners/trim_mbart.py
DATASET=CSL python captioners/trim_mbart.py
DATASET=H2S python captioners/trim_mbart.pyAll training uses HuggingFace's Trainer with HfArgumentParser. Key hyperparameters are CLI flags.
- Frozen: encoder, decoder, detection heads, caption head
- Trained: pose backbone + text encoder
- Loss: 3-way InfoNCE (view1↔view2, view1↔text, view2↔text) with temperature τ=0.07
torchrun --nproc_per_node <NUM_GPUS> main.py \
--mode 1 \
--output_dir ./checkpoints/mode1 \
--max_event_tokens 40 \
--d_model 1024 \
--encoder_layers 2 \
--decoder_layers 2 \
--num_cap_layers 3 \
--num_queries 30 \
--num_train_epochs 50 \
--learning_rate 5e-4 \
--per_device_train_batch_size 32- Unfrozen: all parameters
- Losses: classification + GIoU + event count + caption (Hungarian matching)
torchrun --nproc_per_node <NUM_GPUS> main.py \
--mode 2 \
--mode1_checkpoint ./checkpoints/mode1/mode1_final \
--output_dir ./checkpoints/mode2 \
--max_event_tokens 40 \
--d_model 1024 \
--encoder_layers 2 \
--decoder_layers 2 \
--num_cap_layers 3 \
--num_queries 30 \
--num_train_epochs 100 \
--learning_rate 2e-4 \
--per_device_train_batch_size 32| Category | Flag | Default | Description |
|---|---|---|---|
| Data | --max_event_tokens |
40 | Max tokens per caption |
| Data | --stride_ratio |
0.9 | Sliding window stride (val/test) |
| Data | --noise_rate |
0.15 | Token masking rate for contrastive learning |
| Data | --pose_augment |
False | Apply pose augmentation (train only) |
| Model | --d_model |
1024 | Hidden dimension |
| Model | --num_queries |
30 | Max detected events per window |
| Model | --encoder_layers |
2 | Deformable DETR encoder layers |
| Model | --decoder_layers |
2 | Deformable DETR decoder layers |
| Model | --num_cap_layers |
3 | Caption decoder layers |
| Model | --captioner_type |
mbart | Caption head type (mbart or lstms) |
| Loss | --class_cost |
2 | Classification loss weight |
| Loss | --giou_cost |
4 | GIoU loss weight |
| Loss | --counter_cost |
2 | Event count loss weight |
| Loss | --caption_cost |
2 | Caption loss weight |
| Backbone | BACKBONE env var |
cosign | Pose backbone (cosign or mska) |
The cascaded GFSLT-VLP baseline trains in two stages:
# Stage 1: CLIP-style VLP + Masked LM
torchrun --nproc_per_node <NUM_GPUS> gfslt_stage1.py \
--output_dir ./checkpoints/gfslt_stage1 \
--num_train_epochs 50 \
--learning_rate 1e-4
# Stage 2: End-to-end translation (encoder initialized from Stage 1)
torchrun --nproc_per_node <NUM_GPUS> gfslt_stage2.py \
--stage1_checkpoint ./checkpoints/gfslt_stage1 \
--output_dir ./checkpoints/gfslt_stage2 \
--num_train_epochs 80 \
--learning_rate 5e-5Use eval.py for the proposed model. Run on a single GPU to avoid distributed overhead:
# Evaluate on both val and test
CUDA_VISIBLE_DEVICES=0 python eval.py \
--checkpoint_path checkpoints/mode2/mode2_final
# Evaluate only test set with custom settings
CUDA_VISIBLE_DEVICES=0 python eval.py \
--checkpoint_path checkpoints/mode2/mode2_final \
--eval_val False \
--max_event_tokens 40 \
--num_queries 30 \
--per_device_eval_batch_size 32 \
--ranking_temperature 2.0 \
--top_k 20 \
--aggregation_mode videoFor the cascaded (GFSLT + DETR localization) evaluation:
CUDA_VISIBLE_DEVICES=0 python gfslt_cascaded_eval.py \
--detr_checkpoint_path checkpoints/mode2/pytorch_model.bin \
--gfslt_checkpoint_path checkpoints/gfslt_stage2/pytorch_model.binThe evaluation computes three levels of quality:
- Localization: Precision / Recall / F1 at IoU thresholds {0.3, 0.5, 0.7, 0.9}
- Dense captioning: BLEU-4, METEOR, ROUGE-L, CIDEr, BLEURT for matched (pred, GT) pairs at each IoU threshold, plus SODA_c storytelling F1
- Paragraph-level: Sort predicted captions by time → join into paragraph → compare against GT paragraph (same translation metrics)
Aggregation modes: --aggregation_mode {corpus, window, video}
Verify the forward/backward pass on a small batch:
python pdvc.py├── main.py # Proposed model training (Stage 1 + 2)
├── eval.py # Proposed model evaluation
├── gfslt_stage1.py # GFSLT baseline Stage 1 (VLP)
├── gfslt_stage2.py # GFSLT baseline Stage 2 (translation)
├── gfslt_cascaded_eval.py # Cascaded evaluation (DETR loc + GFSLT cap)
├── gfslt_models.py # GFSLT model definitions
├── pdvc.py # Deformable DETR + caption head model
├── loader.py # DVCDataset (sliding window, streaming)
├── config.py # Dataset/backbone/path configuration
├── loss.py # Hungarian matcher + losses
├── postprocess.py # NMS and top-k event extraction
├── utils.py # VTT parsing, helpers
├── backbones/
│ ├── cosign.py # CoSign ST-GCN backbone
│ └── mska_backbone.py # MSKA pose encoder backbone
├── captioners/
│ ├── mbart.py # Trimmed mBART decoder captioner
│ ├── lstm.py # LSTM captioner (ablation)
│ └── trim_mbart.py # Vocabulary trimming script
├── deformable_detr/ # Deformable DETR encoder/decoder
├── evaluation/
│ ├── metrics.py # Trainer-integrated metric computation
│ ├── helpers.py # Top-k selection, aggregation utilities
│ └── soda_c.py # SODA_c implementation
├── data_synth/ # Stream synthesis for PHOENIX/CSL/H2S
│ ├── synthesize_streams.py # PHOENIX/CSL synthesis
│ ├── synthesize_h2s.py # How2Sign synthesis
│ └── README.md # Synthesis documentation
└── poses/
├── preprocessing.py # Keypoint normalization
└── augmentation.py # Pose augmentation
- VTT parsing:
webvtt-pyis used if installed; otherwise a built-in parser is used. Ensure.vttfiles are under the configuredVTT_DIR. - Poses: Ensure
POSE_ROOT/<video_id>/*.npy(BOBSL) orPOSE_ROOT/<stream_id>.npy(synth) exists for each video in your split JSON. - Multi-GPU evaluation: Always set
CUDA_VISIBLE_DEVICES=0foreval.pyto avoid unnecessary DDP initialization. - BLEURT: The BLEURT-20 checkpoint is downloaded automatically to
/tmp/BLEURT-20on first use.