SLRTP-Sign-Production-Evaluation

This codebase supports the 3rd Workshop on Sign Language Recognition, Translation, and Production (SLRTP) at CVPR 2025. For more details, visit the workshop website.

Overview

We have seen an increase in Sign Language Production (SLP) approaches over the last few years. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. The goal of this challenge is to develop a system that can translate spoken language to sign language. We release a standardized evaluation network, establishing a consistent baseline for the SLP field and enabling future researchers to compare their work against a broader range of methods.

Installation

Requirements

# Create and activate a new conda environment (recommended)
conda create --name slrtp python=3.8
conda activate slrtp

# Install PyTorch with CUDA support
# Please change PyTorch CUDA version to match your system!
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia

# Install other dependencies
pip install -r ./requirements.txt

Back Translation Model

The evaluation framework uses a back translation model to convert generated sign poses back to text for comparison.

Original Source: Sign Language Transformers (GitHub Repository)
Pretrained Models and Phix Data: Download from Google Drive (alternative to training your own)

Please cite the original paper appropriately when using this code.

Running the Evaluation

Basic Usage

python main.py <input_path> <gt_path> <model_dir> --tag <evaluation_name> --fps <frame_rate>

Arguments

Argument	Description	Default
`input_path`	Path to predicted skeleton keypoints (.pt or .pkl file)	Required
`gt_path`	Path to ground truth skeleton keypoints and text	Required
`model_dir`	Directory containing the pretrained back translation model	Required
`--tag`	Name for the output results file	'results'

⚠️ Important: The expected frame rate for optimal performance is 25fps. Different frame rates may result in lower performance metrics.

Example Usage

python main.py ./demo_data/progressive_transformer_baseline_dev.pt ./demo_data/ground_truth_dev.pt ./backTranslation_PHIX_model --tag demo --fps 12

Example Output

{
    "bleu": {
        "bleu1": 15.946522413242281,
        "bleu2": 4.580890661810554,
        "bleu3": 0.0,
        "bleu4": 0.0
    },
    "chrf": 19.755041720753972,
    "rouge": 15.536963615448718,
    "wer": 101.83727034120736,
    "dtw_mje": 0.04589659720659256,
    "total_distance": 0.30942216515541077
}

Evaluation Metrics Explained

The framework evaluates sign language production using both text-based and pose-based metrics:

Text-Based Metrics

BLEU (Bilingual Evaluation Understudy): Measures n-gram precision between predicted and reference text
- BLEU-1 through BLEU-4 values represent different n-gram sizes
CHRF (Character n-gram F-score): Character-level metric that balances precision and recall
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Focuses on recall of n-grams
WER (Word Error Rate): Measures the minimum edit distance between predicted and reference texts
Total Distance: This metric measures the overall distance the signer’s hands have moved in 3D space. This is to judge how expressive the productions are. The prediction is normalized by the ground truth distance, therefore, a score of 1 is optimal.
Average Duration: This metric assesses the accuracy of signed duration, a key component of prosody. It averages the duration of the predicted output against the duration of the ground truth reference.

Pose-Based Metrics

DTW MJE (Dynamic Time Warping Mean Joint Error): Measures pose accuracy while accounting for timing differences
Total Distance: Total distance between predicted and ground truth poses

SLP 2025 CVPR Challenge Scores

Table 1: RWTH-PHOENIX-Weather-2014T Test Set

Teams	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CHRF	ROUGE	WER ↓	DTW-MJE ↓	Total Distance	Average Duration
Ground Truth	34.40	22.04	16.09	12.78	34.59	35.20	85.77	0.0000	1.000	1.000
Team 1 (USTC-MoE)	34.85	21.96	15.65	12.06	36.83	36.59	93.49	0.0448	1.631	1.438
Team 2 (hfut-lmc)	16.96	6.56	3.38	2.05	25.88	19.77	147.85	0.0403	2.512	2.463
Team 3 (Hacettepe)	30.44	17.75	12.42	9.59	29.70	30.64	88.88	0.0423	0.798	1.026
Progressive Transformer	22.17	10.71	7.09	5.43	24.13	21.98	101.45	0.0418	0.257	0.999

Caption: SLP Challenge Results on the ph14t test set.

Table 2: Hidden Test Set

Teams	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CHRF	ROUGE	WER ↓	DTW-MJE ↓	Total Distance
Ground Truth	37.94	19.87	10.67	5.90	30.64	38.60	101.25	0.0000	1.000
Team 1 (USTC-MoE)	31.40	17.09	9.43	5.86	31.73	33.75	109.38	0.0574	1.185
Team 2 (hfut-lmc)	30.54	16.22	9.33	5.66	30.17	32.92	107.93	0.0492	0.971
Team 3 (Hacettepe)	27.51	11.13	5.36	2.91	23.37	27.29	105.49	0.0531	0.761
Progressive Transformer [cite]	18.33	4.99	1.74	0.78	21.65	21.10	141.93	0.0467	0.322

Caption: SLP Challenge Results on the Hidden Test Set.

Publicly Available Codebases

Team 1: USTC-MoE Team 2: hfut-lmc Team 3: Hacettepe

Other Publicly Available SLP Codebase

Progressive Transformer: Progressive Transformer Sign VQ Transformer: Sign VQ Transformer

Dataset

For the SLRTP 2025 CVPR challenge, we use the RWTH-PHOENIX-Weather-2014T dataset. Access the challenge dataset and back translation model at Google Drive.

File Content:

├── backTranslation_PHIX_model
│   ├── best.ckpt
│   ├── config.yaml
│   ├── gls.vocab
│   ├── txt.vocab
│   └── validations.txt
└── data
    ├── dev.pt
    ├── PT_baseline_dev.pt
    ├── PT_baseline_test.pt
    ├── test.pt
    └── train.pt

Note that the data folder contains the PHOENIX-2014 train, test, and dev split. Plus, the prediction from the progressive transformer

The PHOENIX-2014 dataset is a large-scale dataset used for research in sign language recognition and translation. It contains continuous sign language videos along with their corresponding gloss annotations and spoken language translations. The dataset is derived from weather forecast broadcasts in German Sign Language (DGS).

The dataset is split into training, development, and test sets. The training set contains 7096 videos, the development set contains 519 videos, and the test set contains 642 videos.

Challenge Data Format

Inside the provided .pt files, you will find a dictionary with the following structure:

{
    "01April_2010_Thursday_heute-6704":{
      "name": (string),
      "text": (string),
      "gloss": (string),
      "poses_3d" : (N x K x D),
      "speaker" : (string),
    },
     "30September_2012_Sunday_tagesschau-4038": {
      "name": (string),
      "text": (string),
      "gloss": (string),
      "poses_3d" : (N x K x D),
      "speaker" : (string),
    },
    ...
    "27October_2009_Tuesday_tagesschau-6148":   {
      "name": (string),
      "text": (string),
      "gloss": (string),
      "poses_3d" : (N x K x D),
      "speaker" : (string),
    },
}

Split 80:10:10 for training, validation, and test sets.

The frame rate is equal to 25 frames per second.

Skeleton Format

For each video we extract Mediapipe holistic keypoints and use the approach from "Improving 3D Pose Estimation For Sign Language" by Ivashechkin, Maksym and Mendez, Oscar and Bowden, Richard, to uplift the predictions to 3D. We process the skeleton and provide 178 keypoint representations.

- 21 keypoints for each hand
- 128 keypoints for the face
- 8 keypoints for the body

The face is a subset of the 468 keypoint representations from the Mediapipe face mesh. See 'make_128_face_from_478.py' to create the 128 face mesh from mediapipe's 478 (mp_face_mesh.FaceMesh)

Example visualizations can be found in the ./demo_plots/ directory.

The skeleton format follows the standard used in the SLRTP CVPR 2025 Challenge. For detailed information on the format specification, visit the challenge Skeleton Information.

Reference Performance

Ground Truth Back-Translation Scores

Development Set

Command:

python main.py ./data/dev.pt ./data/dev.pt ./backTranslation_PHIX_model --tag ground_truth_dev --fps 25

Results:

{
    "bleu": {
        "bleu1": 35.64289683751959,
        "bleu2": 23.060216602791453,
        "bleu3": 16.907030146272817,
        "bleu4": 13.378651856913775
    },
    "chrf": 34.82214188714514,
    "rouge": 36.85085849846843,
    "wer": 83.37019018133569,
    "dtw_mje": 0.0,
    "total_distance": 1.0
}

Test Set

Command:

python main.py ./data/test.pt ./data/test.pt ./backTranslation_PHIX_model --tag ground_truth_test --fps 25

Results:

{
    "bleu": {
        "bleu1": 34.39639390982878,
        "bleu2": 22.038589437767946,
        "bleu3": 16.092729031718463,
        "bleu4": 12.777148564612425
    },
    "chrf": 34.58509769785496,
    "rouge": 35.19608041310144,
    "wer": 85.77470203767781,
    "dtw_mje": 0.0,
    "total_distance": 1.0
}

Citation

When using this evaluation framework or participating in the challenge, please cite:

@inproceedings{walsh2025slrtp,
title={SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work},
author={Walsh, Harry and Fish, Ed and Sincan, Ozge Mercanoglu and Lakhal, Mohamed Ilyes and Bowden, Richard and Fox, Neil and Cormier, Kearsy and Woll, Bencie and Wu, Kepeng and Li, Zecheng and Zhao, Weichao and Wang, Haodong and Zhou, Wengang and Li, Houqiang and Tang, Shengeng and He, Jiayi and Wang, Xu and Zhang, Ruobei and Wang, Yaxiong and Cheng, Lechao and Tasyurek, Meryem and Kiziltepe, Tugce and Keles, Hacer Yalim}, 
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
back_translation		back_translation
demo_data		demo_data
demo_plots		demo_plots
external_metrics		external_metrics
README.md		README.md
helpers.py		helpers.py
main.py		main.py
make_128_face_from_478.py		make_128_face_from_478.py
metrics.py		metrics.py
requirements.txt		requirements.txt
skeleton_def.py		skeleton_def.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLRTP-Sign-Production-Evaluation

Overview

Installation

Requirements

Back Translation Model

Running the Evaluation

Basic Usage

Arguments

Example Usage

Example Output

Evaluation Metrics Explained

Text-Based Metrics

Pose-Based Metrics

SLP 2025 CVPR Challenge Scores

Publicly Available Codebases

Other Publicly Available SLP Codebase

Dataset

Challenge Data Format

Skeleton Format

Reference Performance

Ground Truth Back-Translation Scores

Development Set

Test Set

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SLRTP-Sign-Production-Evaluation

Overview

Installation

Requirements

Back Translation Model

Running the Evaluation

Basic Usage

Arguments

Example Usage

Example Output

Evaluation Metrics Explained

Text-Based Metrics

Pose-Based Metrics

SLP 2025 CVPR Challenge Scores

Publicly Available Codebases

Other Publicly Available SLP Codebase

Dataset

Challenge Data Format

Skeleton Format

Reference Performance

Ground Truth Back-Translation Scores

Development Set

Test Set

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages