GitHub - Justlovesmile/SynCLIP: [CVPR 2026] Official Code for “SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception”

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

CVPR 2026 Accept

Abstract

Open-vocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inconsistency, where semantically equivalent expressions yield disparate spatial attention patterns. This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications. To address this issue, we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP tasks. SynCLIP introduces a Semantic-consistent Spatial Attention alignment (SSA) module to enhance spatial attention consistency by minimizing discrepancies between attention maps of original and synonymous expressions. Furthermore, a Spatial Attention Refinement (SAR) module selectively strengthens the most semantically relevant spatial regions within aligned maps, resulting in more precise and stable grounding. To support synonym-coherent pretraining, we also construct a Synonym-Enriched Visual Corpus (SEViC), which augments each category with multiple synonyms and textual definitions. Extensive experiments on multiple benchmarks demonstrate that SynCLIP substantially improves grounding consistency under diverse linguistic variants and achieves state-of-the-art performance among CLIP-based OVDP methods.

Dataset

The main experiments are conducted based on COCO and LVIS datasets. Please prepare datasets and organize them like the following:

SynCLIP/
├── dataset
    ├── coco
        ├── annotations
            ├── instances_train2017.json  # only access images
            ├── sevic.json             # only access synonyms and definitions 
            ├── panoptic_val2017.json  # for validation
            ├── panoptic_val2017       # for validation
        ├── train2017
        ├── val2017
    ├── lvis_v1
        ├── annotations
            ├── lvis_v1_train.json  # only access images
        ├── train2017    # the same with coco
        ├── val2017      # the same with coco

The distillation process of SynCLIP only requires images and expressions from SEViC. The downstream finetuning process of SynCLIP is built on OV-COCO and OV-LVIS, following CLIPSelf and DeCLIP.

Model

Please download the pretrained weights from EVA-CLIP, DeCLIP and DINOv2 and organize them as shown below.

SynCLIP/
├── ckpts
    ├── EVA02_CLIP_B_psz16_s8B.pt
    ├── EVA02_CLIP_L_336_psz14_s6B.pt
    ├── DeCLIP_EVA-B_DINOv2-B_csa_0.05_2.0.pt
    ├── DeCLIP_EVA-L_DINOv2-L_csa_0.05_2.0.pt
    ├── dinov2_vitb14_reg4_pretrain.pth
    ├── dinov2_vitl14_reg4_pretrain.pth

Statement

Acknowledgement

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

Citation

If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.

@inproceedings{xie2026synclip,
  title={SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception},
  author={Xie, Mingjie and He, Guangjun and Xu, Dongli and Lin, Youtian and Li, Hongjue and Feng, Pengming and Guan, Jian and Deng, Yue},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

CVPR 2026 Accept

Abstract

Dataset

Model

Statement

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

CVPR 2026 Accept

Abstract

Dataset

Model

Statement

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages