RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability [NeurIPS 2025]

📝 Paper • 🤗 Model • 🧩 Codes

Introduction

Figure 1. Comparison of attention maps and the proposed VL similarity map for visualizing VL alignment. (a) While traditional attention maps inevitably exhibit high values at certain points due to softmax activation, the proposed VL similarity maps yield low values for unrelated image-text pair. (b) Their fixed scale, originating from cosine similarity, enables open-vocabulary semantic segmentation through simple thresholding.

Figure 2. Overview of the RadZero framework. Finding-sentences are extracted from reports and aligned with local image patch features through similarity-based cross-attention (VL-CABS), enabling zero-shot classification, grounding, and segmentation.

Abstract

Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce RadZero, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is VL-CABS (Vision-Language Cross-Attention Based on Similarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging.

Updates

2025-11-23: Code and model checkpoints of RadZero have been released. 🚀
2025-10-19: RadZero3D (an extension of RadZero to Chest CT) has been published at the ICCV 2025 VLM3D Workshop! 🎉 You can read the paper here.
2025-09-18: RadZero has been accepted to NeurIPS 2025! 🎉 (5,290 / 21,575 = 24.52% acceptance rate)

RadZero Model Inference

Install dependencies

pip install -r requirements.txt

Model Inference Codes

RadZero can perform zero-shot classification / grounding / segmentation for chest X-ray using the RadZero model on 🤗 Hugging Face.

import warnings

import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

from utils import model_inference

# Suppress specific warnings for cleaner logs
warnings.filterwarnings("ignore", category=UserWarning)


def load_model(device, dtype):

    tokenizer = AutoTokenizer.from_pretrained("Deepnoid/RadZero")
    image_processor = AutoImageProcessor.from_pretrained("Deepnoid/RadZero")

    model = AutoModel.from_pretrained(
        "Deepnoid/RadZero",
        trust_remote_code=True,
        torch_dtype=dtype,
        device_map=device,
    )

    models = {
        "tokenizer": tokenizer,
        "image_processor": image_processor,
        "model": model,
    }
    return models


if __name__ == "__main__":
    # Setup constant
    device = torch.device("cuda")
    dtype = torch.float32

    # load models
    models = load_model(device, dtype)

    # load image
    image_path = "cxr_image.jpg"

    # inference
    similarity_prob, similarity_map = model_inference(
        image_path, "There is fibrosis", **models
    )

    print(similarity_prob)
    print(similarity_map.min())
    print(similarity_map.max())
    print(similarity_map.shape)

Training

Download dataset

Training

MIMIC-CXR (preprocessed JSON used in RadZero)
- Hugging Face: Deepnoid/RadZero – data/MIMIC-CXR

Evaluation

All evaluation benchmarks used in the paper are provided under the data/ directory of the Hugging Face repo:
Deepnoid/RadZero – data/

The sources for the preprocessed data can be found: link

Zero-shot classification datasets
- OpenI
- PadChest
- Chexpert
- ChestXray14
- ChestXDet10
Zero-shot grounding datasets
- ChestXDet10
- MS-CXR
Zero-shot segmentation datasets
- SIIM
- RSNA

Run Command

Set the appropriate data and output paths for your environment in exp/cxr_pt/configs/paths.yaml.

Make sure to update these fields to reflect where your dataset is stored and where experiment outputs should be saved.

Use the command below to start training the RadZero model.

cd RadZero
PYTHONPATH=. torchrun --nproc_per_node=4 --nnodes=1 exp/cxr_pt/run.py --add_cfg_list radzero paths

References

Dataset
- MIMIC-CXR
Zero-shot evaluation
- CARZero (CVPR 2024)
Pretrained models
- Vision encoder: XrayDINOv2
- Text encoder: all-mpnet-base-v2
Finetuning baseline
- MGCA (NeurIPS 2022)

Acknowledgments

This work was supported by the Technology Innovation Program (RS-2025-02221011, Development of Medical-Specialized Multimodal Hyperscale Generative AI Technology for Global Integration) funded by the Ministry of Trade Industry & Energy (MOTIE, South Korea).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
common		common
exp		exp
external		external
misc		misc
preprocess		preprocess
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability [NeurIPS 2025]

Introduction

Abstract

Updates

RadZero Model Inference

Install dependencies

Model Inference Codes

Training

Download dataset

Run Command

References

Acknowledgments

LICENSE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability [NeurIPS 2025]

Introduction

Abstract

Updates

RadZero Model Inference

Install dependencies

Model Inference Codes

Training

Download dataset

Run Command

References

Acknowledgments

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages