- Install PyTorch.
conda create -n d4c python=3.10 scipy
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install OpenAI CLIP
cd CLIP
python setup.py install
cd ..
- Install CLIP_benchmark
cd CLIP_benchmark
python setup.py install
cd ..
- Install other requirements
pip install easydict
If a valid --dataset_root is provided, CLIP_benchmark will automatically download the CIFAR-10 and CIFAR-100 datasets. For ImageNet-1K, you may either place the original ILSVRC .tar archive in the specified root, or reformat your existing ImageNet dataset using the following structure:
|-- ImageNet-1K
| |-- train
| | |-- n01440764
| | |-- ...
| |-- val
| | |-- n01440764
| | |-- ...
| |-- meta.bin
We use the original pre-trained CLIP models provided by OpenAI for our experiments. Once a valid --model name is specified, CLIP will automatically download the corresponding pre-trained weights.
Use the following command to perform D4C quantization:
python d4c/solver/test_quant.py \
--dataset <DATASET_NAME> \
--dataset_root <DATASET_ROOT> \
--model <MODEL_NAME> \
--q_config ./exp/<QCONFIG>.yaml \
--recon \
--dfq \
--gen_img
Here, <DATASET_NAME> and <DATASET_ROOT> refer to the name and directory of the dataset, <MODEL_NAME> specifies the encoder model, and <QCONFIG>.yaml is the corresponding quantization configuration file.
For example, to perform W6A6 quantization on ViT-B/32 evaluated on CIFAR-10, use the following command:
python d4c/solver/test_quant.py \
--dataset cifar10 \
--dataset_root ./your/dataset/root \
--model ViT-B/32 \
--q_config ./exp/config66.yaml \
--recon \
--dfq \
--gen_img
A detailed description of the arguments is provided below for your reference:
| Argument | Description | Options |
|---|---|---|
| dataset | Dataset for evaluation. | cifar10, cifar100, imagenet1k |
| dataset_root | Directory for dataset download. | NA |
| model | CLIP model (image encoder). | RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px |
| fp_model | Run FP model without quantization. | NA |
| q_config | Quantization configuration file. | NA |
| recon | Apply PTQ using reconstruction. | NA |
| custom_file | PGSI prompt and template. | NA |
| dfq | Activate DFQ model. Run PTQ without this argument. | NA |
| gen_img | Generate pseudo images for DFQ. | NA |
| gen_method | Select method for pseudo image generation. | baseline, d4c |
| d4c_config | Ablation with different combination of PGSI, SCG, and PAE. | 0, 1, 2, 3 |
| gen_batch_size | Batch size for pseudo image generation. | NA |
| gen_lr | Learning rate for pseudo image generation. | NA |
| gen_iter | Total iterations for pseudo image generation. | NA |
| img_path | Give a path if you want to save and visualize the generated samples. | your path, None |
All experiments were conducted on an RTX A6000 GPU with 48 GB of memory. However, we believe that a more commonly available GPU with 16 GB of memory is sufficient to reproduce the results reported in the paper. For reference, the reconstruction for ViT-B/32 require 6,593 sec, and the pseudo image generation time (sec) of 128 images on the RTX A6000 (with a batch size of 16) is listed below:
| Method | RN50 | RN50x16 | ViT-B/32 | ViT-B/16 |
|---|---|---|---|---|
| BNS | 1,280 | 9,515 | NA | NA |
| PSE | NA | NA | 3,488 | 44,398 (bs=8) |
| D4C | 1,623 | 12,491 | 1,434 | 5,346 |
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: \textbf{1)} Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; \textbf{2)} Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and \textbf{3)} Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models.
If you find this repo is useful, please cite our paper. Thanks.
@article{zhang2025d4c,
title={D4C: Data-Free Quantization for Contrastive Language-Image Pre-training Models},
author={Zhang, Wenlun and Zhong, Yunshan and Ding, Zihao and Li, Xinyu and Yoshioka, Kentaro},
journal={arXiv preprint arXiv:2511.15411},
year={2025}
}Our work builds upon QDrop, CLIP, and CLIP_benchmark. We sincerely appreciate their pioneering efforts, which provided the foundation and codebase for developing and evaluating D4C.