DynamiCS

DynamiCS is an efficient and long-tail-aware data sampling method for vision-language model (VLM) pre-training. This repository contains the code used to build dynamic cluster-based sampling probabilities and to plug them into an OpenCLIP-style training pipeline.

Main Results

Zero-shot top-1 classification on ImageNet-1K and Let It Wag! with full-training CLIP baselines. All models use a ViT-B/16 image encoder.

Models	Dataset (Data Size)	Samples Seen @ Resolution	Tokens	ImageNet-1K	Let It Wag!	GPU-hours
OpenAI-WIT	`--- (400M)`	`12.8B@224`	274	68.3	37.9	10700
MetaCLIP-400M	`--- (400M)`	`12.8B@224`	274	70.8	46.5	`~10700`
OpenCLIP	`LAION-400M (400M)`	`12.8B@224`	274	67.1	39.1	10736
DynamiCS (Ours)	`LAION-400M (298M)`	`2.56B@112 + 128M@224`	81	67.5	45.5	299
DynamiCS (Ours) [ckpt]	`DataComp-DFN (130M)`	`1.28B@112 + 128M@224`	81	71.3	50.2	163
DynamiCS (Ours) [ckpt]	`DataComp-DFN (130M)`	`2.56B@112 + 128M@224`	81	72.6	52.0	299

Key takeaways from the paper:

On LAION-400M, DynamiCS reaches 67.5 on ImageNet-1K and 45.5 on Let It Wag! using 299 GPU-hours, compared with 10736 GPU-hours for full-training OpenCLIP.
On DataComp-DFN, DynamiCS reaches 72.6 on ImageNet-1K and 52.0 on Let It Wag! with only 81 tokens and 299 GPU-hours.
DynamiCS is especially strong on long-tail recognition, outperforming OpenCLIP, MetaCLIP-400M, and OpenAI-WIT on Let It Wag! while using substantially less compute.

The DataComp-DFN checkpoints and released SHA256-keyed sampling file are hosted on Hugging Face: MingliangLiang3/DynamiCS-ViT-B-16-DataComp-DFN

Installation

This repository follows the standard OpenCLIP installation flow.

Virtualenv

python3 -m venv .venv
source .venv/bin/activate
make install
make install-training

Other Dependencies

For the DynamiCS preprocessing pipeline and sampling-aware training, you will also need:

orjson for loading sampling-probability JSON files in src/open_clip_train/data.py
pyarrow for parquet metadata written by tests/DynamiCS/embedding_dinov2.py
a FAISS build such as faiss-cpu or faiss-gpu for clustering and nearest-neighbor search

For the exact environment used in our experiments, see myenv.yml.

Data

Data Download

We use LAION-400M and a DataComp subset filtered by DFN. Choose one of the following options:

Option A — Download LAION-400M and DataComp directly:

Follow the instructions at DataComp.
Follow the instructions at img2dataset to download LAION-400M.

Option B — Build the DFN-filtered subset from scratch:

Download the DFN filter index from apf1/datafilteringnetworks_2b.
Match the index with DataComp using adams-story/dfn-200m.
Download the matched subset using img2dataset.

DynamiCS Pipeline

The full DynamiCS workflow, including DINOv2 embedding extraction, FAISS clustering, sampling-probability generation, OpenCLIP training examples, and Slurm script references, is documented on a separate page:

tests/DynamiCS/README.md

That guide also explains the difference between the filename-keyed sampling JSON used during training and the SHA256-keyed companion file released for open-source distribution.

Evaluation

We use CLIP Benchmark to evaluate on a standard suite of 38 datasets in zero-shot classification and retrieval settings.

Acknowledgements

DynamiCS is implemented on top of OpenCLIP. Please also cite and acknowledge the OpenCLIP project if you use this repository in your work.

This work used the Dutch national e-infrastructure with the support of the SURF Cooperative. The computations were carried out on the Snellius supercomputer.

License

This repository is released under the MIT License.

Citation

If you use DynamiCS in your research, please cite the paper:

@article{liang2026dynamics,
  title={Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training},
  author={Mingliang Liang and Zhuoran Liu and Arjen P. de Vries and Martha Larson},
  journal={arXiv preprint arXiv:2604.27932},
  year={2026}
}

The repository metadata is also available in CITATION.cff.

Contact

For questions about DynamiCS, checkpoint access, or potential collaboration, please:

open an issue at MingliangLiang3/DynamiCS
contact Mingliang Liang at mliang@cs.ru.nl

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
tutorials		tutorials
.codex		.codex
.gitignore		.gitignore
CITATION.cff		CITATION.cff
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
myenv.yml		myenv.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DynamiCS

Main Results

Installation

Virtualenv

Other Dependencies

Data

Data Download

DynamiCS Pipeline

Evaluation

Acknowledgements

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DynamiCS

Main Results

Installation

Virtualenv

Other Dependencies

Data

Data Download

DynamiCS Pipeline

Evaluation

Acknowledgements

License

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages