Vision-Function-Layer in Multimodal LLMs

⚠️ The huggingface package version should be exactly as 4.50.0 or you should modify the vision token swapping code based on your own version.

Vision Token Dropping

This repository contains the implementation of Vision Token Dropping.
For detailed explanation and code, please refer to the Vision-Token-Dropping folder.

🚀 Experiments

All experiments are conducted under the VFL-LoRA setup.
Please check out our VFL-LoRA for the base code and environment setup.

✅ TODO List

Training data for VFL-LoRA
[✅] Open-Source Code
[✅] Publish arXiv Paper

Citation

If you find this work useful, please cite our paper:

@article{shi2025vision,
  title={Vision Function Layer in Multimodal LLMs},
  author={Shi, Cheng and Yu, Yizhou and Yang, Sibei},
  journal={arXiv preprint arXiv:2509.24791},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision-Function-Layer in Multimodal LLMs

Vision Token Dropping

🚀 Experiments

✅ TODO List

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Vision-Function-Layer in Multimodal LLMs

Vision Token Dropping

🚀 Experiments

✅ TODO List

Citation