Skip to content

MYMY-young/DelimScaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enhancing Multi-Image Understanding Through Delimiter Token Scaling (ICLR 2026)

by Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh, Junsuk Choe

This repository contains the code for the paper "Enhancing Multi-Image Understanding Through Delimiter Token Scaling" presented at ICLR 2026.

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model’s ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews and WCEP-10. Notably, our method requires no additional training or inference cost.

TODO / Code Release Plan

We are in the process of cleaning up and preparing the codebase for public release. The following components will be released progressively:

  • Multi-image understanding evaluation code
    (Delimiter token scaling integrated into LVLM inference and evaluation pipelines)

  • LLM benchmark code
    (Multi-document and multi-table benchmarks including TQABench, MultiNews, and WCEP-10)

  • Visualization code
    (Attention maps and interaction analysis for delimiter tokens)

The full code will be released upon final preparation.

Installation

Pull the docker image

docker pull myelena/delim_scaling:slim

Install dependencies

Inside the container:

git clone https://github.com/MYMY-young/DelimScaling.git
cd DelimScaling

cd transformers
pip install -e .

cd ../qwen-vl-utils
pip install -e .

pip install flash-attn==2.7.4.post1

Running Evaluation

accelerate launch --num_processes 4 --main_process_port 12345 -m lmms_eval \
    --model qwen2_5_vl \
    --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct,device_map=cuda,attn_implementation=flash_attention_2 \
    --tasks mantis \
    --batch_size 1 \
    --delim_scaling True \
    --scale 8 \
    --select_layer 0,1,2,3

Key Arguments

  • --model_args pretrained=... : Specify the pretrained model to use. This can be either a local path or a HuggingFace model identifier. For example, Qwen/Qwen2.5-VL-3B-Instruct.
  • --tasks : Specify the evaluation tasks.
  • --delim_scaling : Enable delimiter token scaling.
  • --scale : Scaling factor.
  • --select_layer : Layers where scaling is applied.

Supported Tasks

Multi-Image Understanding benchmarks:

Supported Models

For Multi-Image Understanding:

Acknowledgments

Our code is based on lmms-eval and Transformer. If you use our work, please consider citing the above works as well.

Citation

If you find this work useful for your research, please consider citing:

@inproceedings{lee2026delimscale,
  title={Enhancing Multi-Image Understanding through Delimiter Token Scaling},
  author={Lee, Minyoung and Park, Yeji and Hwang, Dongjun and Kim, Yejin and Oh, Seong Joon and Choe, Junsuk},
  booktitle={Proceedings of the 14th International Conference on Learning Representations},
  year={2026}
}

About

[ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding Through Delimiter Token Scaling"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors