STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

arXiv | Molmo2-4B-Lite-30% | Molmo2-4B-Lite-50%

Welcome to the official repository for the paper: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs.

This guide will help you understand and implement STTS. While our implementation is built on top of Molmo2, the STTS mechanism is highly portable to any Vision-Language Model (VLM) architecture. We highly encourage you to adapt this code to fit your own custom codebase!

Setup

Our codebase relies entirely on the Molmo2 environment. To get started:

Visit the official Molmo2 repository.
Follow their provided setup instructions to configure your environment.

Note: STTS does not require any additional dependencies or packages beyond what is needed for Molmo2.

Training

The core STTS algorithm is implemented within olmo/nn/temporal_image_vit.py. Other files included, such as olmo/nn/temporal_vision_backbone.py and olmo/models/temporal_video_olmo, are required for the handling the token pruning and packing procedure as the token reduction propagates from the ViT to the LLM.

Example: Single-Node Training (8 GPUs)

Below is a sample bash command to train STTS on a single node with 8 GPUs. To enable multi-node training, please refer back to the Molmo2 repository documentation.

torchrun --nproc-per-node 8 \
    launch_scripts/train_image_video_sft.py /path/to/pretrained/model all-v5-video-academic-only \
    --wandb.allow_resume=False \
    --global_batch_size 64 \
    --device_batch_size=2 \
    --max_duration=6250 \
    --seq_len 9686 \
    --packing \
    --model.mm_preprocessor.max_frames=64 \
    --model.mm_preprocessor.max_subtitle_tokens=null \
    --model.mm_preprocessor.use_frame_special_tokens=True \
    --model.vision_backbone.vit.sdpa_backend=efficient \
    --model.vision_backbone.compile_vit=null \
    --model.vision_backbone.compile_connector=null \
    --data.num_workers=4 \
    --save_folder=/path/to/save/folder \
    --use_temporal_video_olmo \
    --model.mm_preprocessor.topk=0.5 \
    --model.vision_backbone.vit.topk=0.5 \
    --model.vision_backbone.vit.prune_at=3 \
    --model.vision_backbone.vit.prune_method=scorer

Key STTS Parameters

The following arguments are specific to configuring STTS as described in our paper:

/path/to/pretrained/model: We fine-tune all STTS variants using this checkpoint. Even though this checkpoint is pretrained with Qwen3-4B instead of Qwen3-4B-Instruct (what Molmo2 uses), it provides minimal performance difference (as demonstrated by our paper).
all-v5-video-academic-only: The data mixture we use, representing the video-QA subset of Molmo2-Data. Ensure you have downloaded the required data following the Molmo2 guidelines.
--use_temporal_video_olmo: A mandatory flag required to enable the STTS architecture.
--model.mm_preprocessor.topk=0.5: Defines the $k$% pruning ratio conducted by STTS (e.g., 0.5 removes 50% of all vision tokens from the input). Note: Our code currently only supports 3x3 pooling.
--model.vision_backbone.vit.topk=0.5: Must match the mm_preprocessor.topk value above.
--model.vision_backbone.vit.prune_at=3: Designates the layer to inject the STTS scorer after (referred to as $l$ in the paper).
--model.vision_backbone.vit.prune_method=scorer: Determines the pruning method. Supported options include:
scorer: STTS pruning (our method).
easy: Heuristic-based neighbor-frame cosine similarity sort and prune.
random: Random token pruning.

For other customizable parameters, please review the TemporalVitConfig class inside olmo/nn/temporal_image_vit.py.

Evaluation

To evaluate the model, you can run the following sample command:

torchrun --nproc-per-node 8 launch_scripts/eval_molmo2.py /path/to/save/folder \
    --device_batch_size=2 \
    --mode=video

The evaluation process operates largely the same as Molmo2. You can specify different evaluation subsets using the --mode flag (e.g., --mode=video for video-QA only, --mode=video_pointing for pointing tasks, etc.).

Citation

If you find this work useful in your research, please consider citing our paper:

@misc{zhang2026stts,
      title={Unified Spatio-Temporal Token Scoring for Efficient Video VLMs}, 
      author={Jianrui Zhang and Yue Yang and Rohun Tripathi and Winson Han and Ranjay Krishna and Christopher Clark and Yong Jae Lee and Sangho Lee},
      year={2026},
      eprint={2603.18004},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
launch_scripts		launch_scripts
olmo		olmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Setup

Training

Example: Single-Node Training (8 GPUs)

Key STTS Parameters

Evaluation

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Setup

Training

Example: Single-Node Training (8 GPUs)

Key STTS Parameters

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages