fix(qwen3_vl): align video timestamp token placement with transformers by kcz358 · Pull Request #154 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-04-13T09:27:12Z

Summary

Fixes #132

Timestamp token ordering: The Qwen3 VL processor was placing timestamp tokens after <|vision_start|>, producing <|vision_start|><timestamp><video_pads><|vision_end|>. The canonical Qwen3VLProcessor in transformers places timestamps before <|vision_start|>, producing <timestamp><|vision_start|><video_pads><|vision_end|>. This PR aligns lmms-engine with the upstream behavior.
temporal_patch_size vs merge_size: Changed _calculate_timestamps to use temporal_patch_size instead of merge_size, matching the transformers implementation. Both default to 2 so this was a silent bug, but using the semantically correct parameter prevents issues if the values ever diverge.
Simplified the per-frame expansion logic from a 3-case if/elif/else (first/middle/last frame) into a single uniform pattern, since all frames now follow the same structure.

Timestamp tokens were placed after <|vision_start|> instead of before it, misaligning with the canonical Qwen3VLProcessor in transformers. Also use temporal_patch_size instead of merge_size for timestamp calculation to match upstream. Fixes #132

kcz358 · 2026-04-13T09:33:47Z

I do not think this fully fixes #132.

This PR does fix the timestamp-token ordering issue and aligns qwen3_vl_processor.py with upstream transformers:
<timestamp><|vision_start|><video_tokens><|vision_end|>

It also correctly switches the timestamp calculation to use temporal_patch_size, which matches the upstream processor behavior.

However, issue #132 also reports a separate dataset-side problem: the Qwen3-VL dataset resize / patch-size handling appears to be inconsistent (the issue mentions missing patch size 16 alignment). This PR only changes src/lmms_engine/datasets/processor/qwen3_vl_processor.py and does not touch src/lmms_engine/datasets/iterable/qwen3_vl_iterable_dataset.py, so that part still seems unresolved.

So I think this PR fixes the first part of #132, but Fixes #132 is probably too strong unless the dataset-side issue is handled elsewhere.

qwen_vl_utils.fetch_video() defaults to image_patch_size=14 (Qwen2 VL), causing video frames to be resized with factor=28 instead of the correct factor=32 for Qwen3 VL (patch_size=16). Fixes the dataset-side issue reported in #132.

kcz358 merged commit 8260c77 into main Apr 13, 2026
3 checks passed

kcz358 deleted the fix/qwen3-vl-timestamp-token-alignment branch April 13, 2026 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3_vl): align video timestamp token placement with transformers#154

fix(qwen3_vl): align video timestamp token placement with transformers#154
kcz358 merged 2 commits intomainfrom
fix/qwen3-vl-timestamp-token-alignment

kcz358 commented Apr 13, 2026

Uh oh!

kcz358 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kcz358 commented Apr 13, 2026

Summary

Uh oh!

kcz358 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant