Skip to content

fix(qwen3_vl): align video timestamp token placement with transformers#154

Merged
kcz358 merged 2 commits intomainfrom
fix/qwen3-vl-timestamp-token-alignment
Apr 13, 2026
Merged

fix(qwen3_vl): align video timestamp token placement with transformers#154
kcz358 merged 2 commits intomainfrom
fix/qwen3-vl-timestamp-token-alignment

Conversation

@kcz358
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 commented Apr 13, 2026

Summary

Fixes #132

  • Timestamp token ordering: The Qwen3 VL processor was placing timestamp tokens after <|vision_start|>, producing <|vision_start|><timestamp><video_pads><|vision_end|>. The canonical Qwen3VLProcessor in transformers places timestamps before <|vision_start|>, producing <timestamp><|vision_start|><video_pads><|vision_end|>. This PR aligns lmms-engine with the upstream behavior.
  • temporal_patch_size vs merge_size: Changed _calculate_timestamps to use temporal_patch_size instead of merge_size, matching the transformers implementation. Both default to 2 so this was a silent bug, but using the semantically correct parameter prevents issues if the values ever diverge.
  • Simplified the per-frame expansion logic from a 3-case if/elif/else (first/middle/last frame) into a single uniform pattern, since all frames now follow the same structure.

Timestamp tokens were placed after <|vision_start|> instead of before
it, misaligning with the canonical Qwen3VLProcessor in transformers.
Also use temporal_patch_size instead of merge_size for timestamp
calculation to match upstream.

Fixes #132
@kcz358
Copy link
Copy Markdown
Collaborator Author

kcz358 commented Apr 13, 2026

I do not think this fully fixes #132.

This PR does fix the timestamp-token ordering issue and aligns qwen3_vl_processor.py with upstream transformers:
<timestamp><|vision_start|><video_tokens><|vision_end|>

It also correctly switches the timestamp calculation to use temporal_patch_size, which matches the upstream processor behavior.

However, issue #132 also reports a separate dataset-side problem: the Qwen3-VL dataset resize / patch-size handling appears to be inconsistent (the issue mentions missing patch size 16 alignment). This PR only changes src/lmms_engine/datasets/processor/qwen3_vl_processor.py and does not touch src/lmms_engine/datasets/iterable/qwen3_vl_iterable_dataset.py, so that part still seems unresolved.

So I think this PR fixes the first part of #132, but Fixes #132 is probably too strong unless the dataset-side issue is handled elsewhere.

qwen_vl_utils.fetch_video() defaults to image_patch_size=14 (Qwen2 VL),
causing video frames to be resized with factor=28 instead of the correct
factor=32 for Qwen3 VL (patch_size=16).

Fixes the dataset-side issue reported in #132.
@kcz358 kcz358 merged commit 8260c77 into main Apr 13, 2026
3 checks passed
@kcz358 kcz358 deleted the fix/qwen3-vl-timestamp-token-alignment branch April 13, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qwen3vl的视频时间戳token位置与transformers库不符

1 participant