fix(qwen3_vl): align video timestamp token placement with transformers#154
fix(qwen3_vl): align video timestamp token placement with transformers#154
Conversation
Timestamp tokens were placed after <|vision_start|> instead of before it, misaligning with the canonical Qwen3VLProcessor in transformers. Also use temporal_patch_size instead of merge_size for timestamp calculation to match upstream. Fixes #132
|
I do not think this fully fixes #132. This PR does fix the timestamp-token ordering issue and aligns It also correctly switches the timestamp calculation to use However, issue #132 also reports a separate dataset-side problem: the Qwen3-VL dataset resize / patch-size handling appears to be inconsistent (the issue mentions missing patch size 16 alignment). This PR only changes So I think this PR fixes the first part of #132, but |
qwen_vl_utils.fetch_video() defaults to image_patch_size=14 (Qwen2 VL), causing video frames to be resized with factor=28 instead of the correct factor=32 for Qwen3 VL (patch_size=16). Fixes the dataset-side issue reported in #132.
Summary
Fixes #132
<|vision_start|>, producing<|vision_start|><timestamp><video_pads><|vision_end|>. The canonicalQwen3VLProcessorin transformers places timestamps before<|vision_start|>, producing<timestamp><|vision_start|><video_pads><|vision_end|>. This PR aligns lmms-engine with the upstream behavior.temporal_patch_sizevsmerge_size: Changed_calculate_timestampsto usetemporal_patch_sizeinstead ofmerge_size, matching the transformers implementation. Both default to 2 so this was a silent bug, but using the semantically correct parameter prevents issues if the values ever diverge.