Skip to content

[Config] Token ID mismatch between config.json and tokenizer_config.json #67

@yuanheng-zhao

Description

@yuanheng-zhao

Issue

Model Config and tokenizer config mismatch

In HF model repo config.json - llm_config section:
https://huggingface.co/inclusionAI/Ming-flash-omni-2.0/blob/main/config.json#L96-L99

"image_patch_token": 157157,
"video_patch_token": 157175,
"image_start_token": 157158,
"video_start_token": 157159,

The video_start_token is 157159,

However, in the tokenizer_config.json and tokenizer.json file, the id is pointing to

Ming/tokenizer_config.json

Lines 2149 to 2156 in 2a0c02a

"157159": {
"content": "</image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},

which seems to be the image end token id.

Refer to the video start token id in tokenizer config file:

Ming/tokenizer_config.json

Lines 2157 to 2164 in 2a0c02a

"157160": {
"content": "<video>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},

Should we update the video_start_token to 157160 in HF repo config.json?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions