Skip to content

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

@BorisTheBrave

Description

@BorisTheBrave

Describe the bug

tokenize_and_concatenate slices strings into 20 chunks by character before tokenizing. This can cut a token in two, leading to token pairs that would never normally occur.

Code example
I don't have an example, I found this while debugging a larger project.

In a debugger in that method after tokenizing and dropping padding tokens I observed the following:

> chunks[2][-10:], chunks[3][:10]
('t on the M', 'ilitary Ne')
> tokenizer.decode([4460])
' Mil'
> tokenizer.decode([337])
' M'
> tokenizer.decode([346])
'il'
> np.where((tokens[:-1] == 337) & (tokens[1:] == 346))[0]
array([79848])
> tokens[79848:79848+2]
array([337, 346]) #  SHOULD NEVER OCCUR

I think it's obvious from code inspection what the problem is.

Checklist

  • I have checked that there is no similar issue in the repo (required)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions