Describe the bug
tokenize_and_concatenate slices strings into 20 chunks by character before tokenizing. This can cut a token in two, leading to token pairs that would never normally occur.
Code example
I don't have an example, I found this while debugging a larger project.
In a debugger in that method after tokenizing and dropping padding tokens I observed the following:
> chunks[2][-10:], chunks[3][:10]
('t on the M', 'ilitary Ne')
> tokenizer.decode([4460])
' Mil'
> tokenizer.decode([337])
' M'
> tokenizer.decode([346])
'il'
> np.where((tokens[:-1] == 337) & (tokens[1:] == 346))[0]
array([79848])
> tokens[79848:79848+2]
array([337, 346]) # SHOULD NEVER OCCUR
I think it's obvious from code inspection what the problem is.
Checklist