Skip to content

[Bug Report] tokenize_and_concatenate can generate invalid tokens #1139

@BorisTheBrave

Description

@BorisTheBrave

Describe the bug

If you call tokenize_and_concatenate with too short a sequence, it pads the data to ensure you have at least once batch. It'll do this even if there is no padding token for the tokenizer as it secretly creates a padding token for you.

Code example

from datasets import Dataset
from transformer_lens.utils import tokenize_and_concatenate
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(f"Vocab size: {tokenizer.vocab_size}") # 50257
print(tokenizer.pad_token) # None


dataset = Dataset.from_dict({"text": ["Hello", "world", "today"]})

result = tokenize_and_concatenate(
    dataset,
    tokenizer,
    column_name="text",
    add_bos_token=True,
)
print(tokenizer.pad_token) # <PAD>
print(tokenizer.pad_token_id) # 50257
print(result[0]) # {'tokens': tensor([50256,    39,    68,  ..., 50257, 50257, 50257])}

System Info
Describe the characteristic of your environment:

  • Describe how transformer_lens was installed: uv
  • What OS are you using? linux
  • Python version 3.11.11

Checklist

  • I have checked that there is no similar issue in the repo (required)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions