-
Notifications
You must be signed in to change notification settings - Fork 490
Open
Description
Describe the bug
If you call tokenize_and_concatenate with too short a sequence, it pads the data to ensure you have at least once batch. It'll do this even if there is no padding token for the tokenizer as it secretly creates a padding token for you.
Code example
from datasets import Dataset
from transformer_lens.utils import tokenize_and_concatenate
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(f"Vocab size: {tokenizer.vocab_size}") # 50257
print(tokenizer.pad_token) # None
dataset = Dataset.from_dict({"text": ["Hello", "world", "today"]})
result = tokenize_and_concatenate(
dataset,
tokenizer,
column_name="text",
add_bos_token=True,
)
print(tokenizer.pad_token) # <PAD>
print(tokenizer.pad_token_id) # 50257
print(result[0]) # {'tokens': tensor([50256, 39, 68, ..., 50257, 50257, 50257])}
System Info
Describe the characteristic of your environment:
- Describe how
transformer_lenswas installed: uv - What OS are you using? linux
- Python version 3.11.11
Checklist
- I have checked that there is no similar issue in the repo (required)
Metadata
Metadata
Assignees
Labels
No labels