Skip to content

Fix data processing in chapter3#1139

Open
CaptainArshia wants to merge 2 commits intohuggingface:mainfrom
CaptainArshia:fix-processing-the-data-bug
Open

Fix data processing in chapter3#1139
CaptainArshia wants to merge 2 commits intohuggingface:mainfrom
CaptainArshia:fix-processing-the-data-bug

Conversation

@CaptainArshia
Copy link
Copy Markdown

@CaptainArshia CaptainArshia commented Nov 30, 2025

image image

This commit fixes a data processing bug in the tokenization examples across all language translations of Chapter 3, Section 2.

The Problem:
The code was passing dataset columns directly to the tokenizer, which caused compatibility issues.

The Fix:
Converted the dataset columns to lists before tokenization by wrapping them in list():

Changed: raw_datasets["train"]["sentence1"]
To: list(raw_datasets["train"]["sentence1"])

Impact:
This change was applied consistently across all languages versions to ensure the code examples work correctly when tokenizing sentence pairs from the MRPC dataset.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants