Skip to content

Fixes SFT checkpointing#1435

Closed
finbarrtimbers wants to merge 2 commits intomainfrom
finbarr/fix-sft-tokenization
Closed

Fixes SFT checkpointing#1435
finbarrtimbers wants to merge 2 commits intomainfrom
finbarr/fix-sft-tokenization

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

No description provided.

finbarrtimbers and others added 2 commits January 28, 2026 10:10
- Add scripts/train/debug/sft-tokenization.sh for quick testing (~5 min)
- Add Beaker progress updates via maybe_update_beaker_description
- Only save checkpoints when --resume flag is enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use incremental append-only checkpointing instead of serializing
  all tokens to JSON (was taking ~2 hours per checkpoint with 19B tokens)
- Stream tokens to binary files, only save metadata in JSON checkpoint
- Remove GPU requirement from tokenization script (CPU-only task)
- Switch from print statements to logger.info

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and scalability of the SFT data conversion process by refactoring its checkpointing logic. The core change involves moving from an in-memory storage of processed token data to an incremental, disk-based approach. This ensures that large datasets can be processed without exhausting memory resources and allows for reliable resumption of jobs, improving overall operational efficiency and stability. The changes also include better logging practices and a dedicated debug script to streamline development and testing.

Highlights

  • Incremental Checkpointing: Implemented a new checkpointing mechanism that flushes token data to binary files on disk incrementally, significantly reducing memory usage during SFT data conversion.
  • Memory Efficiency: The script now avoids holding all token IDs, labels, and document boundaries in memory, making it more scalable for very large datasets and preventing out-of-memory errors.
  • Improved Resumption: Jobs can now resume more reliably from checkpoints, as the partial token data is persisted on disk, allowing for seamless continuation of interrupted conversion processes.
  • Enhanced Logging: Replaced standard print statements with logger.info for better structured and configurable logging throughout the data conversion script.
  • Debug Script Added: Introduced a new debug script (sft-tokenization.sh) to easily test and verify the SFT tokenization process within a Beaker environment.
  • GPU Resource Optimization: Updated the 7b-hybrid-sft-tokenization.sh script to specify --gpus 0, indicating that the tokenization step no longer requires GPU resources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the SFT data conversion script by refactoring the checkpointing mechanism to be more memory-efficient. It moves from storing all token data in a single JSON file to streaming it to partial binary files, which is a crucial enhancement for handling large datasets. The consistent use of a logger instead of print statements is also a welcome improvement.

My review identifies a critical bug in the new resume logic that could lead to data duplication. I've also included a couple of medium-severity suggestions to enhance the script's robustness and maintainability.

print(f" Remaining samples: {len(train_dataset) - state['samples_processed']:,}")
print(f"================================")
start_idx = checkpoint["samples_processed"]
current_position, _ = load_progress(output_dir, token_dtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical issue in the resume logic that can lead to data duplication. When resuming, current_position is determined by the size of partial files on disk (load_progress), while start_idx (the sample to resume from) is read from the checkpoint metadata. If the script crashes after appending data to partial files but before the metadata is updated, on resume, it will re-process samples that are already on disk, leading to duplicated data in the final output.

To ensure consistency, current_position should be loaded from the checkpoint metadata, not from the partial files. Additionally, the partial files should be truncated to match the state recorded in the checkpoint. This prevents re-processing and data duplication.

I've suggested changing this line to load current_position from the checkpoint. You'll also need to add logic to truncate the partial files (_partial_tokens.bin, _partial_labels.bin, _partial_boundaries.bin) to the sizes consistent with current_position and start_idx from the checkpoint to fully resolve the issue.

Suggested change
current_position, _ = load_progress(output_dir, token_dtype)
current_position = checkpoint["current_position"]

os.remove(checkpoint_path)
print(f"Removed checkpoint file: {checkpoint_path}")
logger.info(f"Removed checkpoint file: {checkpoint_path}")
for partial_file in ["_partial_tokens.bin", "_partial_labels.bin", "_partial_boundaries.bin"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filenames for partial checkpoint data (_partial_tokens.bin, etc.) are hardcoded as string literals here and in other functions (append_tokens_to_disk, load_progress, main). This can make maintenance difficult and risks inconsistencies if a filename needs to be changed.

Consider defining these filenames as constants at the module level and reusing them. This would centralize the definitions and improve code clarity and maintainability.

Comment on lines +129 to +133
if os.path.exists(tokens_path):
token_count = os.path.getsize(tokens_path) // np.dtype(token_dtype).itemsize
boundaries_data = np.fromfile(boundaries_path, dtype=np.int64).reshape(-1, 2)
boundaries = [tuple(b) for b in boundaries_data.tolist()]
return token_count, boundaries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function checks for the existence of tokens_path but then unconditionally reads from boundaries_path. If the script is interrupted after writing the tokens file but before the boundaries file, this will cause a FileNotFoundError. To make the function more robust, you should also check if boundaries_path exists before attempting to read from it.

Suggested change
if os.path.exists(tokens_path):
token_count = os.path.getsize(tokens_path) // np.dtype(token_dtype).itemsize
boundaries_data = np.fromfile(boundaries_path, dtype=np.int64).reshape(-1, 2)
boundaries = [tuple(b) for b in boundaries_data.tolist()]
return token_count, boundaries
if os.path.exists(tokens_path) and os.path.exists(boundaries_path):
token_count = os.path.getsize(tokens_path) // np.dtype(token_dtype).itemsize
boundaries_data = np.fromfile(boundaries_path, dtype=np.int64).reshape(-1, 2)
boundaries = [tuple(b) for b in boundaries_data.tolist()]
return token_count, boundaries

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c9ae74053

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +393 to +397
append_tokens_to_disk(output_dir, token_ids, labels_mask, document_boundaries, token_dtype)
token_ids.clear()
labels_mask.clear()
document_boundaries.clear()
save_checkpoint_metadata(output_dir, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make checkpoint flush atomic to avoid duplicates

If the process is interrupted after append_tokens_to_disk runs but before save_checkpoint_metadata updates _checkpoint.json, the partial files will already include the latest samples while samples_processed still points to the previous checkpoint. On resume, the loop restarts from the older start_idx but uses current_position from the partial files, so those samples are reprocessed and appended again, corrupting the token stream and document boundaries. This can happen on any crash/kill during the small window between the two calls; consider making the flush+metadata update atomic (e.g., write metadata first with a temp marker, or validate/truncate partial files against samples_processed on resume).

Useful? React with 👍 / 👎.

@hamishivi hamishivi closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants