Skip to content

feat: pretokenize dataset#12

Merged
Neonkraft merged 4 commits into
mainfrom
feat/tokenize-only
Mar 18, 2026
Merged

feat: pretokenize dataset#12
Neonkraft merged 4 commits into
mainfrom
feat/tokenize-only

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

Summary

This PR adds a --tokenize-only switch to submit.py, which terminates the program after the dataset is tokenized (and therefore, cached). This is useful for pretokenizing the dataset using a single node and GPU before launching the actual training runs.

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

Validation

Verified with trial runs on the Leonardo cluster.

@Neonkraft Neonkraft merged commit 5702251 into main Mar 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant