Use metadata-backed document boundaries for SFT datasets#653
Open
taivu1998 wants to merge 1 commit into
Open
Conversation
Contributor
Author
|
Hi @tyler-romero, could you help review this PR? Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes issue #594 by making document-aware SFT paths prefer sibling
.csv.gzboundary metadata over rediscovering boundaries from tokenizer EOS tokens.The key behavior changes are:
DocumentBoundaryModesupport withauto,metadata, andtokenizermodesNumpyPackedFSLDataset,NumpyPaddedFSLDataset, andNumpyDocumentSourceuse metadata-backed boundaries inautomode when metadata exists for all source filesautomode instead of silently mixing modesdoc_lensfrom the packed document spans instead of rescanning token valuesMotivation
Issue #594 showed that Qwen-style SFT data can break when the tokenizer EOS token also appears at message boundaries. In that case, recovering document boundaries from tokenizer EOS fragments one conversation into many smaller documents.
The upstream SFT conversion pipeline already writes authoritative conversation boundaries into sibling
.csv.gzmetadata files, so this PR uses that metadata directly in the document-aware dataset paths implicated by the issue.Design Notes
The change is intentionally narrow:
doc_lensnow come from known packed spans, which removes an unnecessary EOS-token dependency from the masking pathTesting
Focused regression coverage added for:
doc_lensgeneration from metadata-backed spansNumpyDocumentSourcemetadata-backed offsetsCommands run:
uv run pytest src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.py -quv run pytest src/test/data -k 'not mixes_test' -qUV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/olmo_core/data/types.py src/olmo_core/data/utils.py src/olmo_core/data/numpy_dataset.py src/olmo_core/data/composable/numpy_document_source.py src/olmo_core/data/__init__.py src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.pyRelated