Use metadata-backed document boundaries for SFT datasets by taivu1998 · Pull Request #653 · allenai/OLMo-core

taivu1998 · 2026-04-02T08:31:56Z

Summary

This PR fixes issue #594 by making document-aware SFT paths prefer sibling .csv.gz boundary metadata over rediscovering boundaries from tokenizer EOS tokens.

The key behavior changes are:

add explicit DocumentBoundaryMode support with auto, metadata, and tokenizer modes
make NumpyPackedFSLDataset, NumpyPaddedFSLDataset, and NumpyDocumentSource use metadata-backed boundaries in auto mode when metadata exists for all source files
reject mixed metadata presence in auto mode instead of silently mixing modes
compute packed doc_lens from the packed document spans instead of rescanning token values
include effective boundary mode and metadata file identity in fingerprints/cache keys when metadata-backed boundaries are used
update the SFT README to reflect the new metadata-backed behavior

Motivation

Issue #594 showed that Qwen-style SFT data can break when the tokenizer EOS token also appears at message boundaries. In that case, recovering document boundaries from tokenizer EOS fragments one conversation into many smaller documents.

The upstream SFT conversion pipeline already writes authoritative conversation boundaries into sibling .csv.gz metadata files, so this PR uses that metadata directly in the document-aware dataset paths implicated by the issue.

Design Notes

The change is intentionally narrow:

low-level utility helpers still default to tokenizer-based behavior so unrelated dataset paths do not change semantics accidentally
metadata-first behavior is enabled explicitly in the document-aware packed, padded, and composable source paths that need it
packed runtime doc_lens now come from known packed spans, which removes an unnecessary EOS-token dependency from the masking path

Testing

Focused regression coverage added for:

metadata precedence over local EOS scanning
mixed metadata presence failures
packed doc_lens generation from metadata-backed spans
padded metadata-backed segmentation
composable NumpyDocumentSource metadata-backed offsets
fingerprint/cache changes when metadata changes

Commands run:

uv run pytest src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.py -q
uv run pytest src/test/data -k 'not mixes_test' -q
UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/olmo_core/data/types.py src/olmo_core/data/utils.py src/olmo_core/data/numpy_dataset.py src/olmo_core/data/composable/numpy_document_source.py src/olmo_core/data/__init__.py src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use metadata-backed document boundaries for SFT datasets#653

Use metadata-backed document boundaries for SFT datasets#653
taivu1998 wants to merge 1 commit into
allenai:mainfrom
taivu1998:tdv/issue-594-boundary-metadata

taivu1998 commented Apr 2, 2026

Uh oh!

taivu1998 commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 2, 2026

Summary

Motivation

Design Notes

Testing

Related

Uh oh!

taivu1998 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taivu1998 commented Apr 19, 2026 •

edited

Loading