Skip to content

Use metadata-backed document boundaries for SFT datasets#653

Open
taivu1998 wants to merge 1 commit into
allenai:mainfrom
taivu1998:tdv/issue-594-boundary-metadata
Open

Use metadata-backed document boundaries for SFT datasets#653
taivu1998 wants to merge 1 commit into
allenai:mainfrom
taivu1998:tdv/issue-594-boundary-metadata

Conversation

@taivu1998
Copy link
Copy Markdown
Contributor

Summary

This PR fixes issue #594 by making document-aware SFT paths prefer sibling .csv.gz boundary metadata over rediscovering boundaries from tokenizer EOS tokens.

The key behavior changes are:

  • add explicit DocumentBoundaryMode support with auto, metadata, and tokenizer modes
  • make NumpyPackedFSLDataset, NumpyPaddedFSLDataset, and NumpyDocumentSource use metadata-backed boundaries in auto mode when metadata exists for all source files
  • reject mixed metadata presence in auto mode instead of silently mixing modes
  • compute packed doc_lens from the packed document spans instead of rescanning token values
  • include effective boundary mode and metadata file identity in fingerprints/cache keys when metadata-backed boundaries are used
  • update the SFT README to reflect the new metadata-backed behavior

Motivation

Issue #594 showed that Qwen-style SFT data can break when the tokenizer EOS token also appears at message boundaries. In that case, recovering document boundaries from tokenizer EOS fragments one conversation into many smaller documents.

The upstream SFT conversion pipeline already writes authoritative conversation boundaries into sibling .csv.gz metadata files, so this PR uses that metadata directly in the document-aware dataset paths implicated by the issue.

Design Notes

The change is intentionally narrow:

  • low-level utility helpers still default to tokenizer-based behavior so unrelated dataset paths do not change semantics accidentally
  • metadata-first behavior is enabled explicitly in the document-aware packed, padded, and composable source paths that need it
  • packed runtime doc_lens now come from known packed spans, which removes an unnecessary EOS-token dependency from the masking path

Testing

Focused regression coverage added for:

  • metadata precedence over local EOS scanning
  • mixed metadata presence failures
  • packed doc_lens generation from metadata-backed spans
  • padded metadata-backed segmentation
  • composable NumpyDocumentSource metadata-backed offsets
  • fingerprint/cache changes when metadata changes

Commands run:

  • uv run pytest src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.py -q
  • uv run pytest src/test/data -k 'not mixes_test' -q
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/olmo_core/data/types.py src/olmo_core/data/utils.py src/olmo_core/data/numpy_dataset.py src/olmo_core/data/composable/numpy_document_source.py src/olmo_core/data/__init__.py src/test/data/utils_test.py src/test/data/numpy_dataset_test.py src/test/data/composable/numpy_document_source_test.py

Related

@taivu1998
Copy link
Copy Markdown
Contributor Author

taivu1998 commented Apr 19, 2026

Hi @tyler-romero, could you help review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SFT for Qwen Models

1 participant