Skip to content

Add step-by-step data pipeline and resumable training#1

Open
mjnchen wants to merge 1 commit intomainfrom
feature/step-by-step-pipeline
Open

Add step-by-step data pipeline and resumable training#1
mjnchen wants to merge 1 commit intomainfrom
feature/step-by-step-pipeline

Conversation

@mjnchen
Copy link
Owner

@mjnchen mjnchen commented Mar 10, 2026

Summary

  • Add scripts/prepare_data.py to separate data download and tokenization into inspectable steps (data/raw/*.jsonl and data/tokenized/*.pt)
  • Update minillm/dataset.py to load pre-tokenized .pt files from disk instead of re-downloading/tokenizing on every training run
  • Add --resume and --max-steps flags to scripts/train.py for incremental, checkpoint-resumable training
  • Update README with the new step-by-step workflow and inspection commands

Test plan

  • All 28 tests pass (python -m pytest tests/ -v)
  • No linter errors
  • Run python scripts/prepare_data.py end-to-end
  • Run python scripts/train.py --max-steps 500 then resume with --resume

- Add scripts/prepare_data.py: separate download (JSONL) and tokenize (.pt) steps
  so raw and tokenized data can be inspected independently
- Update minillm/dataset.py to load pre-tokenized .pt files from disk
  instead of downloading and tokenizing on every run
- Add --resume and --max-steps flags to scripts/train.py for incremental
  training with full checkpoint resumption (model, optimizer, step, best val loss)
- Update README with step-by-step workflow documentation
- Fix flaky test_respects_max_new_tokens test (BPE re-encode round-trip issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant