Add step-by-step data pipeline and resumable training by mjnchen · Pull Request #1 · mjnchen/miniLLM

mjnchen · 2026-03-10T19:50:11Z

Summary

Add scripts/prepare_data.py to separate data download and tokenization into inspectable steps (data/raw/*.jsonl and data/tokenized/*.pt)
Update minillm/dataset.py to load pre-tokenized .pt files from disk instead of re-downloading/tokenizing on every training run
Add --resume and --max-steps flags to scripts/train.py for incremental, checkpoint-resumable training
Update README with the new step-by-step workflow and inspection commands

Test plan

All 28 tests pass (python -m pytest tests/ -v)
No linter errors
Run python scripts/prepare_data.py end-to-end
Run python scripts/train.py --max-steps 500 then resume with --resume

- Add scripts/prepare_data.py: separate download (JSONL) and tokenize (.pt) steps so raw and tokenized data can be inspected independently - Update minillm/dataset.py to load pre-tokenized .pt files from disk instead of downloading and tokenizing on every run - Add --resume and --max-steps flags to scripts/train.py for incremental training with full checkpoint resumption (model, optimizer, step, best val loss) - Update README with step-by-step workflow documentation - Fix flaky test_respects_max_new_tokens test (BPE re-encode round-trip issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add step-by-step data pipeline and resumable training#1

Add step-by-step data pipeline and resumable training#1
mjnchen wants to merge 1 commit intomainfrom
feature/step-by-step-pipeline

mjnchen commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mjnchen commented Mar 10, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant