Skip to content

gpu-compat-bugs-and-computational-bottlenecks #20

@david-thrower

Description

@david-thrower
# Bug Location Impact
1 HelixDataset lazy mode re-tokenizes on every __getitem__ dataset.py HelixDataset.__getitem__ ~1-2s per sample — the single biggest per-batch blocker if HelixDataset is ever used (e.g. via create_helix_dataloader).
2 create_document_loader defaults to num_workers=0, pin_memory=False dataset.py create_document_loader Main-thread tokenization + sync CPU→GPU transfers choke the GPU between batches.
3 load_texts() fully materializes streaming HF datasets into a Python list nas_helixlm.py load_texts() For the 400M-token dataset, this explodes RAM and forces the OS to swap, making everything crawl.
4 AMP loss reporting uses scaled loss (not a throughput bug, but corrupts metrics) trainer.py train_epoch() scaler.scale(loss) is called before loss.item(), so reported loss is loss * 65536.
5 Trainer scheduler init calls len(train_loader) which forces eager chunking of lazy datasets at epoch start trainer.py train_epoch() Causes a massive main-thread hiccup right when the epoch starts.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions