Description
#
Bug
Location
Impact
1
HelixDataset lazy mode re-tokenizes on every __getitem__
dataset.py HelixDataset.__getitem__
~1-2s per sample — the single biggest per-batch blocker if HelixDataset is ever used (e.g. via create_helix_dataloader).
2
create_document_loader defaults to num_workers=0, pin_memory=False
dataset.py create_document_loader
Main-thread tokenization + sync CPU→GPU transfers choke the GPU between batches.
3
load_texts() fully materializes streaming HF datasets into a Python list
nas_helixlm.py load_texts()
For the 400M-token dataset, this explodes RAM and forces the OS to swap, making everything crawl.
4
AMP loss reporting uses scaled loss (not a throughput bug, but corrupts metrics)
trainer.py train_epoch()
scaler.scale(loss) is called before loss.item(), so reported loss is loss * 65536.
5
Trainer scheduler init calls len(train_loader) which forces eager chunking of lazy datasets at epoch start
trainer.py train_epoch()
Causes a massive main-thread hiccup right when the epoch starts.
Reactions are currently unavailable
You can’t perform that action at this time.
HelixDatasetlazy mode re-tokenizes on every__getitem__dataset.pyHelixDataset.__getitem__HelixDatasetis ever used (e.g. viacreate_helix_dataloader).create_document_loaderdefaults tonum_workers=0,pin_memory=Falsedataset.pycreate_document_loaderload_texts()fully materializes streaming HF datasets into a Pythonlistnas_helixlm.pyload_texts()trainer.pytrain_epoch()scaler.scale(loss)is called beforeloss.item(), so reported loss isloss * 65536.Trainerscheduler init callslen(train_loader)which forces eager chunking of lazy datasets at epoch starttrainer.pytrain_epoch()