🐛 Bug
The train_test_split code in litdata/utilities/train_test_split.py seems to rely on the alphabetical filename ordering of chunks, which is unreliable.
Stepping through the code I noticed the very first split contains files
curr_chunk_filename=['chunk-0-0.bin', 'chunk-1-0.bin', 'chunk-10-0.bin', 'chunk-2-0.bin', 'chunk-3-0.bin', 'chunk-4-0.bin', 'chunk-5-0.bin', 'chunk-6-0.bin']
on L91
Notice that chunk 10 snuck in there.
From testing it seems there are 2 failure paths:
- more than 10 chunks are written per worker
- more than 10 workers are used
As soon as a chunk has the name chunk-10-x or chunk-x-10 it no longer sorts properly.
This is especially relevant when working with temporal data.
🐛 Bug
The train_test_split code in litdata/utilities/train_test_split.py seems to rely on the alphabetical filename ordering of chunks, which is unreliable.
Stepping through the code I noticed the very first split contains files
curr_chunk_filename=['chunk-0-0.bin', 'chunk-1-0.bin', 'chunk-10-0.bin', 'chunk-2-0.bin', 'chunk-3-0.bin', 'chunk-4-0.bin', 'chunk-5-0.bin', 'chunk-6-0.bin']
on L91
Notice that chunk 10 snuck in there.
From testing it seems there are 2 failure paths:
As soon as a chunk has the name
chunk-10-xorchunk-x-10it no longer sorts properly.This is especially relevant when working with temporal data.