-
Notifications
You must be signed in to change notification settings - Fork 87
Description
🚀 Feature
Being able to interrupt and resume the merge_datasets process
Motivation
When dealing with very large datasets, one cannot run ld.optimize on a single machine at once. Then, the workflow is:
split dataset in pieces -> ld.optimize on every piece (eventually in parallel) -> merge_datasets to reconstruct full dataset
merge_datasets can take a significant amount of time and sometimes crashes (interrupted connection, dead workers). In that situation, it would be great that re-executing merge_datasets doesn't crash but resume the merging operation instead. Right now, I have to delete the written partially merged folder and restart from the beginning.
Pitch
When merge_datasetscrashes and i execute it again, instead of failing it should scan the already created folder and resume the merging operation
Alternatives
In the meantime, would it be viable to "recursively" call merge_datasets? Let's say my dataset is split into 20 parts, calling merge_datasets separately on parts 0-4, 5-9, 10-14, 15-19, and then call again merge_datasets on the resulting folders? Would it be equivalent to call merge_datasets only once on everything?