Skip to content

Interrupt and resume merge_datasets #784

@aRI0U

Description

@aRI0U

🚀 Feature

Being able to interrupt and resume the merge_datasets process

Motivation

When dealing with very large datasets, one cannot run ld.optimize on a single machine at once. Then, the workflow is:

split dataset in pieces -> ld.optimize on every piece (eventually in parallel) -> merge_datasets to reconstruct full dataset

merge_datasets can take a significant amount of time and sometimes crashes (interrupted connection, dead workers). In that situation, it would be great that re-executing merge_datasets doesn't crash but resume the merging operation instead. Right now, I have to delete the written partially merged folder and restart from the beginning.

Pitch

When merge_datasetscrashes and i execute it again, instead of failing it should scan the already created folder and resume the merging operation

Alternatives

In the meantime, would it be viable to "recursively" call merge_datasets? Let's say my dataset is split into 20 parts, calling merge_datasets separately on parts 0-4, 5-9, 10-14, 15-19, and then call again merge_datasets on the resulting folders? Would it be equivalent to call merge_datasets only once on everything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions