Skip to content

[FEATURE] Make it possible to resume training mid-epoch, and not with checkpoints that are captured at the start of an epoch. #55

@MauritsBleeker

Description

@MauritsBleeker

Is your feature request related to a problem? Please describe.
Currently, it is not possible to resume training from a checkpoint that ended mid-epoch. This works in most use cases. However, with very large datasets, and when only trained for a few epochs, continuing from an epoch start is very limited. Hence, we need to make it possible to resume also at a specific update step

Describe the solution you'd like
Technically, it is already possible to resume training mid-epoch. We only have to accept that we don't have control over the data shuffling. We have to remove the asserts that prevent resuming mid-epoch, and validate if the pipeline is still correct.

Describe alternatives you've considered

  • Only initialize the weights, and manually set the LR schedule to the state where the previous checkpoint ended. However, you are missing the optimizer state then.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions