Is your feature request related to a problem? Please describe.
Currently, it is not possible to resume training from a checkpoint that ended mid-epoch. This works in most use cases. However, with very large datasets, and when only trained for a few epochs, continuing from an epoch start is very limited. Hence, we need to make it possible to resume also at a specific update step
Describe the solution you'd like
Technically, it is already possible to resume training mid-epoch. We only have to accept that we don't have control over the data shuffling. We have to remove the asserts that prevent resuming mid-epoch, and validate if the pipeline is still correct.
Describe alternatives you've considered
- Only initialize the weights, and manually set the LR schedule to the state where the previous checkpoint ended. However, you are missing the optimizer state then.
Additional context
Is your feature request related to a problem? Please describe.
Currently, it is not possible to resume training from a checkpoint that ended mid-epoch. This works in most use cases. However, with very large datasets, and when only trained for a few epochs, continuing from an epoch start is very limited. Hence, we need to make it possible to resume also at a specific update step
Describe the solution you'd like
Technically, it is already possible to resume training mid-epoch. We only have to accept that we don't have control over the data shuffling. We have to remove the asserts that prevent resuming mid-epoch, and validate if the pipeline is still correct.
Describe alternatives you've considered
Additional context