[FEATURE] Make it possible to resume training mid-epoch, and not with checkpoints that are captured at the start of an epoch.

**Is your feature request related to a problem? Please describe.**
Currently, it is not possible to resume training from a checkpoint that ended mid-epoch. This works in most use cases. However, with very large datasets, and when only trained for a few epochs, continuing from an epoch start is very limited. Hence, we need to make it possible to resume also at a specific update step

**Describe the solution you'd like**
Technically, it is already possible to resume training mid-epoch. We only have to accept that we don't have control over the data shuffling.  We have to remove the asserts that prevent resuming mid-epoch, and validate if the pipeline is still correct. 

**Describe alternatives you've considered**
- Only initialize the weights, and manually set the LR schedule to the state where the previous checkpoint ended. However, you are missing the optimizer state then.

**Additional context**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Make it possible to resume training mid-epoch, and not with checkpoints that are captured at the start of an epoch. #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Make it possible to resume training mid-epoch, and not with checkpoints that are captured at the start of an epoch. #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions