Add options to precompute the epoch #569
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add the option to pre-generate the epoch. This should save us a lot of time when there is a lot of work happening between creating the StreamingDataset and iterating it.
Pre-generating can happen concurrently with the last third of init and beyond by providing which epoch and sample offset to generate (
init_pregen_epochinit_pregen_sample). Note that this is before anyload_state_dict()so if there is going to be a resumption happening to not0:0, we won't know it at that time, although the user might. Also, we can't just yolo all the epochs at once because of RAM/scale concerns. Finally, we need to be provided DataLoadernum_workersfor this to work, as we won't otherwise know it in a rank process without resorting to the garbage collector trampoline.Pre-generating can happen on the fly as well, more easily so, by setting the bool arg
pregen_next_epoch, which simply pre-generatesepoch + 1:0in the background when done generating (or loading pre-generated) the current epoch.Details are managed by
pregen_epoch_timeout(defaults to 12 min) andpregen_epoch_tick(defaults to0xCAFE / 1337 / 42, or just under a second).