On a high-level, my training setup works like:
- Run training, keep some fixed epochs (either via the default predefined pattern, or custom), and the N best epochs per train/dev scores.
- Run recog (or translation or whatever inference) on fixed epochs + M best epochs on some other dev set (e.g. Hub500 for Switchboard).
- Select the best epoch from the recog results.
- Run recog for all relevant eval sets on the selected best epoch.
I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via Job.update. For the other epochs, this needs the final learning-rate-file with the scores, so it depends on that. This is then also via Job.update, to dynamically add some recogs. Note that the number of epochs where recog is performed on is variable, because there might be overlaps between those sets.
I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.
I think it's good if we have some common pipeline or helper code for this, and not that everyone has its own custom solution.
So I want to discuss this here. We can implement sth new, or use some existing code. For example, I have implemented exactly that already. See my GetBestRecogTrainExp job, the recog_training_exp function, and related code.
On a high-level, my training setup works like:
I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via
Job.update. For the other epochs, this needs the final learning-rate-file with the scores, so it depends on that. This is then also viaJob.update, to dynamically add some recogs. Note that the number of epochs where recog is performed on is variable, because there might be overlaps between those sets.I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.
I think it's good if we have some common pipeline or helper code for this, and not that everyone has its own custom solution.
So I want to discuss this here. We can implement sth new, or use some existing code. For example, I have implemented exactly that already. See my
GetBestRecogTrainExpjob, therecog_training_expfunction, and related code.