- Clean up `evaluation.ipynb` notebook - plug and play essentially (e.g. avoid thing where GT and EST sources getting mixed up/backwards) - make it easy for people to evaluate their results - Refactor `plot_runs.py` - so it stores data automatically when experiments are running - makes no assumptions (e.g. line assuming validation episodes every 5 episodes) - Anything else that makes it easier to evaluate the trained model
evaluation.ipynbnotebookplot_runs.py