## To-do - [ ] we don't want to duplicate data - [ ] the "truth of data" should be in our GitHub repo - [ ] implement a real dataset loader (with a config and version number) that downloads the dataset from GitHub and preprocesses it
To-do