Add the dataset as Hugging Face dataset.

## To-do
- [ ] we don't want to duplicate data 
- [ ] the "truth of data" should be in our GitHub repo
- [ ] implement a real dataset loader (with a config and version number) that downloads the dataset from GitHub and preprocesses it