In terms of repository structure, I think it would be beneficial to split each data source into separate files. The idea would be to create a standarized "recipe" format that would include all info about the dataset (e.g. where to download, bibtex cite, name of cleaning script, date updated), and then a cleaning script that does all the magic we need.
I use something like that locally, where I have a YAML file that specifies all the info and then an accompanying python script that I use for cleaning.
This makes user contributions very easy. They just cut and paste another "recipe" and include an R script that does the cleaning. The only thing psData has to do is provide a proper API to parse the recipe, download the data, and activate the cleaning script.
Think of something like the homebrew install for mac and its library of "formulas":
https://github.com/Homebrew/homebrew/tree/master/Library/Formula
In terms of repository structure, I think it would be beneficial to split each data source into separate files. The idea would be to create a standarized "recipe" format that would include all info about the dataset (e.g. where to download, bibtex cite, name of cleaning script, date updated), and then a cleaning script that does all the magic we need.
I use something like that locally, where I have a YAML file that specifies all the info and then an accompanying python script that I use for cleaning.
This makes user contributions very easy. They just cut and paste another "recipe" and include an R script that does the cleaning. The only thing psData has to do is provide a proper API to parse the recipe, download the data, and activate the cleaning script.
Think of something like the homebrew install for mac and its library of "formulas":
https://github.com/Homebrew/homebrew/tree/master/Library/Formula