## Context We're starting to generate more sources, but we don't update our training data. ## Requirements - [x] We need a script which we can automate to update our training data—manually for now, just before we retrain (which is infrequent). - [x] check for newly labeled stuff from our source collector app - [x] We want to grab not only items labeled as "relevant" and in our db, but also items labeled _not_ relevant. - https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/324 - [ ] optional: apply [keyword extraction](https://github.com/Police-Data-Accessibility-Project/data-source-identification/pull/58#issuecomment-2123699871) to labeled data - [x] #152 - [x] update the hugging face training-urls dataset, with batch ID - #141 - place it there as raw data; we can transform it into more specific datasets as needed
Context
We're starting to generate more sources, but we don't update our training data.
Requirements