Current State
BacDive data is currently fetched from a Google Drive file (see PR #349). The download_bacdive.py utility script exists (PR #273, thanks @realmarcin) and was updated to use .env credentials (PR #314), but it's not integrated into the standard kg download workflow.
Proposal
Integrate BacDive API fetching into kg download, similar to how MediaDive bulk download works (see _post_download_mediadive_bulk() in download.py).
This would make the build fully reproducible from source for users with BacDive API credentials.
Notes
Acceptance Criteria
Current State
BacDive data is currently fetched from a Google Drive file (see PR #349). The
download_bacdive.pyutility script exists (PR #273, thanks @realmarcin) and was updated to use .env credentials (PR #314), but it's not integrated into the standardkg downloadworkflow.Proposal
Integrate BacDive API fetching into
kg download, similar to how MediaDive bulk download works (see_post_download_mediadive_bulk()indownload.py).This would make the build fully reproducible from source for users with BacDive API credentials.
Notes
download_bacdive.pyscans 200k IDs sequentially (~97k actually exist) - may need optimizationkg downloadbut are not specified inpyproject.toml#333, Is DSMZ "giving us" the bacdive_strains.json file, or are we scraping it with https://pypi.org/project/bacdive/ ? #335Acceptance Criteria
kg downloadcan fetch BacDive data via API (with credentials)