Skip to content

Move datasets off Git LFS to Hugging Face mirror (fixes #60)#62

Open
Ruiying-Ma wants to merge 1 commit into
mainfrom
fix/datasets-off-lfs-hf-mirror
Open

Move datasets off Git LFS to Hugging Face mirror (fixes #60)#62
Ruiying-Ma wants to merge 1 commit into
mainfrom
fix/datasets-off-lfs-hf-mirror

Conversation

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Summary

Fixes #60. The repository's Git LFS budget is exhausted, so on a fresh clone the dataset files come down as
pointer stubs and 10 of 17 datasets fail to load. This PR removes the dependency on Git LFS entirely by
mirroring all dataset files on the Hugging Face Hub and fetching them via download.sh.

Changes

  • Mirror: all 36 dataset files (former LFS files + the patents DB previously on Google Drive) uploaded to
    ruiyingm/DataAgentBench-data, preserving
    repo-relative paths.
  • download.sh: rewritten to be manifest-driven — downloads every file from the HF mirror, verifies each
    against a sha256 checksum, and skips files already present and intact (re-runnable; VERIFY_ALL=1 re-hashes
    everything).
  • dataset_manifest.tsv: new source of truth listing each file's path, sha256, and size.
  • .gitattributes: removed all filter=lfs rules so these paths are no longer LFS-tracked.
  • .gitignore: the dataset files are now download-only.
  • upload_datasets_to_hf.py: maintainer tool to (re)publish the dataset files to HF.
  • README.md: updated clone/setup instructions to use bash download.sh instead of git lfs.

Verification

  • All 35 former-LFS files' sha256 hashes match their original Git LFS OIDs (byte-for-byte identical).
  • Round-trip confirmed: uploaded to HF, downloaded back, checksum verified.
  • Downloads need no token — the HF dataset is public, and egress is free (so this can't hit a bandwidth budget
    like LFS did).

Notes

  • History is left untouched; old commits still reference the LFS objects (harmless — fresh clones no longer
    fetch them).
  • The mirror currently lives under a personal HF namespace (ruiyingm) and can be transferred to an org later
    without breaking links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Git LFS budget exceeded — 10 dataset files unfetchable, blocking 10 of 17 datasets

1 participant