Skip to content

Git LFS budget exceeded — 10 dataset files unfetchable, blocking 10 of 17 datasets #60

Description

@Jarus77

Summary

git lfs pull on a fresh clone fails for 10 LFS-tracked dataset files. The LFS endpoint returns a repository-wide budget error, so the files stay on disk as ~130-byte pointer stubs instead of real content. This blocks full loading of 10 of the 17 datasets, every query that needs the missing PostgreSQL/MongoDB half of a dataset is unanswerable, which silently caps achievable scores for all agents.

This is an account-side (repository owner) budget, so it affects every cloner, not a single user.

batch response: This repository exceeded its LFS budget. The account responsible
for the budget should increase it to restore access.
Failed to fetch some objects from 'https://github.com/ucbepic/DataAgentBench.git/info/lfs'

Reproduction

git lfs install
git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBench
git lfs pull         # fails with the budget error above
git lfs ls-files     # the 10 files below show '-' (pointer) instead of '*' (downloaded)

Environment: git 2.54.0, git-lfs 3.7.1, macOS arm64. Confirmed still failing as of this report — a single-file retry (git lfs pull --include=query_cve/query_dataset/kev.sql) returns the same budget error and the file remains 132 bytes.

Affected files (pointer stubs after git lfs pull)

File LFS OID (prefix) Real size
query_imdb/query_dataset/movies.sql 9ef5452628ec 1.62 GB
query_krama/query_dataset/domain_docs/domain_docs_db/files.bson 3bab163bfef5 529 MB
query_PATENTS/query_dataset/patent_CPCDefinition.sql f888382228bf 135 MB
query_crmarenapro/query_dataset/support.sql 5248cd64cddf 8.9 MB
query_PANCANCER_ATLAS/query_dataset/pancancer_clinical.sql ce57356f7a0f 7.6 MB
query_usaspending/query_dataset/contracts.sql 7fdde155670e 4.8 MB
query_cve/query_dataset/kev.sql 76f15a345a8d 1.0 MB
query_bookreview/query_dataset/books_info.sql 80acb3ece574 649 KB
query_civic_unstructured/query_dataset/civic_docs_dump/civic_db/civic_docs.bson 57fd5ff32b81 238 KB
query_googlelocal/query_dataset/business_description.sql cac36db1c60a 38 KB

The 5 GB query_PATENTS/query_dataset/patent_publication.db is unaffected — it downloads fine via download.sh (Google Drive). download.sh covers only that one file; there is no fallback for the 10 above.

Impact

Each affected dataset is federated across multiple engines (SQLite/DuckDB + PostgreSQL/MongoDB). The SQLite/DuckDB halves are present and load fine; the missing files are the PostgreSQL .sql dumps and two MongoDB BSON dumps. Any query that reads the missing half cannot be answered, so those queries score 0 regardless of agent quality.

Datasets with at least one missing constituent database (10 of 17): imdb, krama, PATENTS, crmarenapro, PANCANCER_ATLAS, usaspending, cve, bookreview, civic_unstructured, googlelocal.

The remaining 7 (DEPS_DEV_V1, GITHUB_REPOS, stockindex, stockmarket, yelp, agnews, music_brainz_20k) load cleanly.

Note on PATENTS / PR #59

PR #59 regenerated the PATENTS ground truths "from the released data," but 2 of the 3 queries still require the missing patent_CPCDefinition.sql:

  • The present patent_publication.db exposes CPC codes (publicationinfo.cpc) but no titles.
  • The CPC code→title mapping (cpc_definition.titleFull) lives only in patent_CPCDefinition.sql, and query2 (titleFull, cpc_group, best_year) and query3 (CPC subclass titles) both emit those titles.
  • query1 (codes only) is answerable from the present SQLite alone.

So even post-PR #59, a budget-blocked cloner cannot fully evaluate PATENTS query2/query3. The other 9 datasets above have not had their ground truths audited the same way, so their missing-data queries are likely similarly affected but undocumented.

Request

Either of the following would restore the benchmark for external users:

  1. Increase / restore the repository's Git LFS budget (or add a data pack) so git lfs pull works again, or
  2. Provide an alternate download for the 10 files — e.g. a Google Drive / Hugging Face mirror wired into download.sh, mirroring how patent_publication.db is already handled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions