Summary
git lfs pull on a fresh clone fails for 10 LFS-tracked dataset files. The LFS endpoint returns a repository-wide budget error, so the files stay on disk as ~130-byte pointer stubs instead of real content. This blocks full loading of 10 of the 17 datasets, every query that needs the missing PostgreSQL/MongoDB half of a dataset is unanswerable, which silently caps achievable scores for all agents.
This is an account-side (repository owner) budget, so it affects every cloner, not a single user.
batch response: This repository exceeded its LFS budget. The account responsible
for the budget should increase it to restore access.
Failed to fetch some objects from 'https://github.com/ucbepic/DataAgentBench.git/info/lfs'
Reproduction
git lfs install
git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBench
git lfs pull # fails with the budget error above
git lfs ls-files # the 10 files below show '-' (pointer) instead of '*' (downloaded)
Environment: git 2.54.0, git-lfs 3.7.1, macOS arm64. Confirmed still failing as of this report — a single-file retry (git lfs pull --include=query_cve/query_dataset/kev.sql) returns the same budget error and the file remains 132 bytes.
Affected files (pointer stubs after git lfs pull)
| File |
LFS OID (prefix) |
Real size |
query_imdb/query_dataset/movies.sql |
9ef5452628ec |
1.62 GB |
query_krama/query_dataset/domain_docs/domain_docs_db/files.bson |
3bab163bfef5 |
529 MB |
query_PATENTS/query_dataset/patent_CPCDefinition.sql |
f888382228bf |
135 MB |
query_crmarenapro/query_dataset/support.sql |
5248cd64cddf |
8.9 MB |
query_PANCANCER_ATLAS/query_dataset/pancancer_clinical.sql |
ce57356f7a0f |
7.6 MB |
query_usaspending/query_dataset/contracts.sql |
7fdde155670e |
4.8 MB |
query_cve/query_dataset/kev.sql |
76f15a345a8d |
1.0 MB |
query_bookreview/query_dataset/books_info.sql |
80acb3ece574 |
649 KB |
query_civic_unstructured/query_dataset/civic_docs_dump/civic_db/civic_docs.bson |
57fd5ff32b81 |
238 KB |
query_googlelocal/query_dataset/business_description.sql |
cac36db1c60a |
38 KB |
The 5 GB query_PATENTS/query_dataset/patent_publication.db is unaffected — it downloads fine via download.sh (Google Drive). download.sh covers only that one file; there is no fallback for the 10 above.
Impact
Each affected dataset is federated across multiple engines (SQLite/DuckDB + PostgreSQL/MongoDB). The SQLite/DuckDB halves are present and load fine; the missing files are the PostgreSQL .sql dumps and two MongoDB BSON dumps. Any query that reads the missing half cannot be answered, so those queries score 0 regardless of agent quality.
Datasets with at least one missing constituent database (10 of 17): imdb, krama, PATENTS, crmarenapro, PANCANCER_ATLAS, usaspending, cve, bookreview, civic_unstructured, googlelocal.
The remaining 7 (DEPS_DEV_V1, GITHUB_REPOS, stockindex, stockmarket, yelp, agnews, music_brainz_20k) load cleanly.
Note on PATENTS / PR #59
PR #59 regenerated the PATENTS ground truths "from the released data," but 2 of the 3 queries still require the missing patent_CPCDefinition.sql:
- The present
patent_publication.db exposes CPC codes (publicationinfo.cpc) but no titles.
- The CPC code→title mapping (
cpc_definition.titleFull) lives only in patent_CPCDefinition.sql, and query2 (titleFull, cpc_group, best_year) and query3 (CPC subclass titles) both emit those titles.
- query1 (codes only) is answerable from the present SQLite alone.
So even post-PR #59, a budget-blocked cloner cannot fully evaluate PATENTS query2/query3. The other 9 datasets above have not had their ground truths audited the same way, so their missing-data queries are likely similarly affected but undocumented.
Request
Either of the following would restore the benchmark for external users:
- Increase / restore the repository's Git LFS budget (or add a data pack) so
git lfs pull works again, or
- Provide an alternate download for the 10 files — e.g. a Google Drive / Hugging Face mirror wired into
download.sh, mirroring how patent_publication.db is already handled.
Summary
git lfs pullon a fresh clone fails for 10 LFS-tracked dataset files. The LFS endpoint returns a repository-wide budget error, so the files stay on disk as ~130-byte pointer stubs instead of real content. This blocks full loading of 10 of the 17 datasets, every query that needs the missing PostgreSQL/MongoDB half of a dataset is unanswerable, which silently caps achievable scores for all agents.This is an account-side (repository owner) budget, so it affects every cloner, not a single user.
Reproduction
Environment:
git 2.54.0,git-lfs 3.7.1, macOS arm64. Confirmed still failing as of this report — a single-file retry (git lfs pull --include=query_cve/query_dataset/kev.sql) returns the same budget error and the file remains 132 bytes.Affected files (pointer stubs after
git lfs pull)query_imdb/query_dataset/movies.sql9ef5452628ecquery_krama/query_dataset/domain_docs/domain_docs_db/files.bson3bab163bfef5query_PATENTS/query_dataset/patent_CPCDefinition.sqlf888382228bfquery_crmarenapro/query_dataset/support.sql5248cd64cddfquery_PANCANCER_ATLAS/query_dataset/pancancer_clinical.sqlce57356f7a0fquery_usaspending/query_dataset/contracts.sql7fdde155670equery_cve/query_dataset/kev.sql76f15a345a8dquery_bookreview/query_dataset/books_info.sql80acb3ece574query_civic_unstructured/query_dataset/civic_docs_dump/civic_db/civic_docs.bson57fd5ff32b81query_googlelocal/query_dataset/business_description.sqlcac36db1c60aThe 5 GB
query_PATENTS/query_dataset/patent_publication.dbis unaffected — it downloads fine viadownload.sh(Google Drive).download.shcovers only that one file; there is no fallback for the 10 above.Impact
Each affected dataset is federated across multiple engines (SQLite/DuckDB + PostgreSQL/MongoDB). The SQLite/DuckDB halves are present and load fine; the missing files are the PostgreSQL
.sqldumps and two MongoDB BSON dumps. Any query that reads the missing half cannot be answered, so those queries score 0 regardless of agent quality.Datasets with at least one missing constituent database (10 of 17): imdb, krama, PATENTS, crmarenapro, PANCANCER_ATLAS, usaspending, cve, bookreview, civic_unstructured, googlelocal.
The remaining 7 (DEPS_DEV_V1, GITHUB_REPOS, stockindex, stockmarket, yelp, agnews, music_brainz_20k) load cleanly.
Note on PATENTS / PR #59
PR #59 regenerated the PATENTS ground truths "from the released data," but 2 of the 3 queries still require the missing
patent_CPCDefinition.sql:patent_publication.dbexposes CPC codes (publicationinfo.cpc) but no titles.cpc_definition.titleFull) lives only inpatent_CPCDefinition.sql, and query2 (titleFull, cpc_group, best_year) and query3 (CPC subclass titles) both emit those titles.So even post-PR #59, a budget-blocked cloner cannot fully evaluate PATENTS query2/query3. The other 9 datasets above have not had their ground truths audited the same way, so their missing-data queries are likely similarly affected but undocumented.
Request
Either of the following would restore the benchmark for external users:
git lfs pullworks again, ordownload.sh, mirroring howpatent_publication.dbis already handled.