Git LFS budget exceeded — 10 dataset files unfetchable, blocking 10 of 17 datasets

## Summary

`git lfs pull` on a fresh clone fails for **10 LFS-tracked dataset files**. The LFS endpoint returns a repository-wide budget error, so the files stay on disk as ~130-byte pointer stubs instead of real content. This blocks full loading of **10 of the 17 datasets**,  every query that needs the missing PostgreSQL/MongoDB half of a dataset is unanswerable, which silently caps achievable scores for all agents.

This is an **account-side (repository owner) budget**, so it affects **every cloner**, not a single user.

```
batch response: This repository exceeded its LFS budget. The account responsible
for the budget should increase it to restore access.
Failed to fetch some objects from 'https://github.com/ucbepic/DataAgentBench.git/info/lfs'
```

## Reproduction

```bash
git lfs install
git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBench
git lfs pull         # fails with the budget error above
git lfs ls-files     # the 10 files below show '-' (pointer) instead of '*' (downloaded)
```

Environment: `git 2.54.0`, `git-lfs 3.7.1`, macOS arm64. Confirmed still failing as of this report — a single-file retry (`git lfs pull --include=query_cve/query_dataset/kev.sql`) returns the same budget error and the file remains 132 bytes.

## Affected files (pointer stubs after `git lfs pull`)

| File | LFS OID (prefix) | Real size |
|---|---|---|
| `query_imdb/query_dataset/movies.sql` | `9ef5452628ec` | 1.62 GB |
| `query_krama/query_dataset/domain_docs/domain_docs_db/files.bson` | `3bab163bfef5` | 529 MB |
| `query_PATENTS/query_dataset/patent_CPCDefinition.sql` | `f888382228bf` | 135 MB |
| `query_crmarenapro/query_dataset/support.sql` | `5248cd64cddf` | 8.9 MB |
| `query_PANCANCER_ATLAS/query_dataset/pancancer_clinical.sql` | `ce57356f7a0f` | 7.6 MB |
| `query_usaspending/query_dataset/contracts.sql` | `7fdde155670e` | 4.8 MB |
| `query_cve/query_dataset/kev.sql` | `76f15a345a8d` | 1.0 MB |
| `query_bookreview/query_dataset/books_info.sql` | `80acb3ece574` | 649 KB |
| `query_civic_unstructured/query_dataset/civic_docs_dump/civic_db/civic_docs.bson` | `57fd5ff32b81` | 238 KB |
| `query_googlelocal/query_dataset/business_description.sql` | `cac36db1c60a` | 38 KB |

The 5 GB `query_PATENTS/query_dataset/patent_publication.db` is unaffected — it downloads fine via `download.sh` (Google Drive). `download.sh` covers **only** that one file; there is no fallback for the 10 above.

## Impact

Each affected dataset is federated across multiple engines (SQLite/DuckDB + PostgreSQL/MongoDB). The SQLite/DuckDB halves are present and load fine; the missing files are the **PostgreSQL `.sql` dumps and two MongoDB BSON dumps**. Any query that reads the missing half cannot be answered, so those queries score 0 regardless of agent quality.

Datasets with at least one missing constituent database (10 of 17): **imdb, krama, PATENTS, crmarenapro, PANCANCER_ATLAS, usaspending, cve, bookreview, civic_unstructured, googlelocal**.

The remaining 7 (DEPS_DEV_V1, GITHUB_REPOS, stockindex, stockmarket, yelp, agnews, music_brainz_20k) load cleanly.

### Note on PATENTS / PR #59

PR #59 regenerated the PATENTS ground truths "from the released data," but 2 of the 3 queries still require the missing `patent_CPCDefinition.sql`:

- The present `patent_publication.db` exposes CPC **codes** (`publicationinfo.cpc`) but no titles.
- The CPC code→title mapping (`cpc_definition.titleFull`) lives only in `patent_CPCDefinition.sql`, and query2 (`titleFull, cpc_group, best_year`) and query3 (CPC subclass titles) both emit those titles.
- query1 (codes only) is answerable from the present SQLite alone.

So even post-PR #59, a budget-blocked cloner cannot fully evaluate PATENTS query2/query3. The other 9 datasets above have **not** had their ground truths audited the same way, so their missing-data queries are likely similarly affected but undocumented.

## Request

Either of the following would restore the benchmark for external users:

1. **Increase / restore the repository's Git LFS budget** (or add a data pack) so `git lfs pull` works again, or
2. **Provide an alternate download** for the 10 files — e.g. a Google Drive / Hugging Face mirror wired into `download.sh`, mirroring how `patent_publication.db` is already handled.



File	LFS OID (prefix)	Real size
`query_imdb/query_dataset/movies.sql`	`9ef5452628ec`	1.62 GB
`query_krama/query_dataset/domain_docs/domain_docs_db/files.bson`	`3bab163bfef5`	529 MB
`query_PATENTS/query_dataset/patent_CPCDefinition.sql`	`f888382228bf`	135 MB
`query_crmarenapro/query_dataset/support.sql`	`5248cd64cddf`	8.9 MB
`query_PANCANCER_ATLAS/query_dataset/pancancer_clinical.sql`	`ce57356f7a0f`	7.6 MB
`query_usaspending/query_dataset/contracts.sql`	`7fdde155670e`	4.8 MB
`query_cve/query_dataset/kev.sql`	`76f15a345a8d`	1.0 MB
`query_bookreview/query_dataset/books_info.sql`	`80acb3ece574`	649 KB
`query_civic_unstructured/query_dataset/civic_docs_dump/civic_db/civic_docs.bson`	`57fd5ff32b81`	238 KB
`query_googlelocal/query_dataset/business_description.sql`	`cac36db1c60a`	38 KB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git LFS budget exceeded — 10 dataset files unfetchable, blocking 10 of 17 datasets #60

Summary

Reproduction

Affected files (pointer stubs after `git lfs pull`)

Impact

Note on PATENTS / PR #59

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Git LFS budget exceeded — 10 dataset files unfetchable, blocking 10 of 17 datasets #60

Description

Summary

Reproduction

Affected files (pointer stubs after git lfs pull)

Impact

Note on PATENTS / PR #59

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Affected files (pointer stubs after `git lfs pull`)