Skip to content

fix: gracefully handle missing HF tool-calls dataset in generate-data#27

Open
mvanhorn wants to merge 1 commit into
cactus-compute:mainfrom
mvanhorn:fix/19-gracefully-handle-missing-hf-tool-calls-dataset-in
Open

fix: gracefully handle missing HF tool-calls dataset in generate-data#27
mvanhorn wants to merge 1 commit into
cactus-compute:mainfrom
mvanhorn:fix/19-gracefully-handle-missing-hf-tool-calls-dataset-in

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

needle generate-data defaults to the HF dataset cactus-compute/tool-calls-mix, which currently 404s. The script raised the bare HuggingFace exception, which is opaque for a first-run user. Catch the 404, log a clear message naming the dataset, and fall back to a small bundled prompt set so the README's quickstart still works without an extra flag.

Why this matters

This is the #19 reproducer exactly: a fresh checkout, needle generate-data, immediate crash on the HF fetch. The reporter spent some time finding the right --dataset flag to use as a workaround; the failure mode is much less hostile when the script names the missing dataset and falls back gracefully.

Changes

  • needle/dataset/generate.py - wrap the HF fetch in a try/except, log "Dataset X not found, falling back to bundled prompts" on HTTPError(404), and switch to a small synthetic prompt set defined inline.

Testing

needle generate-data --output /tmp/out.jsonl produces a viable file with the bundled fallback. Passing --dataset to a working dataset path still works.

Fixes #19

The default HF dataset path 404s; the script crashed instead of falling
back to a synthetic prompt set or surfacing the missing-dataset error.
Catch the 404, log a clear message, and fall back to a small bundled
prompt set so the example command from the README still works.

Fixes cactus-compute#19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing dataset for generate data

1 participant