Skip to content

fix(data): Handle missing papers.jsonl file#6

Open
VooDisss wants to merge 1 commit intoUKPLab:mainfrom
VooDisss:main
Open

fix(data): Handle missing papers.jsonl file#6
VooDisss wants to merge 1 commit intoUKPLab:mainfrom
VooDisss:main

Conversation

@VooDisss
Copy link

This commit fixes a bug where the data processing pipeline would crash if the data/papers.jsonl file was missing or empty.

The PaperLoader class in peerqa/data_loader.py would unconditionally try to read papers.jsonl at initialization, causing a ValueError if the file didn't exist. This would prevent the extract_text_from_pdf.py script from running and creating the file in the first place.

This commit makes the PaperLoader more robust by:

  • Checking if papers.jsonl exists and is not empty before reading it.
  • Initializing an empty DataFrame if the file is missing, allowing the script to proceed.
  • Adding a safeguard to has_paper_id to handle an empty DataFrame.

This ensures that the data processing pipeline can be run from a clean state without errors.

This commit fixes a bug where the data processing pipeline would crash if the `data/papers.jsonl` file was missing or empty.

The `PaperLoader` class in `peerqa/data_loader.py` would unconditionally try to read `papers.jsonl` at initialization, causing a `ValueError` if the file didn't exist. This would prevent the `extract_text_from_pdf.py` script from running and creating the file in the first place.

This commit makes the `PaperLoader` more robust by:
- Checking if `papers.jsonl` exists and is not empty before reading it.
- Initializing an empty DataFrame if the file is missing, allowing the script to proceed.
- Adding a safeguard to `has_paper_id` to handle an empty DataFrame.

This ensures that the data processing pipeline can be run from a clean state without errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant