This directory contains all dataset files used in the project, organized into three main subdirectories.
datasets/
├── original/ # Original dataset files
│ ├── bamboogle/
│ ├── hotpotqa/
│ ├── nq/
│ └── strategyqa/
├── prepared/ # Preprocessed datasets
│ ├── threePassages/ # Datasets with 3 passages per question
│ └── fivePassage/ # Datasets with 5 passages per question
└── noise_generated/ # Generated noise passages
├── bamboogle/
├── hotpotqa/
├── nq/
└── strategyqa/
Contains the original dataset files in JSON format. Each dataset (bamboogle, hotpotqa, nq, strategyqa) has its train and/or test splits.
Contains preprocessed datasets with structured passages:
- threePassages/: Each question is associated with exactly 3 passages
- fivePassage/: Each question is associated with 5 passages
These prepared datasets are used for inference tasks.
Contains generated noise passages created by the noise_generation/ module. Each dataset subdirectory contains noise passages for different noise types:
- Counterfactual passages
- Relevant noise passages
- Irrelevant noise passages
- Consistent passages
These datasets are used by:
- Noise Generation (
../noise_generation/) - Reads fromoriginal/and writes tonoise_generated/ - Inference (
../inference/) - Reads fromprepared/andnoise_generated/for model inference
Each dataset file is a JSON list where each item contains:
id: Unique identifierquestion: The question textanswerorgold_answers: Ground truth answer(s)passages: List of passage objects withcontentandtypefieldsfactsor other dataset-specific fields
- All paths in this directory are relative to the
NAACL/root directory - Generated noise files can be regenerated using the
noise_generation/module - Prepared datasets are typically created through preprocessing pipelines