We will open-source the dataset and scripts after the review is finished.
- wash_data_scripts: This section illustrates some of the data cleaning processes we employed during data collection.
- src: This section differentiates between datasets, highlighting which datasets are used for evaluation and which serve as corpora.
- knowledge_enhancement: This section demonstrates an implementation of RAGSynth.
- assemble: The process of synthesizing data using the content in the components.
- components: A specific implementation of RAGSynth in the form of a pipeline, where each step produces an output that serves as the input for the next step. This includes the synthesis of both single-hop and multi-hop data.
- chunk_by_files: The logic for chunking documents.
- rag:training scripts and logs;