[Question] Missing Data Preprocessing Scripts for SubgraphRAG Framework

Hello Dr. Li, Dr. Miao, and Dr. Li,

First, thank you for your excellent ICLR 2025 paper, "SIMPLE IS EFFECTIVE: THE ROLES OF GRAPHS AND LARGE LANGUAGE MODELS IN KNOWLEDGE-GRAPH-BASED RETRIEVAL-AUGMENTED GENERATION." The SubgraphRAG framework you proposed is very impressive and insightful.

I have been exploring your official GitHub repository to gain a deeper understanding of the experimental setup. The provided code and processed data (like `entity_identifiers.txt` and `gpt_triples.pth`) have been incredibly helpful.

I am currently trying to follow the full data generation pipeline, and I couldn't seem to locate the scripts for the initial preprocessing steps. I would be very grateful if you could provide some clarification on this.

Specifically, I'm looking for:

1.  The script or methodology used to process the original WebQSP and CWQ datasets to create the `WebQSP-sub` and `CWQ-sub` variants mentioned in the paper.
2.  The script used to interact with the GPT-4o API (using the prompt from Appendix E) to generate the labeled triples that are stored in `gpt_triples.pth`.

Having access to these scripts or any pointers you could provide would be immensely helpful for my understanding and for potentially reproducing your valuable results.

Thank you again for your time and for this significant contribution to the field.

Best regards,
Martin Wang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Missing Data Preprocessing Scripts for SubgraphRAG Framework #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Missing Data Preprocessing Scripts for SubgraphRAG Framework #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions