Skip to content

[Question] Missing Data Preprocessing Scripts for SubgraphRAG Framework #32

@Martin1007Wang

Description

@Martin1007Wang

Hello Dr. Li, Dr. Miao, and Dr. Li,

First, thank you for your excellent ICLR 2025 paper, "SIMPLE IS EFFECTIVE: THE ROLES OF GRAPHS AND LARGE LANGUAGE MODELS IN KNOWLEDGE-GRAPH-BASED RETRIEVAL-AUGMENTED GENERATION." The SubgraphRAG framework you proposed is very impressive and insightful.

I have been exploring your official GitHub repository to gain a deeper understanding of the experimental setup. The provided code and processed data (like entity_identifiers.txt and gpt_triples.pth) have been incredibly helpful.

I am currently trying to follow the full data generation pipeline, and I couldn't seem to locate the scripts for the initial preprocessing steps. I would be very grateful if you could provide some clarification on this.

Specifically, I'm looking for:

  1. The script or methodology used to process the original WebQSP and CWQ datasets to create the WebQSP-sub and CWQ-sub variants mentioned in the paper.
  2. The script used to interact with the GPT-4o API (using the prompt from Appendix E) to generate the labeled triples that are stored in gpt_triples.pth.

Having access to these scripts or any pointers you could provide would be immensely helpful for my understanding and for potentially reproducing your valuable results.

Thank you again for your time and for this significant contribution to the field.

Best regards,
Martin Wang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions