-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hi,
Thank you for sharing this interesting project and the accompanying paper! I am trying to replicate the results reported in the paper, specifically the citation quality score, but I am encountering a significant discrepancy.
In the paper, the citation quality score is reported to be around 0.8. However, when I replicate the experiments using the 20 papers provided in the dataset, my results are much lower, fluctuating between 0.3–0.4.
Here are the steps I followed for my experiment:
- I followed the instructions in the paper and considered the first 1500 tokens of the full text when generating the write-up.
- Other parts of the pipeline, such as literature retrieval, outline generation, and scoring, were used as-is without modifications.
Additionally, I ran the evaluation on my own custom dataset, but the citation quality score remained in the 0.3–0.4 range.
To ensure my setup was correct, I used the following command for the evaluation:
python evaluation.py --topic "Domain Specialization of LLMs"
--gpu 0
--saving_path ./output/
--model gpt-4o-2024-05-13
--db_path ./database
--embedding_model ./model/nomic-embed-text-v1
--api_url
--api_key sk-Could you please provide more details on how the citation quality score is computed and any additional considerations or settings that might help achieve the scores reported in the paper? I’d appreciate any guidance on how to ensure an accurate reproduction of the results.
Thank you!