Hi, your attack method in the work is very interesting, but I've noticed that if I run the gen_adv.py file from scratch, the generated correct answers and incorrect answers seem to be quite long. This appears to be inconsistent with the results you provided in adv_targeted_results. Could you please clarify if you used manual review or additional prompt constraints?
The LLM I am using is gpt-4o-mini, but I saw in the issues that someone using gpt-4 got the same result.
In addition, the step for generating the correct answer in gen_adv.py involves calling the LLM twice (once for a direct query and once with the ground truth document included) and comparing them using string matching. Due to the aforementioned issue, this approach seems to be resulting in a large number of queries failing the match and being skipped.
The aforementioned issue of overly long generated answers also leads to the failure of string-matching-based ASR evaluation, resulting in a significant drop in the ASR score.
Could you please provide suggestions on reproducing the experiment and on the results you provided?
Hi, your attack method in the work is very interesting, but I've noticed that if I run the gen_adv.py file from scratch, the generated correct answers and incorrect answers seem to be quite long. This appears to be inconsistent with the results you provided in adv_targeted_results. Could you please clarify if you used manual review or additional prompt constraints?
The LLM I am using is gpt-4o-mini, but I saw in the issues that someone using gpt-4 got the same result.
In addition, the step for generating the correct answer in gen_adv.py involves calling the LLM twice (once for a direct query and once with the ground truth document included) and comparing them using string matching. Due to the aforementioned issue, this approach seems to be resulting in a large number of queries failing the match and being skipped.
The aforementioned issue of overly long generated answers also leads to the failure of string-matching-based ASR evaluation, resulting in a significant drop in the ASR score.
Could you please provide suggestions on reproducing the experiment and on the results you provided?