In your paper‘s section 3.2.1, you mention that utilizing original caption can enhance the model's answer. But in your prompt i haven't found any content about the original caption. So GPT4V only uses images to produce data, including caption and instruction? I'd like to know if the original caption used in your data synthesis or model inference procedure.
In your paper‘s section 3.2.1, you mention that utilizing original caption can enhance the model's answer. But in your prompt i haven't found any content about the original caption. So GPT4V only uses images to produce data, including caption and instruction? I'd like to know if the original caption used in your data synthesis or model inference procedure.