Hi GRAM authors,
Thanks for your excellent work.
I'm trying to reproduce your results with your code.
Since I don’t have the VAST27M 150k subset, I tried fine-tuning the MSRVTT dataset using the three checkpoints based on VAST model, 4modal, and 5modal to reproduce your results. However, regardless of which dataset or which base checkpoint I use, I cannot reach your reported 64% R@1.
Could you let me know on which base model you fine-tuned the MSRVTT dataset?
Here are my reproduced results:
| base model |
T2D r1 |
D2T r1 |
| VAST model |
59 |
59 |
| 4modal |
60.5 |
61 |
| 5modal |
60.6 |
60.9 |
Hi GRAM authors,
Thanks for your excellent work.
I'm trying to reproduce your results with your code.
Since I don’t have the VAST27M 150k subset, I tried fine-tuning the MSRVTT dataset using the three checkpoints based on VAST model, 4modal, and 5modal to reproduce your results. However, regardless of which dataset or which base checkpoint I use, I cannot reach your reported 64% R@1.
Could you let me know on which base model you fine-tuned the MSRVTT dataset?
Here are my reproduced results: