Thanks for your excellent work!
I noticed that the dataset released on Hugging Face mainly contains the final training annotations (e.g., reasoning traces and tool-augmented sequences). I am particularly interested in better understanding the data construction pipeline, especially the intermediate stages.
Specifically, I am wondering:
- Are the video clip-level captions used during dataset construction available?
- Is there access to the ground-truth temporal annotations (i.e., start/end timestamps of relevant segments)?
Thanks for your excellent work!
I noticed that the dataset released on Hugging Face mainly contains the final training annotations (e.g., reasoning traces and tool-augmented sequences). I am particularly interested in better understanding the data construction pipeline, especially the intermediate stages.
Specifically, I am wondering: