Hi thanks for your impressive work!
After reading your paper, I have a question about the frame-level interaction control. To my understanding, the actions are injected as a (1+n) length sequence to generate (1+n) images together, and autoregressively extended to a long video.
So during inference, is it possible to provide one action a time to generate the next content? or how do you define the frame-level control. Thank you a lot in advance.