Hi InterFuser developers,
In the supplementary documentation of InterFuser research paper, you included the visualization of attention weights between the object density map queries and features from different views (Figure 6 attached). Given the multimodality of the model, I had a hard time figuring out how to visualize the attention weights between a specific input (such as front view) and a specific output (such as object density map). I couldn't find any explanation on how to generate it in the paper or the github page either.
Have you considered revealing the code or the concept to generate it? Thank you very much!
