Hello,author:
I have confused about text to pixel matching.
Images use Patch Merging to be embbeding into VIT(image encoder) and becomes features maps .
But this feature map is feaure of patchs,not pixels.How can i realize text to pixel?
And I see that your papar visualized the cam .which output you used to generate cam?
Hello,author:
I have confused about text to pixel matching.
Images use Patch Merging to be embbeding into VIT(image encoder) and becomes features maps .
But this feature map is feaure of patchs,not pixels.How can i realize text to pixel?
And I see that your papar visualized the cam .which output you used to generate cam?