Text to pixel maching

Hello，author：
    I have confused about **text to pixel** matching.

Images  use  _Patch Merging_ to be embbeding into VIT(image encoder) and becomes features maps .

But this **feature map** is **feaure of patchs**,**not pixels**.How can i  realize text to pixel？


And I see that your papar visualized the cam .which output you used to generate cam？