Hi,
For normal ViT, there is no normalization layer after the nn.Conv2d in the patch embeding. However, for Swin Transformer, there is a normalization layer after the nn.Conv2d in the patch embeding.
Why did you decide to add normalization after that nn.Conv2d? Have you tried training Swin without adding a normalization layer after the nn.Conv2d in the patch embedding and see if it's better?