As described in the title. Torch by default uses `batch_first=False` for the `TransformerEncoderLayer`, resulting in high training loss and test error. Using default `batch_first=False`   Passing `batch_first=True`  