Training R4

Hi,

I have had good success so far reproducing results training R1/R2.

I am wondering if you attempted to train per-layer R4 rotation?

Bonus question: why is the orthogonality contraint required? Would we not just be able to train over invertible matrices, such that:

```
AB = BA = I_n
```

and
```
y = x @ W.T = (x @ A) @ (W @ B.T).T
```

?

Thank you!