Hi,
I have had good success so far reproducing results training R1/R2.
I am wondering if you attempted to train per-layer R4 rotation?
Bonus question: why is the orthogonality contraint required? Would we not just be able to train over invertible matrices, such that:
and
y = x @ W.T = (x @ A) @ (W @ B.T).T
?
Thank you!
Hi,
I have had good success so far reproducing results training R1/R2.
I am wondering if you attempted to train per-layer R4 rotation?
Bonus question: why is the orthogonality contraint required? Would we not just be able to train over invertible matrices, such that:
and
?
Thank you!