Ref: https://arxiv.org/abs/2503.18866
Why
We lack data, but have plenty of compute for the current scale of models we're interested in. Being able to generate more data in a smart way might be key to actually improving models. BoLT seems like a promising way to do that.
Approach
Implement the method and run experiments on e.g. 7B CPT w/ Dynaword or similar scale data. This method can soak up a lot of compute, so it'll be important to cap compute usage for experiments, maybe 20k MI250X hours to show positive results.
Ref: https://arxiv.org/abs/2503.18866
Why
We lack data, but have plenty of compute for the current scale of models we're interested in. Being able to generate more data in a smart way might be key to actually improving models. BoLT seems like a promising way to do that.
Approach
Implement the method and run experiments on e.g. 7B CPT w/ Dynaword or similar scale data. This method can soak up a lot of compute, so it'll be important to cap compute usage for experiments, maybe 20k MI250X hours to show positive results.