Add weight_decay and mask arguments to adabelief optimizer.#1291
Add weight_decay and mask arguments to adabelief optimizer.#1291carlosgmartin wants to merge 1 commit intogoogle-deepmind:mainfrom
Conversation
405a6a3 to
e8de17b
Compare
|
Could we move towards calling the weight decay mask, If adabelief needs weight decay then perhaps we can think of adding it to all the main optimizers in |
e8de17b to
b314331
Compare
|
@rdyro I've changed the argument's name from I'll leave the changing the other optimizers' argument names to a subsequent PR, to keep this one self-contained. |
|
@rdyro Does this look good? |
|
I'm not entirely sure about this change, the original adabelief paper explicitly discusses, but does not use weight decay. The problem for optax is that weight decay is NOT scaled by the learning rate, so the user has two options for adding weight decay to an existing optimizer:
It'd be great if we can solve this problem more systematically to not have to add extra weight decay arguments to every popular optimizer. Perhaps we can introduce another keyword argument to the For a systematic fix, I'd prefer to remove the additional weight_decay keyword argument from pre-made optimizers (but we should keep the ones that explicitly include them (e.g., |
|
What does @vroulet think? |
|
The repository of the original author seems to have some weight decay https://github.com/juntang-zhuang/Adabelief-Optimizer/tree/update_0.2.0. So having a weight decay implementation makes sense. I agree with Robert that the current duplications of |
Fixes #1290.