feat(aggregation): Add IMTL-L#725
Conversation
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
If they are the same, shouldn't we merge them and write a note? I'm pretty sure the factor two will just double the effective LR of I think we could in principle name it as the first implementation of the two (does any cite another?) I'm not so sure that adding duplicated methods is a good idea as it contributes noise to the library, it will also cost compute to people doing benchmarks on all methods. Not sure what we should do. |
|
@PierreQuinton I added doc strings saying that but kept code as separate as we know both are different methods right(I mean papers). |
The reason why I wanted to have two separate classes is so that it's easy for people of the field to find the method they want to benchmark against. If they're implementing the IMTL paper, they know they need IMTL-G + IMTL-L. They will never know that they can replace IMTL-L by UW. Also, these methods are not exactly the same, even if the difference is extremely minimal. So I guess it's ok to include this. It's not like this will happen very often I think. It's more of a reviewer's mistake to let them claim IMTL-L as novel. |
Co-authored-by: Valérian Rey <31951177+ValerianRey@users.noreply.github.com>
Co-authored-by: Valérian Rey <31951177+ValerianRey@users.noreply.github.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
@ValerianRey I have made the changes |
ValerianRey
left a comment
There was a problem hiding this comment.
LGTM. @PierreQuinton are you ok with merging this? See my comment for an answer to your concerne.
|
Yes, the solution to my concern is an improved onboarding, which is independent from this. Thanks a lot @ppraneth ! |
|
@PierreQuinton How about we work on docs once we I am done with the whole scalarization package |
I agree with that. I think our README is outdated and we're missing a simple getting-started tutorial. Also, we need to emphasize much more on scalarization when the package becomes more complete. Instead of spending a lot of time explaining what jacobian descent is, I would rather say that we can either combine the losses into a scalar loss and do gradient descent, or compute every gradient and combine them into a single gradient, which is jacobian descent. Then explain a bit about the pros and cons. |
Adds
IMTL, the loss-balancing variant (IMTL-L) of Impartial Multi-Task Learning from Towards Impartial Multi-Task Learning (ICLR 2021). It's a stateful, trainableScalarizer.IMTLEach value$L_i$ (typically a per-task loss) is assigned a learnable scale $s_i$ , and the values are combined as:
This is the loss-balance objective (eq. 6 in the paper, with the default$a=e, b=1$ ), and it matches the loss-balancing part of the LibMTL implementation (
loss_scale.exp() * losses - loss_scale).The factor$e^{s_i}$ rescales each loss so the scaled losses stay at a comparable magnitude across tasks, and the $-s_i$ term is a regularizer that prevents the trivial solution $s_i \to -\infty$ . The $s_i$ are stored as an
nn.Parameter, so the scalarizer's parameters must be passed to the optimizer to be learned jointly with the model.Design notes:
shapeis given at construction (IMTL(3)orIMTL((2, 3))), since the parameter has to exist before the optimizer is built. The shape is validated against the input at call time, likeConstantandUW.0, so at the start of training the scalarization reduces to the plain sum of the values (reset()(fromStateful), which zeros the scales.Relationship to
UW(almost equivalent)IMTL-L is almost equivalent to
UW: it equalsUWup to a constant factor of two and the sign of the learned parameter, namely(the paper notes this in Appendix C.4, where$\tfrac{1}{2}(e^s L - s)$ . They derive from different principles —
UW's regression form is written asUWfrom Gaussian/Laplace likelihoods, IMTL-L without any distribution assumption — but share the same per-task weighting and the same optima.IMTLis kept as its own discoverable class with its own direct formula; the docstring states theUWrelationship, and a test locks it. The complementary gradient-balancing variant (IMTL-G) is already available as theIMTLGaggregator.Tests
tests/unit/scalarization/test_imtl.pycovers the value at init (reduces tosum(values)), int-vs-tuple shape equivalence, scalar output and gradient flow over all input shapes (0-dim, vector, matrix, higher-dim), gradient flow tolog_scale, shape validation,reset(), that negative inputs are allowed, trainability via an optimizer step, the representations, and thatIMTL(s) == 2 * UW(-s).