From bbe4957bc415353a0eabcca81a88d2f78227cd27 Mon Sep 17 00:00:00 2001 From: ppraneth Date: Sun, 21 Jun 2026 09:46:17 +0530 Subject: [PATCH 1/5] add readme --- src/torchjd/scalarization/README.md | 47 +++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 src/torchjd/scalarization/README.md diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md new file mode 100644 index 00000000..0c8e2016 --- /dev/null +++ b/src/torchjd/scalarization/README.md @@ -0,0 +1,47 @@ +# Scalarization + +A `Scalarizer` reduces a tensor of values (typically a vector of per-task or per-instance losses) +into a single scalar that can be optimized with a standard `loss.backward()`. Scalarizers are the +simple baseline against which the Jacobian-descent [aggregators](../aggregation) are compared: +instead of combining the per-loss gradients, a scalarizer combines the losses directly. + +Full documentation for every scalarizer is at +[torchjd.org](https://torchjd.org/latest/docs/scalarization/). + +## Usage + +```python +import torch +from torch.nn import Linear +from torchjd.scalarization import Mean + +model = Linear(3, 2) +scalarizer = Mean() + +features = torch.randn(8, 3) +losses = model(features).pow(2).mean(dim=0) # one loss per output dimension +loss = scalarizer(losses) +loss.backward() # gradients flow to the model parameters +``` + +## Available scalarizers + +- **Constant**: combines the values with constant, pre-determined weights. +- **COSMOS**: linear scalarization minus a cosine-similarity penalty toward a preference direction. +- **DWA**: weights each value by the relative rate at which its loss decreased over the two previous + epochs. +- **FAMO**: decreases all task losses at an approximately equal rate, learning the task weights + internally. +- **GeometricMean**: geometric mean of the values (also known as GLS). +- **IMTLL**: learns a per-task scale and combines the values as the sum of `exp(s_i) * L_i - s_i`. +- **Mean**: mean of the values. +- **PBI**: decomposes the values along a preference direction and penalizes the perpendicular + distance. +- **Random**: combines the values with positive random weights summing to one. +- **STCH**: smooth approximation of the weighted, shifted maximum of the values. +- **Sum**: sum of the values. +- **UW**: weights the values using learned per-task uncertainties. + +`UW`, `IMTLL`, and `FAMO` are trainable, and `DWA` and `FAMO` carry state between calls, so they +need a little more than a single call (an optimizer, a per-epoch `step()`, or an `update()`). See +the documentation for the exact usage. From 0d342cb1b97a332ce9cde37c28fe8ac93435660b Mon Sep 17 00:00:00 2001 From: ppraneth Date: Sun, 21 Jun 2026 13:28:25 +0530 Subject: [PATCH 2/5] add readme v2 --- src/torchjd/scalarization/README.md | 98 ++++++++++++++++++----------- 1 file changed, 60 insertions(+), 38 deletions(-) diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md index 0c8e2016..7a6beadf 100644 --- a/src/torchjd/scalarization/README.md +++ b/src/torchjd/scalarization/README.md @@ -1,47 +1,69 @@ # Scalarization -A `Scalarizer` reduces a tensor of values (typically a vector of per-task or per-instance losses) -into a single scalar that can be optimized with a standard `loss.backward()`. Scalarizers are the -simple baseline against which the Jacobian-descent [aggregators](../aggregation) are compared: -instead of combining the per-loss gradients, a scalarizer combines the losses directly. +This package implements the `Scalarizer`s: objects that reduce a tensor of values (typically a +vector of losses) into a single scalar optimizable with a standard `loss.backward()`. -Full documentation for every scalarizer is at -[torchjd.org](https://torchjd.org/latest/docs/scalarization/). +This file is for contributors working on scalarizers. For the list of available scalarizers and their +full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/). -## Usage +## The contract + +A scalarizer subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and implements +one method: ```python -import torch -from torch.nn import Linear -from torchjd.scalarization import Mean +def forward(self, values: Tensor, /) -> Tensor: + ... +``` -model = Linear(3, 2) -scalarizer = Mean() +- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix, + higher-dim) into a 0-dim scalar. +- **`values`, not `losses`:** a scalarizer is generic and not tied to losses. +- **Pure and differentiable:** the output depends only on `values` and the configured parameters, so + that `scalarizer(values).backward()` produces the gradient. -features = torch.randn(8, 3) -losses = model(features).pow(2).mean(dim=0) # one loss per output dimension -loss = scalarizer(losses) -loss.backward() # gradients flow to the model parameters -``` +## Adding one + +A new scalarizer is a class plus the files that register it. Mirror an existing scalarizer of the +same kind: + +- `_.py` — the class. +- `__init__.py` — the import and an `__all__` entry. +- `docs/source/docs/scalarization/.rst` — the docs page, added to the `index.rst` toctree. +- `tests/unit/scalarization/test_.py` — the tests. +- `CHANGELOG.md` — an entry under `[Unreleased]`. + +## State + +Most scalarizers are stateless. Keep yours stateless unless the method genuinely needs state (learned +weights, a loss history). When it does: + +- **Subclass `Stateful`** (`from torchjd._mixins import Stateful`) and implement `reset()` to restore + the initial state. +- **Keep `forward` self-contained.** Do not hide cross-call state or side effects inside it. When the + method must carry information between calls, expose it through an explicit, named method and + document the protocol (e.g. a per-epoch `step()`, or an `update()` after the optimizer step). +- **`nn.Parameter` vs buffer:** trainable state is an `nn.Parameter`; non-trained tensors that must + move with `.to()` are registered with `register_buffer`. + +## What is not a scalarizer + +A scalarizer only ever sees the loss values. + +Anything that needs the model, its parameters, or the per-task gradients belongs in the +[aggregation](../aggregation) package as a `Weighting` / `Aggregator`, which operates on the Jacobian +or its Gramian. If you reach for gradient norms or the network inside `forward`, you are writing an +aggregator, not a scalarizer. + +## Things to be careful about -## Available scalarizers - -- **Constant**: combines the values with constant, pre-determined weights. -- **COSMOS**: linear scalarization minus a cosine-similarity penalty toward a preference direction. -- **DWA**: weights each value by the relative rate at which its loss decreased over the two previous - epochs. -- **FAMO**: decreases all task losses at an approximately equal rate, learning the task weights - internally. -- **GeometricMean**: geometric mean of the values (also known as GLS). -- **IMTLL**: learns a per-task scale and combines the values as the sum of `exp(s_i) * L_i - s_i`. -- **Mean**: mean of the values. -- **PBI**: decomposes the values along a preference direction and penalizes the perpendicular - distance. -- **Random**: combines the values with positive random weights summing to one. -- **STCH**: smooth approximation of the weighted, shifted maximum of the values. -- **Sum**: sum of the values. -- **UW**: weights the values using learned per-task uncertainties. - -`UW`, `IMTLL`, and `FAMO` are trainable, and `DWA` and `FAMO` carry state between calls, so they -need a little more than a single call (an optimizer, a per-epoch `step()`, or an `update()`). See -the documentation for the exact usage. +- **Determinism and side effects:** the output should depend only on `values` and the configured + parameters. Any state change must be deliberate, explicit, and undone by `reset()`. +- **Numerical stability:** keep the reduction finite on the edges of its domain (log-sum-exp + centering, an eps under a norm or in a denominator, etc.), and explain any value shift in a comment + and a `.. note::`. +- **Hyperparameters:** when a coefficient has no single good value across problems, make it required + rather than guessing a default, and validate it in `__init__`. +- **Shape validation:** check parameter shapes against `values` at call time and raise `ValueError`. +- **Preconditions:** if the method is undefined on some inputs, document it in a `.. note::` and lock + it with a test (e.g. assert `nan` propagates rather than being silently clamped). From ada66ad8f962ddb8a64ceed661e852d6f8056003 Mon Sep 17 00:00:00 2001 From: Praneth Paruchuri Date: Sun, 21 Jun 2026 14:31:21 +0530 Subject: [PATCH 3/5] Update src/torchjd/scalarization/README.md Co-authored-by: Pierre Quinton --- src/torchjd/scalarization/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md index 7a6beadf..0515f61c 100644 --- a/src/torchjd/scalarization/README.md +++ b/src/torchjd/scalarization/README.md @@ -16,8 +16,8 @@ def forward(self, values: Tensor, /) -> Tensor: ... ``` -- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix, - higher-dim) into a 0-dim scalar. +- **Any shape in, scalar out:** it reduces over *all* dimensions of `values` (scalar, vector, matrix, + etc...) into a scalar. - **`values`, not `losses`:** a scalarizer is generic and not tied to losses. - **Pure and differentiable:** the output depends only on `values` and the configured parameters, so that `scalarizer(values).backward()` produces the gradient. From 5990d23ad5d05b7a0947bd09206fb7ebad931d3f Mon Sep 17 00:00:00 2001 From: ppraneth Date: Sun, 21 Jun 2026 14:42:03 +0530 Subject: [PATCH 4/5] add readme v3 --- src/torchjd/scalarization/README.md | 54 +++++++++++++++++------------ 1 file changed, 31 insertions(+), 23 deletions(-) diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md index 7a6beadf..88e1af20 100644 --- a/src/torchjd/scalarization/README.md +++ b/src/torchjd/scalarization/README.md @@ -6,32 +6,44 @@ vector of losses) into a single scalar optimizable with a standard `loss.backwar This file is for contributors working on scalarizers. For the list of available scalarizers and their full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/). -## The contract +## The abstraction -A scalarizer subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and implements -one method: +A scalarizer captures a single decision: **how to collapse a vector of objective values into one +scalar to minimize**, using only those values. It is the value-level counterpart of an aggregator, +which makes the same kind of decision at the gradient level. Everything after it (backpropagation, +the optimizer step) is standard PyTorch. + +Concretely, it subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and +implements one method: ```python def forward(self, values: Tensor, /) -> Tensor: ... ``` -- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix, - higher-dim) into a 0-dim scalar. -- **`values`, not `losses`:** a scalarizer is generic and not tied to losses. -- **Pure and differentiable:** the output depends only on `values` and the configured parameters, so - that `scalarizer(values).backward()` produces the gradient. +- It reduces over *all* elements of `values`, of any shape, into a 0-dim scalar. +- The result is a **differentiable** function of `values` and the configured parameters, so that + `scalarizer(values).backward()` produces the gradient. + +## What is not a scalarizer + +A scalarizer sees only the values. Its gradient-level counterpart lives in the +[aggregation](../aggregation) package: an `Aggregator` (which, like a scalarizer, can be stateful) +combines the per-objective *gradients* (the Jacobian or its Gramian) into a single gradient. + +So if your method needs the model, its parameters, or the per-objective gradients (gradient norms, +for instance), it is an aggregator, not a scalarizer. ## Adding one A new scalarizer is a class plus the files that register it. Mirror an existing scalarizer of the same kind: -- `_.py` — the class. -- `__init__.py` — the import and an `__all__` entry. -- `docs/source/docs/scalarization/.rst` — the docs page, added to the `index.rst` toctree. -- `tests/unit/scalarization/test_.py` — the tests. -- `CHANGELOG.md` — an entry under `[Unreleased]`. +- `_.py`: the class. +- `__init__.py`: the import and an `__all__` entry. +- `docs/source/docs/scalarization/.rst`: the docs page, added to the `index.rst` toctree. +- `tests/unit/scalarization/test_.py`: the tests. +- `CHANGELOG.md`: an entry under `[Unreleased]`. ## State @@ -46,19 +58,15 @@ weights, a loss history). When it does: - **`nn.Parameter` vs buffer:** trainable state is an `nn.Parameter`; non-trained tensors that must move with `.to()` are registered with `register_buffer`. -## What is not a scalarizer - -A scalarizer only ever sees the loss values. - -Anything that needs the model, its parameters, or the per-task gradients belongs in the -[aggregation](../aggregation) package as a `Weighting` / `Aggregator`, which operates on the Jacobian -or its Gramian. If you reach for gradient norms or the network inside `forward`, you are writing an -aggregator, not a scalarizer. +Randomness is not state: a scalarizer may draw fresh randomness on each call (like the random +baseline) without being `Stateful`. There is no stochastic mixin; it just uses the global torch RNG, +so document the behavior and let users seed it with `torch.manual_seed`. ## Things to be careful about -- **Determinism and side effects:** the output should depend only on `values` and the configured - parameters. Any state change must be deliberate, explicit, and undone by `reset()`. +- **Determinism and side effects:** the output should depend only on `values`, the configured + parameters, and (if the method is intentionally random) the global RNG. Any state change must be + deliberate, explicit, and undone by `reset()`. - **Numerical stability:** keep the reduction finite on the edges of its domain (log-sum-exp centering, an eps under a norm or in a denominator, etc.), and explain any value shift in a comment and a `.. note::`. From 0156c1f3fba9ad4b059c7bf3721564f414ce7be0 Mon Sep 17 00:00:00 2001 From: ppraneth Date: Sun, 21 Jun 2026 14:54:50 +0530 Subject: [PATCH 5/5] minor fix --- src/torchjd/scalarization/README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md index f72b750c..3aef1561 100644 --- a/src/torchjd/scalarization/README.md +++ b/src/torchjd/scalarization/README.md @@ -8,10 +8,11 @@ full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/). ## The abstraction -A scalarizer captures a single decision: **how to collapse a vector of objective values into one -scalar to minimize**, using only those values. It is the value-level counterpart of an aggregator, -which makes the same kind of decision at the gradient level. Everything after it (backpropagation, -the optimizer step) is standard PyTorch. +A scalarizer captures a single decision: **how to collapse a vector of values into one scalar to +minimize**. It operates purely on those values: it has no notion of the losses, tasks, or model they +come from, which is why its input is named `values` and not `losses`. It is the value-level +counterpart of an aggregator, which makes the same decision at the gradient level. Everything after +it (backpropagation, the optimizer step) is standard PyTorch. Concretely, it subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and implements one method: @@ -21,9 +22,8 @@ def forward(self, values: Tensor, /) -> Tensor: ... ``` -- **Any shape in, scalar out:** it reduces over *all* dimensions of `values` (scalar, vector, matrix, - etc...) into a scalar. -- **`values`, not `losses`:** a scalarizer is generic and not tied to losses. +- **Any shape in, scalar out:** it reduces over *all* elements of `values` (scalar, vector, matrix, + higher-dim) into a single scalar. - **Pure and differentiable:** the output depends only on `values` and the configured parameters, so that `scalarizer(values).backward()` produces the gradient.