From bbe4957bc415353a0eabcca81a88d2f78227cd27 Mon Sep 17 00:00:00 2001
From: ppraneth <pranethparuchuri@gmail.com>
Date: Sun, 21 Jun 2026 09:46:17 +0530
Subject: [PATCH 1/5] add readme

---
 src/torchjd/scalarization/README.md | 47 +++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 src/torchjd/scalarization/README.md

diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md
new file mode 100644
index 00000000..0c8e2016
--- /dev/null
+++ b/src/torchjd/scalarization/README.md
@@ -0,0 +1,47 @@
+# Scalarization
+
+A `Scalarizer` reduces a tensor of values (typically a vector of per-task or per-instance losses)
+into a single scalar that can be optimized with a standard `loss.backward()`. Scalarizers are the
+simple baseline against which the Jacobian-descent [aggregators](../aggregation) are compared:
+instead of combining the per-loss gradients, a scalarizer combines the losses directly.
+
+Full documentation for every scalarizer is at
+[torchjd.org](https://torchjd.org/latest/docs/scalarization/).
+
+## Usage
+
+```python
+import torch
+from torch.nn import Linear
+from torchjd.scalarization import Mean
+
+model = Linear(3, 2)
+scalarizer = Mean()
+
+features = torch.randn(8, 3)
+losses = model(features).pow(2).mean(dim=0)  # one loss per output dimension
+loss = scalarizer(losses)
+loss.backward()  # gradients flow to the model parameters
+```
+
+## Available scalarizers
+
+- **Constant**: combines the values with constant, pre-determined weights.
+- **COSMOS**: linear scalarization minus a cosine-similarity penalty toward a preference direction.
+- **DWA**: weights each value by the relative rate at which its loss decreased over the two previous
+  epochs.
+- **FAMO**: decreases all task losses at an approximately equal rate, learning the task weights
+  internally.
+- **GeometricMean**: geometric mean of the values (also known as GLS).
+- **IMTLL**: learns a per-task scale and combines the values as the sum of `exp(s_i) * L_i - s_i`.
+- **Mean**: mean of the values.
+- **PBI**: decomposes the values along a preference direction and penalizes the perpendicular
+  distance.
+- **Random**: combines the values with positive random weights summing to one.
+- **STCH**: smooth approximation of the weighted, shifted maximum of the values.
+- **Sum**: sum of the values.
+- **UW**: weights the values using learned per-task uncertainties.
+
+`UW`, `IMTLL`, and `FAMO` are trainable, and `DWA` and `FAMO` carry state between calls, so they
+need a little more than a single call (an optimizer, a per-epoch `step()`, or an `update()`). See
+the documentation for the exact usage.

From 0d342cb1b97a332ce9cde37c28fe8ac93435660b Mon Sep 17 00:00:00 2001
From: ppraneth <pranethparuchuri@gmail.com>
Date: Sun, 21 Jun 2026 13:28:25 +0530
Subject: [PATCH 2/5] add readme v2

---
 src/torchjd/scalarization/README.md | 98 ++++++++++++++++++-----------
 1 file changed, 60 insertions(+), 38 deletions(-)

diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md
index 0c8e2016..7a6beadf 100644
--- a/src/torchjd/scalarization/README.md
+++ b/src/torchjd/scalarization/README.md
@@ -1,47 +1,69 @@
 # Scalarization
 
-A `Scalarizer` reduces a tensor of values (typically a vector of per-task or per-instance losses)
-into a single scalar that can be optimized with a standard `loss.backward()`. Scalarizers are the
-simple baseline against which the Jacobian-descent [aggregators](../aggregation) are compared:
-instead of combining the per-loss gradients, a scalarizer combines the losses directly.
+This package implements the `Scalarizer`s: objects that reduce a tensor of values (typically a
+vector of losses) into a single scalar optimizable with a standard `loss.backward()`.
 
-Full documentation for every scalarizer is at
-[torchjd.org](https://torchjd.org/latest/docs/scalarization/).
+This file is for contributors working on scalarizers. For the list of available scalarizers and their
+full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/).
 
-## Usage
+## The contract
+
+A scalarizer subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and implements
+one method:
 
 ```python
-import torch
-from torch.nn import Linear
-from torchjd.scalarization import Mean
+def forward(self, values: Tensor, /) -> Tensor:
+    ...
+```
 
-model = Linear(3, 2)
-scalarizer = Mean()
+- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix,
+  higher-dim) into a 0-dim scalar.
+- **`values`, not `losses`:** a scalarizer is generic and not tied to losses.
+- **Pure and differentiable:** the output depends only on `values` and the configured parameters, so
+  that `scalarizer(values).backward()` produces the gradient.
 
-features = torch.randn(8, 3)
-losses = model(features).pow(2).mean(dim=0)  # one loss per output dimension
-loss = scalarizer(losses)
-loss.backward()  # gradients flow to the model parameters
-```
+## Adding one
+
+A new scalarizer is a class plus the files that register it. Mirror an existing scalarizer of the
+same kind:
+
+- `_<name>.py` — the class.
+- `__init__.py` — the import and an `__all__` entry.
+- `docs/source/docs/scalarization/<name>.rst` — the docs page, added to the `index.rst` toctree.
+- `tests/unit/scalarization/test_<name>.py` — the tests.
+- `CHANGELOG.md` — an entry under `[Unreleased]`.
+
+## State
+
+Most scalarizers are stateless. Keep yours stateless unless the method genuinely needs state (learned
+weights, a loss history). When it does:
+
+- **Subclass `Stateful`** (`from torchjd._mixins import Stateful`) and implement `reset()` to restore
+  the initial state.
+- **Keep `forward` self-contained.** Do not hide cross-call state or side effects inside it. When the
+  method must carry information between calls, expose it through an explicit, named method and
+  document the protocol (e.g. a per-epoch `step()`, or an `update()` after the optimizer step).
+- **`nn.Parameter` vs buffer:** trainable state is an `nn.Parameter`; non-trained tensors that must
+  move with `.to()` are registered with `register_buffer`.
+
+## What is not a scalarizer
+
+A scalarizer only ever sees the loss values.
+
+Anything that needs the model, its parameters, or the per-task gradients belongs in the
+[aggregation](../aggregation) package as a `Weighting` / `Aggregator`, which operates on the Jacobian
+or its Gramian. If you reach for gradient norms or the network inside `forward`, you are writing an
+aggregator, not a scalarizer.
+
+## Things to be careful about
 
-## Available scalarizers
-
-- **Constant**: combines the values with constant, pre-determined weights.
-- **COSMOS**: linear scalarization minus a cosine-similarity penalty toward a preference direction.
-- **DWA**: weights each value by the relative rate at which its loss decreased over the two previous
-  epochs.
-- **FAMO**: decreases all task losses at an approximately equal rate, learning the task weights
-  internally.
-- **GeometricMean**: geometric mean of the values (also known as GLS).
-- **IMTLL**: learns a per-task scale and combines the values as the sum of `exp(s_i) * L_i - s_i`.
-- **Mean**: mean of the values.
-- **PBI**: decomposes the values along a preference direction and penalizes the perpendicular
-  distance.
-- **Random**: combines the values with positive random weights summing to one.
-- **STCH**: smooth approximation of the weighted, shifted maximum of the values.
-- **Sum**: sum of the values.
-- **UW**: weights the values using learned per-task uncertainties.
-
-`UW`, `IMTLL`, and `FAMO` are trainable, and `DWA` and `FAMO` carry state between calls, so they
-need a little more than a single call (an optimizer, a per-epoch `step()`, or an `update()`). See
-the documentation for the exact usage.
+- **Determinism and side effects:** the output should depend only on `values` and the configured
+  parameters. Any state change must be deliberate, explicit, and undone by `reset()`.
+- **Numerical stability:** keep the reduction finite on the edges of its domain (log-sum-exp
+  centering, an eps under a norm or in a denominator, etc.), and explain any value shift in a comment
+  and a `.. note::`.
+- **Hyperparameters:** when a coefficient has no single good value across problems, make it required
+  rather than guessing a default, and validate it in `__init__`.
+- **Shape validation:** check parameter shapes against `values` at call time and raise `ValueError`.
+- **Preconditions:** if the method is undefined on some inputs, document it in a `.. note::` and lock
+  it with a test (e.g. assert `nan` propagates rather than being silently clamped).

From ada66ad8f962ddb8a64ceed661e852d6f8056003 Mon Sep 17 00:00:00 2001
From: Praneth Paruchuri <pranethparuchuri@gmail.com>
Date: Sun, 21 Jun 2026 14:31:21 +0530
Subject: [PATCH 3/5] Update src/torchjd/scalarization/README.md

Co-authored-by: Pierre Quinton <pierre.quinton@gmail.com>
---
 src/torchjd/scalarization/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md
index 7a6beadf..0515f61c 100644
--- a/src/torchjd/scalarization/README.md
+++ b/src/torchjd/scalarization/README.md
@@ -16,8 +16,8 @@ def forward(self, values: Tensor, /) -> Tensor:
     ...
 ```
 
-- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix,
-  higher-dim) into a 0-dim scalar.
+- **Any shape in, scalar out:** it reduces over *all* dimensions of `values` (scalar, vector, matrix,
+  etc...) into a scalar.
 - **`values`, not `losses`:** a scalarizer is generic and not tied to losses.
 - **Pure and differentiable:** the output depends only on `values` and the configured parameters, so
   that `scalarizer(values).backward()` produces the gradient.

From 5990d23ad5d05b7a0947bd09206fb7ebad931d3f Mon Sep 17 00:00:00 2001
From: ppraneth <pranethparuchuri@gmail.com>
Date: Sun, 21 Jun 2026 14:42:03 +0530
Subject: [PATCH 4/5] add readme v3

---
 src/torchjd/scalarization/README.md | 54 +++++++++++++++++------------
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md
index 7a6beadf..88e1af20 100644
--- a/src/torchjd/scalarization/README.md
+++ b/src/torchjd/scalarization/README.md
@@ -6,32 +6,44 @@ vector of losses) into a single scalar optimizable with a standard `loss.backwar
 This file is for contributors working on scalarizers. For the list of available scalarizers and their
 full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/).
 
-## The contract
+## The abstraction
 
-A scalarizer subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and implements
-one method:
+A scalarizer captures a single decision: **how to collapse a vector of objective values into one
+scalar to minimize**, using only those values. It is the value-level counterpart of an aggregator,
+which makes the same kind of decision at the gradient level. Everything after it (backpropagation,
+the optimizer step) is standard PyTorch.
+
+Concretely, it subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and
+implements one method:
 
 ```python
 def forward(self, values: Tensor, /) -> Tensor:
     ...
 ```
 
-- **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix,
-  higher-dim) into a 0-dim scalar.
-- **`values`, not `losses`:** a scalarizer is generic and not tied to losses.
-- **Pure and differentiable:** the output depends only on `values` and the configured parameters, so
-  that `scalarizer(values).backward()` produces the gradient.
+- It reduces over *all* elements of `values`, of any shape, into a 0-dim scalar.
+- The result is a **differentiable** function of `values` and the configured parameters, so that
+  `scalarizer(values).backward()` produces the gradient.
+
+## What is not a scalarizer
+
+A scalarizer sees only the values. Its gradient-level counterpart lives in the
+[aggregation](../aggregation) package: an `Aggregator` (which, like a scalarizer, can be stateful)
+combines the per-objective *gradients* (the Jacobian or its Gramian) into a single gradient.
+
+So if your method needs the model, its parameters, or the per-objective gradients (gradient norms,
+for instance), it is an aggregator, not a scalarizer.
 
 ## Adding one
 
 A new scalarizer is a class plus the files that register it. Mirror an existing scalarizer of the
 same kind:
 
-- `_<name>.py` — the class.
-- `__init__.py` — the import and an `__all__` entry.
-- `docs/source/docs/scalarization/<name>.rst` — the docs page, added to the `index.rst` toctree.
-- `tests/unit/scalarization/test_<name>.py` — the tests.
-- `CHANGELOG.md` — an entry under `[Unreleased]`.
+- `_<name>.py`: the class.
+- `__init__.py`: the import and an `__all__` entry.
+- `docs/source/docs/scalarization/<name>.rst`: the docs page, added to the `index.rst` toctree.
+- `tests/unit/scalarization/test_<name>.py`: the tests.
+- `CHANGELOG.md`: an entry under `[Unreleased]`.
 
 ## State
 
@@ -46,19 +58,15 @@ weights, a loss history). When it does:
 - **`nn.Parameter` vs buffer:** trainable state is an `nn.Parameter`; non-trained tensors that must
   move with `.to()` are registered with `register_buffer`.
 
-## What is not a scalarizer
-
-A scalarizer only ever sees the loss values.
-
-Anything that needs the model, its parameters, or the per-task gradients belongs in the
-[aggregation](../aggregation) package as a `Weighting` / `Aggregator`, which operates on the Jacobian
-or its Gramian. If you reach for gradient norms or the network inside `forward`, you are writing an
-aggregator, not a scalarizer.
+Randomness is not state: a scalarizer may draw fresh randomness on each call (like the random
+baseline) without being `Stateful`. There is no stochastic mixin; it just uses the global torch RNG,
+so document the behavior and let users seed it with `torch.manual_seed`.
 
 ## Things to be careful about
 
-- **Determinism and side effects:** the output should depend only on `values` and the configured
-  parameters. Any state change must be deliberate, explicit, and undone by `reset()`.
+- **Determinism and side effects:** the output should depend only on `values`, the configured
+  parameters, and (if the method is intentionally random) the global RNG. Any state change must be
+  deliberate, explicit, and undone by `reset()`.
 - **Numerical stability:** keep the reduction finite on the edges of its domain (log-sum-exp
   centering, an eps under a norm or in a denominator, etc.), and explain any value shift in a comment
   and a `.. note::`.

From 0156c1f3fba9ad4b059c7bf3721564f414ce7be0 Mon Sep 17 00:00:00 2001
From: ppraneth <pranethparuchuri@gmail.com>
Date: Sun, 21 Jun 2026 14:54:50 +0530
Subject: [PATCH 5/5] minor fix

---
 src/torchjd/scalarization/README.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/torchjd/scalarization/README.md b/src/torchjd/scalarization/README.md
index f72b750c..3aef1561 100644
--- a/src/torchjd/scalarization/README.md
+++ b/src/torchjd/scalarization/README.md
@@ -8,10 +8,11 @@ full API, see [torchjd.org](https://torchjd.org/latest/docs/scalarization/).
 
 ## The abstraction
 
-A scalarizer captures a single decision: **how to collapse a vector of objective values into one
-scalar to minimize**, using only those values. It is the value-level counterpart of an aggregator,
-which makes the same kind of decision at the gradient level. Everything after it (backpropagation,
-the optimizer step) is standard PyTorch.
+A scalarizer captures a single decision: **how to collapse a vector of values into one scalar to
+minimize**. It operates purely on those values: it has no notion of the losses, tasks, or model they
+come from, which is why its input is named `values` and not `losses`. It is the value-level
+counterpart of an aggregator, which makes the same decision at the gradient level. Everything after
+it (backpropagation, the optimizer step) is standard PyTorch.
 
 Concretely, it subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and
 implements one method:
@@ -21,9 +22,8 @@ def forward(self, values: Tensor, /) -> Tensor:
     ...
 ```
 
-- **Any shape in, scalar out:** it reduces over *all* dimensions of `values` (scalar, vector, matrix,
-  etc...) into a scalar.
-- **`values`, not `losses`:** a scalarizer is generic and not tied to losses.
+- **Any shape in, scalar out:** it reduces over *all* elements of `values` (scalar, vector, matrix,
+  higher-dim) into a single scalar.
 - **Pure and differentiable:** the output depends only on `values` and the configured parameters, so
   that `scalarizer(values).backward()` produces the gradient.