feat(scalarization): Add readme#749
Conversation
PierreQuinton
left a comment
There was a problem hiding this comment.
So think that if this is a good idea to add, then it should be dev oriented only. Here it looks like documentation, which it should not be. The way I view this would be to explain how to create scalarizers, what to be careful about (when are they stateful, avoid non-atomic things, etc...), what are not scalarizers? etc...
What do you think of this?
|
Yea I will make it dev oriented |
|
|
||
| - **Any shape in, scalar out:** it reduces over *all* elements of `values` (0-dim, vector, matrix, | ||
| higher-dim) into a 0-dim scalar. | ||
| - **`values`, not `losses`:** a scalarizer is generic and not tied to losses. |
There was a problem hiding this comment.
I think we may want to speak more about the abstraction level rather than saying that. Not sure how though.
There was a problem hiding this comment.
rewrote it around the abstraction
| Anything that needs the model, its parameters, or the per-task gradients belongs in the | ||
| [aggregation](../aggregation) package as a `Weighting` / `Aggregator`, which operates on the Jacobian | ||
| or its Gramian. If you reach for gradient norms or the network inside `forward`, you are writing an | ||
| aggregator, not a scalarizer. |
There was a problem hiding this comment.
Well the aggregation package should not contain anything else than Scalarizers, it contains aggregators (that could be stateful).
There was a problem hiding this comment.
Fixed. It now says the gradient-level counterpart in the aggregation package is an Aggregator (which, like a scalarizer, can be stateful) that operates on the Jacobian or its Gramian.
| - **Subclass `Stateful`** (`from torchjd._mixins import Stateful`) and implement `reset()` to restore | ||
| the initial state. |
There was a problem hiding this comment.
Do we have Randomness? If so do we have the Stochastic mixin that we also use in aggregation? Maybe we can mention it here.
There was a problem hiding this comment.
We do have a random baseline, but there is no Stochastic mixin. It just calls torch.randn directly
There was a problem hiding this comment.
My bad, I thought that In the aggregation package, we consider random aggregators as stateful (the seed is the state), I think it would be beneficial to do that in scalarization. @ValerianRey Didn't we go in that direction? Did we abandon that idea?
There was a problem hiding this comment.
Yes we abandoned this idea because it was quite tricky to make it work on cuda:
See:
The biggest problems were that something can be stochastic without directly owning a generator (it can own another stochastic object that does own a generator), and generators need a device to be created, so they can't be simply created ahead of time.
Co-authored-by: Pierre Quinton <pierre.quinton@gmail.com>
There was a problem hiding this comment.
Really nice but IMO we should try to include some parts within public docstrings, some within comments, and some within a skill. So I don't think we actually need a README file here at all. It's a bit different from what other repos do, but it will really enable agentic development to a much stronger level, while avoiding duplication of information (which is a nightmare to maintain).
E.g.
# Scalarization
This package implements the `Scalarizer`s: objects that reduce a tensor of values (typically a
vector of losses) into a single scalar optimizable with a standard `loss.backward()`.
=> init.py public docstring
## The abstraction
=> Split between public docstring of Scalarizer.forward, comments, etc
## Adding one
## State
## Things to be careful about
=> skill
And somewhere in CONTRIBUTING.md we should tell people that they're supposed to read skills before implementing something, and to use those skills if they use agents to develop.
| ## The abstraction | ||
|
|
||
| A scalarizer captures a single decision: **how to collapse a vector of values into one scalar to | ||
| minimize**. It operates purely on those values: it has no notion of the losses, tasks, or model they |
There was a problem hiding this comment.
Well, the values are 99% of the time losses so it's a bit confusing to say that it has no notion of the losses. Also I think we should say that in most cases, values are losses, but the scalarization package has been designed to be able to scalarize any tensor of values.
| Concretely, it subclasses `Scalarizer` (in [`_scalarizer_base.py`](_scalarizer_base.py)) and | ||
| implements one method: | ||
|
|
||
| ```python | ||
| def forward(self, values: Tensor, /) -> Tensor: | ||
| ... | ||
| ``` | ||
|
|
||
| - **Any shape in, scalar out:** it reduces over *all* elements of `values` (scalar, vector, matrix, | ||
| higher-dim) into a single scalar. | ||
| - **Pure and differentiable:** the output depends only on `values` and the configured parameters, so | ||
| that `scalarizer(values).backward()` produces the gradient. |
There was a problem hiding this comment.
This part seems like something that should be explained in the code itself (if not already), either in a public docstring for things that are intended to be user-facing, or in a comment for things that are intended to be contributor-facing.
| ## Adding one | ||
|
|
||
| A new scalarizer is a class plus the files that register it. Mirror an existing scalarizer of the | ||
| same kind: | ||
|
|
||
| - `_<name>.py`: the class. | ||
| - `__init__.py`: the import and an `__all__` entry. | ||
| - `docs/source/docs/scalarization/<name>.rst`: the docs page, added to the `index.rst` toctree. | ||
| - `tests/unit/scalarization/test_<name>.py`: the tests. | ||
| - `CHANGELOG.md`: an entry under `[Unreleased]`. |
There was a problem hiding this comment.
Honestly I feel like this should be explained in a skill rather than here. But I also think skills should be addressed to humans and agents alike.
| - **Subclass `Stateful`** (`from torchjd._mixins import Stateful`) and implement `reset()` to restore | ||
| the initial state. |
There was a problem hiding this comment.
Yes we abandoned this idea because it was quite tricky to make it work on cuda:
See:
The biggest problems were that something can be stochastic without directly owning a generator (it can own another stochastic object that does own a generator), and generators need a device to be created, so they can't be simply created ahead of time.
| - **Determinism and side effects:** the output should depend only on `values`, the configured | ||
| parameters, and (if the method is intentionally random) the global RNG. Any state change must be | ||
| deliberate, explicit, and undone by `reset()`. |
There was a problem hiding this comment.
I don't necessarily agree with that. We want to be able to give extra information through setters. E.g. a scalarizer that takes the gramian every once in a while:
my_agg = MyAgg()
for i in ...:
losses = ...
if i % 1000 == 0:
gramian = engine.compute_gramian(losses)
my_agg.set_gramian(gramian)
loss = my_agg(losses)
loss.backward()
...Note that such a scalarizer could use internally for example an UPGradWeighting, return weights that would be used for the next 1000 steps, etc.
This would be equivalent to:
W = UPGradWeighting()
for i in ...:
losses = ...
if i % 1000 == 0:
gramian = engine.compute_gramian(losses)
weights = W(gramian)
losses.backward(weights)
...So my point is that really a scalarizer could use anything to make its decision, let's not be restrictive here.
Adds a simple README