[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm by theap06 · Pull Request #3748 · pytorch/rl

theap06 · 2026-05-13T09:37:39Z

Context

Multi-agent RL is currently the weakest research surface in torchrl: the only multi-agent loss shipped is QMixerLoss (DQN family, discrete actions). For cooperative continuous-control MARL — where most modern benchmarks live (SMAC, VMAS, PettingZoo MPE, Hanabi, Overcooked) — users have to hand-assemble ClipPPOLoss + manual set_keys(done=("agents", "done"), terminated=("agents", "terminated")) + manual make_value_estimator(GAE, ...). The existing sota-implementations/multiagent/mappo_ippo.py recipe shows what this boilerplate looks like.

This PR adds MAPPO (Yu et al. 2022) and IPPO (de Witt et al. 2020) as first-class objectives, plus the two pieces of supporting infrastructure they need.

What's new

torchrl.objectives.multiagent.MAPPOLoss — centralised-critic, decentralised-actor PPO. Subclasses ClipPPOLoss; defaults the value estimator to MultiAgentGAE, defaults normalize_advantage_exclude_dims=(-2,), and optionally accepts a ValueNorm for the critic-stability trick from the paper.
torchrl.objectives.multiagent.IPPOLoss — independent-learner counterpart. Each agent has its own local critic; no centralised state required.
torchrl.objectives.value.MultiAgentGAE — GAE variant that broadcasts team-shared reward / done / terminated (shape [*B, T, 1]) across the agent dim before the vec-GAE recursion, so users don't have to manually replicate signals or override set_keys. New ValueEstimators.MAGAE enum entry.
torchrl.modules.ValueNorm — PopArt-style running value normaliser (van Hasselt et al. 2019), used opt-in by MAPPOLoss. Yu et al. 2022 Table 13 credits this trick with the algorithm's strong SMAC results.

Design notes

Two classes instead of a centralized: bool flag. The structural code difference between MAPPO and IPPO is small (~20 lines), but I made them separate named classes rather than a single class with a flag because:

The recent feedback on (HER) was explicit about avoiding wrapper-in-wrapper / "sampler-in-sampler" APIs. A boolean flag on a single class is the same pattern shifted to losses.
from torchrl.objectives.multiagent import MAPPOLoss is self-documenting; the docstring spells out the full recipe (centralised critic construction, etc.) for each algorithm independently.

MAGAE dispatch in plain PPO / A2C / Reinforce. Adding ValueEstimators.MAGAE to the enum would break every parent test that parametrises over list(ValueEstimators) unless every make_value_estimator knows the new enum value. Two options: (a) update ~29 test parametrisations to skip MAGAE, or (b) have plain PPO / A2C / Reinforce dispatch MAGAE to MultiAgentGAE. I went with (b) — it's ~5 lines per file, leaves the enum exhaustive, and is the right thing semantically (any actor-critic with the right data shapes can use MAGAE).

ValueNorm placement. Lives under torchrl/modules/ rather than torchrl/objectives/utils/ because it's a stateful learnable component that participates in .to(device) / state_dict(). Happy to move if reviewers prefer otherwise.

Out of scope (follow-up)

HAPPO / sequential update scheme (Kuba et al. 2022)
Multi-Agent Transformer (MAT)
Refactoring sota-implementations/multiagent/mappo_ippo.py to use the new classes — left untouched in this PR to keep the blast radius small; can be a one-line follow-up.

Verification

pytest test/objectives/test_mappo.py — 16/16 passing. Synthetic-tensordict tests for forward shapes, backward, centralised-vs-decentralised critic semantics, share-params modes, ValueNorm convergence, and critic-loss bounded-ness under 10× reward inflation.
pytest test/test_cost.py -k "ppo or qmixer or a2c or reinforce" — 2394/2394 passing (no regressions).
Full test_cost.py — 8788 passing, 1 pre-existing unrelated failure (test_exploration_compile — torch.compile + torch.utils.mkldnn deprecation, no MAPPO involvement).
examples/multiagent/mappo_vmas.py --algo mappo --frames 200_000 provides a minimal end-to-end smoke recipe on VMAS Navigation.

pytorch-bot · 2026-05-13T09:37:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3748

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

⚠️ 16 Awaiting Approval

As of commit 5c55c31 with merge base cc31dc3 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xmaster6y · 2026-05-13T11:24:27Z

I think you're right about making MA training more straightforward, but I have some concerns:

MultiAgentGAE.forward seems to duplicate most of GAE.forward; maybe an additional level of abstraction is needed.
ValueNorm should be more generic and less tied to MAPPO
We might need a registry for value estimators instead of enums
We should maybe handle/consider potential compatibility issues with MAGAE for other algs

theap06 · 2026-05-13T19:41:11Z

I think you're right about making MA training more straightforward, but I have some concerns:

MultiAgentGAE.forward seems to duplicate most of GAE.forward; maybe an additional level of abstraction is needed.

ValueNorm should be more generic and less tied to MAPPO

We might need a registry for value estimators instead of enums

We should maybe handle/consider potential compatibility issues with MAGAE for other algs

I think the layers of abstractions make sense. I think the value estimators would help because of the utilization of the collectors as well. For the compatibility issues, I can write up some test cases to ensure it doesn't impact existing algos.

…tations

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026

github-actions Bot added Documentation Improvements or additions to documentation Examples Objectives Modules Integrations/torch_geometric Integrations Feature New feature labels May 13, 2026

[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm

d1f1eb0

theap06 force-pushed the feat/mappo-ippo branch from 0d8dfea to d1f1eb0 Compare May 13, 2026 09:47

[Refactor] GAE extension hooks + ValueNorm ABC + alternative implemen…

5c55c31

…tations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm#3748

[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm#3748
theap06 wants to merge 2 commits into
pytorch:mainfrom
theap06:feat/mappo-ippo

theap06 commented May 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Xmaster6y commented May 13, 2026

Uh oh!

theap06 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

theap06 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What's new

Design notes

Out of scope (follow-up)

Verification

Uh oh!

pytorch-bot Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3748

❗ 1 Active SEVs

⚠️ 16 Awaiting Approval

Uh oh!

Xmaster6y commented May 13, 2026

Uh oh!

theap06 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theap06 commented May 13, 2026 •

edited

Loading

pytorch-bot Bot commented May 13, 2026 •

edited

Loading