Skip to content

[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm#3748

Open
theap06 wants to merge 2 commits into
pytorch:mainfrom
theap06:feat/mappo-ippo
Open

[Feature] MAPPOLoss + IPPOLoss + MultiAgentGAE + ValueNorm#3748
theap06 wants to merge 2 commits into
pytorch:mainfrom
theap06:feat/mappo-ippo

Conversation

@theap06
Copy link
Copy Markdown
Contributor

@theap06 theap06 commented May 13, 2026

Context

Multi-agent RL is currently the weakest research surface in torchrl: the only multi-agent loss shipped is QMixerLoss (DQN family, discrete actions). For cooperative continuous-control MARL — where most modern benchmarks live (SMAC, VMAS, PettingZoo MPE, Hanabi, Overcooked) — users have to hand-assemble ClipPPOLoss + manual set_keys(done=("agents", "done"), terminated=("agents", "terminated")) + manual make_value_estimator(GAE, ...). The existing sota-implementations/multiagent/mappo_ippo.py recipe shows what this boilerplate looks like.

This PR adds MAPPO (Yu et al. 2022) and IPPO (de Witt et al. 2020) as first-class objectives, plus the two pieces of supporting infrastructure they need.

What's new

  • torchrl.objectives.multiagent.MAPPOLoss — centralised-critic, decentralised-actor PPO. Subclasses ClipPPOLoss; defaults the value estimator to MultiAgentGAE, defaults normalize_advantage_exclude_dims=(-2,), and optionally accepts a ValueNorm for the critic-stability trick from the paper.
  • torchrl.objectives.multiagent.IPPOLoss — independent-learner counterpart. Each agent has its own local critic; no centralised state required.
  • torchrl.objectives.value.MultiAgentGAEGAE variant that broadcasts team-shared reward / done / terminated (shape [*B, T, 1]) across the agent dim before the vec-GAE recursion, so users don't have to manually replicate signals or override set_keys. New ValueEstimators.MAGAE enum entry.
  • torchrl.modules.ValueNorm — PopArt-style running value normaliser (van Hasselt et al. 2019), used opt-in by MAPPOLoss. Yu et al. 2022 Table 13 credits this trick with the algorithm's strong SMAC results.

Design notes

Two classes instead of a centralized: bool flag. The structural code difference between MAPPO and IPPO is small (~20 lines), but I made them separate named classes rather than a single class with a flag because:

  • The recent feedback on (HER) was explicit about avoiding wrapper-in-wrapper / "sampler-in-sampler" APIs. A boolean flag on a single class is the same pattern shifted to losses.
  • from torchrl.objectives.multiagent import MAPPOLoss is self-documenting; the docstring spells out the full recipe (centralised critic construction, etc.) for each algorithm independently.

MAGAE dispatch in plain PPO / A2C / Reinforce. Adding ValueEstimators.MAGAE to the enum would break every parent test that parametrises over list(ValueEstimators) unless every make_value_estimator knows the new enum value. Two options: (a) update ~29 test parametrisations to skip MAGAE, or (b) have plain PPO / A2C / Reinforce dispatch MAGAE to MultiAgentGAE. I went with (b) — it's ~5 lines per file, leaves the enum exhaustive, and is the right thing semantically (any actor-critic with the right data shapes can use MAGAE).

ValueNorm placement. Lives under torchrl/modules/ rather than torchrl/objectives/utils/ because it's a stateful learnable component that participates in .to(device) / state_dict(). Happy to move if reviewers prefer otherwise.

Out of scope (follow-up)

  • HAPPO / sequential update scheme (Kuba et al. 2022)
  • Multi-Agent Transformer (MAT)
  • Refactoring sota-implementations/multiagent/mappo_ippo.py to use the new classes — left untouched in this PR to keep the blast radius small; can be a one-line follow-up.

Verification

  • pytest test/objectives/test_mappo.py — 16/16 passing. Synthetic-tensordict tests for forward shapes, backward, centralised-vs-decentralised critic semantics, share-params modes, ValueNorm convergence, and critic-loss bounded-ness under 10× reward inflation.
  • pytest test/test_cost.py -k "ppo or qmixer or a2c or reinforce" — 2394/2394 passing (no regressions).
  • Full test_cost.py — 8788 passing, 1 pre-existing unrelated failure (test_exploration_compiletorch.compile + torch.utils.mkldnn deprecation, no MAPPO involvement).
  • examples/multiagent/mappo_vmas.py --algo mappo --frames 200_000 provides a minimal end-to-end smoke recipe on VMAS Navigation.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3748

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⚠️ 16 Awaiting Approval

As of commit 5c55c31 with merge base cc31dc3 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026
@theap06 theap06 force-pushed the feat/mappo-ippo branch from 0d8dfea to d1f1eb0 Compare May 13, 2026 09:47
@Xmaster6y
Copy link
Copy Markdown
Contributor

I think you're right about making MA training more straightforward, but I have some concerns:

  • MultiAgentGAE.forward seems to duplicate most of GAE.forward; maybe an additional level of abstraction is needed.
  • ValueNorm should be more generic and less tied to MAPPO
  • We might need a registry for value estimators instead of enums
  • We should maybe handle/consider potential compatibility issues with MAGAE for other algs

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented May 13, 2026

I think you're right about making MA training more straightforward, but I have some concerns:

  • MultiAgentGAE.forward seems to duplicate most of GAE.forward; maybe an additional level of abstraction is needed.
  • ValueNorm should be more generic and less tied to MAPPO
  • We might need a registry for value estimators instead of enums
  • We should maybe handle/consider potential compatibility issues with MAGAE for other algs

I think the layers of abstractions make sense. I think the value estimators would help because of the utilization of the collectors as well. For the compatibility issues, I can write up some test cases to ensure it doesn't impact existing algos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Documentation Improvements or additions to documentation Examples Feature New feature Integrations/torch_geometric Integrations Modules Objectives

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants