Skip to content

Question about how dynamic context compression is handled during training #8

@googlelab123

Description

@googlelab123

Hi, thank you for sharing this amazing work!

I’m trying to understand how WebAgent-R1 handles dynamic context compression during M-GRPO training.
From the paper, it seems that after each step the previous observation is simplified into a short “Simplified HTML” version, while the agent keeps the full action history.
What I’m not fully sure about is how this works during training vs rollout:

  • During rollout, the model sees the full observation before it’s simplified, and you record the log-prob of the chosen action.
  • But during training, since earlier observations have been replaced by simplified versions, how do you make sure the new log-prob is computed on the same input tokens as during rollout?

Could you maybe share the high-level logic or pseudocode of how this is implemented?
(e.g., when exactly you replace the old observation with the simplified one, and when the log-prob is stored).
Thanks again — I’m mainly trying to understand how you keep the rollout and training contexts consistent so the GRPO ratio stays valid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions