Question about how dynamic context compression is handled during training

Hi, thank you for sharing this amazing work!

I’m trying to understand how WebAgent-R1 handles dynamic context compression during M-GRPO training.
From the paper, it seems that after each step the previous observation is simplified into a short “Simplified HTML” version, while the agent keeps the full action history.
What I’m not fully sure about is how this works during training vs rollout:
- During rollout, the model sees the full observation before it’s simplified, and you record the log-prob of the chosen action.
- But during training, since earlier observations have been replaced by simplified versions, how do you make sure the new log-prob is computed on the same input tokens as during rollout?

Could you maybe share the high-level logic or pseudocode of how this is implemented?
(e.g., when exactly you replace the old observation with the simplified one, and when the log-prob is stored).
Thanks again — I’m mainly trying to understand how you keep the rollout and training contexts consistent so the GRPO ratio stays valid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about how dynamic context compression is handled during training #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about how dynamic context compression is handled during training #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions