-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Hi, thank you for sharing this amazing work!
I’m trying to understand how WebAgent-R1 handles dynamic context compression during M-GRPO training.
From the paper, it seems that after each step the previous observation is simplified into a short “Simplified HTML” version, while the agent keeps the full action history.
What I’m not fully sure about is how this works during training vs rollout:
- During rollout, the model sees the full observation before it’s simplified, and you record the log-prob of the chosen action.
- But during training, since earlier observations have been replaced by simplified versions, how do you make sure the new log-prob is computed on the same input tokens as during rollout?
Could you maybe share the high-level logic or pseudocode of how this is implemented?
(e.g., when exactly you replace the old observation with the simplified one, and when the log-prob is stored).
Thanks again — I’m mainly trying to understand how you keep the rollout and training contexts consistent so the GRPO ratio stays valid.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels