Skip to content

[RLLib] Reset EvnRunner on Connector error#61523

Open
ArturNiederfahrenhorst wants to merge 4 commits intoray-project:masterfrom
ArturNiederfahrenhorst:resetonconnectorerror
Open

[RLLib] Reset EvnRunner on Connector error#61523
ArturNiederfahrenhorst wants to merge 4 commits intoray-project:masterfrom
ArturNiederfahrenhorst:resetonconnectorerror

Conversation

@ArturNiederfahrenhorst
Copy link
Contributor

Description

This PR makes it so that, if we error out in an EnvRunner.sample() call at a point where we have consumed self._cached_to_module but not set it again, the next consecutive call will reset the envrunner.

This situation is useful if there is some stochasticity to the a (probably user-provided) connector where the connector will fail based on the appearance of some observations such as an observation of some not well-defined state. In this case, the connector will raise an exception and leave the EnvRunner sampling loop broken so we should reset.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst requested a review from a team as a code owner March 5, 2026 10:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a recovery mechanism for EnvRunner when a connector error occurs during sampling. By checking if self._cached_to_module is None at the beginning of a sample() call, it detects a previously failed state and forces a reset of the environment runner. This is a good addition for robustness, especially with potentially unstable user-provided connectors.

My main feedback is regarding code duplication. The logic to handle the connector error is nearly identical in both MultiAgentEnvRunner and SingleAgentEnvRunner. I've added comments suggesting a refactoring to move this logic into the base EnvRunner class to improve maintainability.

Note: Security Review did not run due to the size of the PR.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ray-gardener ray-gardener bot added the rllib RLlib related issues label Mar 5, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

logger.warning(
"Error in sample call detected. Resetting envs and episodes to start over. You can ignore this warning if a connector is expectedly unstable."
)
reset_required = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious reset when module is None with random actions

Medium Severity

The condition self._cached_to_module is None and not reset_required assumes that a None value for _cached_to_module always indicates a prior error, but _cached_to_module is also legitimately None when self.module is None — a supported configuration for random_actions=True mode. Both _reset_envs() and the sampling loop only set _cached_to_module inside if self.module guards. This causes every subsequent _sample() call with num_timesteps to unnecessarily reset all environments, discarding ongoing episode state. The condition needs to also verify self.module is not None.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rllib RLlib related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant