[RLLib] Reset EvnRunner on Connector error by ArturNiederfahrenhorst · Pull Request #61523 · ray-project/ray

ArturNiederfahrenhorst · 2026-03-05T10:50:02Z

Description

This PR makes it so that, if we error out in an EnvRunner.sample() call at a point where we have consumed self._cached_to_module but not set it again, the next consecutive call will reset the envrunner.

This situation is useful if there is some stochasticity to the a (probably user-provided) connector where the connector will fail based on the appearance of some observations such as an observation of some not well-defined state. In this case, the connector will raise an exception and leave the EnvRunner sampling loop broken so we should reset.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

rllib/env/single_agent_env_runner.py

gemini-code-assist

Code Review

This pull request introduces a recovery mechanism for EnvRunner when a connector error occurs during sampling. By checking if self._cached_to_module is None at the beginning of a sample() call, it detects a previously failed state and forces a reset of the environment runner. This is a good addition for robustness, especially with potentially unstable user-provided connectors.

My main feedback is regarding code duplication. The logic to handle the connector error is nearly identical in both MultiAgentEnvRunner and SingleAgentEnvRunner. I've added comments suggesting a refactoring to move this logic into the base EnvRunner class to improve maintainability.

_{Note: Security Review did not run due to the size of the PR.}

rllib/env/multi_agent_env_runner.py

rllib/env/single_agent_env_runner.py

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-03-05T13:29:10Z

rllib/env/single_agent_env_runner.py

+                logger.warning(
+                    "Error in sample call detected. Resetting envs and episodes to start over. You can ignore this warning if a connector is expectedly unstable."
+                )
+            reset_required = True


Spurious reset when module is None with random actions

Medium Severity

The condition self._cached_to_module is None and not reset_required assumes that a None value for _cached_to_module always indicates a prior error, but _cached_to_module is also legitimately None when self.module is None — a supported configuration for random_actions=True mode. Both _reset_envs() and the sampling loop only set _cached_to_module inside if self.module guards. This causes every subsequent _sample() call with num_timesteps to unnecessarily reset all environments, discarding ongoing episode state. The condition needs to also verify self.module is not None.

Additional Locations (1)

rllib/env/multi_agent_env_runner.py#L293-L302

ArturNiederfahrenhorst added 2 commits March 5, 2026 11:33

initial

67055b9

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

log once

2b30c7d

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst requested a review from a team as a code owner March 5, 2026 10:50

cursor bot reviewed Mar 5, 2026

View reviewed changes

rllib/env/single_agent_env_runner.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

rllib/env/multi_agent_env_runner.py Outdated Show resolved Hide resolved

rllib/env/single_agent_env_runner.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst added 2 commits March 5, 2026 11:57

cursor comment

ceecbf9

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

fix log_once

3871c2e

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ray-gardener bot added the rllib RLlib related issues label Mar 5, 2026

cursor bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLLib] Reset EvnRunner on Connector error#61523

[RLLib] Reset EvnRunner on Connector error#61523
ArturNiederfahrenhorst wants to merge 4 commits intoray-project:masterfrom
ArturNiederfahrenhorst:resetonconnectorerror

ArturNiederfahrenhorst commented Mar 5, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ArturNiederfahrenhorst commented Mar 5, 2026

Description

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Spurious reset when module is None with random actions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant