RL custom reward strucutre #348
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, PyTAG uses the TAG score mechanism directly as the reinforcement-learning reward signal. While this is sufficient for some environments, it limits the ability to define intermediate rewards or reward functions that are non-monotonic with respect to the final game score both of which are often necessary for long-horizon or phase-based games. To address this, I added to the default AbstractState class with a new method, getReward(). By default, this method simply returns the game score, ensuring full backward compatibility and preserving the behavior of all existing RL environments. With the introduction of getReward(), PyTAG now queries the state’s reward function rather than directly using the score. Developers may optionally override this method to define a custom reward structure that differs from the final scoring mechanism. In the case of Power Grid, this allows the score to remain consistent to the official game rules while enabling a separate reward signal (e.g., intermediate rewards during bureaucracy phases). As a result, reward design is cleanly decoupled from score computation, providing greater flexibility without breaking existing implementations.