This repository implements a multi-stage reinforcement learning framework for playing Pokémon Gen 9 VGC (Regulation H) for a fixed metagame. The training procedure combines trajectory bootstrapping via Behavioural Cloning with Proximal Policy Optimization (PPO) in a league based self play setup (partially inspired by AlphaStar and OpenAI Five).
The project relies on a local instance of the Pokémon Showdown server. Initialize the submodules to fetch the source:
git submodule update --init --recursiveuv is used for Python 3.13 environment and dependency management:
uv python install 3.13
uv syncInstall the Node.js dependencies for the Pokémon Showdown server:
cd pokemon-showdown && npm install && cd ..A full training run follows a sequential pipeline:
-
Start a local Showdown server.
cd pokemon-showdown && node pokemon-showdown start --no-security
-
Generate replays from heuristics.
uv run python src/replay_gen.py -n 500
After this step, you can stop the local showdown from step 1.
-
Create the initial pool of policies through behaviour cloning across the heuristic data.
uv run python src/seed_pool.py
-
Launch the PPO training loop. This script manages its own pokemon-showdown server processes.
uv run python src/train_loop.py
The pipeline begins by recording battles between various heuristic agents (Fuzzy Heuristics, Simple Heuristics, and Max Base Power). MaxBasePower and SimpleHeuristic come from the implementation in poke-env and FuzzyHeuristic can be found in src/heuristic.py. They aren't particularly good (an average human player should be able to beat them comfortably) but they provide a base for weeding out poor moves (such as KOing your own teammate unnecessarily).
The replays from the previous step are used in supervised learning to "seed" the neural network. By training the model to predict the actions of these heuristics, we get a semi-decent starting pool of policies to start out the league with and don't waste compute trying to explore extremely poor moves.
The agent is optimized via Proximal Policy Optimization (PPO) within a league-based environment. The objective of the pool of opponents is to search for more general strategies rather than a chain of policies that learn to exploit the previous model. As of right now, the latest policy faces the following opponents:
- Latest Self: The current version of the training policy (50% probability).
- Opponent Pool: Historical snapshots of the agent and initial seeds. FPSP used for sampling, where the agent is more likely to face opponents that currently have a good win rate against the previous few policies. Once the pool is large enough, the policy with the lowest win rate is evicted.
The observation to the model is a static embedding produced by passing text inputs from observed features in the ongoing battle into TinyBERT and using the concatenation of the CLS token and the mean. It consists of:
- H1, H2, H3: Text summary embeddings of the outcomes of the last 3 turns
- Field State: Text embedding of the global and local effects on the field (Weather, Terrain, Trick Room, Tailwind, etc).
- Global Info: Text embedding containing any other global info. Contains only the terastallization info for now, but has the space to potentially encode more info.
- Player and Opponents pokemon: 2 text embeddings per pokemon for both player and opponent. Contains info about ability, type, moveset, etc.
The remaining features are purely numerical, but also extracted from the poke-env Battle object.
- Field Nums: Turns of each type of global / local field effect.
- Player and Opponents pokemon nums: Base stats, boosts, status condition (one hot), etc.
The model can be thought of as a spatio-temporal model, using an Encoder only Transformer to attend to the spatial features (pokemon + field info) from the static encoding + one time dependent hidden state token (HG).
Over the time axis, the model acts as an RNN, updating the hidden state (HG) with the new information from the latest turn.
The update logic follows:
The CLS token becomes the internal state representation for the turn, and also the shared backbone, from which the policy and value network split.
Doubles coordination is modeled as a sequential decision process:
-
First Action Prediction: The policy predicts the action for the first Pokémon slot,
$P(a_1 \mid z)$ -
Conditional Action Prediction: The first action is embedded and used to condition the prediction for the second slot autoregressively,
$P(a_2 \mid z, a_1)$ . - Sequential Masking: Legal move filtering is applied at each step, preventing coordination errors such as double-terastallization or invalid switch-ins.
The value head is pretty standard, a small MLP with GELU as the activation function with a single scalar output at the end of it. In the latest policy, it also has a custom grad scaler that scales down the value losses before the enter the shared backbone to reduce value interference with the shared backbone.
NOTE: In the old policy, the model was not autoregressive, instead it generated
Hyperparameters (learning rates, entropy coefficients, batch sizes) can be modified via a .ppoconfig file in the root directory. If absent, the system defaults to the parameters defined in src/ppo_utils.py.
- TensorBoard: Training metrics (Win Rate, KL Divergence, Explained Variance) are logged to
runs/ppo_training/. - Logging: A text output of the same is saved to
training.log. - Checkpoints: The primary PPO checkpoint is stored at
checkpoints/ppo_checkpoint.pt, with opponent snapshots archived incheckpoints/pool/.checkpoints/pool/pool_state.jsongives you more information about how the latest checkpoint is doing against the other policies in the pool.
Will be added soon. Unfortunately, due to the nature of the closed metagame, I will not be able to find a stable elo rating for this on the ladder + the fact that Reg H no longer is a format on the official Pokemon Showdown server.