Skip to content

Stable Baselines3

Francois edited this page Feb 7, 2026 · 1 revision

[Tool]

Presentation : Stable baseline is a python library offering implementations of reinforcement learning algorithms. In the case of Transcendence, we are interested in PPO Algorithm.


Setup

Baseline3 requires PyTorch, Gymnasium

PPO parameters

  • policy : usually MlpPolicy for feature vectors
  • env
  • learning rate
  • n_steps
  • batch_size
  • n_epochs
  • gamma : discount factor for future rewards (typically 0.99)
  • gae_lambda : factor for trade-off of bias vs variance
  • clip_range

Use cases

Initialize

env = gym.make("Pong-v4")
model = PPO(
    "MlpPolicy", 
    env, 
    verbose=1, 
    learning_rate=0.0003, 
    n_steps=2048
)
model.learn(total_timesteps=10000)
model.save("ppo_pong_agent")

Important

Ensure the observation space (input data) is normalized using VecNormalize to help the PPO algorithm converge faster.

Do's & Don'ts

โœ… Do โŒ Don't
Normalize inputs: Scale coordinates and velocities between [-1, 1]. Too large updates: Avoid high learning rates that cause the policy to collapse.
Save Checkpoints: Periodically save models during long training sessions. Hardcode Environment: Don't link logic directly to the agent; keep the Gym interface decoupled.
Use TensorBoard: Monitor reward curves and loss values in real-time. Ignore Seed: Don't forget to set manual seeds for reproducible AI experiments.

Resources

Type Ressource Notes
๐Ÿ“„ Official Docs PPO implementation
๐Ÿ“„ The 37 Implementation Details of Proximal Policy Optimization Blog post
๐ŸŽฅ An introduction to Policy Gradient methods Arxiv Insights

Legend: ๐Ÿ“„ Doc, ๐Ÿ“˜ Book, ๐ŸŽฅ Video, ๐Ÿ’ป GitHub, ๐Ÿ“ฆ Package, ๐Ÿ’ก Blog

๐Ÿ—๏ธ Architecture

๐ŸŒ Web Technologies

Backend

Frontend

๐Ÿ”ง Core Technologies

๐Ÿ” Security

โ›“๏ธ Blockchain

๐Ÿ› ๏ธ Dev Tools & Quality


๐Ÿ“ Page model

Clone this wiki locally