Skip to content

PPO Algorithm

Francois edited this page Feb 7, 2026 · 3 revisions

Proximal policy optimization is a reinforcement learning algorithm. It was first developed by OpenAI in 2017. It can learn from a dynamic environment, making it appropriate for a pong game where data are evolving.

Architecture and steps

They key idea is to train the AI in order to maximize an advantage.

  1. establish two neural networks
  • actor : predicts probability of an action
  • critic : estimates state value
  1. collect information (action, reward, next state...) by interacting with the environment (game service in our case). Reward ponderation can be as follow for a pong game : +1 for scoring, -1 for missing the ball, +0.1 for touching it.
  2. compute advantage : A GAE (generalized advantaged estimation) is used to calculate if the advantage is above average.
  3. optimization with a clipped ratio to prevent brutal policy changes. A loss function is applied, taking into account the probability ratio and estimated advantage. Not fully understood here, but check this article for more info.
  4. iterations : many repetitions are done to train the model.

Formula and example

image

Source: OpenAI

  • ฮธ (policy parameter) is the relative weight of the actor neural network (compared to critic NN)
  • $\mathbb{E}_t$ (empirical expectation over timesteps) is the GAE
  • $r_t(\theta)$ (probability ratio) is the ratio of probability to have chosen action A with new policy over action A with old policy
  • $\mathbb{A}_t$ (estimated advantage) is computed with $\mathbb{A_t} = Q(s_ta_t) - V(s_t)$, Q being expected future reward for action A, V being average future reward. If $\mathbb{A}_t > 0$, it is a good action
  • $clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ (ratio clipping) aims at limiting the extend of policy update
  • $min$ between probability of good policy and clipped ratio will ensure policy change is not too brutal

Example with pong

Old policy $\pi_old$ associates possible actions with following value (closer to 1 is better)

  • up : 0.6
  • down : 0.1
  • stop : 0.3

New policy $\pi_\theta$

  • up : 0.8
  • down : 0.005
  • stop : 0.15

Advantages with GAE

  • up : +0.5
  • down : -0.3
  • stop = -0.1

Hyperparameter $\epsilon$ = 0.2

Step Desc Computation
$s_t$ ball arrives towards paddle
$r_t(\theta)$ ratio for action "up" 0.8/0.6 โ‰ˆ 1.33
$clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ clipping clip(1.33, 0.8, 1.2) = 1.2
$clip(r_t(\theta)\mathbb{A}_t$ non clipped loss $1.33 \times 0.5 = 0.665$
clipped loss $1.2 \times$ 0.5 = 0.6$
final loss $min(0.665, 0.6) = 0.6$

Differences with other supervised algorithms

The key point for building a pong model is that the dataset is not fixed : such variables as ball position, speed and opponent paddle position are evolving through time.

  • A2C : only one neural network for actor and critic, but less stable than PPO.
  • TRPO : Trust Region Policy Optimization : stable but complex to implement

Differences with other neural network algorithms

  • DQN (Deep-Q network) combines Q-learning (value based learning) with neural network.
  • MLP (Multilayer perceptron) is used to extract information from tabular data in order to classify or deduce patterns for prediction. It uses back-propagation and optimization techniques.
  • CNN (Convolutional neural network) is used to extract information grid-shaped information (typically images).
  • RNN (Recurrent neural network) and LSTM (Long Short term memory) are used to extract sequencial information(time,text).

Known limitations

PPO is sensitive to hyperparameters (batch size, learning rate, epsilon clipping parameter ...), which can be balanced with attention mechanisms, curriculum learning or distributed PPO. However, a pong game, even with customizations, remains pretty basic.

See also

Stable Baselines3 PyTorch NumPy

Resources

Type Ressource Notes
๐Ÿ“„ PPO Wiki

Legend: ๐Ÿ“„ Doc, ๐Ÿ“˜ Book, ๐ŸŽฅ Video, ๐Ÿ’ป GitHub, ๐Ÿ“ฆ Package, ๐Ÿ’ก Blog

๐Ÿ—๏ธ Architecture

๐ŸŒ Web Technologies

Backend

Frontend

๐Ÿ”ง Core Technologies

๐Ÿ” Security

โ›“๏ธ Blockchain

๐Ÿ› ๏ธ Dev Tools & Quality


๐Ÿ“ Page model

Clone this wiki locally