PPO Algorithm

Proximal policy optimization is a reinforcement learning algorithm. It was first developed by OpenAI in 2017. It can learn from a dynamic environment, making it appropriate for a pong game where data are evolving.

Architecture and steps

They key idea is to train the AI in order to maximize an advantage.

establish two neural networks

actor : predicts probability of an action
critic : estimates state value

collect information (action, reward, next state...) by interacting with the environment (game service in our case). Reward ponderation can be as follow for a pong game : +1 for scoring, -1 for missing the ball, +0.1 for touching it.
compute advantage : A GAE (generalized advantaged estimation) is used to calculate if the advantage is above average.
optimization with a clipped ratio to prevent brutal policy changes. A loss function is applied, taking into account the probability ratio and estimated advantage. Not fully understood here, but check this article for more info.
iterations : many repetitions are done to train the model.

Formula and example

Source: OpenAI

θ (policy parameter) is the relative weight of the actor neural network (compared to critic NN)
$\mathbb{E}_t$ (empirical expectation over timesteps) is the GAE
$r_t(\theta)$ (probability ratio) is the ratio of probability to have chosen action A with new policy over action A with old policy
$\mathbb{A}_t$ (estimated advantage) is computed with $\mathbb{A_t} = Q(s_ta_t) - V(s_t)$, Q being expected future reward for action A, V being average future reward. If $\mathbb{A}_t > 0$, it is a good action
$clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ (ratio clipping) aims at limiting the extend of policy update
$min$ between probability of good policy and clipped ratio will ensure policy change is not too brutal

Example with pong

Old policy $\pi_old$ associates possible actions with following value (closer to 1 is better)

up : 0.6
down : 0.1
stop : 0.3

New policy $\pi_\theta$

up : 0.8
down : 0.005
stop : 0.15

Advantages with GAE

up : +0.5
down : -0.3
stop = -0.1

Hyperparameter $\epsilon$ = 0.2

Step	Desc	Computation
$s_t$	ball arrives towards paddle
$r_t(\theta)$	ratio for action "up"	0.8/0.6 ≈ 1.33
$clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$	clipping	clip(1.33, 0.8, 1.2) = 1.2
$clip(r_t(\theta)\mathbb{A}_t$	non clipped loss	$1.33 \times 0.5 = 0.665$
	clipped loss	$1.2 \times$ 0.5 = 0.6$
	final loss	$min(0.665, 0.6) = 0.6$

Differences with other supervised algorithms

The key point for building a pong model is that the dataset is not fixed : such variables as ball position, speed and opponent paddle position are evolving through time.

A2C : only one neural network for actor and critic, but less stable than PPO.
TRPO : Trust Region Policy Optimization : stable but complex to implement

Differences with other neural network algorithms

DQN (Deep-Q network) combines Q-learning (value based learning) with neural network.
MLP (Multilayer perceptron) is used to extract information from tabular data in order to classify or deduce patterns for prediction. It uses back-propagation and optimization techniques.
CNN (Convolutional neural network) is used to extract information grid-shaped information (typically images).
RNN (Recurrent neural network) and LSTM (Long Short term memory) are used to extract sequencial information(time,text).

Known limitations

PPO is sensitive to hyperparameters (batch size, learning rate, epsilon clipping parameter ...), which can be balanced with attention mechanisms, curriculum learning or distributed PPO. However, a pong game, even with customizations, remains pretty basic.

Resources

Type	Ressource	Notes
📄	PPO	Wiki

Legend: 📄 Doc, 📘 Book, 🎥 Video, 💻 GitHub, 📦 Package, 💡 Blog

Home

🏗️ Architecture

Gateway Service - API Gateway & JWT validation
Auth Service - Authentication & 2FA/TOTP
AI Service - AI opponent
API Documentation - OpenAPI/Swagger

🌐 Web Technologies

Backend

Fastify - Web framework
Prisma - ORM
WebSockets - Real-time communication
Restful API - API standards

Frontend

React - UI library
CSS - Styling
Tailwind - CSS framework
Accessibility - WCAG compliance

🔧 Core Technologies

TypeScript - Language
Zod - Schema validation
Nginx - Reverse proxy

🔐 Security

Logging and Error management - Observability
OAuth 2.0 - Authentication flows
Two-factor authentication - 2FA/TOTP

⛓️ Blockchain

Avalanche - Blockchain network
Hardhat - Development framework
Solidity - Smart contracts language
Open Zeppelin - Security standards

🛠️ Dev Tools & Quality

ESLint - Linting
Vitest - Testing
GitHub Actions - CI/CD
Husky, Commit lints and git hooks - Git hooks
ELK - Logging stack

📝 Page model

PPO Algorithm

Architecture and steps

Formula and example

Example with pong

Differences with other supervised algorithms

Differences with other neural network algorithms

Known limitations

See also

Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

🏗️ Architecture

🌐 Web Technologies

Backend

Frontend

🔧 Core Technologies

🔐 Security

⛓️ Blockchain

🛠️ Dev Tools & Quality

Clone this wiki locally