-
Notifications
You must be signed in to change notification settings - Fork 4
PPO Algorithm
Proximal policy optimization is a reinforcement learning algorithm. It was first developed by OpenAI in 2017. It can learn from a dynamic environment, making it appropriate for a pong game where data are evolving.
They key idea is to train the AI in order to maximize an advantage.
- establish two neural networks
- actor : predicts probability of an action
- critic : estimates state value
- collect information (action, reward, next state...) by interacting with the environment (game service in our case). Reward ponderation can be as follow for a pong game : +1 for scoring, -1 for missing the ball, +0.1 for touching it.
- compute advantage : A GAE (generalized advantaged estimation) is used to calculate if the advantage is above average.
- optimization with a clipped ratio to prevent brutal policy changes. A loss function is applied, taking into account the probability ratio and estimated advantage. Not fully understood here, but check this article for more info.
- iterations : many repetitions are done to train the model.
Source: OpenAI
- ฮธ (policy parameter) is the relative weight of the actor neural network (compared to critic NN)
-
$\mathbb{E}_t$ (empirical expectation over timesteps) is the GAE -
$r_t(\theta)$ (probability ratio) is the ratio of probability to have chosen action A with new policy over action A with old policy -
$\mathbb{A}_t$ (estimated advantage) is computed with$\mathbb{A_t} = Q(s_ta_t) - V(s_t)$ , Q being expected future reward for action A, V being average future reward. If$\mathbb{A}_t > 0$ , it is a good action -
$clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ (ratio clipping) aims at limiting the extend of policy update -
$min$ between probability of good policy and clipped ratio will ensure policy change is not too brutal
Old policy
- up : 0.6
- down : 0.1
- stop : 0.3
New policy
- up : 0.8
- down : 0.005
- stop : 0.15
Advantages with GAE
- up : +0.5
- down : -0.3
- stop = -0.1
Hyperparameter
| Step | Desc | Computation |
|---|---|---|
| ball arrives towards paddle | ||
| ratio for action "up" | 0.8/0.6 โ 1.33 | |
| clipping | clip(1.33, 0.8, 1.2) = 1.2 | |
| non clipped loss | ||
| clipped loss |
|
|
| final loss |
The key point for building a pong model is that the dataset is not fixed : such variables as ball position, speed and opponent paddle position are evolving through time.
- A2C : only one neural network for actor and critic, but less stable than PPO.
- TRPO : Trust Region Policy Optimization : stable but complex to implement
- DQN (Deep-Q network) combines Q-learning (value based learning) with neural network.
- MLP (Multilayer perceptron) is used to extract information from tabular data in order to classify or deduce patterns for prediction. It uses back-propagation and optimization techniques.
- CNN (Convolutional neural network) is used to extract information grid-shaped information (typically images).
- RNN (Recurrent neural network) and LSTM (Long Short term memory) are used to extract sequencial information(time,text).
PPO is sensitive to hyperparameters (batch size, learning rate, epsilon clipping parameter ...), which can be balanced with attention mechanisms, curriculum learning or distributed PPO. However, a pong game, even with customizations, remains pretty basic.
Stable Baselines3 PyTorch NumPy
| Type | Ressource | Notes |
|---|---|---|
| ๐ | PPO | Wiki |
Legend: ๐ Doc, ๐ Book, ๐ฅ Video, ๐ป GitHub, ๐ฆ Package, ๐ก Blog
- Gateway Service - API Gateway & JWT validation
- Auth Service - Authentication & 2FA/TOTP
- AI Service - AI opponent
- API Documentation - OpenAPI/Swagger
- Fastify - Web framework
- Prisma - ORM
- WebSockets - Real-time communication
- Restful API - API standards
- React - UI library
- CSS - Styling
- Tailwind - CSS framework
- Accessibility - WCAG compliance
- TypeScript - Language
- Zod - Schema validation
- Nginx - Reverse proxy
- Logging and Error management - Observability
- OAuth 2.0 - Authentication flows
- Two-factor authentication - 2FA/TOTP
- Avalanche - Blockchain network
- Hardhat - Development framework
- Solidity - Smart contracts language
- Open Zeppelin - Security standards
- ESLint - Linting
- Vitest - Testing
- GitHub Actions - CI/CD
- Husky, Commit lints and git hooks - Git hooks
- ELK - Logging stack
๐ Page model