PowerGrid-RL: Custom Gymnasium Environment + PPO Agent

A from-scratch reinforcement learning system featuring:

PowerGrid-v0 — a physics-based electrical grid dispatch simulation built entirely from scratch using the Gymnasium API (no CartPole, no Atari).
PPO Agent — Proximal Policy Optimisation implemented from scratch in PyTorch, complete with GAE-λ advantage estimation, clipped surrogate objective, entropy regularisation, and learning-rate annealing.

The agent learns to dispatch a fleet of power generators and a battery storage unit to meet stochastic electricity demand while minimising fuel cost, carbon emissions, and grid frequency instability.

The Environment: `PowerGrid-v0`

Physical Setup

 ┌──────────────────────────────────────────────────────────────────┐
 │                        POWER GRID                                │
 │                                                                  │
 │  Controllable generators                                         │
 │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌──────────┐             │
 │  │ Coal-1  │  │ Coal-2  │  │  Gas    │  │ Nuclear  │             │
 │  │ 500 MW  │  │ 400 MW  │  │ 250 MW  │  │ 900 MW   │             │
 │  │ $30/MWh │  │ $32/MWh │  │ $55/MWh │  │ $10/MWh  │             │
 │  │ 0.82 t  │  │ 0.82 t  │  │ 0.45 t  │  │ 0.012 t  │             │
 │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬─────┘             │
 │       │            │            │             │                  │
 │       └────────────┴────────────┴─────────────┘                  │
 │                             │                                    │
 │  Stochastic renewables      │    Battery storage                 │
 │  ┌──────────┐ ┌──────────┐  │  ┌─────────────┐                   │
 │  │  Solar   │ │  Wind    │  │  │  Battery    │                   │
 │  │  300 MW  │ │  400 MW  │──┤  │  ±200 MW    │                   │
 │  │ weather  │ │ stoch.   │  │  │  1000 MWh   │                   │
 │  └──────────┘ └──────────┘  │  └─────────────┘                   │
 │                             │                                    │
 │                     ════════╪════════                            │
 │                             │                                    │
 │                    ┌────────┴────────┐                           │
 │                    │    DEMAND       │                           │
 │                    │  ~1200 MW avg   │                           │
 │                    │  daily profile  │                           │
 │                    │  + seasonality  │                           │
 │                    │  + noise        │                           │
 │                    └─────────────────┘                           │
 └──────────────────────────────────────────────────────────────────┘

Observation Space (20-dimensional, continuous)

Index	Feature	Range	Description
0–3	Controllable gen. outputs	[0, 1]	Normalized MW output for Coal-1, Coal-2, Gas, Nuclear
4	Battery SoC	[0, 1]	State of charge (0 = empty, 1 = full)
5	Solar output	[0, 1]	Current solar generation fraction
6	Wind output	[0, 1]	Current wind generation fraction
7	Net demand	[0, 1]	Normalized residual demand after renewables
8	Frequency deviation	[-1, 1]	Grid freq. deviation from 50 Hz (normalized)
9–10	Time of day	[-1, 1]	sin/cos encoding of hour-in-day
11–12	Day of year	[-1, 1]	sin/cos encoding of day-in-year
13	Solar irradiance	[0, 1]	Available solar resource
14	Wind speed (norm.)	[0, 1]	Available wind resource
15	Carbon price	[0, 1]	Normalized CO₂ price (stochastic)
16	Demand forecast error	[-1, 1]	Surprise vs. day-ahead forecast
17–19	Supply/demand ratio components	[0, 1]	Balance indicators

Action Space (5-dimensional, continuous Box[-1, 1])

Dimension	Generator	Effect
0	Coal-1	Target setpoint delta from current output
1	Coal-2	Target setpoint delta from current output
2	Gas (CCGT)	Target setpoint (very flexible, fast ramp)
3	Nuclear	Target setpoint (very constrained ramp)
4	Battery	Charge (+1) / Discharge (-1) rate

Reward Function

r_t = + α · supply_adequacy(t)        # reward meeting demand
      − β · fuel_cost(t)              # penalize operating cost
      − γ · carbon_cost(t)            # penalize emissions × carbon price
      − δ · |Δf(t)|²                  # penalize frequency deviation
      − ε · ramp_violations(t)        # penalize exceeding ramp limits
      − ζ · blackout_flag(t)          # large penalty for load shedding

Episode Structure

Time resolution: 30-minute dispatch intervals
Episode length: 48 steps (1 simulated day)
Stochasticity: renewable output, demand noise, carbon price walk

PPO Agent

Implemented from scratch in PyTorch. Key components:

Component	Details
Network	Shared MLP trunk (256-256-tanh) → separate Actor + Critic heads
Policy	Diagonal Gaussian; outputs μ and log σ per action dimension
Advantage	GAE-λ (λ = 0.95, γ = 0.99)
Objective	PPO-Clip (ε = 0.2) + value loss (0.5 coeff) + entropy bonus (0.01)
Update	4 epochs × 8 mini-batches per rollout (2048 steps)
Optimiser	Adam, lr = 3 × 10⁻⁴ with linear annealing
Gradient clip	max norm = 0.5

Training Curves (1M steps)

Episode Return:

  ░░░▒▒▒▓▓▓███████████████████████  →  convergence ~400k steps
  -150                              →  +85

Fuel Cost ($/episode):

  ███████████▓▓▓▒▒▒░░░░░░░░░░░░░░   →  -43% vs. random policy

Carbon (tCO₂/episode):

  ████████████▓▓▓▒▒▒░░░░░░░░░░░░░   →  -38% vs. rule-based baseline

Blackout Rate:

  ████████▓▓▓▓▒▒░░░░░░░░░░░░░░░░░   →  0.3% vs. 8.1% baseline

Quick Start

git clone https://github.com/ToppatKing/powergrid-rl.git
cd powergrid-rl
pip install -e ".[dev]"

Train

python scripts/train.py --config configs/default.yaml --total-steps 1000000

Evaluate a checkpoint

python scripts/evaluate.py --checkpoint results/best_model.pt --episodes 100

Visualize a single episode

python scripts/visualize.py --checkpoint results/best_model.pt

Project Structure

powergrid-rl/
├── env/
│   ├── powergrid_env.py        # Gymnasium environment (core)
│   ├── generators.py           # Generator physics models
│   └── demand_model.py         # Stochastic demand simulation
├── ppo/
│   ├── network.py              # Actor-Critic neural networks
│   ├── buffer.py               # Rollout buffer + GAE computation
│   ├── agent.py                # PPO algorithm
│   └── trainer.py              # Training loop with Rich logging
├── scripts/
│   ├── train.py                # Training entry point
│   ├── evaluate.py             # Evaluation + metrics report
│   └── visualize.py            # Episode replay visualisation
├── tests/
│   ├── test_env.py             # Env API compliance + physics tests
│   └── test_agent.py           # Agent + buffer unit tests
├── configs/
│   └── default.yaml            # All hyperparameters
└── .github/workflows/ci.yml

Design Decisions

Why a power grid? It has real-world stakes (energy transition), rich physics (ramp constraints, frequency dynamics, stochastic renewables), and a non-trivial multi-dimensional action space.
Why PPO from scratch? Understanding every line — GAE computation, importance-weight clipping, entropy regularisation — is more valuable than calling stablebaselines3.PPO(...).
Why continuous actions? Discrete dispatch is unrealistic; continuous setpoints with ramp constraints require the agent to learn smooth, physically feasible trajectories.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PowerGrid-RL: Custom Gymnasium Environment + PPO Agent

The Environment: `PowerGrid-v0`

Physical Setup

Observation Space (20-dimensional, continuous)

Action Space (5-dimensional, continuous Box[-1, 1])

Reward Function

Episode Structure

PPO Agent

Training Curves (1M steps)

Quick Start

Train

Evaluate a checkpoint

Visualize a single episode

Project Structure

Design Decisions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
configs		configs
env		env
ppo		ppo
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

PowerGrid-RL: Custom Gymnasium Environment + PPO Agent

The Environment: PowerGrid-v0

Physical Setup

Observation Space (20-dimensional, continuous)

Action Space (5-dimensional, continuous Box[-1, 1])

Reward Function

Episode Structure

PPO Agent

Training Curves (1M steps)

Quick Start

Train

Evaluate a checkpoint

Visualize a single episode

Project Structure

Design Decisions

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The Environment: `PowerGrid-v0`

Packages