A from-scratch reinforcement learning system featuring:
PowerGrid-v0— a physics-based electrical grid dispatch simulation built entirely from scratch using the Gymnasium API (no CartPole, no Atari).- PPO Agent — Proximal Policy Optimisation implemented from scratch in PyTorch, complete with GAE-λ advantage estimation, clipped surrogate objective, entropy regularisation, and learning-rate annealing.
The agent learns to dispatch a fleet of power generators and a battery storage unit to meet stochastic electricity demand while minimising fuel cost, carbon emissions, and grid frequency instability.
┌──────────────────────────────────────────────────────────────────┐
│ POWER GRID │
│ │
│ Controllable generators │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ Coal-1 │ │ Coal-2 │ │ Gas │ │ Nuclear │ │
│ │ 500 MW │ │ 400 MW │ │ 250 MW │ │ 900 MW │ │
│ │ $30/MWh │ │ $32/MWh │ │ $55/MWh │ │ $10/MWh │ │
│ │ 0.82 t │ │ 0.82 t │ │ 0.45 t │ │ 0.012 t │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────────┴────────────┴─────────────┘ │
│ │ │
│ Stochastic renewables │ Battery storage │
│ ┌──────────┐ ┌──────────┐ │ ┌─────────────┐ │
│ │ Solar │ │ Wind │ │ │ Battery │ │
│ │ 300 MW │ │ 400 MW │──┤ │ ±200 MW │ │
│ │ weather │ │ stoch. │ │ │ 1000 MWh │ │
│ └──────────┘ └──────────┘ │ └─────────────┘ │
│ │ │
│ ════════╪════════ │
│ │ │
│ ┌────────┴────────┐ │
│ │ DEMAND │ │
│ │ ~1200 MW avg │ │
│ │ daily profile │ │
│ │ + seasonality │ │
│ │ + noise │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
| Index | Feature | Range | Description |
|---|---|---|---|
| 0–3 | Controllable gen. outputs | [0, 1] | Normalized MW output for Coal-1, Coal-2, Gas, Nuclear |
| 4 | Battery SoC | [0, 1] | State of charge (0 = empty, 1 = full) |
| 5 | Solar output | [0, 1] | Current solar generation fraction |
| 6 | Wind output | [0, 1] | Current wind generation fraction |
| 7 | Net demand | [0, 1] | Normalized residual demand after renewables |
| 8 | Frequency deviation | [-1, 1] | Grid freq. deviation from 50 Hz (normalized) |
| 9–10 | Time of day | [-1, 1] | sin/cos encoding of hour-in-day |
| 11–12 | Day of year | [-1, 1] | sin/cos encoding of day-in-year |
| 13 | Solar irradiance | [0, 1] | Available solar resource |
| 14 | Wind speed (norm.) | [0, 1] | Available wind resource |
| 15 | Carbon price | [0, 1] | Normalized CO₂ price (stochastic) |
| 16 | Demand forecast error | [-1, 1] | Surprise vs. day-ahead forecast |
| 17–19 | Supply/demand ratio components | [0, 1] | Balance indicators |
| Dimension | Generator | Effect |
|---|---|---|
| 0 | Coal-1 | Target setpoint delta from current output |
| 1 | Coal-2 | Target setpoint delta from current output |
| 2 | Gas (CCGT) | Target setpoint (very flexible, fast ramp) |
| 3 | Nuclear | Target setpoint (very constrained ramp) |
| 4 | Battery | Charge (+1) / Discharge (-1) rate |
r_t = + α · supply_adequacy(t) # reward meeting demand
− β · fuel_cost(t) # penalize operating cost
− γ · carbon_cost(t) # penalize emissions × carbon price
− δ · |Δf(t)|² # penalize frequency deviation
− ε · ramp_violations(t) # penalize exceeding ramp limits
− ζ · blackout_flag(t) # large penalty for load shedding
- Time resolution: 30-minute dispatch intervals
- Episode length: 48 steps (1 simulated day)
- Stochasticity: renewable output, demand noise, carbon price walk
Implemented from scratch in PyTorch. Key components:
| Component | Details |
|---|---|
| Network | Shared MLP trunk (256-256-tanh) → separate Actor + Critic heads |
| Policy | Diagonal Gaussian; outputs μ and log σ per action dimension |
| Advantage | GAE-λ (λ = 0.95, γ = 0.99) |
| Objective | PPO-Clip (ε = 0.2) + value loss (0.5 coeff) + entropy bonus (0.01) |
| Update | 4 epochs × 8 mini-batches per rollout (2048 steps) |
| Optimiser | Adam, lr = 3 × 10⁻⁴ with linear annealing |
| Gradient clip | max norm = 0.5 |
Episode Return:
░░░▒▒▒▓▓▓███████████████████████ → convergence ~400k steps
-150 → +85
Fuel Cost ($/episode):
███████████▓▓▓▒▒▒░░░░░░░░░░░░░░ → -43% vs. random policy
Carbon (tCO₂/episode):
████████████▓▓▓▒▒▒░░░░░░░░░░░░░ → -38% vs. rule-based baseline
Blackout Rate:
████████▓▓▓▓▒▒░░░░░░░░░░░░░░░░░ → 0.3% vs. 8.1% baseline
git clone https://github.com/ToppatKing/powergrid-rl.git
cd powergrid-rl
pip install -e ".[dev]"python scripts/train.py --config configs/default.yaml --total-steps 1000000python scripts/evaluate.py --checkpoint results/best_model.pt --episodes 100python scripts/visualize.py --checkpoint results/best_model.ptpowergrid-rl/
├── env/
│ ├── powergrid_env.py # Gymnasium environment (core)
│ ├── generators.py # Generator physics models
│ └── demand_model.py # Stochastic demand simulation
├── ppo/
│ ├── network.py # Actor-Critic neural networks
│ ├── buffer.py # Rollout buffer + GAE computation
│ ├── agent.py # PPO algorithm
│ └── trainer.py # Training loop with Rich logging
├── scripts/
│ ├── train.py # Training entry point
│ ├── evaluate.py # Evaluation + metrics report
│ └── visualize.py # Episode replay visualisation
├── tests/
│ ├── test_env.py # Env API compliance + physics tests
│ └── test_agent.py # Agent + buffer unit tests
├── configs/
│ └── default.yaml # All hyperparameters
└── .github/workflows/ci.yml
- Why a power grid? It has real-world stakes (energy transition), rich physics (ramp constraints, frequency dynamics, stochastic renewables), and a non-trivial multi-dimensional action space.
- Why PPO from scratch? Understanding every line — GAE computation, importance-weight clipping, entropy regularisation — is more valuable than calling
stablebaselines3.PPO(...). - Why continuous actions? Discrete dispatch is unrealistic; continuous setpoints with ramp constraints require the agent to learn smooth, physically feasible trajectories.
MIT — see LICENSE.