This repository contains a custom implementation of a PPO-based vision agent focused on applied reinforcement learning for visual control. The agent is trained end-to-end on raw-pixel inputs to navigate procedurally generated environments, featuring automated hyperparameter sweeps and reproducible evaluation workflows.
- Orchestrated PPO-based vision agent training for 10M+ environment steps on raw-pixel inputs; achieved a 75% success rate on the StarPilot benchmark for autonomous navigation.
- Implemented 4-frame temporal stacking to resolve partial observability, generating a 45% reward increase via velocity and direction inference from sequential image data.
- Automated experimentation via Weights & Biases for 50+ hyperparameter sweeps; streamlined training iteration cycles by 60% through modular pipeline architecture.
The agent uses an actor-critic PPO objective with an Impala-style convolutional encoder and temporal context aggregation for policy learning from vision.
| Component | Configuration | Purpose |
|---|---|---|
| Input observations | 64x64x3 RGB frames |
Preserves high-information visual state directly from the environment |
| Visual encoder | Impala CNN with residual blocks (ResidualBlock, ConvSequence) |
Learns robust spatial representations under procedural variability |
| Temporal context | 4-frame stacking (t-3 to t) |
Provides motion and trajectory cues from sequential image dynamics |
| Policy head | Categorical actor over discrete actions | Outputs navigation actions for StarPilot control |
| Value head | Scalar critic | Estimates state value for PPO advantage learning |
The network backbone follows the Impala CNN pattern: convolution, downsampling, and residual blocks repeated across stages. This architecture improves representation quality and optimization stability when learning from high-dimensional visual observations.
At each policy step, the pipeline constructs state from the four most recent frames. This temporal window allows the model to infer direction, relative speed, and short-horizon trajectories that are not observable from a single image.
| Path | Description |
|---|---|
cleanrl/ppo_procgen.py |
Core training engine with temporal stacking and Impala CNN |
cleanrl_utils/ |
Utilities for W&B tracking and hyperparameter sweeps |
record_hd.py |
Custom evaluation script for HD rendering and inference of the trained .cleanrl_model |
streamlit_app.py |
Interactive dashboard for run analysis |
assets/ |
Contains the pre-trained weights and demo footage |
python -m venv .venv
# Linux/macOS
source .venv/bin/activate
# Windows PowerShell
# .venv\Scripts\Activate.ps1
pip install -r requirements.txt
pip install -r requirements/requirements-procgen.txtpython record_hd.pypython cleanrl/ppo_procgen.py \
--env-id starpilot \
--total-timesteps 10000000 \
--track \
--wandb-project-name <your-project> \
--wandb-entity <your-entity>Architecture base: core PPO algorithms adapted from the CleanRL library, modified for custom temporal stacking and Impala CNN feature extraction.


