This project implements a Deep Q-Network (DQN) agent for a continuous control environment with discrete action space. The goal was to learn a near-optimal policy directly from pixel-level observations while ensuring stability and convergence within a limited training budget.
- Train a reinforcement learning agent to maximize long-term rewards.
- Use DQN components to stabilize training and balance exploration and exploitation.
- Evaluate performance across internal validation and official challenge submissions.
-
Observations: Pixel-level environment frames.
-
DQN components:
- Replay buffer (100,000 samples).
- Target network with soft updates.
- ε-greedy exploration: ε annealed from 1.0 → 0.2 over 100,000 steps.
- Delayed training start after 2,000 steps to ensure buffer diversity.
-
Hyperparameters:
- Minibatch size: 256
- Learning rate: 0.0001
- Discount factor (γ): 0.99
-
Training: 1,000 episodes total.
-
Monitoring: Reward and loss plots used to track convergence.
-
Agent performance improved rapidly in early episodes and stabilized at high reward levels (>0.9).
-
Loss curve followed expected DQN dynamics: initial instability → structured updates with variance.
-
Final evaluation:
- Internal test set: average return 0.9632 over 50 episodes.
- Challenge server: 0.964 score.
Takeaway The chosen hyperparameters and DQN design yielded a robust policy that converged quickly and matched the challenge benchmark.