Objective: Prevent a pole attached to a cart from falling over.
- State (s):
[cart_position, cart_velocity, pole_angle, pole_angular_velocity] - Action (a):
0(push left) or1(push right) - Reward:
+1for every timestep the pole remains upright - Termination: Pole angle > ±12° or cart moves > ±2.4 units from center
We'll use linear function approximation for both actor and critic.
We'll use polynomial features to capture interactions, feature vector:
We use the softmax policy:
Where
Linear function approximation:
Where
Update Rules:
- TD Error:
- Critic Update (Semi-gradient TD(0)):
- Actor Update (Policy Gradient):
Where the policy gradient for softmax is:
Here
The graph shows three distinct phases:
- The Ascent (Episodes 0-750): The agent starts with no knowledge. The moving average reward slowly and noisily climbs as both the actor and critic begin to form a useful model of the environment.
- The Peak (Episodes 750-1250): The agent reaches a point where it can solve the task, achieving the maximum score, but incredibly unstable.
- The Collapse (Episodes 1250+): The moving average plummets.
This happens because the agent seems to "forget" everything it learned a series of updates, driven by the inherent instability, pushes the policy or value function parameters into a bad region from which it is difficult to recover.