Neural Network with heuristic vs Deep Q Network

Artificial Intelligence 7750 - Graduate final project Relevant papers

Demo video

Summary

Agent will compare the usage of Neural Network with heuristic vs Deep-Q-Network (DQN) learning to increasingly improve itself on playing a Snake game.

Environment

Actions

Represent snake's 3 possible actions using one-hot encoding, with 1 = action to do and 0 = action to not do.

[1,0,0] = forward (continues in current direction)

[0,1,0] = turn right

[0,0,1] = turn left

State

state = Represents 11 conditions using one-hot encoding, with 1 = condition met and 0 = condition unmet.

If danger (snake collides with its own body or game window boundary) is forward, right, and or left of the snake.
If current direction of snake is going left, right, up, or down.

If mice is left, right, up, and or down of snake (can have 2 combos if it's diagonal).

[danger_forward, danger_right_turn, danger_left_turn,
 going_left, going_right, going_up, going_down,
 mice_left, mice_right, mice_up, mice_down]

Ex: state = [0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0] = Danger to left of snake, snake moving downward, and mice (food) is to right & up of snake.

Model

Neural Network with heuristic

Uses heuristic function to determine target action to take:

decided_action = Direction(s) where there's no danger
decided_action = If mice in same direction snake is heading towards, return "go forward" action
decided_action = If mice in direction that snake can turn towards, return that direction
decided_action = If no previous conditions matched/danger everywhere just return random action

DQN

Reward & Penalty

eat_mice = +10
game_over = -10
idle_steps_after_long_time = -10 (idle/useless steps limit porportional to length of snake*100)

Q learning

Uses Bellman equation to calculate new Q values

Gamma & Epsilon

epsilon = 80-m Random exploration if randint(0, 200) < epsilon, else do exploitation

gamma = 0.9 Results fair better when gamma, aka discount factor, set closer to 1 (aka values future rewards almost as much as current rewards)

Results comparison

Both experiments ran for 10 minutes.

Conclusion As can be seen, whereas Neural Network with heuristic approach improves in performance quickly, as time passes the increase in performance plateaus. In DQN, the performance doesn't improve as quickly initially, but as time goes on there is a clear and continuous increase in performance and no signs of plateauing yet even after 10 minutes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly