Skip to content
sofian edited this page Jul 2, 2013 · 9 revisions

This page enumerates the different configurable parameters in Qualia and describes their role.

Q-Learning / Sarsa parameters

Reference class: QLearningAgent

Discount factor (γ) ("gamma")

  • Variable: gamma (float)
  • Range: [0, 1) (shout not exceed 1 otherwise learning may diverge)
  • Typical: [0.9, 1)

"The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the Q values may diverge."

Source: http://en.wikipedia.org/wiki/Q-learning#Discount_factor

Trace decay (λ)

  • Variable: lambda (float)
  • Range: [0, 1]
  • Typical: (0, 0.1]

"Heuristic parameter controlling the temporal credit assignment of how an error detected at a given time step feeds back to correct previous estimates. When λ = 0, no feedback occurs beyond the current time step, while when λ = 1, the error feeds back without decay arbitrarily far in time. Intermediate values of lambda provide a smooth way to interpolate between these two limiting cases."

Source: http://www.research.ibm.com/massive/tdl.html

Off-policy vs on-policy learning

**NOTE: We do not support off-policy learning in the QLearningAgent anymore because it is known to diverge when used with linear function approximators like neural networks.

Variable: offPolicy (boolean) Default: false

"An off-policy learner learns the value of the optimal policy independently of the agent's actions. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps." Source: http://artint.info/html/ArtInt_268.html

This variable controls wether to use the off-policy learning algorithm (Q-Learning) or the on-policy algorithm (Sarsa).

NOTE: Off-policy learning should be used at all time when training on a pre-generated dataset. When the agent is trained online (eg. in real time) the on-policy algorithm will result in the agent showing a better online performance at the expense of finding a sub-optimal solution. On the opposite, the off-policy strategy will converge to the optimal solution but will usually show a lower online performance as it will more often make mistakes.

Policy

  • Variable: policy (Policy*)
  • Reference classes: Policy, QLearningPolicy, QLearningEGreedyPolicy, QLearningSoftmaxPolicy

See section on policies below for a complete description of the policies and their parameters.

Policies

Reference classes: Policy, QLearningPolicy, QLearningEGreedyPolicy, QLearningSoftmaxPolicy

The policy is a strategy followed by an agent on how to choose what is the next action to take. In reinforcement learning applications, we typically use two kinds of policies: ε-greedy and softmax.

ε-greedy

Reference class: QLearningEGreedyPolicy

"Most of the time the action with the highest estimated reward is chosen, called the greediest action. Every once in a while, say with a small probability ε, an action is selected at random. The action is selected uniformly, independant of the action-value estimates. This method ensures that if enough trials are done, each action will be tried an infinite number of times, thus ensuring optimal actions are discovered."

Source: http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

Epsilon (ε)

  • Variable: epsilon (float)
  • Range: [0, 1]
  • Typical: (0, 0.1]

The larger the value, the more random moves the agent will take, which results in more exploration but poor exploitation (low short-term rewards). The lower the value, the more "greedy" the agent is: that is, it will tend to exploit more its knowledge but will be less exploring. If ε = 1, the agent will take purely random actions, while if ε = 0 it will only take greedy actions (ie. the actions it thinks is the best).

A good learning strategy is to start training with a higher ε and slowly decrease it. When the agent is trained, you can even stop it from exploring anymore by setting its ε to zero (0).

Softmax

Reference class: QLearningSoftmaxPolicy

One drawback of ε-greedy is that it selects random actions uniformly. "The worst possible action is just as likely to be selected as the second best. Softmax remedies this by assigning a rank or weight to each of the actions, according to their action-value estimate. A random action is selected with regards to the weight associated with each action, meaning the worst actions are unlikely to be chosen. This is a good approach to take where the worst actions are very unfavourable."

Source: http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

Temperature (τ)

  • Variable: temperature (float)
  • Range: (0, ∞)
  • Typical: around 1
  • Default: 1

"For high temperatures (τ → ∞), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (τ → 0+), the probability of the action with the highest expected reward tends to 1 [ie. the policy becomes "greedy"]."

Source: http://en.wikipedia.org/wiki/Softmax_activation_function

Epsilon (softmax class) (ε)

  • Variable: epsilon (float)
  • Range: [0, 1]
  • Typical: (0, 0.1]
  • Default: 0

The QLearningSoftmaxPolicy class also has an ε parameter which acts exactly as for the ε-greedy policy ie. there is an ε probability to simply take a random action. By default it is set to zero (pure softmax policy).

Neural network (stochastic gradient)

Reference class: NeuralNetwork

Learning rate

  • Variable: learningRate (float)
  • Range: [0, 1]
  • Typical: (0, 0.1]
  • Default: 0.01

"The learning rate is used to adjust the speed of training. The higher the learning rate the faster the network is trained. However, the network has a better chance of being trained to a local minimum solution. A local minimum is a point at which the network stabilizes on a solution which is not the most optimal global solution." Source: http://pages.cs.wisc.edu/~bolo/shipyard/neural/tort.html

In the case of reinforcement learning, "the learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information." Source: http://en.wikipedia.org/wiki/Q-learning#Learning_rate

Number of hidden neurons

  • Variable: none (specified in the class constructor)
  • Range: (0, ∞)
  • Typical: 3 to 10 (could be more depending on the application and the number of available samples)

The number of hidden neurons modifies the capacity of the neural network to adapt to the data. The more hidden neurons there are, the more complex the network's equation is. The number of hidden neurons should be neither too large nor too small. "Having too many hidden neurons is analogous to a system of equations with more equations than there are free variables: the system is over specified, and is incapable of generalization. Having too few hidden neurons, conversely, can prevent the system from properly fitting the input data, and reduces the robustness of the system." Source: http://en.wikibooks.org/wiki/Artificial_Neural_Networks/Neural_Network_Basics#Number_of_neurons_in_the_hidden_layer

Decrease constant (optional)

  • Variable: decreaseConstant (float)
  • Range: [0, 1]
  • Typical: (0, 1e-3]
  • Default: 0

The decrease constant is applied as a way to slowly decrease the learning rate during gradient descent to help converge to a better minimum. At iteration step t, the learning rate used for the gradient descent will be: learningRate / (1 + t * decreaseConstant). The higher the value of the decrease constant, the faster the learning rate will converge to zero (0). A decrease constant of zero (0) has no effect on the learning rate.

Weight decay (optional)

  • Variable: weightDecay (float)
  • Range: [0, 1]
  • Typical: (0, 1e-3]
  • Default: 0

The weight decay is a simple regularization method that limits the number of free parameters in the model so as to prevent over-fitting (in other words, to get a better generalization). In practice, it penalizes large weights and thus also limits the freedom in the model.

Clone this wiki locally