Right now the neural network used for the Q(s,a) function always represents actions the same way: one graded input per action dim. We should add the following options:
- treat each action as a separate output
- each action as a "one hot" input
- some action types as "one hot" and others as graded