I am having trouble understanding the SARSA algorithm: http://en.wikipedia.org/wiki/SARSA
In particular, when updating the Q value what is gamma? and what values are used for s(t+1) and a(t+1)?
Can someone explain this algorithm to me?
Gamma determines how much memory your algorithm has. If you set it to 0.0, then your algorithm will not update the value function Q at all. If you set it to 1.0, then the new experience will be given as much weight as all the previous experiences combined. The best values lie inbetween and have to be determined experimentally.
Here is how it works:
- In your first step, you just get a state. Simply store it away as st. Also, look up your value function for the best action to make in this state and store it as at.
- In each subsequent step, you get rt+1 and st+1. Again, use your value function to find the best action — at+1. The value of the transition from your previous action to the new one is equal to rt+1+Q(st+1,at+1)-Q(st,at). Use this to update your long-term estimate of the previous action's value Q(st,att). Finally, store st+1 and at+1 as st and at for the next step.
In effect, the value function is just a running average of these update values for each action and every state.