Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...
-
date post
15-Jan-2016 -
Category
Documents
-
view
220 -
download
0
Transcript of Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...
Outline
• MDP (brief)– Background– Learning MDP
• Q learning
• Game theory (brief)– Background
• Markov games (2-player)– Background– Learning Markov games
• Littman’s Minimax Q learning (zero-sum)• Hu & Wellman’s Nash Q learning (general-sum)
/ SG/ POSG
Stochastic games (SG)
Partially observable SG (POSG)
Immediate reward
Expectation over next states
Value of next state
• Model-based reinforcement learning:1. Learn the reward function and the state transition function
2. Solve for the optimal policy
• Model-free reinforcement learning:1. Directly learn the optimal policy without knowing the reward
function or the state transition function
#times action a has been executed in state s
#times action a causes state transition s s’
Total reward accrued when applying a in s
v(s’)
1. Start with arbitrary initial values of Q(s,a), for all sS, aA
2. At each time t the agent chooses an action and observes its reward rt
3. The agent then updates its Q-values based on the Q-learning rule
4. The learning rate t needs to decay over time in order for the learning algorithm to converge
Famous game theory example
A co-operative game
Mixed strategy
Generalization of MDP
Stationary: the agent’s policy does not change over time
Deterministic: the same action is always chosen whenever the agent is in state s
Example
0 1 -1
-1 0 1
1 -1 0
1 -1
-1 1
2 1 1
1 2 1
1 1 2State 1
State 2
1 1
1 1
v(s,*) v(s,) for all s S,
Max V
Such that: rock + paper + scissors = 1
Best response
Worst case
Expectation over all actions
Quality of a state-action pair
Discounted value of all succeeding states weighted by their likelihood
Discounted value of all succeeding states
This learning rule converges to the correct values of Q and v
eplor controls how often the agent will deviate from its current policy
Expected reward for taking
action a when opponent chooses o from state s
Hu and Wellman general-sum Markov games as a framework for RL
Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game