Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...

Post on 15-Jan-2016

221 views 0 download

Tags:

Transcript of Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...

Outline

• MDP (brief)– Background– Learning MDP

• Q learning

• Game theory (brief)– Background

• Markov games (2-player)– Background– Learning Markov games

• Littman’s Minimax Q learning (zero-sum)• Hu & Wellman’s Nash Q learning (general-sum)

/ SG/ POSG

Stochastic games (SG)

Partially observable SG (POSG)

Immediate reward

Expectation over next states

Value of next state

• Model-based reinforcement learning:1. Learn the reward function and the state transition function

2. Solve for the optimal policy

• Model-free reinforcement learning:1. Directly learn the optimal policy without knowing the reward

function or the state transition function

#times action a has been executed in state s

#times action a causes state transition s s’

Total reward accrued when applying a in s

v(s’)

1. Start with arbitrary initial values of Q(s,a), for all sS, aA

2. At each time t the agent chooses an action and observes its reward rt

3. The agent then updates its Q-values based on the Q-learning rule

4. The learning rate t needs to decay over time in order for the learning algorithm to converge

Famous game theory example

A co-operative game

Mixed strategy

Generalization of MDP

Stationary: the agent’s policy does not change over time

Deterministic: the same action is always chosen whenever the agent is in state s

Example

0 1 -1

-1 0 1

1 -1 0

1 -1

-1 1

2 1 1

1 2 1

1 1 2State 1

State 2

1 1

1 1

v(s,*) v(s,) for all s S,

Max V

Such that: rock + paper + scissors = 1

Best response

Worst case

Expectation over all actions

Quality of a state-action pair

Discounted value of all succeeding states weighted by their likelihood

Discounted value of all succeeding states

This learning rule converges to the correct values of Q and v

eplor controls how often the agent will deviate from its current policy

Expected reward for taking

action a when opponent chooses o from state s

Hu and Wellman general-sum Markov games as a framework for RL

Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game