Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...

36
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...

Page 1: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 2: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Outline

• MDP (brief)– Background– Learning MDP

• Q learning

• Game theory (brief)– Background

• Markov games (2-player)– Background– Learning Markov games

• Littman’s Minimax Q learning (zero-sum)• Hu & Wellman’s Nash Q learning (general-sum)

Page 3: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

/ SG/ POSG

Stochastic games (SG)

Partially observable SG (POSG)

Page 4: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Immediate reward

Expectation over next states

Value of next state

Page 5: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

• Model-based reinforcement learning:1. Learn the reward function and the state transition function

2. Solve for the optimal policy

• Model-free reinforcement learning:1. Directly learn the optimal policy without knowing the reward

function or the state transition function

Page 6: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

#times action a has been executed in state s

#times action a causes state transition s s’

Total reward accrued when applying a in s

Page 7: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

v(s’)

Page 8: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

1. Start with arbitrary initial values of Q(s,a), for all sS, aA

2. At each time t the agent chooses an action and observes its reward rt

3. The agent then updates its Q-values based on the Q-learning rule

4. The learning rate t needs to decay over time in order for the learning algorithm to converge

Page 9: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 10: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Famous game theory example

Page 11: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 12: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 13: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

A co-operative game

Page 14: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 15: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 16: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Mixed strategy

Generalization of MDP

Page 17: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 18: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Stationary: the agent’s policy does not change over time

Deterministic: the same action is always chosen whenever the agent is in state s

Page 19: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Example

0 1 -1

-1 0 1

1 -1 0

1 -1

-1 1

2 1 1

1 2 1

1 1 2State 1

State 2

1 1

1 1

Page 20: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

v(s,*) v(s,) for all s S,

Page 21: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Max V

Such that: rock + paper + scissors = 1

Page 22: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Best response

Worst case

Expectation over all actions

Page 23: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 24: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Quality of a state-action pair

Discounted value of all succeeding states weighted by their likelihood

Discounted value of all succeeding states

This learning rule converges to the correct values of Q and v

Page 25: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

eplor controls how often the agent will deviate from its current policy

Expected reward for taking

action a when opponent chooses o from state s

Page 26: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 27: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 28: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 29: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 30: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 31: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Hu and Wellman general-sum Markov games as a framework for RL

Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game

Page 32: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 33: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 34: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 35: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Page 36: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.