Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

23
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    237
  • download

    4

Transcript of Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Page 1: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Learning: Reinforcement Learning

Russell and Norvig: ch 21

CMSC421 – Fall 2005

Page 2: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Project

Teams: 2-3 You should have emailed me your

team members!

Two components: Define Environment Learning Agent

Page 3: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Example: Agent_with_Personality

State: mood: happy, sad, mad, bored sensor: smile, cry, glare, snoreAction: smile, hit, tell-joke, tickleDefine: S X A X S’ X P with probabilities and output

string

Define S X {-10,10}

Page 4: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Example cont:

State: happy (s0), sad (s1), mad (s2), bored (s3) smile (p0), cry(p1), glare(p2), snore (p3)Action: smile (a0), hit (a1), tell-joke (a2), tickle (a3)Define:

S X A X S’ X P with probabilities and output string i.e. 0 0 0 0 0.8 “It makes me happy when you smile” 0 0 2 2 0.2 “Argh! Quit smiling at me!!!” 0 1 0 0 0.1 “Oh, I’m so happy I don’t care if you hit me” 0 1 2 2 0.6 “HEY!!! Quit hitting me” 0 1 1 1 0.3 “Boo hoo, don’t be hitting me”

Define S X {-10,10} i.e. 0 10 1 -10 2 -5 3 0

Page 5: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Example: Robot Navigation

State: locationAction: forward, back, left, rightState -> Reward: define rewards of states in your grid

State x Action -> State defined by movements

Page 6: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Learning Agent

Calls Environment Program to get a training setOutputs a Q function: Q(S x A)

We will evaluate the output of your learning program, by using it to execute and computing the reward given.

Page 7: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Schedule

Monday, Dec. 5 Electronically submit your

environment

Monday, Dec. 12 Submit your learning agent

Wednesday, Dec 13 Submit your writeup

Page 8: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Reinforcement Learningsupervised learning is simplest and best-studied type of learninganother type of learning tasks is learning behaviors when we don’t have a teacher to tell us howthe agent has a task to perform; it takes some actions in the world; at some later point gets feedback telling it how well it did on performing taskthe agent performs the same task over and over againit gets carrots for good behavior and sticks for bad behavior called reinforcement learning because the agent gets positive reinforcement for tasks done well and negative reinforcement for tasks done poorly

Page 9: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Reinforcement Learning

The problem of getting an agent to act in the world so as to maximize its rewards. Consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem. We can use a similar method to train computers to do many tasks, such as playing backgammon or chess, scheduling jobs, and controlling robot limbs.

Page 10: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Reinforcement Learning

for blackjackfor robot motionfor controller

Page 11: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Formalization

we have a state space Swe have a set of actions a1, …, akwe want to learn which action to take at every state in the spaceAt the end of a trial, we get some reward, positive or negativewant the agent to learn how to behave in the environment, a mapping from states to actions

example: Alvinnstate: configuration of the car

learn a steering action for each state

Page 12: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Accessible orobservable state

Repeat: s sensed state If s is terminal then exit a choose action (given s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

Page 13: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Policy Policy (Reactive/Closed-Loop (Reactive/Closed-Loop Strategy)Strategy)

• A policy is a complete mapping from states to actions

-1

+1

2

3

1

4321

Page 14: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Repeat: s sensed state If s is terminal then exit a (s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

Page 15: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Approaches

learn policy directly– function mapping from states to actionslearn utility values for states, the value function

Page 16: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Value FunctionAn agent knows what state it is in and it has a number of actions it can perform in each state. Initially it doesn't know the value of any of the states. If the outcome of performing an action at a state is deterministic then the agent can update the utility value U() of a state whenever it makes a transition from one state to another (by taking what it believes to be the best possible action and thus maximizing): U(oldstate) = reward + U(newstate) The agent learns the utility values of states as it works its way through the state space.

Page 17: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

ExplorationThe agent may occasionally choose to explore suboptimal moves in the hopes of finding better outcomes. Only by visiting all the states frequently enough can we guarantee learning the true values of all the states. A discount factor is often introduced to prevent utility values from diverging and to promote the use of shorter (more efficient) sequences of actions to attain rewards. The update equation using a discount factor gamma is: U(oldstate) = reward + gamma * U(newstate) Normally gamma is set between 0 and 1.

Page 18: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Q-Learning

augments value iteration by maintaining a utility value Q(s,a) for every action at every state. utility of a state U(s) or Q(s) is simply the maximum Q value over all the possible actions at that state.

Page 19: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Q-Learningforeach state s foreach action a Q(s,a)=0 s=currentstate do forever a = select an action do action a r = reward from doing a t = resulting state from doing a Q(s,a) = (1 – alpha) Q(s,a) + alpha * (r + gamma * Q(t)) s = t Notice that a learning coefficient, alpha, has been introduced into the update equation. Normally alpha is set to a small positive constant less than 1.

Page 20: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Selecting an Action

simply choose action with highest expected utility?problem: action has two effects gains reward on current sequence information received and used in

learning for future sequences

trade-off immediate good for long-term well-being

stuck in a rut

jumping off a cliff just because you’ve never done

it before…

Page 21: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Exploration policy

wacky approach: act randomly in hopes of eventually exploring entire environmentgreedy approach: act to maximize utility using current estimateneed to find some balance: act more wacky when agent has little idea of environment and more greedy when the model is close to correctexample: one-armed bandits…

Page 22: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Page 23: Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

RL Summary

active area of researchboth in OR and AIseveral more sophisticated algorithms that we have not discussedapplicable to game-playing, robot controllers, others