¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...

Post on 30-Jul-2020

2 views 0 download

Transcript of ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...

Introduction to Real-LifeReinforcement Learning

Michael L. LittmanRutgers University

Department of Computer Science

Brief History

Idea for symposium came out of a discussion Ihad with Satinder Singh @ ICML 2003 (DC).

Both were starting new labs. Wanted tohighlight an important challenge in RL.

Felt we could help create some momentum bybringing together like minded researchers.

Attendees (Part I)

• ABRAMSON,MYRIAM GREENWALD,LLOYD

• BAGNELL,JAMES GROUNDS,MATTHEW

• BENTIVEGNA,DARRIN JONG,NICHOLAS

• BLANK,DOUGLAS LANE,TERRAN

• BOOKER,LASHON LANGFORD,JOHN

• DIUK,CARLOS LEROUX,DAVE

• FAGG,ANDREW LITTMAN,MICHAEL

• FIDELMAN,PEGGY MCGLOHON,MARY

• FOX,DIETER MCGOVERN,AMY

• GORDON,GEOFFREY MEEDEN,LISA

Attendees (More)

• MIKKULAINEN,RISTO

• MUSLINER,DAVID

• PETERS,JAN

• PINEAU,JOELLE

• PROPER,SCOTT

Definitions

What is “reinforcement learning”?

• Decision making driven to maximize ameasurable performance objective.

What is “real life”?

• “Measured” experience. Data doesn’tcome from a model with known or pre-defined properties/assumptions.

Multiple Lives

• Real-life learning (us): use real data,possibly small (even toy) problems

• Life-sized learning (Kaelbling): large statespaces, possibly artificial problems

• Life-long learning (Thrun): Same learningsystem, different problems (somewhatorthogonal)

Find The Ball

Learn:

• which way to turn

• to minimize steps

• to see goal (ball)

• from camera input

• given experience.

The RL Problem

Input: <s1, a1, s2, r1>, <s2, a2, s3, r2>, …, st

Output: ats to maximize discounted sum of ris.

, right, , +1

Problem Formalization: MDP

Most popular formalization: Markov decision process

Assume:

• States/sensations, actions discrete.

• Transitions, rewards stationary and Markov.

• Transition function: Pr(s’|s,a) = T(s,a,s’).

• Reward function: E[r|s,a] = R(s,a).

Then:

• Optimal policy !*(s) = argmaxa Q*(s,a)

• where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)

Find the Ball: MDP Version

• Actions: rotate left/right

• States: orientation

• Reward: +1 for facing ball, 0 otherwise

It Can Be Done: Q-learning

Since optimal Q function is sufficient, useexperience to estimate it (Watkins & Dayan 92)

Given <s, a, s’, r>:Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) )

If:

• all (s,a) pairs updated infinitely often

• Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a)

• #%t = !, #%t 2 < !

Then: Q(s,a) & Q*(s,a)

Real-Life Reinforcement Learning

Emphasize learning with real* data.

Q-learning good, but might not be right here…

Mismatches to “Find the Ball” MDP:

• Efficient exploration: data is expensive

• Rich sensors: never see the same thing twice

• Aliasing: different states can look similar

• Non-stationarity: details change over time

* Or, if simulated, from simulators developed outsidethe AI community

RL2: A Spectrum

Unmodified physical world

Controlled physical world

Electronic-only world

Pure math world

Detailed simulation

Lab-created simulation

RLRL

RL

RLRL gray zone

Unmodified Physical World

weight loss (BodyMedia)

helicopter (Bagnell)

Controlled Physical World

Mahadevan and Connell, 1990

Electronic-only World

Recovery from corruptednetwork interfaceconfiguration.

Java/Windows XP:Minimize time to repair.

Littman, Ravi, Fenson,Howard, 2004

After 95 failure episodes

Learning to sort fastLittman & Lagoudakis

Pure Math World

backgammon (Tesauro)

Detailed Simulation

• Independently developed

elevator control (Crites, Barto)

RARS video game

Robocup Simulator

Lab-created Simulation

Car on the Hill

Taxi World

The Plan

Talks, Panels

Talk slot: 30 minutes, shoot for 25 minutes toleave time for switchover, questions, etc.

Try plugging in during a break.

Panel slot: 5 minutes per panelist (slidesoptional), will use the discussion time

Friday, October 22nd, AM

9:00 Michael Littman, Introduction to Real-lifeReinforcement-learning

9:30 Darrin Bentivegna, Learning From Observationand Practice Using Primitives

10:00 Jan Peters, Learning Motor Primitives with Reinforcement Learning

10:30 break

11:00 Dave LeRoux, Instance-Based Reinforcement Learning on the Sony Aibo Robot

11:30 Bill Smart, Applying Reinforcement Learning toReal Robots: Problems and Possible Solutions

12:00 HUMAN-LEVEL AI PANEL, Roy

12:30 lunch break

Friday, October 22nd, PM

2:00 Andy Fagg, Learning Dexterous ManipulationSkills Using the Control Basis

2:30 Dan Stronger, Simultaneous Calibration of Action and Sensor Models on a Mobile Robot

3:00 Dieter Fox, Reinforcement Learning for SensingStrategies

3:30 break

4:00 Roberto Santiago, What is Real Life? Using Simulation to Mature Reinforcement Learning

4:30 OTHER MODELS PANEL, Diuk, Greenwald, Lane

5:00 Gerry Tesauro, RL-Based Online Resource Allocation in Multi-Workload Computing Systems

5:30 session ends

Saturday, October 23rd, AM

9:00 Drew Bagnell, Practical Policy Search

9:30 John Moody, Learning to Trade via Direct Reinforcement

10:00 Risto Miikkulainen, Learning Robust Control andComplex Behavior Through Neuroevolution

10:30 break

11:00 Michael Littman, Real Life MultiagentReinforcement Learning

11:30 MULTIAGENT PANEL, Stone, Reidmiller, Moody, Bowling

12:00 HIERARCHY/STRUCTURED REPRESENTATIONSPANEL, Tadepalli, McGovern, Jong, Grounds

12:30 lunch break

Join

t w

ith A

rtific

ial

Mul

ti-A

gent

Learn

ing

Saturday, October 23rd, PM

2:00 Lisa Meeden, Self-Motivated, Task-IndependentReinforcement Learning for Robots

2:30 Marge Skubic and David Noelle, A BiologicalInspired Adaptive Working Memory for Robots

3:00 COGNITIVE ROBOTICS PANEL, Blank, Noelle,Booksbaum

3:30 break

4:00 Peggy Fidelman, Learning Ball Acquisition andFast Quadrupedal Locomotion on a Physical Robot

4:30 John Langford, Real World Reinforcement Learning Theory

5:00 OTHER TOPICS PANEL, Abramson, Proper, Pineau

5:30 session ends

Join

t w

ith

Cognitiv

e R

obotics

Sunday, October 24th, AM

9:00 Satinder Singh, RL for Human Level AI

9:30 Geoff Gordon, Learning Valid Predictive Representations

10:00 Yasutake Takahashi, Abstraction of State/Actionbased on State Value Function

10:30 break

11:00 Martin Reidmiller/Stephan Timmer, RL for technical process control

11:30 Matthew Taylor, Speeding Up Reinforcement Learning with Behavior Transfer

12:00 Discussion: Wrap Up, Future Plans

12:30 symposium ends

Plenary

Saturday (tomorrow) night

6pm-7:30pm Plenary

Each symposium gets a 10-minute slot

Ours: Video. I need today’s speakers to joinme for lunch and also immediately after thesession today.

Darrin’s Summary

• extract features

• domain knowledge

• function approximators

• bootstrap learning/behavior transfer

• improve current skill

• learn skill initially using other methods

• start with low-level skills

What Next?

• Collect successes to point to– Contribute to newly created page:

http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/SuccessesOfRL

– We’re already succeeding (ideas are spreading)

– rejoice: control theorists are scared of us

• Sources of information– This workshop web site:

http://www.cs.rutgers.edu/~mlittman/rl3/rl2/ .

– Will include pointers to slides, papers

– Can include twiki links or a pointer from RL repository.

– Michael requesting slides / URLs / videos (up front).

– Newly created Myth Page:http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/MythsofRL

Other Activities

• Possible Publication Activities– special issue of a journal (JMLR? JAIR?)

– editted book

– other workshops

– guidebook for newbies

– textbook?

• Benchmarks– Upcoming NIPS workshop on benchmarks

– We need to push for including real-life examples

– greater set of domains, make an effort to widenapplications

Future Challenges

• How can we better talk about the inherent problemdifficulty? Problem classes?

• Can we clarify the distinction between controltheory and AI problems?

• Stress making sequential decisions (outside roboticsas well).

• What about structure? Can we say more?

• Need to encourage a fresh perspective.

• Help convey how to see problems as RL problems.