¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...
Transcript of ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...
Introduction to Real-LifeReinforcement Learning
Michael L. LittmanRutgers University
Department of Computer Science
Brief History
Idea for symposium came out of a discussion Ihad with Satinder Singh @ ICML 2003 (DC).
Both were starting new labs. Wanted tohighlight an important challenge in RL.
Felt we could help create some momentum bybringing together like minded researchers.
Attendees (Part I)
• ABRAMSON,MYRIAM GREENWALD,LLOYD
• BAGNELL,JAMES GROUNDS,MATTHEW
• BENTIVEGNA,DARRIN JONG,NICHOLAS
• BLANK,DOUGLAS LANE,TERRAN
• BOOKER,LASHON LANGFORD,JOHN
• DIUK,CARLOS LEROUX,DAVE
• FAGG,ANDREW LITTMAN,MICHAEL
• FIDELMAN,PEGGY MCGLOHON,MARY
• FOX,DIETER MCGOVERN,AMY
• GORDON,GEOFFREY MEEDEN,LISA
Attendees (More)
• MIKKULAINEN,RISTO
• MUSLINER,DAVID
• PETERS,JAN
• PINEAU,JOELLE
• PROPER,SCOTT
Definitions
What is “reinforcement learning”?
• Decision making driven to maximize ameasurable performance objective.
What is “real life”?
• “Measured” experience. Data doesn’tcome from a model with known or pre-defined properties/assumptions.
Multiple Lives
• Real-life learning (us): use real data,possibly small (even toy) problems
• Life-sized learning (Kaelbling): large statespaces, possibly artificial problems
• Life-long learning (Thrun): Same learningsystem, different problems (somewhatorthogonal)
Find The Ball
Learn:
• which way to turn
• to minimize steps
• to see goal (ball)
• from camera input
• given experience.
The RL Problem
Input: <s1, a1, s2, r1>, <s2, a2, s3, r2>, …, st
Output: ats to maximize discounted sum of ris.
, right, , +1
Problem Formalization: MDP
Most popular formalization: Markov decision process
Assume:
• States/sensations, actions discrete.
• Transitions, rewards stationary and Markov.
• Transition function: Pr(s’|s,a) = T(s,a,s’).
• Reward function: E[r|s,a] = R(s,a).
Then:
• Optimal policy !*(s) = argmaxa Q*(s,a)
• where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)
Find the Ball: MDP Version
• Actions: rotate left/right
• States: orientation
• Reward: +1 for facing ball, 0 otherwise
It Can Be Done: Q-learning
Since optimal Q function is sufficient, useexperience to estimate it (Watkins & Dayan 92)
Given <s, a, s’, r>:Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) )
If:
• all (s,a) pairs updated infinitely often
• Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a)
• #%t = !, #%t 2 < !
Then: Q(s,a) & Q*(s,a)
Real-Life Reinforcement Learning
Emphasize learning with real* data.
Q-learning good, but might not be right here…
Mismatches to “Find the Ball” MDP:
• Efficient exploration: data is expensive
• Rich sensors: never see the same thing twice
• Aliasing: different states can look similar
• Non-stationarity: details change over time
* Or, if simulated, from simulators developed outsidethe AI community
RL2: A Spectrum
Unmodified physical world
Controlled physical world
Electronic-only world
Pure math world
Detailed simulation
Lab-created simulation
RLRL
RL
RLRL gray zone
Unmodified Physical World
weight loss (BodyMedia)
helicopter (Bagnell)
Controlled Physical World
Mahadevan and Connell, 1990
Electronic-only World
Recovery from corruptednetwork interfaceconfiguration.
Java/Windows XP:Minimize time to repair.
Littman, Ravi, Fenson,Howard, 2004
After 95 failure episodes
Learning to sort fastLittman & Lagoudakis
Pure Math World
backgammon (Tesauro)
Detailed Simulation
• Independently developed
elevator control (Crites, Barto)
RARS video game
Robocup Simulator
Lab-created Simulation
Car on the Hill
Taxi World
The Plan
Talks, Panels
Talk slot: 30 minutes, shoot for 25 minutes toleave time for switchover, questions, etc.
Try plugging in during a break.
Panel slot: 5 minutes per panelist (slidesoptional), will use the discussion time
Friday, October 22nd, AM
9:00 Michael Littman, Introduction to Real-lifeReinforcement-learning
9:30 Darrin Bentivegna, Learning From Observationand Practice Using Primitives
10:00 Jan Peters, Learning Motor Primitives with Reinforcement Learning
10:30 break
11:00 Dave LeRoux, Instance-Based Reinforcement Learning on the Sony Aibo Robot
11:30 Bill Smart, Applying Reinforcement Learning toReal Robots: Problems and Possible Solutions
12:00 HUMAN-LEVEL AI PANEL, Roy
12:30 lunch break
Friday, October 22nd, PM
2:00 Andy Fagg, Learning Dexterous ManipulationSkills Using the Control Basis
2:30 Dan Stronger, Simultaneous Calibration of Action and Sensor Models on a Mobile Robot
3:00 Dieter Fox, Reinforcement Learning for SensingStrategies
3:30 break
4:00 Roberto Santiago, What is Real Life? Using Simulation to Mature Reinforcement Learning
4:30 OTHER MODELS PANEL, Diuk, Greenwald, Lane
5:00 Gerry Tesauro, RL-Based Online Resource Allocation in Multi-Workload Computing Systems
5:30 session ends
Saturday, October 23rd, AM
9:00 Drew Bagnell, Practical Policy Search
9:30 John Moody, Learning to Trade via Direct Reinforcement
10:00 Risto Miikkulainen, Learning Robust Control andComplex Behavior Through Neuroevolution
10:30 break
11:00 Michael Littman, Real Life MultiagentReinforcement Learning
11:30 MULTIAGENT PANEL, Stone, Reidmiller, Moody, Bowling
12:00 HIERARCHY/STRUCTURED REPRESENTATIONSPANEL, Tadepalli, McGovern, Jong, Grounds
12:30 lunch break
Join
t w
ith A
rtific
ial
Mul
ti-A
gent
Learn
ing
Saturday, October 23rd, PM
2:00 Lisa Meeden, Self-Motivated, Task-IndependentReinforcement Learning for Robots
2:30 Marge Skubic and David Noelle, A BiologicalInspired Adaptive Working Memory for Robots
3:00 COGNITIVE ROBOTICS PANEL, Blank, Noelle,Booksbaum
3:30 break
4:00 Peggy Fidelman, Learning Ball Acquisition andFast Quadrupedal Locomotion on a Physical Robot
4:30 John Langford, Real World Reinforcement Learning Theory
5:00 OTHER TOPICS PANEL, Abramson, Proper, Pineau
5:30 session ends
Join
t w
ith
Cognitiv
e R
obotics
Sunday, October 24th, AM
9:00 Satinder Singh, RL for Human Level AI
9:30 Geoff Gordon, Learning Valid Predictive Representations
10:00 Yasutake Takahashi, Abstraction of State/Actionbased on State Value Function
10:30 break
11:00 Martin Reidmiller/Stephan Timmer, RL for technical process control
11:30 Matthew Taylor, Speeding Up Reinforcement Learning with Behavior Transfer
12:00 Discussion: Wrap Up, Future Plans
12:30 symposium ends
Plenary
Saturday (tomorrow) night
6pm-7:30pm Plenary
Each symposium gets a 10-minute slot
Ours: Video. I need today’s speakers to joinme for lunch and also immediately after thesession today.
Darrin’s Summary
• extract features
• domain knowledge
• function approximators
• bootstrap learning/behavior transfer
• improve current skill
• learn skill initially using other methods
• start with low-level skills
What Next?
• Collect successes to point to– Contribute to newly created page:
http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/SuccessesOfRL
– We’re already succeeding (ideas are spreading)
– rejoice: control theorists are scared of us
• Sources of information– This workshop web site:
http://www.cs.rutgers.edu/~mlittman/rl3/rl2/ .
– Will include pointers to slides, papers
– Can include twiki links or a pointer from RL repository.
– Michael requesting slides / URLs / videos (up front).
– Newly created Myth Page:http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/MythsofRL
Other Activities
• Possible Publication Activities– special issue of a journal (JMLR? JAIR?)
– editted book
– other workshops
– guidebook for newbies
– textbook?
• Benchmarks– Upcoming NIPS workshop on benchmarks
– We need to push for including real-life examples
– greater set of domains, make an effort to widenapplications
Future Challenges
• How can we better talk about the inherent problemdifficulty? Problem classes?
• Can we clarify the distinction between controltheory and AI problems?
• Stress making sequential decisions (outside roboticsas well).
• What about structure? Can we say more?
• Need to encourage a fresh perspective.
• Help convey how to see problems as RL problems.