Interactively Shaping Agents via Human Reinforcement
Transcript of Interactively Shaping Agents via Human Reinforcement
![Page 1: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/1.jpg)
Interactively Shaping Agentsvia Human Reinforcement
W. Bradley Knoxand
Peter Stone
The University of Texas at AustinDepartment of Computer Science
The TAMER Framework
![Page 2: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/2.jpg)
Learning agents
©1997-2009 Adam Dorman
![Page 3: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/3.jpg)
Autonomous Learning
• human defines andprograms anevaluation functionand then steps back
![Page 4: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/4.jpg)
Autonomous Learning
• can be calledreinforcementlearning– types:
• value functionapproximation
• policy search
• dominant inresearch
Kohl and Stone(2004)
![Page 5: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/5.jpg)
Shaping
Def. - creating a desired behavior by reinforcingsuccessive approximations of the behavior
LOOK magazine, 1952
![Page 6: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/6.jpg)
The Shaping Scenario(in this context)
A human trainer observes an agentand manually delivers reinforcement(a scalar value), signaling approval
or disapproval.
![Page 7: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/7.jpg)
Why shaping?
Potential benefits over purely autonomouslearners:
• No evaluation function needed• Allows lay users to teach agents the
policies that they prefer (no programming!)• Decreases sample size• Learns in more challenging domains
![Page 8: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/8.jpg)
Research Question
How can agents harness the informationcontained in signals of positive and
negative evaluation from a human tolearn sequential decision-making tasks?
![Page 9: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/9.jpg)
Talk Goals
• the usual
• get suggestions for cognitive sciencedirections
• collaborate
![Page 10: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/10.jpg)
Types of Natural Knowledge Transferfrom Humans
• Imitation learning(I.e., Programming byDemonstration)
• Natural Language Advice• Shaping
![Page 11: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/11.jpg)
Imitation LearningSchaal et al., 2003
Def. - agent observes demonstrations from a humanexpert and learns to imitate the human’s behavior
• often the goal is to generalize to unseen states
![Page 12: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/12.jpg)
Natural Language Advice
Robot soccer: “If player 2 has the balland is near the opponent’s goal,player 2 should shoot the ball atthe goal.”
• Kuhlmann et al., 2004
Def. - using natural language, a human givesadvice in the form of conditions and suggestedbehavior under those conditions
![Page 13: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/13.jpg)
If limited to one form of knowledgetransfer...
cheap samples?
yes no
yesno
can define and program anevaluation function?
yes no
requisite expertise and interface to control?
autonomous learningvia an evaluationfunction
programming bydemonstration shaping
![Page 14: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/14.jpg)
Outline
• Intro to shaping• Our approach• Future work
![Page 15: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/15.jpg)
Previous work on human-shapableagents
• Clicker training for entertainment agents(Blumberg et al., 2002; Kaplan et al., 2002)
• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +
human reinforcement• Social software agent Cobot in LambdaMoo
(Isbell et al., 2006)– RL with reward = human reinforcement
![Page 16: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/16.jpg)
The Shaped Agent’s Perspective
• Each time step, agent:– receives state description– might receive a real-valued human
reinforcement signal– chooses an action
– does not receive an MDP reward signal
![Page 17: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/17.jpg)
MDP rewardvs.
Human reinforcement
• MDP reward– Key problem:
credit assignmentfrom sparse,delayed rewards
I won!
But why did I win?
![Page 18: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/18.jpg)
MDP rewardvs.
Human reinforcement
Reinforcement from ahuman trainer:– Trainer has long-
term impact in mind– Reinforcement is
within a smalltemporal window ofthe targetedbehavior
– Credit assignmentproblem is largelyremoved
BADROBOT!!
I just didsomething
bad…
![Page 19: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/19.jpg)
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
• TAMER approach:– Learn a model of human reinforcement
– Directly exploit the model to determinepolicy
• If greedy:
![Page 20: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/20.jpg)
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
Learning fromtargeted human reinforcement
is a supervised learning problem,not a reinforcement learning problem.
![Page 21: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/21.jpg)
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
![Page 22: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/22.jpg)
Tetris
– Drop blocks to make solid horizontallines, which then disappear
– |state space| > 2200
– Challenging (NP hard) but slow
– 21 features extracted from (s, a)– TAMER model:
• Linear model over features• Gradient descent updates
– Greedy action selection
![Page 23: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/23.jpg)
TAMER in action: Tetris
Training:Beforetraining:
Aftertraining:
![Page 24: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/24.jpg)
TAMER Results: Tetris(9 subjects)
![Page 25: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/25.jpg)
TAMER Results: Tetris(9 subjects)
![Page 26: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/26.jpg)
Credit assignment
Tasks with several time steps per second
Hockley (1984)
P(response delay)
![Page 27: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/27.jpg)
TAMER in action: Mountain Car
Before training:
Training:
![Page 28: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/28.jpg)
TAMER Results: Mountain Car(19 subjects)
![Page 29: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/29.jpg)
Contributions of finished work
• new learning paradigm: explicitlydefining it and showing its power– baseline algorithms with great results
• guidance and justification of whichalgorithms to use
• novel credit assignment method
![Page 30: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/30.jpg)
Publications• W. Bradley Knox and Peter Stone. TAMER: Training an Agent
Manually via Evaluative Reinforcement. In IEEE 7thInternational Conference on Development and Learning (ICDL-08), August 2008.
• W. Bradley Knox, Ian Fasel, and Peter Stone. DesignPrinciples for Creating Human-Shapable Agents. In AAAISpring 2009 Symposium on Agents that Learn from HumanTeachers, March 2009.
• W. Bradley Knox and Peter Stone. Interactively ShapingAgents via Human Reinforcement: The TAMER Framework.To appear in Proceedings of The Fifth International Conferenceon Knowledge Capture (KCAP-09). September 2009.
![Page 31: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/31.jpg)
Outline
• Intro to shaping• Our approach• Future work
![Page 32: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/32.jpg)
Future work
1. Identify TAMER’s strengths and weaknesses
2. TAMER+R– Human reinforcement: rich but flawed– MDP Reward (R): sparse but flawless– How to use the two signals together?
![Page 33: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/33.jpg)
Future work
3. Extend TAMER to training scenarios thatviolate our current assumptions
4. What about the human?- Investigate how humans train viareinforcement.
5. Other cognitive science directions...
![Page 34: Interactively Shaping Agents via Human Reinforcement](https://reader031.fdocuments.us/reader031/viewer/2022021800/620ca919ab36ec0c366a34a1/html5/thumbnails/34.jpg)
The end