Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat...
-
Upload
winifred-anissa-boyd -
Category
Documents
-
view
214 -
download
0
description
Transcript of Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat...
![Page 1: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/1.jpg)
Flexible and fast convergent learning agent
Miguel A. Soto SantibanezMichael M. Marefat
Department of Electrical and Computer EngineeringUniversity of Arizona, Tucson, AZ
![Page 2: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/2.jpg)
Background and Motivation
“A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
A robot driving learning problem:
Task T: driving on public four-lane highways using vision sensorsPerformance measure P: average distance traveled before an error (as
judged by human overseer)Training experiences E: a sequence of images and steering commands
recorded while observing a human driver
![Page 3: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/3.jpg)
1) Artificial Neural Networks
Robust to errors in the training data
Dependency on the availability of good and extensive training examples
2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations
Dependency on the availability of good and extensive training examples
3) Reinforcement Learning
Independent of the availability of good and extensive training examples
Convergence to the optimal policy can be extremely slow
Background and Motivation II
![Page 4: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/4.jpg)
Background and Motivation III
Motivation:
Is it possible to get the best of both worlds?
Is it possible for a Learning Agent to be flexible and fast convergent at the same time?
![Page 5: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/5.jpg)
The ProblemFormalization:
Given: a) a set of actions A = {a1, a2, a3, . . .}, b) a set of situations S = {s1, s2, s3, . . .}, c) and a function TR(a, s) tr,
where tr is the total reward associated with applying action a while at state s,
The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2), . . .} such that rule(s, a) P, a = amax
where TR(amax, s)=max(TR(a1,s), TR(a2,s), . . .) Also:
1) Increase flexibility 2) Increase speed of convergence
![Page 6: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/6.jpg)
The Solution
The Q learning Algorithm:
1: rule(s, a) P, TR(a, s) 02: find out what is the current situation si3: do forever:4: select an action ai A and execute it5: find out what is the immediate reward r 6: find out what is the current situation si’7: TR(ai, si) r + aFactormax(TR(a, si’)) a8: si si’
![Page 7: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/7.jpg)
The Solution IIAdvantages: 1) The LA does not depend on the availability of good and extensive training examples
Reason: a) This method learns from experimentation instead of given training examplesShortcomings: 1) Convergence to the optimal policy can be very slow
Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled.
2) May not be able to use this method on high dimensionality problems
Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.
![Page 8: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/8.jpg)
The Solution IIISpeed of convergence tied to the number of situations:
situations ==> P rules that need to be found P rules that need to be found ==> experiments are needed experiments are needed ==> convergence speed
120000 situations world 12 situations world
![Page 9: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/9.jpg)
The Solution IVSlow propagation of “good findings”:
A
B C D
J
E F H
I K L
Factor = 0.9
0
0 0
0
0
0
90 100
0
0 0 0
After visiting A, B, . . .G
G
1
2 3 4
5 6
7
0
0 0
0
81
0
90 100
0
0 0 0
After visiting A, . . .G 2 times
59
66 73
0
81
0
90 100
0
0 0 0
After visiting A, . . .G 5 times
0
0 0
0
0
0
0 100
0
0 0 0
Table of intrinsic rewards Possible situations
![Page 10: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/10.jpg)
The Solution VFirst Sub-problem:
Slow propagation of “good findings”
Solution:Develop a method that propagates “good findings” beyond the previous state
A
B C D
J
E F H
I K L
Intrinsic value of F = 100
Intrinsic value of others = 0
Factor = 0.9
0
0 0
0
0
0
90
100
0
0 0 0
Without Propagation With Propagation
G
1
2 3 4
5 6
7
59
66 73
0
81
0
90 100
0
0 0 0
![Page 11: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/11.jpg)
The Solution VISolution to First Sub-problem:
a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:
t = currentTime -1
is entry visited
at time = t stored in the "short term
memory"?
End
YES
NO
is total reward
(coming from entry at time = t + 1) bigger
than the official Value?
NO
YES
t = t -1
Begin
update P
![Page 12: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/12.jpg)
The Solution VII
The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem
b) Speed of convergence tied to number of situations that need to be handled.
Solution: 1) We just keep a few examples of the policy (also called prototypes)
2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes
![Page 13: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/13.jpg)
The Solution VIII
Kanerva Coding
And
Tile Coding
Moving Prototypes
![Page 14: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/14.jpg)
The Solution IX
![Page 15: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/15.jpg)
The Solution X
![Page 16: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/16.jpg)
The Solution XI
![Page 17: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/17.jpg)
The Solution XII A sound tree:
a) all the “areas” are mutually
exclusive
b) their merging is exhaustive
c) the merging of any two sibling “areas” is equal to their parent’s “area”.
children
parent
children
parent
![Page 18: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/18.jpg)
The Solution XIII
Impossible Merge
![Page 19: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/19.jpg)
The Solution XIV
“Smallest predecessor”
![Page 20: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/20.jpg)
The Solution XV
![Page 21: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/21.jpg)
The Solution XVIPossible ways of breaking the existing nodes:
Node being inserted
![Page 22: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/22.jpg)
The Solution XVII
List 1
List 1.1
List 1.2
and
![Page 23: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/23.jpg)
The Solution XVIII
![Page 24: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/24.jpg)
The Solution XIX
![Page 25: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/25.jpg)
The Solution XX
![Page 26: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/26.jpg)
ResultsThe performance of the algorithm “Propagation of Good Findings” is especially good when the world is large:
Memory Size = 100 Seed = 9642Factor = 0.99
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35World s ize
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:
Seed = 2129 World Size = 7X7Factor = 0.9
0
2000
4000
6000
8000
10000
0 1 2 3 4 5 6 7Memory size
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
![Page 27: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/27.jpg)
Results IIThe algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large:
Memory Size = 100 World Size = 7X7Seed = 2129
0100002000030000
400005000060000
0 0.2 0.4 0.6 0.8 1 1.2
Factor
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
Results do not depend on sequence of random numbers:
![Page 28: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/28.jpg)
Conclusions
Moving Prototypes
Q Learning Algorithm
Propagation of good findings The proposed
Learning Agent
Q Learning Algorithm LA becomes more flexible
Propagating concept Convergence is accelerated
Moving Prototypes concept LA becomes more flexible
Moving Prototypes concept Convergence is accelerated
![Page 29: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…](https://reader035.fdocuments.us/reader035/viewer/2022081517/5a4d1bdd7f8b9ab0599de086/html5/thumbnails/29.jpg)
Conclusions II
What is left to do:
Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”).
Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot.
Develop more precisely the limits on the use of this model.