Definition of univariate B-splines - uni-hamburg.de · Definition of univariate B-splines The...
Transcript of Definition of univariate B-splines - uni-hamburg.de · Definition of univariate B-splines The...
Definition of univariate B-splines
The B-splines are employed to specify the linguistic terms, and knots are chosento be different from each other (periodical model). Visually, the selection of k (theorder of the B-splines) determines the following factors of the fuzzy sets formodeling the linguistic terms.
Assume x is a general input variable of a control system that is defined on theuniverse of discourse [x1, xm]. Given a sequence of ordered parameters (knots):x1, x2, . . . , the ith B-spline Ni,k of order k (degree k − 1) is recursively defined asfollows:
Ni,k(x) =
{
1 for x ∈ [xi, xi+1)0 otherwise
if k = 1
x−xixi+k−1−xi
Ni,k−1(x) + xi+k−x
xi+k−xi+1Ni+1,k−1(x) if k > 1
(1)
with i = 1, . . . ,m− k.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 263
Therefore, m knots xi(i = 1, . . . ,m) form l = m− k B-splines (Figure 1).
Abbildung 1: Nine B-splines of order 3 defined over 12 non-uniformly distributedknots.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 264
Examples of B-splines of order 1, 2, 3 and 4 with their knots are shown in Figure 2.
Abbildung 2: Nonuniform univariate B-splines of oder 1 to 4 defined on a parameterx.
In each interval [xj, xj+1], k non-zero B-splines overlap.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 265
The example of order 3 (cubic B-splines) is shown in Figure 3.
Abbildung 3: Cubic B-splines [xj, xj+1] defined on a parameter x.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 266
Properties of B-Splines
Recursive definition is one basic feature of B-splines, which enables the generationof B-splines of arbitrary orders with the incremental smoothness for a given set ofknots. The other most important properties of B-splines, in respect to modelingand control are:
Partition of unity:∑l
i=0 Ni,k(x) = 1.
Positivity: Ni,k(x) ≥ 0 for all x.
Local support: Ni,k(x) = 0 for x /∈ [xi, xi+k].
Ck−2 continuity: If the knots {xi} are pairwise different fromeach other, then Ni,k(x) ∈ Ck−2, i.e., Ni,k(x)is (k − 2) times continuously differentiable.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 267
Lattice
Abbildung 4: The B-spline model – a two-dimensional illustration.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 268
Each n-dimensional rectangle (n > 1) of the lattice is covered by the jth
multivariate B-spline N jk(x) which is formed by taking the tensor product of n
univariate B-splines:
N jk(x) =
n∏j=1
N jij,kj
(xj) (2)
Therefore the shape of each B-spline, and thus the shape of multivariate ones(Figure 5), is implicitly set by their order and their given knot distribution on eachinput interval.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 269
(a) Tensor product of two, order 2univariate B-splines.
(b) Tensor product of one order3 and one order 2 univariate B-splines.
(c) Tensor product of two univa-riate B-splines of order 3.
Abbildung 5: Bivariate B-splines formed by taking the tensor product of twounivariate B-splines.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 270
Fuzzy-Controller eines MISO-Systems - I
Conditions of B-spline Fuzzy Controllers:
• periodical B-spline basis functions as membership functions for inputs,
• fuzzy singletons as membership functions for outputs,
• “product” as fuzzy conjunctions,
• “centroid” as defuzzification method,
• addition of “virtual linguistic terms” at both ends of each input variable and
• extension of the rule base for the “virtual linguistic terms” by copying theoutput values of the “nearest” neighbourhood.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 271
B-Spline-Fuzzy-Controller eines MISO-Systems - II
A MISO system with n inputs x1, x2, . . . , xn, rules with the n conjunctive terms in the premise
are given in the following form:
{Rule(i1, i2, . . . , in): IF (x1 is N1i1,k1
) and (x2 is N2i2,k2
) and . . . and (xn is Nnin,kn
) THEN y
is Yi1i2...in},
where
• xj: the j-th input (j = 1, . . . , n),
• kj: the order of the B-spline basis functions used for xj,
• N jij,kj
: the i-th linguistic term of xj defined by B-spline basis functions,
• ij = 0, . . . ,mj, representing how fine the j-th input is fuzzy partitioned,
• Yi1i2...in: the control vertex (deBoor points) of Rule(i1, i2, . . . , in).
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 272
Fuzzy-Controller eines MISO-Systems - III
The output y of a MISO fuzzy controller is:
y =
∑m1i1=0 . . .
∑mnin=0(Yi1,...,in
∏nj=1 N j
ij,kj(xj))∑m1
i1=0 . . .∑mn
in=0
∏nj=1 N j
ij,kj(xj)
(3)
=
m1∑i1=0
. . .
mn∑in=0
(Yi1,...,in
n∏j=1
Njij,kj
(xj)) (4)
This is called a general NUBS hypersurface, which possesses the followingproperties:
• If the B-functions of order k1, k2, . . . , kn are employed to specify the linguistic terms of the
input variables x1, x2, . . . , xn, it can be guaranteed that the output variable y is (kj − 2)
times continuously differentiable in respect to the input variable xj, j = 1, . . . , n.
• If the input space is partitioned fine enough and at the correct positions, the interpolation with
the B-spline hypersurface can reach a given precision.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 273
B-spline Type: SISO Systems
A SISO system with B-functions of order 2 (Xi(x): firing strength of rule i; yi: thecontribution of rule i to the output).
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 274
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 275
MISO Systems - A 2D Example
An example with two input variables (x and y) and an output z. The controlvertices of the output are Z1, Z2, Z3, Z4.
The linguistic terms of the inputs:
The linguistic terms of the output:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 276
A 2D Example - The Rule Base
The rule base consists of four rules:Rule
1) IF x is X1 and y is Y1 THEN z is Z1
2) IF x is X1 and y is Y2 THEN z is Z2
3) IF x is X2 and y is Y1 THEN z is Z3
4) IF x is X2 and y is Y2 THEN z is Z4
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 277
A 2D Example - Inference
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 278
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 279
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 280
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 281
A 2D Example - Defuzzification
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 282
Supervised Learning
Supervised learning assumes that a “teacher” provides the complete desiredsystem output for each input datum.
Based on the complete set of these input/output vectors, B-spline type fuzzycontrollers can be trained very rapidly.
Computing parameters of such a B-spline fuzzy system is divided into two steps:for the IF-part and for the THEN-part.
Considering the granularity of the input space and the maximal point distributionof the control space if known, the fuzzy sets can be generated using the recursivecomputation of B-spline basis functions.
The control vertices of the THEN parts can be automatically achieved through alearning procedure.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 283
Learning algorithm - I
Assume {(X, yd)} is a set of training data, where
• X = (x1, x2, . . . , xn) : the input data vector,
• yd : the desired output for X.
The squared error is computed as:
E =12(yr − yd)2, (5)
where yr is the current real output value during training.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 284
The parameters to be found are Yi1,i2,...,in, which make the error in (5) as small aspossible, i.e.
E =12(yr − yd)2 ≡ MIN. (6)
Each control vertex Yi1,...,in can be modified by using the gradient descentmethod:
∆Yi1,...,in = −ε∂E
∂Yi1,...,in
(7)
= ε(yr − yd)n∏
j=1
N jij,kj
(xj) (8)
where 0 < ε ≤ 1.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 285
The gradient descent method guarantees that the learning algorithm converges tothe global minimum of the error function because the second partial differentiationin respect to the quadratic error function Yi1,i2,...,in is constant:
∂2E
∂2Yi1,...,in
= (n∏
j=1
N jij,kj
(xj))2 ≥ 0. (9)
This means that the error function (5) is convex in the space Yi1,i2,...,in andtherefore possesses only one (global) minimum.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 286
Immediate learning by self-evaluation
A fuzzy system can learn under supervision.
Such a learning process needs a teacher, i.e. for each input vector, the desiredoutput should be known. Then the fuzzy controller attempts to interpolate theseinput/output vectors to provide a continuous (hyper-)surface for the whole controlspace.
In reality, it is not always simple to find the goal function of the output for acomplex system. An unsupervised learning approach should therefore bedeveloped.
Based on a B-spline fuzzy controller, the parameters to be learned are still mainlythe control vertices of the “THEN” part.
The key problem of unsupervised learning with such a model is then how tomodify the control vertices after each learning step, i.e. the change direction (+ or-) and the change magnitude.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 287
Inspiration by Supervised Learning
We first discuss a control system with (X1, X2, . . . , Xn) as input and Y as output.Let us rewrite the modification of the control vertices for supervised learning:
∆yi1,...,im = −ε∂E
∂yi1,...,im
= ε(yr − yd).m∏
j=1
Xij,kj(xj)
= sign(yr − yd) ε .|yr − yd|.m∏
j=1
Xij,kj(xj) (10)
sign(yr − yd) indicates the direction of the modification of yi1,...,im in eachlearning step, while the product ε · |yr − yd| ·
∏mj=1 Xij,kj
(xj) determines themagnitude of the modification.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 288
Evaluation Function - I
In unsupervised learning, it is usually possible to define an “evaluation function”.Such an evaluation function should describe how “good” the current system state((x1, x2, . . . , xn), y) is.
For each input vector, an output is generated. With this output, the systemtransits to another state. The new state is compared with the old one; anadaptation is performed if necessary.
Assume the evaluation function, denoted by V (·), possesses a bigger value for abetter state, i.e. for two states st and st+1, if st is better than st+1, thenV (st) ≥ V (st+1). The adaptation of the control vertices can be performed with asimilar representation as in supervised learning.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 289
Evaluation Function - II
Let us reconsider the modification of the control vertices through the equation(10). State st transits to st+1 by the output yr. The desired state is sd. Wereplace yr in (10) with V (st+1), yd with V (sd).
Assume two system states st and st+1, and st is better than st+1, i.e.V (st) ≥ V (st+1), where V (·) is the evaluation function.
We consider those systems, for which a function V (·) can be found which fulfillsthe following condition:
Assume st is the current state and y an arbitrary output. With y the systemtransits to the state st+1. If another output y′ fulfills y × y′ ≤ 0, and with y′
the system transits to s′t+1, the following relation of the evaluation functions isvalid:
( V (st+1) − V (st) )× ( V (s′t+1) − V (st) ) ≤ 0. (11)
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 290
Modifying Control Vertices in Reinforcement Learning - I
At the moment t the system has the state st. The ideal state of the moment t + 1would be sd.
With the controller output yr generated at the moment t, the system transits tothe state st+1.
Considering the state transition from st to st+1, the constellation of st, st+1 andsd:
(a) (b) (c)
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 291
Modifying Control Vertices in Reinforcement Learning - II
(a): The system state becomes worse, i.e. the system acts incorrectly. According tothe condition in (11) the change direction is −sign(y).
(b): The system acts in the correct direction. The value of the output should beenlarged. The change direction is then sign(y).
(c): This case is the inverse case of the case (b). The change direction should be−sign(y).
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 292
These three cases can be synthesized by
S = sign(V (st)− V (st+1)) ∗ sign(V (st+1)− V (sd)) ∗ sign(y). (12)
The change of control vertices can finally be written as:
∆yi1,...,im = S . ε . |V (st+1)− V (sd)| .m∏
j=1
Xij,kj(xj). (13)
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 293
Learning of Cart-Pole Balancing
The pendulum possesses an initial state (θ, θ). To be solved is the force f to beexerted, which is able to bring the cart-pole system to the balanced final stateθ = 0 and θ = 0.
The inputs of the system are:
• angle: θ(◦) ∈ [−15,+15] and
• angle velocity: θ(◦/s) ∈ [−20,+20].
Each of the two input variables are covered with 7 B-spline basis functions oforder 3.
The output of the system is the force f to be exerted on the cart.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 294
For learning we choose the evaluation function as:
V (st) = V (θ, θ)def= −|2 ∗ θ + θ|, and the relation of the evaluation functions of
the desired state sd and A: V (sd)def= 0.5 ∗ V (st).
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 295
CP-Balancing: Control Surfaces
at the beginning: after 100 learning steps after 3000 learning steps
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 296
CP-Balancing - Validation
The motion profiles of the pendulum from the starting state (θ=-10, θ=10):
angle:
angle velocity:
applied force:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 297
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 298
Inverses Pendel: I
Problem Balanciere Pendel P durch Steuerung des Motors M
Eingang: zwei Zustand-Variablen:
• Winkel θ;• Winkelgeschwindigkeit θ
als Differenz ∆θt = θt − θt−1
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 299
Ausgang: eine Steuer-Variable Motor-Storm→ Motor-Geschwindigkeit v
Quantisierung von drei linguistischen Variablen in jeweils sieben Fuzzy-Mengen(linguistischen Termen):
{NB, NM,NS,Z, PS, PM, PB}
Beispiel: Regel (NM, Z; PM)
Wenn der Winkel θ in seinem mittleren negativen Bereich istund die Winkelgeschwindigkeit θ ungefahr Null ist,
Dann sollte die Motor-Geschwindigkeit v in ihrem mittleren positiven Bereich sein.
Die Regelbasis in Tabellenform:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 300
θNB NM NS Z PS PM PB
NB PBNM PMNS PS NS
∆θ Z PB PM PS Z NS NM NBPS PS NSPM NMPB NB
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 301
Miniatur–Roboter KHEPERA
• Motorola 68331 Micro–Controller
• 128 KByte RAM, 128 KByte ROM
• Verbindung zur Außenwelt uber ein serielles Kabel
• 2 Schrittmotoren, 600 Schritte/Umdrehung, d.h. ein Schritt entspricht 1/12mm
• 8 Nahbereichssensoren (Infrarot), Siemens SFH900, Empfindlichkeit maximal5cm
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 302
KHEPERA — Sensoren
Eingabe fur Regelung: IR Sensoren
0: SL85, 1: SL45, Mittelwert von 2 und 3: SLR0,
4: SR45, 5: SR85
Sensor Meßwerte gegen deren Distanz:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 303
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 304
Problem der Hindernisvermeidnung
Ausgabe: Geschwindigkeiten des linken und rechten Motors⇒ Robotergeschwindigkeit v, Steuerwinkel s
Ziel: Kollisionsvermeidung, d.h., moglichst“sanftes” Umfahren von Hindernissen
Struktur des Fuzzy-Reglers:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 305
ZF der Ein- und Ausgange
IR-Sensorwerte:
Robotergeschwindigkeit v:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 306
Steuerwinkel s:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 307
Die Regeln des Systems: I
Ausweichmanover im freien Raum beim Erkennen eines Hindernisses von rechts:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 308
Fuzzy–Eingangsvariablen Ausgangsvar.
SL85 SL45 SLR0 SR45 SR85 speed steer
vl vl vl vl low high n
vl vl vl low low low nm
vl vl low low low low nb
vl low low low low low nb
vl vl vl vl high low nm
vl vl vl low high vl nb
vl vl low low high vl nb
vl vl vl high high vl nb
vl vl high high high vl nb
vl vl vl vl vh vl nb
vl vl vl low vh vl nb
vl vl vl high vh vl nb
vl vl low high vh vl nb
vl low high high vh vl nb
vl vl vl vh vh vl nb
vl vl low vh vh vl nb
vl vl vh vh vh vl nb
vl low vh vh vh vl nb
low high vh vh vh vl nb
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 309
Autonome mobile Roboter: 1
Ziel: Zielfahrt und Kollisionsvermeidung
Besonderheiten:
• Fuzzyfikation der Sensorsignale;
(b) Laser range finder
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 310
Autonome mobile Roboter: 2
• Fuzzy-Regeln fur die Realisierung von Verhaltensmustern (“behaviors”);
GO → SC 1 Regel
OP → SC 4 Regeln
GO → TC 3 Regeln
“Far” OP → TC 2 Regeln
“Near” OP → TC 2 Regeln
“Very close” OP → TC 3 Regeln
wobei SC (“speed control”) und TC (“turn control”) Funktionen von GO (“goal orientation”)
und OP (“obstacle proximity”) sind.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 311
Autonome mobile Roboter: 3
• Darstellung des Verhaltens “goal-tracking”:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 312
• On-Board-VLSI-Chip
→ Alle Regeln konnen in 30 µs verarbeitet werden.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 313
Reinforcement Learning
Der Roboter erhalt in jedem Regelungszyklus sowohl Sensordaten als auch einReinforcement-Signal, dann fuhrt er eine Aktion aus, welche seinen Zustandverandert.
Reinforcement Learning liegt zwischen uberwachtem Lernen und unuberwachtemLernen.
Der Roboteragent kann auch uber ein “delayed reinforcement” Signal lernen.Dabei wird auch eine Aktion des Roboteragenten belohnt, wenn sie nur indirektzum Ziel gefuhrt hat. Dies kann der Fall sein, wenn die entsprechende Aktionausgefuhrt werdenmußte, um weitere Aktionen in Richtung des Zielzustandesausfuhren zu konnen.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 314
Erwerb von Fertigkeiten eines Roboters
Skill acquisition: “Verbesserung mototischer oder kognitiver Fahigkeiten durchTraining. Lesen einer Anleitung stellt nur das initiale Wissen dar, das dannsukzessiv verbessert und verfeinert werden muss.”
(Carbonell et. al. 1983”)
Illustration des Reinforcement-Lernproblems:
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 315
Markov-Entscheidungsprozeß
(“Markov Decision Process” MDP)
Ein MDP ist gegeben durch
• Eine Menge S diskreter Zustande (states),
• Eine Menge A moglicher Handlungen (actions),
• eine Reward-Funktion rt = r(st, at),
• Eine Successor-Funktion st+1 = δ(st, at),
Die Funktion r und δ sind Teil der Umgebung und dem Agenten nicht notwendigbekannt.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 316
Graph zu einem Markov-Entscheidungsprozeß
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 317
Das Problem der unvollstandigen Zustandsinformation
.
Man spricht auch von verborgenen Zustanden (engl.: hidden states).
Beispiel fur unvollstandige Zustandsinformation:
a) b) c) d)
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 318
Ablauf des MDPs
Zu jedem Zeitschritt t durchlauft der Agent folgende Schritte:
1. Bestimme den aktuellen Zustand st.
2. Wahle eine Handlung at.
3. Fuhre at aus.
4. Erhalte Reward rt = r(st, at).
Die Umgebung geht als Reaktion auf at in einen neuen Zustand st+1 = δ(st, at)uber.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 319
Policy
Eine Funktionπ : S → A
wird Policy genannt.
Sie stellt eine Strategie dar, wie der Agent in einem bestimmten Zustand s eineHandlung a = π(s) auswahlt.
Die Aufgabe besteht darin, diese Funktion π zu lernen.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 320
Kumulativer Wert
Der kumulative WertV π(st)
ist die kumulierte Reward, die der Agent erzielt, wenn er von einem Zustand st
startet und einer Policy π folgt.
Es gibt unterschiedliche Definitionen fur V π(st), die zukunftige Rewards inunterschiedlicher Weise mit einbeziehen.
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 321
Definitionen fur V π(st)
• “Dicount cumulative reward”: V π(st) =∑∞
i=0 γirt+i
• “Finite horizon reward”: V π(st) =∑h
i=0 rt+i
• “Average reward”: V π(st) = limh→∞1h
∑hi=0 rt+i
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 322
Optimale Policy
Eine Policy, die V π(st) fur alle Zustande s maximiert, wird optimale Policy π∗
genannt:
π∗ ≡ arg maxπ
V π(s),∀s
Der kumulative Wert einer optimalen Policy wird auch mit V ∗(s) bezeichnet:
V ∗(s) ≡ V π∗(s)
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 323
Lernen der optimalen Policy
Aus der Definition von V π(st)
V π(st) =∞∑
i=0
γirt+i
folgt sofort fur π∗(s):
π∗(s) = arg maxa
[r(s, a) + γV ∗(δ(s, a))]
D.h.: Die optimale Policy kann erlernt werden, indem V ∗ gelernt wird, falls r undδ bekannt sind.
Aber dies ist oft nicht der Fall!
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 324
Modellbasiert oder modellfrei?
Modellbasiertes Reinforcement-Lernen:
z.B. mit dynamischer Programmierung.
Vergleich mit A*-Suche.
Anwendungsbeispiel: z.B. kollisionsfreie Bahnplanung unter bekannterUmgebungsdarstellung.
Modellfreies Reinforcement-Lernen:
r und δ sind unbekannt.
⇒: Q-Lernen
Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 325