MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV130.243.105.49/~lilien/ml/seminars/2007_05_11a... ·...
Transcript of MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV130.243.105.49/~lilien/ml/seminars/2007_05_11a... ·...
-
AASS Örebro University- PhD Seminar 1
PhD SEMINAR
REINFORCEMENT LEARNING
MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV
Presented by:
Muhammad Rehan AhmedPhD Student
Intelligent Control Lab
-
AASS Örebro University- PhD Seminar 2
AGENDA
� Simple Learning Taxonomy
� Reinforcement Learning�Basic Terminologies like Policy, Goal
Rewards, Tasks etc, etc...
�Value Functions, How to Estimate ?
� General RL Algorithm
� Reinforcement Comparison
� Actor Critic Methods
� Research Paper
-
AASS Örebro University- PhD Seminar 3
Simple Learning Taxonomy
� Supervised Learning:
� The Learner has Access to a Teacher who corrects it.� Teacher provide the required response to inputs. � Desired behavior is known.� Correct Target output is known for each input pattern� Some Time Costly (Expert examples are expansive & scarce)
� Unsupervised Learning:� No Access to Teacher.� Learner must search for order/patterns in the Environment � No Target Output� No Right Answer� Learning is made by repeating / searching the patterns of inputs.
-
AASS Örebro University- PhD Seminar 4
Continued......� Reinforcement Learning: The Brief Concept
� The Learner is not told what actions to take, but gets reward/punishment from the Environment and Learns/ Adjust the Action to pick next time by Trial & Error Search.
� Learning from Interaction Agent Environment Interface� With Environment� To Achieve some Goal� Cheap & Plentiful
� Example: Baby Playing, Learning to Drive a Car, Hold Conversation, etc.
1. Environment’s Response affects our subsequent actions2. Agent find out the effect of his actions later
-
AASS Örebro University- PhD Seminar 5
RL Vs Supervised Learning� Evaluation VS Instruction
� RL:� Training Information Evaluates the Action� Doesn’t say whether it was Best or Correct relative to all other Actions� Must try All Actions & Compare to see which is Best� Procedure
� Trial & Error Search for Actions� Must try All Actions� Reward is a Scalar (Other action can be better or worse)� Learning by selection (Selectively choose those Actions that prove
to be better)
� Supervised Learning:� Training Instructs� It gives correct answer regardless of the Action Chosen� No search in Action Space
-
AASS Örebro University- PhD Seminar 6
Agent-Environment Interface
1
1
:situation / statenext resulting and
:reward immediate resulting gets
)( : stepat action produces
: stepat state observesAgent
,2,1,0 :steps timediscreteat interact t environmen andAgent
+
+ ℜ∈
∈
∈
=
t
t
tt
t
s
r
sAat
Sst
t K
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3 . . .
t +3a
-
AASS Örebro University- PhD Seminar 7
Reinforcement Learning
� Objective of RL:
� Learning a policy (mapping from State to Actions) that maximizes a scalar reward / reinforcement signal.
� How Learn ?
1. Trial & Error Search:� Try out Actions to Learn which produce Highest Rewards
2. Delayed Effects , Delayed Rewards:� Actions affect Immediate reward + Next state + All subsequent
rewards
� Sequence:
� Sense States Choose Actions To achieve Goals.
-
AASS Örebro University- PhD Seminar 8
Key Features of RL
� Trial & Error Search
� Environment is Stochastic (Uncertain)
� Reward may be delayed, so the Agent may need to sacrifice short term gains for a greater long term gains
� The agent has to balance the need to Explore its Environment and the need to Exploit its current knowledge
`Exploration – Exploitation Tradeoff´
-
AASS Örebro University- PhD Seminar 9
Exploration VS Exploitation
� The Learner actively Interacts with the Environment:� At the beginning the learner does not know anything about the
Environment.� It gradually gains the Experience and Learns how to react to the
Environment
� Dilemma: After some number of steps, should learner select the Best current choice (Exploitation) OR Try to learn more about the Environment (exploration)?
� Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice
� Exploration may spend to much time on trying bad currently suboptimal actions
� Must Do Both
-
AASS Örebro University- PhD Seminar 10
RL Frame Work
-
AASS Örebro University- PhD Seminar 11
ssaaas
t
ttt
t
=== when y that probabilit ),(
:, step
π
πat Policy
Policy� Telling the Agent what to do is its POLICY
� POLICY: � Mapping from States to Actions Probabilities� Way of behaving (Strategy)� Probability of taking Action `a´ in State `s´� Given in the state `s´ at time t, policy gives the probability the
agent’s action will be `a´
Reinforcement Learning To Learn the POLICY.
-
AASS Örebro University- PhD Seminar 12
Agent Learns a Policy
� Agent detect a State, Choose an action & get the reward
� Agent Aim is to learn a POLICY that maximizes the reward
� Maximize the reward over the long term, not necessarily immediate reward
� For Example: Watch TV now, panic over home work later Vs. Do the home work now, watch TV while all your pals are panicking
Reinforcement learning methods specify how the agent changes its policy as a result of
experience
-
AASS Örebro University- PhD Seminar 13
Goal, Reward & Return
� Goal: Maximize Total Reward Received
� Immediate Reward ‘r’ at each step
� Agent must maximize Expected Cumulative Reward
� Suppose the sequence of rewards after step t is : rt+1, rt+2, rt+3,…….
� Then the Return will be the Total Reward `Rt´
`Maximize the Expected Return, E{Rt}, for Each Step t´
-
AASS Örebro University- PhD Seminar 14
shortsighted 0 ← γ → 1 farsighted
� Episodic Tasks:
� Interaction breaks naturally into episodes e.g., Play a game, trip to maze
� Rt = rt+1 + rt+2 + rt+3 +…….rT � where `T´ is a Final Step at which the terminal state is reached,
ending an episode / trial.
� Continuing Tasks:
� Interaction does not have natural episodes, meaning `T´= ∞� Discounted Return:
. theis ,10, where
, 0
13
2
21
ratediscount ≤≤
=+++= ∑∞
=
+++++
γγ
γγγk
kt
k
tttt rrrrR L
(Future rewards counts for more)
Task Types
-
AASS Örebro University- PhD Seminar 15
Example-1Avoid failure: the pole falling beyond
a critical angle or the cart hitting end of
track.
reward = +1 for each step before failure
⇒ return = number of steps before failure
� As an episodic task where episode ends upon failure:
� As a continuing task with discounted return:
reward = −1 upon failure; 0 otherwise
⇒ return = −γ k , for k steps before failure
In either case, return is maximized by
avoiding failure for as long as possible.
-
AASS Örebro University- PhD Seminar 16
Combined Notation� In Episodic Tasks, we number the time steps of each episode
starting from zero
� We can cover all cases by writing
Rt = γk rt +k +1,
k =0
∞
∑
where γ can be 1 only if a zero reward absorbing state is always reached.
-
AASS Örebro University- PhD Seminar 17
Value Functions
� Value of State
� Value of State- Action Pair or Value of Action
-
AASS Örebro University- PhD Seminar 18
1: Value of State
� Expected return that estimates how good it is for the agent to be in a given state
� How Good is defined in terms of Future rewards
� The value of a state is the expected return starting from that state; depends on the agent’s policy:
State - value function for policy π :
Vπ
( s) = Eπ Rt st = s{ }= Eπ γ k rt +k +1 st = sk =0
∞
∑
Or, without the expectation operator:
Vπ(s) = π (s,a) Ps ′ s
aRs ′ s
a+ γ V π( ′ s )[ ]
′ s
∑a
∑
-
AASS Örebro University- PhD Seminar 19
2: Value of State-Action Pair� Value of State-Action Pair under policy: That estimates how good it is to
perform a given action in a given state� How Good is defined in terms of Future rewards (Expected return)
� The value of taking an action in a state under policy ππππ is the expected return starting from that state, taking that action, and thereafter following π :
{ }
====== ∑∞
=
++
0
1 ,,),(
:
k
ttkt
k
ttt aassrEaassREasQ γ
π
πππ
policyfor function value-Action
Q∗( s , a ) = E r t + 1 + γ max
′ a Q
∗( s t + 1 , ′ a ) s t = s , a t = a{ }
= Ps ′ s a
R s ′ s a
+ γ max′ a
Q∗( ′ s , ′ a )[ ]
′ s
∑
-
AASS Örebro University- PhD Seminar 20
Value of Action: Q� Q: Value of an Action
� Expected Reward from that action� We have only the estimates of Q� We build these estimates from experience of rewards
How to Estimate the Q?
� Now, we have value estimates of all the actions at a given state:
How to select the action? 2 Choices:
1. Greedy Actions: Have Highest Estimated Q: EXPLOITATION2. Other Actions: Lower Estimated Q’s: EXPLORATION
� Agent can’t exploit all the time, It must sometimes explore to see if an action that currently looks bad eventually turns out to be good (long run)
� What if we know the Q values exactly? We choose that action which gives highest Q Value & hence we have searched the optimal Q:
-
AASS Örebro University- PhD Seminar 21
1: How to Estimate Q� Q*(a): Optimal Value of Action
� Qt(a): Estimated value of action ´a` at time ´t`
� How to estimate Q: ? By Running mean:
� Suppose we choose action ´a`, ka times & observe a reward rion play i:
� Then we can estimate Q* from running mean:
Qt(a) = (r1+r2+r3+......rka)/ka
� If ka = 0 (trial of that action) , then ri = 0
� But, As ka ���� ∞ , Qt(a) � Q*(a):
This is called sample average method for calculating Q
-
AASS Örebro University- PhD Seminar 22
Incremental Update Equation
� Estimate Q* by Running Mean: (if we have tried action `a´ katimes)
Qt(a) = (r1+r2+r3+......rka)/ka
� Incremental Equation:
[ ]kkkk
k
i
ik
Qrk
QQ
rk
Q
−+
+=
+=
++
+
=
+ ∑
11
1
1
1
1
11
1
� New Estimate = Old Estimate + Step size [Target – Old Estimate]� Step size ( ): Depends on k � Often kept constant e.g. = 0.1� Gives more weight to recent rewards
αα
-
AASS Örebro University- PhD Seminar 23
2: Action Selection� Greedy:
� Select the a* for which Q is highest
� Qt(a*)= maxa Qt(a) `Action that maximize the value function´
� So, a*= arg maxa Qt(a) *- `Best´
� Example: 10-armed Bandit
Suppose at Time `t´ we have actions 1 to 10:
Qt(a) ����
Qt(a*) = 0.4 and a* = 4th action: `argument´
� -greedy:
� Select random action of the time, else select greedy action
Sample all actions infinitely many times,
so as ka ���� ∞ , Q’s converges to Q*
0.100.05000.20.40.10.30
ε
ε
-
AASS Örebro University- PhD Seminar 24
General RL Algorithm
� Initialize Agents internal state (e.g. Q Values, other statistics)� Do for a long time (Looping)
� Observe current world state `s´� Chose action `a´ using the policy (greedy or e-greedy)� Execute action `a´� Let `r´ be immediate reward, s´ new world state� Update internal state based on s, a, r, s´ (previous internal state)
After many trials, agent will learn the optimal Q values (state-action pair) & hence reach the best policy, i.e., the best
mapping from states to actions
� Output a policy based on (e.g.) Learnt Q Values & follow it
-
AASS Örebro University- PhD Seminar 25
RL Framework (Again)
� Task: One instance of RL Problem � Learning: How should agent change policy ???
(RL Algorithm)� Overall Goal: Maximize amount of reward received over time
Function
-
AASS Örebro University- PhD Seminar 26
What RL Algorithms Do
� Continual, On-Line Learning
� Many RL methods can be understood as trying to solve the Bellman optimality equations in an approximate way.
-
AASS Örebro University- PhD Seminar 27
Dynamic / Model of Environment
� Environment provides:
� Transition Function/ Probability:
� Reward Function:
� If Agent know `P´ & `R´ then it has complete information about the environment....
� Agent has to DISCOVER these while exploring the world: How ?
� Agent has to Evaluate Policy& Improve it at every instance untilit will reach at the OPTIMAL POLICY
Ps ′ s a
= Pr s t +1 = ′ s st = s,at = a{ } for all s, ′ s ∈S, a ∈ A(s).
Rs ′ s a
= E rt +1 st = s,at = a,st +1 = ′ s { } for all s, ′ s ∈S, a ∈ A(s).
-
AASS Örebro University- PhD Seminar 28
Policy Evaluation & Policy Improvement
-
AASS Örebro University- PhD Seminar 29
Model Free Learning
� No Model is Available: (No `P´ & `R´)
� Model Free Methods:
� Learn Optimal Policy without Learning Model
� Temporal Difference (TD Learning) is Model Free, Bootstrapping Method based on Sampling the State-Action Space
-
AASS Örebro University- PhD Seminar 30
Temporal Difference Prediction� Policy Evaluation Prediction ProblemPrediction Problem:
� Trying to predict how much return we will get from being in the state `s´ & following a policy by learning the state value function V
� TD & MC methods: Solve Prediction Prob by Experience
� Given some experience following a policy , both methods updates their estimate V of V
� Monte Carlo Update
� Return Rt: Actual Return from st to End of Episode
� Must wait until the End of Episode to determine the increment toV(st)
� Target for Monte Carlo update: `Rt´
ππ
ππ
)]([)()( tttt sVRsVsV −+← α
-
AASS Örebro University- PhD Seminar 31
Continued.......
� Simplest Temporal Difference Update: TD(0)
� Target For TD Update: Estimate of the Return
� TD method waits until the next Time step
� Bootstrapping Method : Its update basis in part on Existing Estimates.
� Learn a Guess from a Guess
)]()([)()( 11 ttttt sVsVrsVsV −++← ++ γα
)]([ 11 ++ + tt sVr γ
-
AASS Örebro University- PhD Seminar 32
Temporal Difference Learning
� TD Learning: Combination of MC & DP Ideas
� Like MC Methods: TD Method learns directly form raw experience without a model of the environment dynamics
� Like DP: TD Method updates estimates based in part on other learned estimates without waiting for a Final outcome
Bootstrap Self Generate
� In Brief: TD methods:
� Does not need a Model
� Learns Directly from Experience
� Updates Estimates of V(st) based on what happens after visiting state `s´
a
ss
a
ss RP '' ,
-
AASS Örebro University- PhD Seminar 33
Advantages of TD Learning Methods
� Do not need a Model of the Environment, of its Rewards & Next State Probability Distributions
� Online & Incremental: FASTFAST
� Need not to wait till the End of the Episode so need less memory & computation
� Updates are based on actual experience (rt+1)
� Converges to V (s)
� But must Decrease step size as Learning continues
π
-
AASS Örebro University- PhD Seminar 34
TD For Control Problem� Control Methods Aim to find: Optimal Policy / Solution� Temporal Difference Learning- TD(0) is used to compute the Values
for a Given policy
� We came to TD for Control Problem: Learning Q Values:� Using the Generalized Policy Iteration (GPI) � With using TD Methods for Evaluation or Prediction Part� Again we face the need to Tradeoff between Exploration &
Exploitation� Resulting 2 main classes:
� On Policy: SARSASARSA� Off Policy: QQ-- LearningLearning
� We consider Transition from State-action pair to state-action pair & learn the value of the state-action pairs
� First step: To learn Action Value Function, Q
-
AASS Örebro University- PhD Seminar 35
Softmax Action Selection
-
AASS Örebro University- PhD Seminar 36
Continued......
� Effect of
� Softmax Action Selection becomes the same as Greedy Action Selection when the limit goes to zero
� Which one is better whether,Softmax or - Greedy
τ
GreedyyprobabilitAs
actioneequiprobalnearlynyprobabilitAs
→→
→∞→
,0
)(/1,
τ
τ
εUNCLEAR
-
AASS Örebro University- PhD Seminar 37
Reinforcement Comparison� Central Theme:
Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)
� How to judge what is Large & Small Judgment Problem ?
� Compare the Reward with Reference Level (Reference reward)� Reference Reward Average of previously received rewards
� Basic Idea of Reinforcement Comparison� Large Reward means Higher than Average� Small Reward means Lower than Average
� Learning methods based on this Basic Idea are called Reinforcement Comparison Methods
Actions
-
AASS Örebro University- PhD Seminar 38
Continued......� Reinforcement Comparison Methods maintains Overall Reference
Reward
� To pick among the actions, they maintain a separate measure of their preference for each action:
� pt(a) Preference for action on play t
� Preference is used to determine action selection probabilities by Softmax Relationship
� After each play, preference for the action selected on that play is updated:
� Preference update:
� Where, is the positive step size parameter
∑=
===n
b
bp
ap
tt
t
t
e
eaapra
1
)(
)(
}{)(π
][)()(_
1 tttttt rrapap −+=+ β
β
-
AASS Örebro University- PhD Seminar 39
Continued......
� Higher rewards increases the reselecting probability of that action while the Low rewards decreases
� Reference Reward updated:
� Where, is step size parameter
][__
1
_
tttt rrrr −+=+ α
α
-
AASS Örebro University- PhD Seminar 40
Actor Critic Methods� TD Methods having separate Memory Structure to represent the
Policy Independent of Value Function
1. Policy Structure ActorUsed to select actions
2. Est. Value Function CriticCriticizes the action made by Actor Typically State Value Function
-
AASS Örebro University- PhD Seminar 41
Continued......� Extension of Idea of Reinforcement Comparison Methods to TD
Learning & to Full Reinforcement Problem
� Critic, critique whatever policy is being followed by Actor
� Criticisms (Scalar Signal) takes the Form of TD Error is the sole output of Critic & drives all learning
� Working Concept
� After each action selection, Critic evaluates the new state to determine whether the things have gone better or worse than expected. That evaluation is TD Error:
� V is the current value function implemented by Critic
� TD Error used to evaluate the action just selected� If TD Error Positive Tendency to select at, strengthened for Future
� If TD Error Negative Tendency to select at, weakened for Future
)()( 11 ttt sVsVrt −+= ++ γδ
-
AASS Örebro University- PhD Seminar 42
Continued......
� Suppose actions are generated by Gibbs Softmax Method:
� p(s,a) indicates the tendency to select / preference for each action `a´ when in each state `s´
� p(s,a) are the values at time, of the modifiable policy parameter of the Actor
� Preference update rule (for strengthening & weakening)
� Where, positive step size parameter
∑=
====n
b
bsp
asp
tt
te
esstaapras
1
),(
),(
}|{),(π
ttttt aspasp βδ+← ),(),(
β
-
AASS Örebro University- PhD Seminar 43
Significant Advantages
� Minimal computation time in order to select actions:
� Policy is explicitly stored reduces the extensive computation
� Learn Stochastic Policy
� Learn Optimal Probabilities of selecting various actions
-
AASS Örebro University- PhD Seminar 44
Research Paper
Direct Reinforcement Adaptive Learning (DRAL) Fuzzy Logic Control for a Class of
Non Linear Systems
(Preecedings of the 12th IEEE)
(International symposium on intelligent control)
-
AASS Örebro University- PhD Seminar 45
Paper Abstract
� Application of RL Tech’s to Feedback Control of Non Linear Systems using Adaptive Fuzzy Logic System (FLS)
� Good Model Non Linear Systems Difficult To Control
� This paper presents Adaptive FLS can handle Non Linearities through RL with No Preliminary Off Line Learning
� Approach Presented: � FLS is indirectly told about the effects of its Control Action on the System
Performance.
� On Line Learning is Based on Binary Reinforcement Signal from a Critic without knowing the Non linearity appearing in the system
`Direct Reinforcement Learning´
-
AASS Örebro University- PhD Seminar 46
Major Modules� Fuzzy Logic System (FLS)
� Reinforcement Learning (Actor Critic Method)
� FLS + RL-Training Algorithm Adaptive FLS
� System to be Controlled:� Non Linear Plant Model
� Having: � Nonlinearity
� Disturbances
-
AASS Örebro University- PhD Seminar 47
FLS, RL & Adaptive FLS� Fuzzy Control Model Free Approach
� Well known for Non Linear Control Technique
� Adaptive FLS FLS + RL-Training Algorithm
� FLS : Collection of IF-Then Rules� Training Algo: Adjusts the Parameters of FLS M.F’s,
according to input / output data
� Reinforcement Learning:Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)
� Extending this idea of Action Selections to depend on State Information: � Emerges the aspects of Feedback Control, Pattern Recognition &
Associative learning
Actions
-
AASS Örebro University- PhD Seminar 48
Continued......
� Why RL is Very Important ?
� Ability to Generate Correct Control Action in situations where Difficult to Define a Priori the Desired Output for Each Input
� Critic in RL Scheme: Scalar Evaluation Signal
Much less information than the Desired o/p's required in Supervised learning
� Supervised Learning &
Direct Inverse Control A Priori Desired Output is known
-
AASS Örebro University- PhD Seminar 49
Adaptive Fuzzy Logic System
Basic Configuration of Adaptive Fuzzy Logic System
-
AASS Örebro University- PhD Seminar 50
Continued......
� Parameters to be Adjusted in Adaptive FLS:
� Control Representative Value of Each Rule `L´ ( )
� Parameters of Membership Function’s ( m , )
� Fuzzy Basis Function (FBF) : Strength of the Rule
θ
σ
-
AASS Örebro University- PhD Seminar 51
Continued......
� Centroid & Width are real valued parameters of Gaussian M.F for i-th input variable & L-th Rule
� Adjustable parameters (Adaptive FBF) &
� Constraint: (Input Physical Domain) &
� FLS output will be;
i
l
i Um ∈
l
iσ
l
iml
iσ
l
im
0≠l
iσ
-
AASS Örebro University- PhD Seminar 52
Continued......� Non Linear Function to be approximated can be represented as;
Provided by Adaptation Learning Algorithm
� Adaptation Learning Algorithm Presented later
-
AASS Örebro University- PhD Seminar 53
DRAL Controller Design
Proposed DRAL Controller, illustrating overall Adaptive Learning Scheme
-
AASS Örebro University- PhD Seminar 54
DRAL Controller Architecture� DRAL Controller is a combination of:
� Action Generating FLS (using Reinforcement Signal as a Adaptive-Learning signal) &
� A Fixed Gain Controller in the Performance measurement loop which uses an error based on given Reference Trajectory
� As Adaptive Learning Process Proceeds:
� Fixed Gain Controller has less & less influence on Control Action
� FLS gets more & more influence on Control Action
� DRAL Controller is designed to Learn how to Generate the Best Control Action at each time instant in the absence of complete information about the Plant & the Disturbances
-
AASS Örebro University- PhD Seminar 55
Continued......
� Critic: Binary Sign Function (Output +1 or -1)
Based on Performance Measure
� This Evaluative Signal (Binary Reinforcement Signal) used for Generation of Correct Control Actions by FLS
� Critic works as Supervisor to the FLS & Control the Learning of the FLS by checking the Actual Performance against the Performance Measures
-
AASS Örebro University- PhD Seminar 56
System Dynamics� Plant can be represented as:
-
AASS Örebro University- PhD Seminar 57
Continued......�
� Full Tracking Error r(t) is not allowed to used for tuning the overall FLS, only a Reduced Reinforced Signal `R´ is allowed
R = sign(r)
� Proposed Control Law: (Input to the Plant)
-
AASS Örebro University- PhD Seminar 58
Adaptive Learning Algorithm� The Derived Adaptive Learning Algorithm for tuning the
� Design Matrices determining the Learning Rate & Ki > 0 Design Parameter governing the Speed of Convergence
-
AASS Örebro University- PhD Seminar 59
Simulation Results� The Dynamic Equation for n-Link Manipulator is as follows:
� The Dynamic Equation for 2-link Robot Manipulator is
� Assuming M(q) is Known, Non Linearities are Unknown
-
AASS Örebro University- PhD Seminar 60
Continued......� Simulation parameters are:
� Lengths: l1=l2= 1m, masses: m1=m2= 1kg� Desired Trajectory:
� qdi(t)= sin(t) (rad)� qd2(t)= cos(t) (rad)
� Linear Gain Kv= diag[20 20]
� States 4 input Dimensions � Three Gaussian M.F selected / Input dimension� 81 Fuzzy IF-THEN Rules can be generated Rules= (M.F)i
� Trajectory tracking performance for 3 cases were studied� With only Fixed Gain in Performance Measure� Action Generating FLS using DRAL Algorithm� DRAL Algorithm in presence of Mass changes
� Programming : Turbo C
-
AASS Örebro University- PhD Seminar 61
Continued......� Performance with only Fixed Gain in Performance Measure
Performance with Kv Only
-
AASS Örebro University- PhD Seminar 62
Continued......
� Performance with Action Generating FLS using DRAL Algorithm
� Trajectory Following Ability : Fairly Good � Kv= Diag[20 20]� DRAL Controller
� Cancel the Non linearity in the Robot System
� Robot Dynamics are Unknown to DRAL Controller
-
AASS Örebro University- PhD Seminar 63
Continued......� Performance with DRAL Algorithm in presence of Mass changes
� Mass m2 Changes (To Verify the Adaptation & Robustness )� At time t1=5s Mass changes from 1.0Kg to 2.5Kg (picked up)� At time t2=12s Mass changes from 2.5Kg to 1.0Kg (released)� Same Parameters� Fairly Satisfactory Performance� System Dynamics unknown to DRAL Controller
-
AASS Örebro University- PhD Seminar 64
Key Features
� No need of Off-Line Training
� Binary Reinforcement Signal is Directly used for adjusting FLS Parameters
� Controller Reusability : �Same Controller Works even if Dynamics
changes
-
AASS Örebro University- PhD Seminar 65
� Reference Books:� Sutton & Barto, Reinforcement learning: An introduction� Reinforcement Learning : A Survey (Journal of Artificial Intelligence
research 4 (1996) 237-285)
� Lecture Notes:� Machine Learning Course Lectures Slides (By Denny & Achim)� Ivan’s Lecture notes of Soft Computing & Control
� Research Papers:� Young H. Kim & Frank L. Lewis, Direct Reinforcement Adaptive
Learning Fuzzy Logic Control � Ihsan Omur, Mohamed Ali zohdy, Reinforcement Learning Control
of Non linear multilink system
Reference Books & Research Papers
-
AASS Örebro University- PhD Seminar 66
THANKS