146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS...

11
146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate Dynamic Programming for Real-Time Online Learning Control: An Experimental Study Xin Xu, Senior Member, IEEE, Chuanqiang Lian, Lei Zuo, and Haibo He, Senior Member, IEEE Abstract—In the past decade, there has been considerable research interest in learning control methods based on rein- forcement learning (RL) and approximate dynamic programming (ADP). As an important class of function approximation tech- niques, kernel methods have been recently applied to improve the generalization ability of RL and ADP methods but most previous works were only based on simulation. This paper focuses on experimental studies of real-time online learning control for nonlinear systems using kernel-based ADP methods. Specifically, the kernel-based dual heuristic programming (KDHP) method is applied and tested on real-time control systems. Two kernel- based online learning control schemes are presented for uncertain nonlinear systems by using simulation data and online sampling data, respectively. Learning control experiments were performed on a single-link inverted pendulum system as well as a double- link inverted pendulum system. From the experimental results, it is shown that both online learning control schemes, either using simulation data or using real sampling data, are effective for approximating near-optimal control policies of nonlinear dynamical systems with model uncertainties. In addition, it is demonstrated that KDHP can achieve better performance than conventional DHP, which uses multilayer perceptron neural networks. Index Terms— Approximate dynamic programming (ADP), learning control, Markov decision processes (MDPs), online learning, reinforcement learning (RL). I. I NTRODUCTION R EINFORCEMENT learning (RL) is a machine learn- ing framework for solving sequential decision-making problems which can be modeled using the Markov decision process (MDP) formalism. In the past decades, RL has been widely studied not only by the machine learning and neural network community but also in control theory and operations Manuscript received April 25, 2012; accepted February 3, 2013. Manuscript received in final form February 10, 2013. Date of publication April 2, 2013; date of current version December 17, 2013. This work was supported in part by the National Natural Science Foundation of China under Grant 61075072, Grant 61005077, 91220301, and Grant 51228701, the New-Century Excellent Talent Plan of the Ministry of Education of China under Grant NCET-10- 0901, and the U.S. National Science Foundation under Grant ECCS 1053717. Recommended by Associate Editor A. Alessandri. X. Xu, C. Lian, and L. Zuo are with the Institute of Unmanned Sys- tems, College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China (e-mail: [email protected]; [email protected]; [email protected]). H. He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCST.2013.2246866 research [1]–[7]. In RL, the learning agent interacts with an initially unknown environment and modifies its action policies to maximize its cumulative payoffs [2]. Thus, RL provides a very promising framework to solve learning control problems that are difficult or even impossible for supervised learning and mathematical programming methods. In the early stages of RL research, most effort was focused on learning control algorithms for MDPs with discrete state and action spaces. But in many real-world applications, a learning controller has to deal with MDPs with large or continuous state and action spaces. In such cases, earlier RL algorithms such as Q-learning and Sarsa-learning may converge very slowly when tabular representations of value functions are used. This issue is known as the “curse of dimensionality”: the exponential growth of the number of parameters to be learned with the size of any compact encoding of system state [3]. To solve the curse of dimensionality, approximate RL meth- ods, also called approximate dynamic programming or adap- tive dynamic programming (ADP), have received increasing attention in recent years. Although the concept of ADP was originally proposed in the areas of operations research and control theory, because of their similar research goals, RL and ADP have become a common interdisciplinary research area in recent years. Currently, there are three main categories of research work on approximate RL methods: value function approximation (VFA) [8], policy search [9], and actor–critic methods [10]. Among these three classes of approximate RL methods, the actor–critic algorithms, viewed as a hybrid of VFA and policy search, have been shown to be more effective than VFA or policy search in online learning control tasks with continuous spaces [11]–[13]. In an actor–critic learning control architecture, there is an actor for policy learning and a critic for VFA or policy evaluation. One pioneering work on RL algorithms using the actor–critic architecture was published in [12]. Recently, there has been much research interest on actor–critic methods for RL, where adaptive critic designs (ACD) [11]–[17] were widely studied as an important class of learning control methods for nonlinear dynamical systems. Generally, ACDs can be categorized into the fol- lowing major groups: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their action-dependent (AD) ver- sions [8]. Among these ACD architectures, DHP is the most popular one and has been proven to be more efficient than HDP [18]. 1063-6536 © 2013 IEEE

Transcript of 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS...

Page 1: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

Kernel-Based Approximate Dynamic Programmingfor Real-Time Online Learning Control:

An Experimental StudyXin Xu, Senior Member, IEEE, Chuanqiang Lian, Lei Zuo, and Haibo He, Senior Member, IEEE

Abstract— In the past decade, there has been considerableresearch interest in learning control methods based on rein-forcement learning (RL) and approximate dynamic programming(ADP). As an important class of function approximation tech-niques, kernel methods have been recently applied to improvethe generalization ability of RL and ADP methods but mostprevious works were only based on simulation. This paper focuseson experimental studies of real-time online learning control fornonlinear systems using kernel-based ADP methods. Specifically,the kernel-based dual heuristic programming (KDHP) methodis applied and tested on real-time control systems. Two kernel-based online learning control schemes are presented for uncertainnonlinear systems by using simulation data and online samplingdata, respectively. Learning control experiments were performedon a single-link inverted pendulum system as well as a double-link inverted pendulum system. From the experimental results,it is shown that both online learning control schemes, eitherusing simulation data or using real sampling data, are effectivefor approximating near-optimal control policies of nonlineardynamical systems with model uncertainties. In addition, itis demonstrated that KDHP can achieve better performancethan conventional DHP, which uses multilayer perceptron neuralnetworks.

Index Terms— Approximate dynamic programming (ADP),learning control, Markov decision processes (MDPs), onlinelearning, reinforcement learning (RL).

I. INTRODUCTION

REINFORCEMENT learning (RL) is a machine learn-ing framework for solving sequential decision-making

problems which can be modeled using the Markov decisionprocess (MDP) formalism. In the past decades, RL has beenwidely studied not only by the machine learning and neuralnetwork community but also in control theory and operations

Manuscript received April 25, 2012; accepted February 3, 2013. Manuscriptreceived in final form February 10, 2013. Date of publication April 2, 2013;date of current version December 17, 2013. This work was supported in partby the National Natural Science Foundation of China under Grant 61075072,Grant 61005077, 91220301, and Grant 51228701, the New-Century ExcellentTalent Plan of the Ministry of Education of China under Grant NCET-10-0901, and the U.S. National Science Foundation under Grant ECCS 1053717.Recommended by Associate Editor A. Alessandri.

X. Xu, C. Lian, and L. Zuo are with the Institute of Unmanned Sys-tems, College of Mechatronics and Automation, National University ofDefense Technology, Changsha 410073, China (e-mail: [email protected];[email protected]; [email protected]).

H. He is with the Department of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCST.2013.2246866

research [1]–[7]. In RL, the learning agent interacts with aninitially unknown environment and modifies its action policiesto maximize its cumulative payoffs [2]. Thus, RL provides avery promising framework to solve learning control problemsthat are difficult or even impossible for supervised learningand mathematical programming methods. In the early stagesof RL research, most effort was focused on learning controlalgorithms for MDPs with discrete state and action spaces.But in many real-world applications, a learning controllerhas to deal with MDPs with large or continuous state andaction spaces. In such cases, earlier RL algorithms such asQ-learning and Sarsa-learning may converge very slowly whentabular representations of value functions are used. This issueis known as the “curse of dimensionality”: the exponentialgrowth of the number of parameters to be learned with thesize of any compact encoding of system state [3].

To solve the curse of dimensionality, approximate RL meth-ods, also called approximate dynamic programming or adap-tive dynamic programming (ADP), have received increasingattention in recent years. Although the concept of ADP wasoriginally proposed in the areas of operations research andcontrol theory, because of their similar research goals, RL andADP have become a common interdisciplinary research areain recent years. Currently, there are three main categories ofresearch work on approximate RL methods: value functionapproximation (VFA) [8], policy search [9], and actor–criticmethods [10]. Among these three classes of approximate RLmethods, the actor–critic algorithms, viewed as a hybrid ofVFA and policy search, have been shown to be more effectivethan VFA or policy search in online learning control taskswith continuous spaces [11]–[13]. In an actor–critic learningcontrol architecture, there is an actor for policy learningand a critic for VFA or policy evaluation. One pioneeringwork on RL algorithms using the actor–critic architecture waspublished in [12]. Recently, there has been much researchinterest on actor–critic methods for RL, where adaptive criticdesigns (ACD) [11]–[17] were widely studied as an importantclass of learning control methods for nonlinear dynamicalsystems. Generally, ACDs can be categorized into the fol-lowing major groups: heuristic dynamic programming (HDP),dual heuristic programming (DHP), globalized dual heuristicprogramming (GDHP), and their action-dependent (AD) ver-sions [8]. Among these ACD architectures, DHP is the mostpopular one and has been proven to be more efficient thanHDP [18].

1063-6536 © 2013 IEEE

Page 2: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

XU et al.: ADP FOR REAL-TIME ONLINE LEARNING CONTROL 147

In the past decade, approximate RL algorithms such asACDs have been studied in various learning control prob-lems [18]–[21], such as aircraft control, automotive enginecontrol, and power system control. In addition to neuralnetworks, kernel methods have been studied to improve thegeneralization ability of RL algorithms. The main idea ofkernel methods is to implicitly construct a nonlinear mappingin a reproducing kernel Hilbert space (RKHS) while all thecomputation can still be in linear forms by using the kerneltrick. One of the early works in kernel RL was publishedin [22], where kernel-based locally weighted averaging wasused to approximate the state value functions of MDPs.Recently, there have been some attempts to apply Gaussianprocesses (GPs) or support vector machines (SVMs) to RLproblems, e.g., GPs in temporal-difference (TD) learning [23],SVMs for RL [24], and GPs in model-based policy iterationlearning [25]. Xu et al. [26] proposed the kernel-based least-squares policy iteration (KLSPI) algorithm, where a kernelleast-squares TD algorithm (KLSTD-Q) was used for effi-cient policy evaluation. The work on GPs in TD learning(GPTD) [23] has some similarities with KLSTD-Q. But GPTDis to define a probabilistic generative model for the state valuefunction by imposing a Gaussian prior over value functionsand assuming a Gaussian noise model. In [24], support vectorregression was applied to batch learning of the state valuefunctions of MDPs, but in discrete state spaces, and there wereno theoretical results on the policies obtained. The GP-basedpolicy iteration method [25] uses support points, which areusually selected by manual discretization of the state spaces,and the policy evaluation is performed using the state transitionmodel approximated by a GP model.

Until now, kernel methods have been shown to be effectivein improving the nonlinear approximation and generalizationability of RL algorithms [27], [28]. However, most of previousresults [23]–[26] were based on simulation studies, and thereare few works on testing and evaluating kernel-based onlinelearning control methods on real-world dynamical systems. Inthis paper, two kernel-based online learning control schemesare developed for real-time control systems by making usingof simulation data and online sampling data from a realsystem, respectively. The online learning control schemes areevaluated with DHP and kernel DHP (KDHP), which is akernelized version of DHP. Online learning control experi-ments are performed on a single-link as well as a double-link inverted pendulum system. From the experimental results,it is shown that both online learning control schemes, eitherusing simulation data or using real sampling data, are effectivefor approximating near-optimal control policies of real-timedynamical systems with model uncertainties. In addition, itis demonstrated that the KDHP approach can achieve muchbetter performance than conventional DHP with multilayerperceptron neural networks (MLPNNs), both in terms oflearning efficiency and the quality of final policies.

The rest of this paper is organized as follows. In Section II,an introduction to MDPs as well as the ACD algorithms isgiven. The real-time online learning control schemes basedon KDHP is presented in Section III, where the performanceof the KDHP is analyzed. In Section IV, experimental studies

are presented to evaluate the performance of online learningcontrol schemes using KDHP. Section V draws conclusionsand suggests future work.

II. SHORT OVERVIEW OF MDP AND ACDS

A. Markov Decision Processes

An MDP M is denoted as a quadruple {X, A, R, P},where X is the state space, A is the action space, P is thestate transition probability, and R is the reward function. Astochastic stationary policy π (or just stationary policy) mapsstates to distributions over the action space. When referring tosuch a policy π , we use π(a|x) to denote the probability ofselecting action a in state x by π . A deterministic stationarypolicy directly maps states to actions, denoted as

at = π(xt ), t ≥ 0. (1)

When the actions at (t ≥ 0) satisfy (1), policy π is followedin the MDP M . A stochastic stationary policy π is said to befollowed in the MDP M if at ∼ π(a|xt), t ≥ 0.

The objective of a decision maker is to estimate the optimalpolicy π∗ satisfying

Jπ∗ = maxπ

Jπ = maxπ

[ ∞∑t=0

γ t rt

](2)

where 0 < γ < 1 is the discount factor; rt is the reward attime step t ; Eπ [·] stands for the expectation with respect to thepolicy π and the state transition probabilities, and Jπ is theexpected total reward along the state trajectories by followingpolicy π . In this paper, Jπ is also called the performance valueof policy π .

The state value function V π(x) of a policy π is the expecteddiscounted total rewards when starting from x and followingthe policy π thereafter:

V π(x) = Eπ

[ ∞∑t=0

γ trt |x0 = x

]. (3)

Similarly, the state-action value function Qπ (x, a) isdefined as the expected discounted total rewards when takingaction a in state x and following policy π thereafter:

Qπ (x, a) = Eπ

[ ∞∑t=0

γ trt |x0 = x, a0 = a

]. (4)

For an MDP, a deterministic optimal policy π∗(x) maxi-mizes the expected discounted total reward of state x

π∗(x) = arg maxa

Qπ∗(x, a). (5)

B. Adaptive Critic Designs

As a popular class of actor–critic RL methods, ACDsmake use of both value function approximation and policygradient learning to search for near-optimal control policiesin continuous spaces. Furthermore, the information from plantmodels can also be incorporated into some ACD methods suchas DHP and GDHP.

In all ACDs, there is a critic and an actor in the learningcontrol structure. However, there are some variations in the

Page 3: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

148 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

critic for different ACD methods. Different ACD structurescan be distinguished from three aspects in the critic trainingprocess: the inputs of the critic, the outputs of the critic, andthe requirements for a plant model in the training process ofthe critic. For the first aspect, the plant states are usually usedas the inputs of the critic network, while in action-dependentstructures the critic also receives the action outputs from theactor. For the second aspect, the critic outputs can be either theapproximated value function or its derivatives. For example, inthe HDP structure, the critic approximates the value functionV (x(t)). To simplify notations, in the following, we will useV (xt ) instead of V π(x(t)) and use xt instead of x(t). In theDHP structure, it approximates λ(t), which is the gradient ofV (xt ); in GDHP, it approximates both V (xt) and its gradient.For the third aspect, a plant model can be either used orunused during the learning process of the critic. Usually, toapproximate the value function V (xt ) alone, the plant modelmay not be used, e.g., in HDP. It is necessary to use the plantmodel to approximate the derivative of V (xt ) such as in DHP.

III. REAL-TIME ONLINE LEARNING CONTROL BASED

ON KDHP

In the following, a framework of KDHP is introduced byintegrating kernel methods into the critic learning of DHP.Different from previous works, our focus is on the extension ofKDHP to real-time online learning control tasks. The aim is tostudy whether kernel machines can be used effectively in real-time online learning control problems and whether better per-formance can be obtained when compared to previous DHPswith MLPNNs. In this section, we will present two real-timeonline learning control schemes using KDHP. Specifically, wewill investigate the real-time implementation, online learning,and performance testing of KDHP.

A. Framework of KDHP

A framework of KDHP is shown in Fig. 1. The maincomponents include a critic, a module for kernel-based featureconstruction, a reward function, an actor/controller, and amodel of the plant. The module for kernel-based featureconstruction is to implement data-driven feature representationand selection in order to get better learning efficiency andgeneralization performance. The critic is used to approximatethe derivatives of value functions, i.e., λt . The actor receivesmeasurement data about the plant’s current state xt and outputsthe control ut . The output of the critic is used in the trainingprocess of the actor so that policy gradients can be computed.The plant model receives the control ut , and estimates the nextstate xt+1. The state data are provided to both the critic andto the reward function.

Mercer kernels are used in the feature construction process.A Mercer kernel is a kernel function that is positive definite,i.e., for any finite set of points {s1, s2, . . . , sn}, the kernelmatrix K = [k(si , s j )] is positive definite. Let X be theoriginal state space of an MDP. According to the Mercertheorem [28], there exists a Hilbert space H and a mappingφ from X to H such that

k(si , s j ) = 〈φ(si ), φ(s j )〉 (6)

Critic

Actor

Model

Plant

Reward function

Kernel-based feature construction

1tλ +

1ˆtx +

tu

1tx +

tr

Fig. 1. Learning control structure of KDHP.

where 〈·, ·〉 is the inner product in H . Although the dimensionof H may be infinite and the nonlinear mapping φ is usuallyunknown, all the computation in the feature space can still beperformed if it is in the form of inner products. Because ofthe above properties of kernel functions, kernel methods haveattracted much research interest to kernelize or design newforms of machine learning algorithms in linear spaces so thatnonlinear feature extraction or function approximation can berealized only by selecting appropriate kernel functions.

In KDHP, the value function derivatives are approximatedin the critic, and a recursive algorithm of KLSTD [26] usingkernel-based feature representation is used. The value functionderivative is approximated as

λ(x) = ∂V (x)

∂x=

t∑i=1

αi k(x, xi) (7)

where αi (i = 1, 2, . . . , t) are the weights, and xi (i =1, 2, . . . , t) are the selected states in the sample data, i.e.,trajectories generated from the MDP.

In the actor of DHP, MLPs are used to approximate thepolicy function

at = g(xt , �θt ). (8)

B. Online Learning Control Based on Simulated or Real Data

In this section, we study two online learning controlschemes that are based on simulated data and real sampleddata, respectively. In the first scheme, a simulation model isused to realize data-driven controller design and optimization.Using the simulation model, online learning control algorithmscan be performed based on the simulated data. In the simu-lation process, there are three main steps. The first step isto perform kernel-based feature construction using simulatedsamples. The second step is online learning control usingKDHP. The third step is to test the final control policy’sperformance using the simulation model. After these threesteps, the final control policies in the actor are fixed and usedin real-time control of the real plant. Although there may besome differences between the simulation model and the realplant, because of the near-optimality of the final policies, thelearning controller can be expected to have robustness to dealwith uncertainties and disturbances.

Page 4: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

XU et al.: ADP FOR REAL-TIME ONLINE LEARNING CONTROL 149

Fig. 2. Flow of the KDHP algorithm.

The second online learning control scheme is to use sampleddata from the real plant for both feature construction andpolicy learning. During the policy learning process, all theiteration and computation of actor–critic learning must beperformed in an online real-time style. After convergence ofthe learning controller, the final policy in the actor is fixedand tested on the plant. Because of the observation noisesand disturbances in the physical system, this online learningcontrol scheme may require more learning episodes to obtain anear-optimal policy with good performance. The experimentalresults in Section IV will show this observation.

Fig. 2 shows a flowchart of the KDHP algorithm. Thethreshold ε for termination condition can be selected as asmall positive value. For the two learning control schemes,the main differences include the way of data generation andthe performance test of final policies. Details about the onlinelearning control schemes based on KDHP are discussed in thefollowing.

1) Feature Construction Using Simulated/Real Samples:After the sample collection process, the kernel-based fea-tures are constructed in a data-driven way, and a sparsifi-cation method based on ALD analysis [26] is employed.Let Sn = {x1, x2, . . . , xn} denote a set of data samples and φbe a feature mapping on the data, which can be determined bythe Mercer kernel function defined in (6). A feature vector setcan be obtained as n = {φ(x1), φ(x2), . . . , φ(xn)}, φ(xi ) ∈

Rm×1, i = 1, 2, . . . , n. To perform ALD analysis on thefeature set, a data dictionary is defined as a subset of thefeature vector set. The data dictionary D is initially empty,and ALD analysis is implemented by testing every featurevectors in n , one at a time. If a feature vector φ(x) cannotbe approximated, within a predefined precision, by the linearcombination of the feature vectors in the dictionary, it willbe added to the dictionary. Otherwise, it will not be added tothe dictionary. Thus, after the ALD analysis process, all thefeature vectors of the data samples in Sn can be approximatelyrepresented by linear combinations of the feature vectors in thedictionary within a given precision.

In KDHP, the ALD-based sparsification procedure mainlyincludes the following steps.

Step 1: Input the sample set {xi} (i = 1, 2, . . . , n), andinitialize the dictionary as an empty set, i.e., D1 = ∅, t = 1.

Step 2: Compute the optimization solutions

δt = minc

∥∥∥∥∥∥∑

x j ∈Dt

c jφ(x j ) − φ(xt )

∥∥∥∥∥∥2

. (9)

Because of the kernel trick, after substituting (6) into (9), wecan obtain

δt = minc

{cT Kt−1c − 2cT kt−1(xt ) + ktt } (10)

where [Kt−1]i, j = k(xi , x j ), xi (i = 1, 2, . . . , d(t −1)) are theelements in the dictionary, d(t − 1) is the length of the datadictionary, kt−1(xt ) = [k(x1, xt ), k(x2, xt ), . . ., k(xd (t−1),xt )]T , c = [c1, c2, . . ., cd ]T , and ktt = k(xt , xt ).

The optimal solution for (10) is

ct = K −1t−1kt−1(xt ) (11)

δt = ktt − kTt−1(xt )ct . (12)

Step 3: Update the data dictionary by comparing δt with apredefined threshold μ. If δt < μ, the dictionary is unchanged,otherwise, xt is added to the dictionary, i.e., Dt = Dt−1

⋃xt .

Step 4: If t = n, then go to Step 2, else end the procedure.2) Online Learning Control Using KDHP: The critic learn-

ing in KDHP is to approximate the derivatives of state valuefunctions that satisfy the following Bellman equation:

∂V (xt)

∂xt= ∂rt

∂xt+ γ

∂ E[V (xt+1)]∂xt

(13)

where rt is the expected reward, and E[·] is with respect to thestate transition probability by following a stationary policy.

Let

D(xt ) = ∂xt+1

∂xt+ ∂xt+1

∂at

∂at

∂xt. (14)

Let �αt , β t denote the weight vector and the learning ratein the critic, respectively. Pt denotes the variance matrixcomputed in the weight updating process. To realize onlinelearning in the critic, by using the kernel trick, the followingupdate rules based on the kernel RLS-TD(0) algorithm [26]

Page 5: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

150 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

are used in the critic of KDHP

βt+1 = Pt �k(xt )

(μ + (�kT (xt ) − γ D(xt )�kT (xt+1))Pt �k(xt ))(15)

�αt+1 = �αt +βt+1

(∂rt

∂xt−(�kT (xt ) − γ D(xt )�kT (xt+1))�αt

)(16)

Pt+1 = 1

μ

[Pt − Pt �k(xt)

× (kT (xt ) − γ D(xt )kT (xt+1))Pt

μ + (kT (xt ) − γ D(xt )kT (xt+1))Pt �k(xt )

](17)

where β t is the step size in the critic, μ(0 < μ ≤ 1) is theforgetting factor, P0 = δ I, δ is a positive number, and I is theidentity matrix.

The actor network is used to generate the control actionsbased on the observed states of the plant. The output of theactor can be given by (8). The learning objective of the actor isto minimize the performance index of the closed-loop system,which can be computed by the value functions of the MDP

J (x) = V (x) = Eπ

[ ∞∑t=0

γ t rt |x0 = x

]. (18)

In order to optimize the objective function, gradient-basedlearning techniques are popularly employed in the actor learn-ing process. Particularly, based on the outputs of the critic,the following policy gradient methods can be used to train theactor:

�θt+1 = �θt − ηt��θt

= �θt − ηt∂V (xt+1)

∂at

∂at

∂ �θt

= �θt − ηtλ(xt+1)∂xt+1

∂at

∂at

∂ �θt. (19)

Since λ(xt+1) and ∂xt+1/∂at can be computed by the criticand the model network, respectively, and ∂at/∂ �θt is given by(8), the above policy gradient learning can be implementedalong with the critic learning.

3) Performance Test of the Final Policy: To test the perfor-mance of the final policy, the policy function in the actor isfixed after learning and it is used as a real-time controller forthe plant.

Let �θT be the weights of the final policy in the actor; thereal-time controller is determined as

a = g(x, �θT ). (20)

For online learning control using real sampled data, theperformance of the final near-optimal policy can be evaluateddirectly from the experimental results. For the learning controlscheme using simulated data, the performance of the finalpolicy needs to be tested under different simulation conditions,e.g., different initial states, disturbances, etc. When the per-formance of the final policy is ready for different simulationconditions, it is finally tested on the real plant. Otherwise,more learning episodes are needed for the learning controllerby using the simulation model.

C. Convergence Analysis

Actor–critic methods are online approximations to policyiteration in which the value function parameters are estimatedusing temporal difference learning and the policy parametersare updated by stochastic gradient methods. In [10], a classof actor–critic algorithms with linear function approximatorswas proved to converge when the features for the critic span asubspace prescribed by the choice of parameterization of theactor. Recently, Bhatnagar et al. [29] proved the convergenceof natural actor-critic algorithms, where the critic also useslinear function approximators. All of the above theoreticalresults are based on the theory of two-timescale stochasticapproximation, and one of the most important conditions forconvergence is that the learning update in the critic is a fasterrecursion than the actor.

In KDHP, the critic uses linear forms of value functionapproximators by using sparse kernel features. Furthermore,RLS-TD learning is employed in the critic training process,which has the following update rules:

�αt+1 = �αt + βt+1

(∂ Rt

∂xt− (�kT (xt ) − γ D(xt )�kT (xt+1))�αt

)(21)

where β t+1 is determined by (15).In the actor of KDHP, policy gradient learning is performed

based on the output from the critic, where the weight updaterule has the following form:

�θt+1 = �θt − γt+1��θt = �θt − γt+1∂ J (t)

∂ �θt. (22)

Similar to the analysis in [17], the update rules (21) and (22)in KDHP can be modeled as a general setting of two-timescalestochastic approximations

Xt+1 = Xt + βt ( f (Xt , Yt ) + N1t+1) (23)

Yt+1 = Yt + γt (g(Xt , Yt ) + N2t+1) (24)

where f, g are Lipschitz continuous functions and {N1t+1},

{N2t+1} are martingale difference sequences with respect to

the field

E[‖Nit+1‖2|Ft ] ≤ D1(1 + ‖Xt‖2 + ‖Yt‖2), i = 1, 2, t ≥ 0]

(25)for some constant D1 < ∞.

It is assumed that the step sizes β t , γ t are deterministic andnonincreasing, and satisfy∑

t

βt =∑

t

γt = ∞,∑

t

β2t

< ∞,∑

t

γ 2t

< ∞ (26)

∑t

(γt

βt

)d

< ∞ (27)

where d is a positive constant.In KDHP, the update rules in the critic use recursive least-

squares methods and the step sizes are adaptively determinedby online computation rules (15). When the step sizes satisfy(26) and (27), the update in the critic is a faster recursionthan the update in the actor, and the weights in the critic have

Page 6: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

XU et al.: ADP FOR REAL-TIME ONLINE LEARNING CONTROL 151

uniformly higher increments compared with the weights in theactor.

In [17], it was proved that, when (26) and (27) are satisfied,a class of actor–critic algorithms will converge almost surelyto a small neighborhood of a local minimum of the averagedreward J . This result provides the theoretical basis for theconvergence analysis of actor–critic algorithms. In KDHP, bymaking use of kernel-based linear features and the RLS-TDlearning algorithm in the critic, the update in the critic canbe a faster recursion than the actor. Thus, it will be morebeneficial to ensure the convergence of the online learningprocess. In the following section, extensive performance testsand comparisons on the real-time control problem of single-link and double-link inverted pendulums are conducted, andit is shown that KDHP has better performance than DHP interms of both convergence speed and the quality of the finalpolicies.

IV. EXPERIMENTAL STUDIES

In this section, we will use two real-time online learn-ing control problems to evaluate and compare the per-formance of KDHP and previous DHP algorithms. Oneproblem is the balancing control of a single-link invertedpendulum system, and the other is a more difficult prob-lem of controlling a double-link inverted pendulum. Theinverted pendulum problem has been widely studied as abenchmark control problem with nonlinearity and instabil-ity. The controller design for inverted pendulums becomesmore difficult when there are model uncertainties andunknown disturbances in the plant dynamics. In recentyears, various intelligent control methods have been devel-oped for real-time control of inverted pendulums [30]–[32],but there are still some difficulties, such as how to improverobustness under model uncertainties, the absence of onlinelearning ability, and so on. In the following, KDHP will beapplied based on the two online learning control schemes andit is shown that KDHP not only has good real-time controlperformance but can also obtain good near-optimal policiesafter learning, which are superior to previous DHP approacheswith MLPNNs.

A. Online Learning Control Using Simulated Data—the Casefor Single-Link Inverted Pendulum

The dynamics of a single-link inverted pendulum can bedescribed by the following equations:{

(M + m) x + bx − ml θ cos θ + ml θ2 sin θ = F(I + ml2

)θ + mgl sin θ = mlx cos θ

(28)

where g is the acceleration due to the gravity, which is9.8 m/s2. The mass of the cart is M = 1.096 kg, the mass ofthe pole is m = 0.109 kg, the half-pole length is l = 0.25 m,b is the friction coefficient of the cart on the track, and I isthe inertia of the pole on the cart.

In the learning control problem of the single-link invertedpendulum, the state space is spanned by four continuous statevariables, i.e., x, x, θ , and θ , where x and x are the cartposition and velocity, and θ and θ are the pole angle and angle

velocity, respectively. The action is the expected accelerationof the cart, which is a continuous variable. The reward functionis defined as follows:

r(t) = 0.25(θ(t) − θd(t))2 + 0.25(x(t) − xd(t))2 (29)

where θd(t) is the expected angle of the pole and xd(t) isthe expected position of the cart, which are both zeros in theexperiments. The discount factor γ is 0.95.

In DHP, three-layer neural networks are used to constructthe action module and the critic module, which are called theaction neural network (ANN) and the critic neural network(CNN), respectively. The structures of ANN and CNN aredetermined by the number of nodes in each layer. For ANN,the numbers of nodes in input layer, hidden layer, and outputlayer are set as 4-5-1, while for CNN, they are set as 4-5-4.Both ANN and CNN have the same transfer functions: fromthe input layer to the hidden layer, the transfer function isf (x) = (1 + e−x)−1, and from the hidden layer to the outputlayer, the transfer function is L(x) = kx. Both ANN andCNN have the same learning rate α = 0.3. The weights ofANN and CNN are randomly initialized from −0.5 to 0.5. Inthe simulation, the initial state vector [x, x, θ, θ ] is randomlyinitialized within the following intervals:[−2°, 2°], [−2°/s, 2°/s], [−0.1 m, 0.1 m], [−0.05 m

s , 0.05 ms ].

For the RLS-TD learning algorithm, the forgetting rate isset to be 1.0 and λ is equal to 0.6. The time step is 0.02 sboth in simulation and experiments. A learning trial startsfrom an initial position and ends after 10 000 time steps orthe controller is unsuccessful. If the pole can be stabilizedat the expected position for 10 000 time steps, the controlleris regarded as successful. One run of the learning controlprocess consists of at most 200 trials. The initial conditionsare independently set among different runs. If a successfulcontroller is obtained in one run, this run ends and a newrun starts. For the learning control algorithms, the perfor-mance is evaluated based on the averaged results of 100independent runs.

In the KDHP algorithm, different types of kernel functionsmay influence the performance of the learning controller.In the simulation, four types of kernel functions, namelylinear kernel function, polynomial kernel function, Gaussiankernel function, and sigmoid kernel function, are used forperformance evaluation. The four types of kernel functionsare described by (30)–(33), respectively. From the simulationand experimental results, it is shown that KDHP with linearkernel functions and polynomial kernel functions cannot obtainsatisfactory performance. The main reason for this is thatlinear and polynomial kernel functions have only limitedapproximation ability, which has been shown in [33]. Gaussiankernel function and sigmoid kernel function exhibit goodperformance under some parameters

k(xi , x j ) = xi · x j (30)

k(xi , x j ) = (1 + xi · x j )d (31)

k(xi , x j ) = exp(− ∥∥xi − x j∥∥2

/2σ 2) (32)

k(xi , x j ) = tanh(k · xi · x j + θ). (33)

Page 7: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

152 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

0 5 10 15 20-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d1s t tr ia l

0 20 40 60 80 100-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

3 rd tr ia l

0 50 100 150 200 250 300-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

8 th tr ia l

0 200 400 600 800-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

14 th tr ia l

0 200 400 600 800 1000 1200-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

19 th tr ia l

0 500 1000 1500 2000 2500 3000-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

24 th tr ia l

Fig. 3. Learning control process of DHP for the single-link inverted pendulumsystem using simulation data.

TABLE I

AVERAGE LEARNING CONTROL PERFORMANCE EMPLOYING GAUSSIAN

KERNEL FUNCTION WITH DIFFERENT PARAMETERS

Average Trials Success Rate (%) J

σ = 2.7 37.1 12 12.745

σ = 2.4 25.7 41 7.397

σ = 2.1 22.4 75 5.214

σ = 1.8 15.9 93 4.548

σ = 1.5 17.1 90 4.801

σ = 1.2 17.8 84 5.069

σ = 0.9 19.5 86 5.281

TABLE II

AVERAGE LEARNING CONTROL PERFORMANCE EMPLOYING SIGMOID

KERNEL FUNCTION WITH DIFFERENT PARAMETERS

Average Trials Success Rate (%) J

k = 0.5, θ = 0.5 62.4 22 11.483

k = 0.5, θ = 1 55.7 34 9.827

k = 1, θ = 0.5 36.1 52 7.688

k = 1, θ = 1 45.5 45 7.894

k = 2, θ = 0.5 19.3 82 5.104

k = 2, θ = 1.5 23.9 61 6.217

k = 2, θ = 2 31.3 47 7.300

Tables I and II show the average learning control per-formance when Gaussian and sigmoid kernel functions areemployed with different parameters in 100 independent runs,respectively. It is shown that the choice of the parametersis critical to the proposed algorithm, and, when employingGaussian kernel function with σ = 1.8, the best controlperformance is obtained.

Figs. 3 and 4 show typical learning processes of DHP andKDHP using simulation data, respectively. It is shown that, in

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

1s t tr ia l

0 100 200 300 400 500 600-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

3 rd tr ia l

0 200 400 600 800 1000 1200-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

7 th tr ia l

0 500 1000 1500-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

11 th tr ia l

0 500 1000 1500 2000 2500 3000-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

15 th tr ia l

0 1000 2000 3000 4000 5000-0.2

-0.1

0

0.1

0.2

time step

angl

e/ra

d

18 th tr ia l

Fig. 4. Learning control process of KDHP for the single-link invertedpendulum system using simulation data.

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2

angl

e/ra

d

DHP

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2

posi

tion/

m

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2KDHP

t/s

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2

t/s

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2LQR

0 2 4 6 8 10 12 14 16 18-0.2

-0.1

0

0.1

0.2

Fig. 5. Real-time responses of the pole angle and the cart position using thecontrol policies obtained by DHP, KDHP, and LQR.

TABLE III

AVERAGE LEARNING CONTROL PERFORMANCE IN 100 RUNS (WITH

SIMULATION DATA)

Min. No.of Trials

Max. No.of Trials

AverageTrials

SuccessRate (%)

DHP 10 44 25.3 67KDHP 5 29 15.9 93

TABLE IV

AVERAGE LEARNING CONTROL PERFORMANCE IN 10 RUNS

Min. No.of Trials

Max. No.of Trials

AverageTrials

SuccessRate (%)

DHP 46 82 61.8 40KDHP 31 68 44.3 70

the learning process, the control policies are improved, and inthe DHP algorithm it needs 24 trials to balance the pole whilein the KDHP algorithm only 18 trials are needed and it costsless time to keep the pole stabilized around the equilibrium.Table III shows the performance comparisons of KDHP andDHP in terms of the average learning control performance in100 runs. It is clear that the performance of KDHP is muchbetter than that of DHP. Thus, KDHP has faster convergenceand can obtain better control policies than DHP.

Page 8: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

XU et al.: ADP FOR REAL-TIME ONLINE LEARNING CONTROL 153

0 0.1 0.2 0.3 0.4 0.5 0.6-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d1 s t tr ia l

0 0.25 0.5 0.75 1 1.25 1.5-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d

4 th tria l

0 0.5 1 1.5 2 2.5 3-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d

13 th tria l

0 1.5 3 4.5 6-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d

27 th tria l

0 2.5 5 7.5-0.2

-0.1

0

0.1

0.2

t/s

46th tria l

angl

e/ra

d

0 10 20 30 40-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d58 th tria l

Fig. 6. Pole angle variation using DHP with real sampled data for the inverted pendulum system.

TABLE V

AVERAGE LEARNING CONTROL PERFORMANCE IN 100 RUNS

Min. No.of Trials

Max. No.of Trials

AverageTrials

SuccessRate (%)

DHP 56 143 95.2 31KDHP 29 115 67.5 53

To test the performance of different control policies obtainedfrom simulated data, the real-time control system of a single-link pendulum, developed by Googol Technology Ltd., wasused for experimental studies. Due to the unmodeled dynamicsand disturbances in real systems, the control policies obtainedfrom simulation data may not be near-optimal because, inthe last successful learning trial in simulation, the weights ofANN and CNN are still updated. Thus, the control policiesneed to be tested with fixed weights in the simulation until asatisfactory control policy is found. In our experiments, afterthe above performance tests, the control policies obtained fromthe simulation can always control the real pendulum systemwith good performance.

Fig. 5 shows that, in the real-time control experiments,compared to DHP, the proposed KDHP algorithm takes a

shorter time to balance the pole near the upright position.Moreover, the pole angle can be stabilized to a smallerregion around the upright position and the angle variationsare minimized. For the DHP algorithm, although the pole canbe balanced, it takes a longer time to balance the pole nearthe upright position and there are larger oscillations in the poleangle.

Fig. 5 also depicts the state variations when a linearquadratic regulator (LQR) is applied in the real pendulumsystem. The LQR is an optimal controller based on the linearsimulation model. It is shown that the performance of KDHPis close to that of LQR. For LQR, due to the observationnoises and unknown disturbances in the real system, there aresmall oscillations in the pole angle. For KDHP, due to thegeneralization ability of kernel machines, the angle oscillationsare smaller.

B. Online Learning Control Using Real Sampled Data—theCase for a Single-Link Inverted Pendulum

The above simulation and experimental results show thatwhen simulated data are used, the near-optimal policiesobtained by DHP and KDHP can both be used to control the

Page 9: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

154 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

0 0.25 0.5 0.75-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d1s t tr ia l

0 0.25 0.5 0.75 1-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d

3 rd tria l

0 2 4 6 8 10 12-0.2

0

0.227th tria l

t/s

angl

e/ra

d

0 5 10 15 20 25 30 35-0.2

-0.1

0

0.1

0.2

t/s

angl

e/ra

d

35 th tria l

0 0.5 1 1.5 2 2.5-0.2

-0.1

0

0.1

0.210th tria l

t/s

angl

e/ra

d

0 1 2 3-0.2

-0.1

0

0.1

0.218th tria l

t/s

angl

e/ra

d

Fig. 7. Pole angle variation using KDHP with real sampled data for theinverted pendulum system.

real system even if there are some unmodeled dynamics andunknown disturbances in the physical system. It is illustratedthat KDHP has better performance than DHP in terms ofhigher convergence speed and improved quality of final poli-cies. However, the difference between the simulation modeland the real system cannot be too large. For many complexsystems, it is difficult to get a good simulation model whichis close to the real system. Therefore, online learning controlusing the real sampled data is necessary for such cases. In thefollowing, experimental results will be given for both DHPand KDHP in online learning control with sampled data fromthe real system of the single-link inverted pendulum.

In the online learning control experiments, the settings ofthe parameters such as the actor module and the critic moduleare the same as that in the simulation. In each trial, the initialcart position and pole angle are initialized around the uprightposition. If a control policy cannot balance the pole or thecart exceeds the boundary, the trial is unsuccessful. The cartposition and pole angle are reinitialized and the next trialis started until the controller can successfully balance thepole or the trial number reaches 200, which is regarded asan unsuccessful run. After 200 trials, initialize all parametersafresh and start a new run.

In the online learning control experiments using real sam-pled data, learning control performances were evaluated basedon 10 independent runs. The averaged results are shown inTable IV. Because of the disturbances and noises in the realsystem, it takes more trials to learn a successful control policyfor both DHP and KDHP. Nevertheless, it is shown thatKDHP still exhibits much better performance than DHP. Inexperiments based on simulation data, there is no noise instate observations and control actions. The control processcan be modeled as a deterministic MDP. So, the successfulrate of KDHP is 93%. However, in experiments on practicalsampled data, due to the properties of physical sensors andactuators, there are unknown noises both in state observationsand control actions. The control process can be modeled as astochastic MDP, which is harder for a learning controller toobtain a near-optimal control policy. Therefore, the successfulrates of KDHP decline to 70%.

Fig. 8. A double-link inverted pendulum system.

0 500 1000 1500 2000-0.05

0

0.05

time step

angl

e1/ra

d

DHP

0 500 1000 1500 2000-0.05

0

0.05

time step

angl

e1/ra

d

KDHP

0 500 1000 1500 2000-0.05

0

0.05

time step

angl

e2/ra

d

0 500 1000 1500 2000-0.05

0

0.05

time step

angl

e2/ra

d

0 500 1000 1500 2000-0.1

-0.05

0

0.05

0.1

time step

posi

tion/

m

0 500 1000 1500 2000-0.1

-0.05

0

0.05

0.1

time step

posi

tion/

mFig. 9. State variations of the double-link inverted pendulum system—simulation test.

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2

angl

e1/ra

d

DHP

0 10 20 30 40 50-0.2

0

0.2

t/s

KDHP

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2

angl

e2/ra

d

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2

t/s

0 10 20 30 40 50-0.4

-0.2

0

0.2

0.4

posi

tion/

m

0 10 20 30 40 50-0.4

-0.2

0

0.2

0.4

t/s

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2LQR

0 10 20 30 40 50-0.2

-0.1

0

0.1

0.2

0 10 20 30 40 50-0.4

-0.2

0

0.2

0.4

Fig. 10. State variations of the double-link inverted pendulum system—experimental test.

Figs. 6 and 7 show typical learning processes of the twoalgorithms. It is shown that the performance of both controllerscan be gradually improved during the learning process, andthe balancing time of the pole becomes longer and longer. InDHP, the pole angle was stabilized around the upright positionafter 58 trials, while in the KDHP algorithm only 39 trialswere needed. Furthermore, in KDHP, the pole angle can bestabilized within a smaller region around the upright position,which shows a better control performance.

Page 10: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

XU et al.: ADP FOR REAL-TIME ONLINE LEARNING CONTROL 155

C. Learning Control Experiments on the Double-Link InvertedPendulum

As shown in Fig. 8, the double-link inverted pendulumsystem, developed by Googol Technology Ltd., was used in theexperiments. The dynamics model of the double-link invertedpendulum can be described by the following equations:⎡⎢⎢⎢⎢⎢⎢⎣

xθ1

θ2xθ1

θ2

⎤⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎣

0 0 0 1 0 00 0 0 0 1 00 0 0 0 0 10 0 0 0 0 00 86.69 −21.62 0 0 00 −40.31 39.45 0 0 0

⎤⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎣

xθ1θ2xθ1

θ2

⎤⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎣

0001

6.64−0.088

⎤⎥⎥⎥⎥⎥⎥⎦

u.

(34)

In the learning control of the double-link inverted pendulumsystem, the state space is spanned by six continuous statevariables, i.e., x, x, θ1, θ1, θ2, and θ2, where x and x are thecart position and velocity, θ1 and θ1 are the pole-1 angle andangular velocity, and θ2 and θ2 are the pole-2 angle and angularvelocity, respectively. The action is the acceleration of the cart,which is a continuous variable. The reward function at time-step t is defined as follows:

r(t) = [(θ1(t)−θ1d (t))2+(θ2(t)−θ2d (t))2+(x(t)−xd(t))2]/4

where θ1d(t) and θ2d(t) are the expected angles of pole-1 andpole-2, respectively, and xd(t) is the expected position of thecart. The structures of ANN and CNN are set as 6-5-1 and6-5-6, respectively, and other parameters are the same as inthe previous experiments.

Simulation data based on the simplified model in (34)were used to train the learning controllers based on DHPor KDHP. Learning control performance was evaluated basedon 10 independent runs. The averaged results are shown inTable V. Because of the increased complexity of the system,both learning controllers need more trials to balance the twopoles. It is also shown that KDHP performs better than DHP.The successful rate of KDHP in the double-link invertedpendulum system is 53% since the learning control problembecomes more difficult due to increased state dimensions andmore complex dynamics in this system.

Before real-time control experiments, the control policiesobtained from simulation still need to be tested with fixedweights until stable control policies are found. However, insome cases, even if a control policy is stable in the simulation,it still may be unstable in the experiment. The reason may bethat there are larger modeling uncertainties and disturbancesin the double-link inverted pendulum system.

Fig. 9 shows the online control results of the two algorithmsin simulation tests. It is shown that the control policy obtainedby KDHP takes a shorter time to balance the double-linkinverted pendulum to a very small region around the uprightposition. In the experiment, as shown in Fig. 10, both algo-rithms can balance the pole to a region around the uprightposition, but there are larger oscillations of the pole angles.The first reason is that, in the simulation, the friction betweenthe cart and the track and the friction in the joint of two polesare ignored. The second is that, since only a simplified linear

model is used in simulation, there are more uncertainties andparameter errors.

From Fig. 10, it is demonstrated that, although there areoscillations of the pole angles for the two algorithms, theoscillation is much smaller in KDHP than that in DHP, whichmeans better control performance.

Fig. 10 also shows the performance of an optimal controllerbased on LQR. Compared with DHP and KDHP, LQR exhibitsbetter control performance since DHP and KDHP can onlyobtain near-optimal solutions while LQR can calculate theoptimal solution exactly. However, almost all the optimalcontrol methods including LQR need the accurate modelinformation while DHP and KDHP do not, which is a clearadvantage.

V. CONCLUSION

This paper investigated real-time online learning controlmethods using KDHP. Two kernel-based online learning con-trol schemes were presented for real-time control systems byusing simulation data and online sampling data, respectively.Learning control experiments were performed on a single-link as well as a double-link inverted pendulum system.From the experimental results, it was shown that both onlinelearning control schemes, either using simulation data orusing real sampling data, are effective for approximating near-optimal control policies of real-time dynamical systems withmodel uncertainties. In addition, it was demonstrated that theKDHP approach can achieve much better performance thanconventional DHP with MLPNNs. Future work may includethe design of stable control strategies combined with theonline learning method based on KDHP so that the closed-loop system can have stable performance during the onlinelearning control process. Since DHP has been applied invarious nonlinear systems such as aircraft control, automotiveengine control, and power system control [17]–[19], KDHPcan also be applied in many other nonlinear control systemswith better learning efficiency and generalization performance.

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction,Cambridge, MA, USA: MIT Press, 1998.

[2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, May 1996.

[3] W. B. Powell, Approximate Dynamic Programming: Solving the Cursesof Dimensionality, New York, USA: Wiley, 2007.

[4] A. Alessandri, M. Gaggero, and R. Zoppoli “Feedback optimal control ofdistributed parameter systems by using finite-dimensional approximationschemes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6,pp. 984–996, Jun. 2012.

[5] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47,May 2009.

[6] M. A. Wiering and H. van Hasselt, “Ensemble algorithms in reinforce-ment learning,” IEEE Trans. Syst. Man Cybern. B, Cybern., vol. 38,no. 4, pp. 930–936, Aug. 2008.

[7] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptivedynamic programming for feedback control,” IEEE Circuits Syst. Mag.,vol. 9, no. 3, pp. 32–50, Jul.–Sep. 2009.

[8] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C.Szepesvári, and E. Wiewiora, “Fast gradient-descent methods fortemporal-difference learning with linear function approximation,” inProc. Int. Conf. Mach. Learn., 2009, pp. 993–1000.

Page 11: 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL… · 2013-12-31 · 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014 Kernel-Based Approximate

156 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2014

[9] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”J. Artif. Intell. Res., vol. 15, pp. 319-350, Nov. 2001.

[10] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advancesin Neural Information Processing Systems, Cambridge, MA, USA: MITPress, 2000.

[11] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEETrans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997.

[12] S. Bhasin, N. Sharma, P. Patre, and W. E. Dixon, “Asymptotic trackingby a reinforcement learning-based adaptive critic controller,” J. ControlTheory Appl., vol. 9, no. 3, pp. 400–409, Aug. 2011.

[13] Z. Sun, “Adaptive critic design for energy minimization of portable videocommunication devices,” IEEE Trans. Circuits Syst. Video Technol.,vol. 20, no. 1, pp. 27–37, Jun. 2010.

[14] J.-W. Sheu and W.-S. Lin, “Adaptive optimal control for designingautomatic train regulation for metro line,” IEEE Trans. Control Syst.Technol., vol. 20, no. 5, pp. 1319–1327, Sep. 2012.

[15] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-imate optimal tracking control for unknown general nonlinear systemsusing adaptive dynamic programming method,” IEEE Trans. NeuralNetw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011.

[16] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71,nos. 7–9, pp. 1180–1190, Mar. 2008.

[17] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Comparisonof heuristic dynamic programming and dual heuristic programmingadaptive critics for neurocontrol of a turbogenerator,” IEEE Trans.Neural Netw., vol. 13, no. 3, pp. 764–773, May 2002.

[18] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm tosolve the continuous-time infinite horizon optimal control problem,”Automatica, vol. 46, no. 5, pp. 878–888, May 2010.

[19] R. Enns and J. Si, “Helicopter trimming and tracking control using directneural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4,pp. 929–939, Jul. 2003.

[20] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming fordamping oscillations in a large power system,” IEEE Trans. Syst. ManCybern. B, Cybern., vol. 38, no. 4, pp. 1008–1013, Aug. 2008.

[21] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier,“Reinforcement-learning-based dual-control methodology for complexnonlinear discrete-time systems with application to spark engine EGRoperation,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1369–1388,Aug. 2008.

[22] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Mach.Learn., vol. 49, nos. 2–3, pp. 161–178, Nov. 2002.

[23] Y. Engel, S. Mannor, and R. Meir, “Bayes meets bellman: The gaussianprocess approach to temporal difference learning,” in Proc. 20th Int.Conf. Mach. Learn., 2003, pp. 154–161.

[24] T. G. Dietterich and X. Wang, “Batch value function approximation viasupport vectors,” in Advances in Neural Information Processing Systems,Cambridge, MA, USA: MIT Press, 2002.

[25] C. E. Rasmussen and M. Kuss, “Gaussian Processes in reinforcementlearning,” in Advances in Neural Information Processing Systems, Cam-bridge, MA, USA: MIT Press, 2004.

[26] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iterationfor reinforcement learning,” IEEE Trans. Neural Netw., vol. 19, no. 4,pp. 973–992, Jul. 2007.

[27] V. Vapnik, Statistical Learning Theory, New York, USA: Wiley, 1998.[28] B. Schölkopf and A. Smola, Learning with Kernels, Cambridge, MA,

USA: MIT Press, 2002.[29] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. “Natural

actor-critic algorithms,” Automatica, vol. 45, no. 11, pp. 2471–2482,Nov. 2009.

[30] C. H. Chiu, “The design and implementation of a wheeled inverted pen-dulum using an adaptive output recurrent cerebellar model articulationcontroller,” IEEE Trans. Ind. Electron., vol. 57, no. 5, pp. 1814–1822,May 2010.

[31] A. Casavola, E. Mosca, and M. Papini, “Control under constraints:An application of the command governor approach to an invertedpendulum,” IEEE Trans. Control Syst. Technol., vol. 12, no. 1,pp. 193–204, Jan. 2004.

[32] M. I. El-Hawwary, A. L. Elshafei, H. M. Emara, and H. A. A. Fattah,“Adaptive fuzzy control of the inverted pendulum problem,” IEEE Trans.Control Syst. Technol., vol. 14, no. 6, pp. 1135–1144, Nov. 2006.

[33] C. Cervellera, M. Gaggero, and D. Maccio, “Efficient kernel modelsfor learning and approximate minimization problems,” Neurocomputing,vol. 97, pp. 74–85, Nov. 2012.

Xin Xu (M’07–SM’12) received the B.S. degree inelectrical engineering from the Department of Auto-matic Control, National University of Defense Tech-nology (NUDT), Changsha, China, in 1996, wherehe received the Ph.D. degree in control science andengineering from the College of Mechatronics andAutomation, in 2002.

He is currently a Full Professor with the Col-lege of Mechatronics and Automation, NUDT. Hehas authored or co-authored more than 100 papersin international journals and conferences, and co-

authored four books. His current research interests include reinforcementlearning, approximate dynamic programming, machine learning, robotics, andautonomous vehicles.

Dr. Xu was a recipient of the 2nd Class National Natural Science Award ofChina in 2012. He is currently an Associate Editor of Information Sciences anda Guest Editor of the International Journal of Adaptive Control and SignalProcessing. He is a Committee Member of the IEEE TC on ApproximateDynamic Programming and Reinforcement Learning and the IEEE TC onRobot Learning.

Chuanqiang Lian received the Bachelor’s degreefrom the Department of Automation, Qinghua Uni-versity, Beijing, China, in 2008, and the Mas-ter’s degree from the College of Mechatronics andAutomation, National University of Defense Tech-nology, Changsha, China, in 2010, where he iscurrently pursuing the Ph.D. degree with the Instituteof Unmanned Systems.

He has authored or co-authored more than 10papers in international journals and conferences.His current research interests include reinforcement

learning, approximate dynamic programming, and autonomous vehicles.

Lei Zuo received the Bachelor’s and Master’s degrees from the College ofMechatronics and Automation, National University of Defense Technology,Changsha, China, in 2009 and 2011, respectively, where he is currentlypursuing the Ph.D. degree with the Institute of Unmanned Systems.

He has authored or co-authored more than 10 papers in internationaljournals and conferences. His current research interests include reinforcementlearning, approximate dynamic programming, and autonomous vehicles.

Haibo He (SM’11) received the B.S. and M.S.degrees from the Huazhong University of Scienceand Technology, Wuhan, China, in 1999 and 2002,respectively, and the Ph.D. degree from Ohio Uni-versity, Athens, OH, USA, in 2006, all in electricalengineering.

He is currently an Associate Professor with theDepartment of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston,RI, USA. From 2006 to 2009, he was an Assis-tant Professor with the Department of Electrical

and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ,USA. His current research interests include adaptive dynamic programming,machine learning, computational intelligence, hardware design for machineintelligence, and various applications such as smart grid. He has authored orco-authored over 100 papers in peer-reviewed journals and conferences, oneresearch book (Wiley), and edited six conference proceedings (Springer). Hisresearch has been covered by numerous media such as the IEEE Smart GridNewsletter, The Wall Street Journal, and Providence Business News.

Dr. He was a recipient of the National Science Foundation CAREERAward in 2011 and the Providence Business News Rising Star Innova-tor Award in 2011. He is currently an Associate Editor of the IEEETRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS AND

THE IEEE TRANSACTIONS ON SMART GRID.