Game-Theory-Based Consensus Learning of Double-Integrator...

JOTA manuscript No.(will be inserted by the editor)

Game-Theory-Based Consensus Learning ofDouble-Integrator Agents in the Presence of Worst-CaseAdversaries

Kyriakos G. Vamvoudakis ¨Joao P. Hespanha

Received: date / Accepted: date

Abstract This work proposes a game-theory based technique for guaran-teeing consensus in unreliable networks by satisfying local objectives. Thismulti-agent problem is addressed under a distributed framework, in whichevery agent has to find the best controller against a worst case adversary sothat agreement is reached amongst the agents in the networked team. The con-struction of such controllers requires the solution of a system of coupled partialdifferential equations, which is typically not feasible. The algorithm proposeduses instead three approximators for each agent: one to approximate the valuefunction, one to approximate the control law, and a third one to approximatea worst-case adversary. The tuning laws for every controller and adversaryare driven by their neighboring controllers and adversaries respectively andneither the controller nor the adversary know each others policies. A Lyapu-nov stability proof ensures that all the signals remain bounded and consensusis asymptotically reached. Simulation results are provided to demonstrate theefficacy of the proposed approach.

Keywords Game-Theory ¨ Consensus ¨ Hamilton-Jacobi Equations ¨Optimization ¨ Security

K. G. Vamvoudakis, Corresponding authorKevin T. Crofton Department of Aerospace and Ocean EngineeringVirginia TechBlacksburg, [email protected]

J. P. HespanhaDepartment of Electrical and Computer EngineeringUniversity of CaliforniaSanta Barbara, [email protected]

2 Kyriakos G. Vamvoudakis, Joao P. Hespanha

1 Introduction

In recent years, complex dynamic systems consisting of interacting subsystemshave received significant attention. Such systems arise, e.g., in vehicle forma-tion and maneuvering, power systems [44], altitude alignment [7], rendezvousproblem [28], flocking [31], consensus seeking and agreement [33]. While largescale distributed systems are especially vulnerable to attacks and intrusions,traditional consensus algorithms [23, 38, 45] typically exhibit large fragilitieswith respect attacks. There is thus a need for architectures that are resilientagainst attacks.

Machine learning is an attractive approach to achieve optimal behaviorwhen classical optimization techniques are infeasible; and its importance inreal applications has been demonstrated, e.g., in [1] that proposes appliedapprenticeship algorithms for learning control policies to helicopters flying ina very wide range of highly aerobatics with a performance close to that of ahuman expert pilot. However, the application of machine learning techniques ina distributed multi-agent setting is complicated by the fact that all agents willlikely be learning and adapting simultaneously, which may prevent the learningprocess from converging. Game theory provides the appropriate framework tostudy learning and autonomy in a distributed setting, as recognized in [41,48].

Reinforcement learning [43] is useful when agents take a sequence of acti-ons, based on information or rewards received from the environment. Thistechnique is inspired by dynamic programming and lay the foundation for de-veloping algorithms to update expected utilities and to use them to explorethe system’s state space. Actor-critic frameworks [50,52] are based on reinfor-cement learning methods and use an actor component that generates actions,and a critic component that assesses the costs of these actions. Based on thisassessment, the actor policy is updated at each learning step [9], [43]. Cur-rent reinforcement learning proofs of convergence do not hold for most of themulti-agent systems problems.

Related work

A survey of existing threats in multi-agent systems and models of realistic andrational adversary models are presented in [12] and [13]. Consensus in the pre-sence of persistent adversaries has been focused on detecting, identifying andisolating the failing nodes [34], which can be computationally expensive andoften requires the use of global information and specific graph connectivity.The adversaries can easily drive the system unstable and make the systemoperate with an undesired behavior. The authors in [2, 5, 47] show the advan-tages of using game theory in network security. The work of [20] presents aframework to show how models from game theory can be used to defend anelectric power system against antagonistic attacks.

Robust consensus algorithm from the information theory side have beenproposed in [51] and [55]. Specifically, distributed averaging in the presence of

Title Suppressed Due to Excessive Length 3

channel uncertainties is presented in [51] where the network is used twice percycle without any optimality guarantees. The work of [55] proposes consensusbased distributed-decoding algorithms (e.g., Viterbi algorithm) that are robustto random link failures and additive noise present in the inter-sensor links.Performance bounds are provided to determine the number of iterations neededto guarantee a given level of average decoding errors at any given sensor.

Most of the consensus algorithms proposed in the literature to provide re-silience against adversaries either are not optimizing a specific performancecriteria [16], [30] or require the off-line solution of Riccati equations [39, 40].The work of [26] formulates the problem of measurement corruption and jam-ming attacks as centralized finite horizon problems where the adversary wantsto maximize the Euclidean distance between the nodes’ state and the con-sensus line. The computations have to be performed offline in order to com-pute the optimal strategy. The authors in [6] propose to evaluate the cost ofevery agent by considering constant states for the other agents and the authorsin [17] present a suboptimal phase-synchronization adaptive control scheme fornetworked nonlinear Euler-Lagrange systems with a centralized performancewithout incorporating the adversarial inputs in it. A distributed clock syn-chronization protocol based on a consensus algorithm for non identical doubleintegrators is presented in [14]. The authors provide conditions on the pro-tocol parameters in terms of rate convergence and in terms of steady stateerror in the presence of an additive noise by solving offline a centralized opti-mization algorithm. In [53], the authors propose a controller that suppressesthe effect of constant and time varying disturbances by using information ofagent’s and neighbors’ states. Vamvoudakis et al. [46] presented reinforcementlearning algorithms that guarantee optimal performance in networked systemswith leaders, but did not consider the effect of adversaries that corrupt themeasured data. In [54], the authors propose a distributed resilient formationcontrol algorithm that consists of a formation control block that adapts onlineto minimize a local formation error; as well as a learning block that collectsinformation in real time to update the estimates of adversaries. The book of [2]models the malware traffic as an H-infinity problem (zero-sum game) wherethe optimal filtering strategies can also be used in spam filtering, distributeddenial of service attacks, etc. This H-infinity framework allows for dynamicallychanging filtering rules in order to ensure a certain performance level. Moreo-ver, the objectives of the defense and adversaries are diametrically opposed, sothe zero-sum assumption is accurate with a performance guarantee in the formof a minimum security level. The work of [42] provides methods for reachingconsensus in the presence of malicious agents but the algorithms proposed arecombinatorial in nature and thus computationally expensive.

A multi-agent synchronization problem of continuous-time Markov jumplinear systems is presented in [56] where one subsystem aims to mislead theother subsystem to an unfavorable system state. The authors provide a setof coupled Riccati differential equations derived from centralized performan-ces to characterize the feedback Nash equilibrium solution but require offlinecomputations and without any stability guarantees. The work of [15] achieves


tracking for a case of double-integrator agents, when sufficient communicationfor path-distribution is permitted but without any optimality and robustnessguarantees. In [35] the authors consider a network consisting of identical agentswith the only measurement given to each agent is a linear combination of theoutput of the agent relative to that of its own neighbors. Their aim was todesign protocols to attenuate the impact of disturbances in the sense of theH-infinity norm of the corresponding transfer function, without enforcing anyoptimization criteria. Coupled Hamilton-Jacobi equations that arise in suchproblems are nonlinear partial differential equations, and it is well-known thatin general such equations do not admit global classical solutions and if theydo, they may not be smooth. But they may have the so-called viscosity so-lutions [4,18]. Under certain local reachability and observability assumptions,they have local smooth solutions [5,49]. Various other assumptions guaranteeexistence of smooth solutions, such as that the dynamics not be bilinear andthe value not contain cross-terms in the state and control input.

Our algorithm on the other side will propose plug-n-play policies while alsosatisfying user-defined distributed performance criteria formulated as a graphi-cal game. In this graphical game, only neighbors in the graph can interact andthe payoffs depend on the actions of the neighbors which gives a representationthat is exponential in the size of the largest neighborhood. This is in contrastto games in normal form where every agent interacts with any other agent, thepayoffs depend on actions of all agents and the representation is exponentialin the number of players. The result of this work is an adaptive control systemthat learns based on the interplay of agents in a game, to deliver true onlinegaming behavior.

Contributions

This paper proposes a game-theory based consensus learning algorithm fornetworked systems with N double-integrator dynamics that are being attac-ked by persistent adversaries. The problem is solved by combining adaptivedynamic programming and game theory. We consider double-integrator dyna-mics, which are often used to represent single degree of freedom rotational ortranslational motion [37].

Furthermore this work provides a relationship between the optimal con-sensus problem for multi-agent systems with adversarial inputs and graphicalNash equilibrium. The derived coupled Hamilton-Jacobi equations for multi-agent systems are established by Bellman’s dynamic programming, and thena formal stability analysis is developed for the learning scheme. The game has2N players, since the input to each double-integrator is the sum of two signalscontrolled by players with opposite goals: a controller that wants to achieveconsensus and an adversary that wants to prevent this goal. We shall see thatthis problem formulation results in a distributed consensus algorithm that issignificantly more robust than the usual linear consensus.


The criteria to be minimized by each agent depend only on local informationand the problem is formulated as a graphical game. Namely, the criteria dependon every agent’s own state and controller/adversaries inputs, as well as thecontroller/adversarial inputs for its neighbors. These optimization criteria willlet us design the decision policies based on the graph structure, local controland adversarial inputs and hence enable us the development of an optimaland distributed consensus algorithm. It is worth noting that even though weare using distributed performance criteria, this does not allow one to find aclosed-form solution to the coupled Bellman equation even for the quadraticcase due to the state coupling in the neighborhood.

The computation of an exact equilibrium to the game formulated requiresthe solution of a system of coupled partial differential equations, one for eachpair of agents; which does not appear to be feasible. The algorithm proposeduses instead three approximators for every agent that use distributed informa-tion. Every agent uses a critic to approximate each value function, an actorto approximate the optimal controller, and a second actor to approximate theworst case adversary. The tuning laws for each controller and each adversaryare driven by the values of their distributed criteria and of their control signals;as well as the criteria and control signals of their neighbors. In particular, thecontrollers do not require explicit measurements of the adversarial signals andvice-versa. Convergence of the game-theoretical actor-critic algorithm and sta-bility of the closed-loop system are analyzed using Lyapunov-based adaptivecontrol methods. These results require an appropriate notion of persistence ofexcitation (PE) to guarantee exponential convergence to a bounded region.

Organization

In Section II we introduce the problem formulation and an existence resultfor the Nash equilibrium. A game-theory based consensus architecture thatuses only local information is presented in Section III, where we also providean online learning algorithm. The effectiveness of the proposed approach isillustrated in Section IV through simulations. Finally, in Section V we concludeand discuss future work.

Notation

The notation used in this paper is standard: IR denotes the set of real numbers,IR` denotes the set tx P IR : x ą 0u, IRn1ˆn2 is the set of n1ˆn2 real matrices,. denotes the Euclidean norm, p.qT denotes the transpose of a matrix, and |S|denotes the cardinality of the set S. A continuous function α : r0, aq Ñ r0,8qis said to belong to class K functions if it is strictly increasing and αp0q “ 0.


2 Problem Formulation

We consider a system consisting of N two-input agents, each modeled by atwo-input double integrator:

#

9qi “ pi qi P IRm

9pi “ ui ` vi, pi P IRm @i P N – t1, ..., Nu, (1)

where ui, vi P IRm are the two inputs to agent i, which are controlled by playerswith opposite goals. We thus have a total of 2N players, N of them — whichwe call controllers — select values for uiptq, t ě 0, i P N within appropriatesets Ui and the other N — which we call adversaries — select values for theviptq, t ě 0, i P N within appropriate sets Vi.

2.1 Distributed Performance Criteria

For each i P N , the criteria of the players associated with the inputs ui and viare symmetric and depend on their state rq1i p

1is, their inputs ru1i v

1is, and also

on the inputs of a nonempty subset Ni Ă N of the other agents. Specificallyui and vi want to minimize and maximize, respectively, the following criteria

Jippip0q, qip0q, pNip0q, qNip0q;ui, uNi , vi, vNiq “

1

2

ż 8

0

´

si2 ` ui

2 ´ γ2iivi2 `

ÿ

jPNi

uj2 ´

ÿ

jPNi

γ2ijvj2¯

dt (2)

where

siptq “ÿ

jPNi

´

rqipi s ´

“ qjpj

‰

¯

P IR2m, @t ě 0, i P N , (3)

and γii, γij, @i, j P N are positive constants. In (2) and in the sequel, we usethe subscript ¨Ni as a short hand notation for all the subscripts ¨j , j P Ni

associated with the neighbors of i. The game just defined is called a graphicalgame [24], because the coupling between the criteria of the different playerscan be naturally associated with a graph G with N nodes (one for each pairof opposing agents) and edges defined by the neighboring relations expressedby the sets Ni.Remark 2.1 Note that, with the performance index (2) we consider attackscenarios where the adversary’s goal is to drive the agents to an non-consensusstate while remaining stealthy. These kind of attacks leave more autonomy tothe adversary and consider attacks that may be theoretically discernible, butare still stealthy since they do not cause any alarm by the anomaly detector.We shall see in the subsequent theorem (c.f. Theorem 2.1) that in order for thesystems to be stabilized one needs to pick, γii ą 1 and γij ă γ2jj, @i P N , j P Ni.These conditions are similar to the ones in [2] (see Chapter 7), [5, 49] where


one needs to pick a γ ą γ˚ ě 0 where γ˚ is the smallest γ such that the systemis stabilized. l

The variables siptq should be viewed as consensus tracking errors thatexpress the (weighted) sum of the errors between the state of agent i andthe states of its neighbors in Ni. When all the siptq converge to zero, we saythat consensus is asymptotically reached in the sense that qi “ qj , pi “ pj ,@i, j P N [33].

We are interested in finding Nash equilibria policies u˚i and v˚i , @i P N forthe 2N -player game, in the sense that

Jip¨;u˚i , u

˚Ni,vi, v

˚Niq ď J˚i p¨;u

˚i , u

˚Ni, v˚i , v

˚Niq

ď Jip¨;ui, u˚Ni, v˚i , v

˚Niq, @vi, ui, i P N , (4)

where, for simplicity, we omitted the dependence of the criteria (2) on theinitial conditions. Essentially, each pair of players ui and vi are engaged ina zero-sum game, but the outcomes of all these zero-sum games are coupledthrough the neighboring relations expressed by the sets Ni. The positive termson the ui and the negative terms on the vi in (2) express the fact that theplayers that select the ui and the vi want to achieve their objectives, whilekeeping these signals small. Recall that Ji is a cost for the player that selectsui, but a reward for the player that selects vi. It is clear that the criteria(2) enable us to define a general neighborhood optimization framework whilethe different signs are due to the zero-sum game formulation. One can alsoremove the neighboring terms (see fourth and fifth term of (2)) since (3) hasthe information from the neighborhood.

To solve this problem, it is convenient to re-write the agent dynamics (1)in terms of the consensus tracking errors (3):

9si “

„

0 I0 0

si ` di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq, i P N , (5)

where di – |Ni| ą 0, and 0 and I denote the mˆm zero and identity matrices,respectively. This shows that Jip¨q in (2) only depends on the initial conditionspip0q, qip0q, pNip0q, qNip0q through sip0q. In the sequel, we thus simply writeJipsip0q;ui, uNi

, vi, vNiq for the left-hand side of (2).

2.2 Existence and Stability of Equilibria

The saddle-point conditions (4) can be expressed by 2N coupled optimizations:

Jipsip0q;u˚i , u

˚Ni, v˚i , v

˚Niq


“ maxvi

Jipsip0q;u˚i , u

˚Ni, vi, v

˚Niq, i P N , (6a)

Jipsip0q;u˚i , u

˚Ni, v˚i , v

˚Niq

“ minui

Jipsip0q;ui, u˚Ni, v˚i , v

˚Niq, i P N , (6b)

which share the dynamics (5) and the Hamiltonians

Hipsi, ρi, ui, uNi, vi, vNi

q– ρTi

ˆ„

0 I0 0

si

` di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2

ˆ

si2 ` ui

2

´ γ2iivi2 `

ÿ

jPNi

uj2 ´

ÿ

jPNi

γ2ijvj2

˙

, i P N ,

where ρi P IR2m is the co-state (adjoint) variable. For each i P N , the left-hand side of the two optimizations in (6) can be viewed as a (common) valuefunction to both optimizations

V ˚i psip0qq– maxvi

Jipsip0q;u˚i , u

˚Ni, vi, v

˚Niq

“ minui


˚Niq, i P N , (7)

that should satisfy the following coupled Bellman equations:

Hi

´

si,BV ˚iBsi

, u˚i , u˚Ni, v˚i , v

˚Ni

¯

“ 0, i P N , (8)

with boundary conditions V ˚i p0q “ 0 and controller/adversarial policies givenby

u˚i “ arg minui

Hipsi,BV ˚iBsi

, ui, u˚Ni, v˚i , v

˚Niq

“ ´di“

0 I‰ BV ˚iBsi

, i P N , (9)

v˚i “ arg maxvi

Hipsi,BV ˚iBsi

, u˚i , u˚Ni, vi, v

˚Niq

“di

γ2ii

“

0 I‰ BV ˚iBsi

“ ´1

γ2iiu˚i , i P N . (10)

One can see from (9) and (10) that when one player accelerates (u˚i ą 0) theother decelerates (v˚i ă 0) and vice-versa.Remark 2.2 One could represent the value functions as quadratic in the con-sensus tracking error, i.e. V ˚i psiq : IRn

Ñ IR,

V ˚i psiq “1

2sTi Pisi, @si,@i P N ,

where Pi P IRnˆn, @i P N are the unique symmetric positive definite matricesthat solve the following complicated distributed coupled equations,

sTi Pi

ˆ„

0 I0 0

si ´ d2i

„

0 00 I

p1´1

γ2iiqPisi


`ÿ

jPNi

dj

„

0 00 I

p1´1

γ2iiqPjsj

˙

`

ˆ„

0 I0 0

si ´ d2i

„

0 00 I

p1´1

γ2iiqPi

`ÿ

jPNi

dj

„

0 00 I

p1´1

γ2iiqPjsj

˙T

Pisi

`ÿ

jPNi

d2jsTj Pj

„

0 00 I

p1´γ2ijγ4jjqPjsj

` d2i sTi Pi

„

0 00 I

p1´1

γ2iiqPisi ` si

2“ 0,@i P N .

It is obvious that in the above equation the coupling of each state si withthe neighborhood state sj , j P Ni makes it difficult to provide a closed-formsolution, hereby rendering optimal control techniques of limited use for nonquadratic costs such as the one considered here [8, 50]. l

Remark 2.3 For every fixed u˚Ni, v˚i , v

˚Ni

, we have B2Hi

B2uią 0 and therefore the

Hamiltonian is minimized at the stationarity point (9). Conversely, for every

fixed u˚i , u˚Ni, v˚Ni

, we have B2Hi

B2viă 0 and therefore the Hamiltonian is maxi-

mized at the stationarity point (10). l

Remark 2.4 As we shall see in the following theorem every agent is partici-pating in a zero-sum game but all the agents together participate in a nonzero-sum game. The controller and the adversarial input of each agent aretwo players competing between each other and as explained in [2] the intelli-gent maximizing adversary plays his strategy to ensure a certain performance.When one player adopts a constant strategy, it becomes an optimal controlproblem. In general each player (controller or adversary) tries to make the bestoutcome taking into account that his opponent also does the same. The twoplayers, controller and adversary are measuring (3) (due to the presence ofBV ˚

i

Bsiin (9)-(10)) which includes the positions and the velocities of the agents

in the neighborhood. l

The following theorem shows that under the assumption of a smooth solu-tion of the coupled Bellman equations (8) and some other conditions, the Nashequilibrium is attained, with the equality to zero replaced by less than or equalto zero (as in the coupled Hamilton-Jacobi-Isaacs (HJI) inequalities [5,10,49]).

Theorem 2.1 Suppose that there exist continuously differentiable, positive de-finite functions V ˚i P C

1, i P N that satisfy the following inequalities,

BV ˚iBsi

T „

0 I0 0

si `d2i2

1´ γ2iiγ2ii

BV ˚iBsi

T „

0 00 I

BV ˚iBsi

`ÿ

jPNi

γ2jj ´ 1

γ2jj

BV ˚iBsi

T „

0 00 I

BV ˚j

Bsj`

1

2

ˆ

si2


`ÿ

jPNi

d2j`γ4jj ´ γ

2ij

γ4jj

˘BV ˚j

Bsj

T „

0 00 I

BV ˚j

Bsj

˙

ď 0, @si (11)

with V ˚i p0q “ 0, @i P N and γii ą 1 and γij ă γ2jj, @i P N , j P Ni. The

closed-loop system (5) with

ui “ u˚i – ´di“

0 I‰ BV ˚iBsi

, vi “ v˚i –di

γ2ii

“

0 I‰ BV ˚iBsi

, @i P N (12)

is asymptotically stable. Moreover, the controller/adversarial policies (12) forma Nash equilibrium and

J˚i psip0q;u˚i , u

˚Ni, v˚i , v

˚Niq “ V ˚i psip0qq, @i P N .

Proof of Theorem 2.1. The orbital derivatives of the V ˚i along solutions to (5),(12) are given by

9V ˚i “BV ˚iBsi

T

9si “BV ˚iBsi

Tˆ„

0 I0 0

si ` di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

(13)

and, after substituting the policies (12) in (13), one obtains

9V ˚i “BV ˚iBsi

Tˆ„

0 I0 0

si `d2iγ2iip1´ γ2iiq

„

0 00 I

BV ˚iBsi

`ÿ

jPNi

γ2jj ´ 1

γ2jj

„

0 00 I

BV ˚j

Bsj

˙

.

By using the inequalities (11) we have that,

9V ˚i “BV ˚iBsi

Tˆ„

0 I0 0

si `d2iγ2iip1´ γ2iiq

„

0 00 I

BV ˚iBsi

`ÿ

jPNi

γ2jj ´ 1

γ2jj

„

0 00 I

BV ˚j

Bsj

˙

ď ´1

2

ˆ

si2 `

d2iγ2iipγ2ii ´ 1q

BV ˚iBsi

T „

0 00 I

BV ˚iBsi

`ÿ

jPNi

d2j

γ4jj

`

γ4jj ´ γ2ij

˘BV ˚j

Bsj

T „

0 00 I

BV ˚j

Bsj

˙

.

Since γii ą 1 and γij ă γ2jj, @i P N , j P Ni, we further conclude that 9V ˚i ď

´ 12si

2, from which asymptotic stability follows using the Lyapunov Theorem[25].Next we need to prove that the controller/adversarial policies form a Nashequilibrium. Since the functions V ˚i , i P N are smooth, zero at zero, and


converge to zero as tÑ8 (due to the asymptotic stability), we can write (2)as,

Jipsip0q, ui, uNi, vi, vNi

q “1

2

ż 8

0

ˆ

si2 ` ui

2

`ÿ

jPNi

uj2 ´ γ2iivi

2 ´ÿ

jPNi

γ2ijvj2

˙

dt

` V ˚i psip0qq `

ż 8

0

9V ˚i psptqqdt, i P N ,

and, in view of (5), we can re-write this equation as

Jipsip0q, ui, uNi, vi, vNi

q “1

2

ż 8

0

ˆ

si2 ` ui

2

`ÿ

jPNi

uj2 ´ γ2iivi

2 ´ÿ

jPNi

γ2ijvj2

˙

dt

` V ˚i psip0qq `

ż 8

0

BV ˚iBsi

Tˆ„

0 I0 0

si ` di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

dt, i P N .

Writing (8) as

BV ˚iBsi

T „

0 I0 0

si “ ´BV ˚iBsi

Tˆ

di

„

0I

pu˚i ` v˚i q

´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

´1

2

ˆ

si2 ` u˚i

2 ´ γ2iiv˚i

2 `ÿ

jPNi

u˚j 2 ´

ÿ

jPNi

γ2ijv˚j

2

˙

,

we can re-write the previous equations as

Jipsip0q;ui, uNi, vi, vNi

q “1

2

ż 8

0

ˆ

ui ´ u˚i

2

`ÿ

jPNi

uj ´ u˚j

2 ´ γ2iivi ´ v˚i

2 ´ÿ

jPNi

γ2ijvj ´ v˚j

2

´ 2BV ˚iBsi

Tÿ

jPNi

„

0I

puj ´ u˚j q ` 2

ÿ

jPNi

u˚j puj ´ u˚j q

´ 2BV ˚iBsi

Tÿ

jPNi

„

0I

pvj ´ v˚j q ´ 2

ÿ

jPNi

v˚j γ2ijpvj ´ v

˚j q

˙

dt

` V ˚i psip0qq, i P N .

Assuming that all the neighbors of the ith node play at the proposed equili-brium (i.e., uj “ u˚j , @j P Ni), we obtain

Jipsip0q;ui, u˚Ni, vi, v

˚Niq “

1

2

ż 8

0

ˆ

ui ´ u˚i

2


´ γ2iivi ´ v˚i

2

˙

dt` V ˚i psip0qq, i P N . (14)

Moreover, setting ui “ u˚i in (14), leads to

Jipsip0q;u˚i , u

˚Ni, vi, v

˚Niq “

1

2

ż 8

0

ˆ

´ γ2iivi ´ v˚i

2

˙

dt

` V ˚i psip0qq, i P N ; (15)

setting vi “ v˚i in (14), leads to


˚Niq “

1

2

ż 8

0

ˆ

ui ´ u˚i

2

˙

dt` V ˚i psip0qq, i P N ; (16)

and setting ui “ u˚i and vi “ v˚i in (14), leads to

J˚i psip0q;u˚i , u

˚Ni, v˚i , v

˚Niq “ V ˚i psip0qq, i P N . (17)

The Nash equilibrium condition (4) then follows directly from (15), (16), and(17). l

3 Reaching Consensus with Learning Ideas

Since in general it is not easy to find solutions V ˚i to the Bellman equations (8),based on which one would construct the optimal controllers (9) and adversa-ries (10), we shall use an actor-critic architecture to compute approximationsto these functions. Specifically, N critic approximators will provide approxi-mations to the value functions V ˚i , i P N in (8) and 2N actor approximatorswill provide approximations to the optimal controllers and adversaries u˚i , v

˚i ,

i P N given by (9) and (10) respectively.We use linearly parametrized critics with h basis functions, defined using

smooth basis functions φi – rφi1 φi2 ¨ ¨ ¨ φihs : IR2mÑ IRh, i P N , which

allow us to write

V ˚i psiq “WTi φipsiq ` εcipsiq, @si, i P N , (18)

where the Wi P IRh denote ideal weights and the εcipsiq the correspondingresidual errors. Based on this, the optimal controllers in (9) can be re-writtenas

u˚i psiq “ ´di

„

0I

T

pBφi

Bsipsiq

TWi `BεciBsi

q, @si, i P N , (19)

and adversaries in (10) can be re-written as

v˚i psiq “di

γ2ii

„

0I

T

pBφi

Bsipsiq

TWi `BεciBsi

q, @si, i P N . (20)

For brevity from now we will omit the dependence of the basis functions onthe state si and simply write φi for φipsiq.


Remark 3.1 Note that the adversarial input v˚i in (20) is not the actual inputby the adversary but the worst case input that she can introduce to our system.The actual adversary has the freedom to do anything she likes, as long as sheremains stealthy. l

Assumption 3.1 The following three statements must be satisfied @si and@i P N :

– The basis functions φi and their derivatives Bφi

Bsiare bounded.

– The ideal weights are bounded by known constants: Wi ďWimax.– The residual errors and their derivatives are bounded by known constants:

εci ď εcimax and›

›

›

BεciBsi

›

›

›ď ∇εcimax. l

Assumption 3.1 is a rather standard assumption in neuro-adaptive control[19, 21, 32]. The first two points are satisfied because Wi and φi provide thevalue functions [50]. In order to satisfy the third one, one can pick sigmoidal(e.g. hyperbolic tangent) and Gaussian functions as basis functions.

Remark 3.2 According to Weierstrass Higher Order Approximation Theorem[19] as the number of basis sets h increases, the approximation errors @i P Ngo to zero, i.e., εci and

›

›

›

BεciBsi

›

›

›as hÑ8. We shall require a form of uniformity

in this approximation result that is common in neuro-adaptive control andother approximation techniques [19,21,50]. l

3.1 Actor-Critic Control Architecture

To find the optimal controller/adversarial policies for every agent one needsto compute the gradient of each optimal value function. While it would seemthat a single set of weights would suffice for the approximation since one caneasily differentiate (18), we will independently adjust three sets of weights:the critic weights Wci P IRh that approximate the optimal value functions (7)according to

Vi “ WTci φipsiq, i P N ;

the controller actor weights Wui P IRh that approximate the optimal controllerin (12) according to

ui “ ´di

„

0I

TBφiBsi

T

Wui , i P N (21)

and the actor weights WviP IRh that approximate the optimal adversary in

(12) according to

vi “diγ2ii

„

0I

TBφiBsi

T

Wvi , i P N . (22)


All these approximators share the same set of basis functions φi. Adjustingthree independent sets of weights carries additional computational burden,but the flexibility introduced by this “over-parametrization” will enable us toestablish convergence of to a Nash equilibrium and guarantee Lyapunov-basedstability.

Remark 3.3 Note that the approximated version of the adversarial input vi in(22) is an approximation of the worst-case adversarial input (20) that the con-troller is using. The actual adversarial inputs have the freedom to do anythingthey want, as long as they remain stealthy. l

Defining the errors ei P IR, @i P N between the values of the Hamiltoniansin (8) at the optimal value function/policies and their value at the estimatedvalue function/policies, leads to

ei ” Hipsi, WTci

Bφi

Bsi, ui, uNi

, vi, vNiq

´Hpsi,BV ˚iBsi

, u˚i , u˚Ni, v˚i , v

˚Niq “ WT

ci

Bφi

Bsi

ˆ„

0 I0 0

si

` di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2

ˆ

si2 ` ui

2 `ÿ

jPNi

uj2 ´ γ2iivi

2 ´ÿ

jPNi

γ2ijvj2

˙

“ WTci

Bφi

Bsi

ˆ„

0 I0 0

si ` di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

` ri, i P N , (23)

where ri :“ 12

ˆ

si2 ` ui

2 `ř

jPNiuj

2 ´ γ2iivi2 ´

ř

jPNiγ2ijvj

2

˙

and

Hpsi,BV ˚

i

Bsi, u˚i , u

˚Ni, v˚i , v

˚Niq “ 0 from (8). Our goal is to design tuning laws for

the critic weights that minimize the squared-norm of the errors ei, @i P N :

Kiptq “1

2ei

2, i P N . (24)

We should note however, that while it is true that if the functions si ÞÑWT

ci φipsiq satisfy the Bellman equations @i P N , si P IR2m then the erroreiptq and their squared errors Kiptq are zero @i P N , t ě 0, the converse isnot necessarily true. Because of this, it will not suffice to show that the eiconverge to zero and, instead, we will need to prove explicitly that the critics(and actors) asymptotically approximate their optimal values.


3.2 Learning Algorithm

The gradient descent estimate Wciptq for the critic’s weights can be constructedby differentiating Ki in (24) as follows,

9Wci “ ´αi

BKi

BWci

“ ´αiωi

p1` ωTi ωiq2eTi , i P N , (25)

where ωi “Bφi

Bsi

´

r 0 I0 0 s si ` di r 0I s pui ` viq ´

ř

jPNir 0I s puj ` vjq

¯

, and the

constant αi P IR` essentially determines the speed of convergence.We shall see below that the following tuning laws for the actors (controllers

and adversaries) guarantee Lyapunov stability of the overall system:

9Wui “ Pr

„

Wui , αui

"

d2iBφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wui

ωTipωTi ωi ` 1q

Wci

`ÿ

jPNi

d2jBφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wuj

ωTipωTi ωi ` 1q

Wcj

` σuipWci ´ Wui q

*

, i P N (26)

9Wvi “ Pr

„

Wvi , αvi

"

´d2iγ2ii

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wvi

ωTipωTi ωi ` 1q

Wci

´ÿ

jPNi

d2jγ2ij

γ4jj

Bφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wvj

ωTipωTi ωi ` 1q

Wcj

` σvipWci ´ Wvi q

*

, i P N , (27)

where the constants αui, αvi P IR` determine the speed of convergence; σui, σvi P

IR` are adaptation gains; and ωi :“ωT

i

ωTi ωi`1

is a bounded signal, with

ωi ď ωimax –1

2, @i P N .

The symbol Pr in (26)–(27) denotes the smooth projection operator, that isused to modify the adaptation laws. The inclusion of this operator in the tuninglaws guarantees that Wui

and Wviremain inside S. The projection operator

used in the update laws (26) and (27) provides an effective way [11,21,27,29,36]to eliminate parameter drift and keeps the basis weights within a-priori definedbounds (bounded convex set). In other words, as stated in Assumption 3.1,the projection operator makes sure that Wimax specifies the boundary andεcimax specifies boundary tolerance. A block diagram showing the proposedcontrol architecture for agent i is given in Figure 1, where we can see that thecontroller and the adversary of every agent i are driven by the controllers andadversaries respectively of her neighborhood. The intuition behind this is thatevery agent needs to have information of the neighborhood “performance”to see how good the agent does (that is why the neighborhood critic weightsappear in both (26) and (27)) and the inputs of the agents in the neighborhoodthat share a common goal (controllers vs. adversaries).


Fig. 1 Visualization of the proposed control architecture for every agent i. The schemeshown is implemented inside the controller of each agent, and uses an approximation of theworst adversary.

3.3 Convergence and Stability Analysis

Define the critic and actor estimation errors @i P N as,

Wci “Wi ´ Wci , Wui “Wi ´ Wui , Wvi “Wi ´ Wvi .

The following lemma proved in the Appendix provides the critic error dynamicsfor each agent i P N in the form of a nominal system plus a perturbation.Lemma 3.1 The critic estimation error dynamics for agent i P N can bewritten in the following form:

9Wci “ Fi0 ` Fi1, @i P N (28)

where Fi0 :“ ´αiωiωTi Wci can be viewed as a nominal system;

Fi1 :“ αiωi

pωTi ωi ` 1q2

„

´WTi

Bφi

Bsi

ˆ

di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2ui

2 ´1

2γ2iivj

2 ´ uTi u˚i

` γ2iivTi v˚i `

1

2

ÿ

jPNi

ˆ

uj2 ´ γ2ijvj

2 ´ 2uTj u˚j ` 2γ2ijv

Tj v˚j

˙

´Bεi

Bsi

Tˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q


´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

, @i P N

as a perturbation; and ui :“ u˚i ´ ui, vi :“ v˚i ´ vi with u˚i , v˚i , ui, vi given

by (19), (20), (21), and (22) respectively. l

Theorem 3.1 Let the tuning of the critic be given by (25), for each agenti P N . Then the nominal system Fi0 from (28) is globally exponentially stablewith its trajectories satisfying Wciptq ď Wcipt0qκi1e

´κi2pt´t0q, for someκi1, κi2 P IR`, @t ą t0 ě 0, provided that there exists a constant T ą 0 suchthat ωi is persistently exciting (PE) over every interval of length T (i.e., thatthe matrix ωiω

Ti is positive denite over any nite interval), in the sense that

ż t`T

t

ωiωTi ě βiI, @i P N , t ě t0

with βi P IR` and I the identity matrix of appropriate dimensions.

Proof of Theorem 3.1. Consider the following Lyapunov function, @t ě 0

Li “1

2αiWT

ci Wci , i P N , (29)

and hence Li is decrescent and radially unbounded in the space of Wci .By differentiating (29) along the error dynamics nominal system trajectoriesone has,

9Li “ ´WTci ωiω

Ti Wci ď 0, i P N .

Viewing the nominal system Fi0 given in (28) as a linear time-varying system,the solution Wci is given as (the reader is directed to [3], and [22] for thedetails),

Wciptq “ Φipt, t0qWcipt0q, i P N , @t, t0 ą 0, (30)

where the state transition matrix is defined as BΦipt,t0qBt :“ ´αiωiω

Ti Φipt, t0q.

We can prove, by following Theorem 1 from [3], that for the nominal system,the equilibrium point is exponentially stable provided that ωi is PE and the-refore for some κi1, κi2 P IR` we can write,

Φipt, t0q ď κi1e´κi2pt´t0q, i P N , @t, t0 ą 0. (31)

Finally by combining (30) and (31) we have,

›

›

›Wciptq

›

›

›ď

›

›

›Wcipt0q

›

›

›κi1e

´κi2pt´t0q, i P N , @t, t0 ą 0

from which the result follows. l


Remark 3.4 Note that the parameters κi1 and κi2 depend on the PE condition.The reader is directed to [3], [22] for a more detailed explanation. Typically,one can ensure PE of the vector signal ωi by adding sinusoids of different fre-quencies to the inputs. This condition is equivalent to state space explorationin reinforcement learning [43] required for convergence to the optimal policies.

For linear systems of order n one can guarantee PE by adding npn`1q2 different

frequencies [22]. It has been shown in [22] and [50] (cf. Chapter 7) that theconvergence rate of gradient descent algorithms of the above form convergeexponentially fast with rate that depends on the level of excitation βi, the sizeof the time interval T . l

The main Theorem, is presented next and provides a Lyapunov-based stabilityproof for the proposed game-theoretical learning controller.

Theorem 3.2 Consider the dynamics given by (5), the controller given by(21), the adversary given by (22), the critic tuning law given by (25), thecontroller tuning law given by (26), and the adversarial tuning law given by(27). Assume that the coupled Bellman equations (8) have locally smooth so-lutions V ˚i psiq,@si, @i P N . Assume that the signal ωi is persistently exciting,that Assumption 3.1 holds, and that di ‰ 0 for all i P N . Then, the solution`

siptq, Wciptq, Wuiptq, Wvi

ptq˘

, for all`

sip0q, Wcip0q, Wuip0q, Wvi

p0q˘

, @i P Nis Uniformly Ultimately Bounded (UUB). l

Remark 3.5 By picking some parameters appropriately we can guarantee theasymptotic reduction of the tracking errors si, @i P N and the actor/criticweight estimation errors to a small neighborhood of zero (c.f. Remark 3.8).

l

Remark 3.6 Although we are considering fixed topology graphs of a networkedteam, the proof will still go through if the team breaks in multiple disconnectedsub-teams, provided that each agent retains at least one neighbor (Ni ‰ H).In this case, the sub-teams formed will reach separate consensus values. l

Remark 3.7 An online learning framework such as the one in Theorem 3.2,allows the controller to adapt to changing conditions. Each agent might le-arn defense strategies online while adversaries perform sequences of simplestrategies. The learning algorithm is not known to the adversary, since theadversary can apply any policy. Now if the adversary wants to know whichlearning framework the controller uses, there are some analysis that could ena-ble the adversary to learn the framework. Our Theorem 3.2 (and the Corollarythat follows) shows that in order to achieve a proper approximation one hasto use a large number of basis sets and hence if the controller and the control-ler’s estimate of the worst case adversary do not use a good approximationthey will deviate from the Nash equilibrium. The learning framework wouldalso allow the distributed performances to be changed on the fly. The fact ofusing same basis sets in (21)-(22) is in order not to use more symbols. Thereal adversary can conceivably do anything. Finally, it will be difficult for the


adversary to have any information regarding the type of approximation usedby the controller, since this process could require a number of experimentingthat could lead the adversary to perform expensive computations in the statespace. l

Corollary 3.1 Suppose that the hypotheses and the statements of Theorem3.2 hold. Then, the policies ui, vi,@i P N form a Nash equilibrium as for asufficiently large number of basis sets h (c.f. Remark 3.2).

Proof. First we shall compute u˚i ´ ui by using (19), (21) and the fact that

Wui“Wi ´ Wui

as,

ui :“ u˚i ´ ui “

›

›

›

›

´ di

„

0I

T

pBφiBsipsiq

TWi `BεciBsi

q

` di

„

0I

TBφiBsi

T

Wui

›

›

›

›

“

›

›

›

›

›

di

„

0I

T

pBφiBsipsiq

T Wui`BεciBsi

q

›

›

›

›

›

.

Similarly for v˚i ´ vi by using (20), (22), and Wvi“Wi ´ Wvi

vi :“ v˚i ´ vi “

›

›

›

›

›

diγ2ii

„

0I

T

pBφiBsipsiq

T Wvi`BεciBsi

q

›

›

›

›

›

.

According to the result of Theorem 3.2, we have proved that Wuiand

Wviare UUB. Moreover after using Assumption 3.1 we have upper bounds

onBεciBsi

and Bφi

Bsiand thus all the quantities on the right hand side of ui and

vi are bounded. Then after following the results from [46] we can concludethat the approximated policies as one selects a large number of basis sets (c.f.Remark 3.8), satisfy (4).

3.4 Proof of Theorem 3.2

Consider the following continuously differentiable positive definite candidateLyapunov function VL : IR2m

ˆ IRhˆ IRh

ˆ IRhÑ IR defined as

VL –

Nÿ

i“1

VLipsi, Wci , Wui, Wvi

q,

with

VLipsi, Wci , Wui, Wvi

q– V ˚i psiq ` VcipWciq


`α´1ui

2WT

uiWui

`α´1vi

2WT

viWvi

,

where V ˚i psiq is the Lyapunov function for (5), (12) used in the proof of The-

orem 2.1, VcipWciq :“›

›

›Wci

›

›

›

2

and αui and αvi are positive scalars.

From the positive definiteness and radial unboundedness of the V ˚i , weconclude that there exist class K functions [25] ki1 and ki2, such that V ˚isatisfies

ki1psiq ď V ˚i psiq ď ki2psiq, @si, i P N .

We can write

ki1psiq ` Wci2 `

1

2αuiWui

2 `1

2αviWvi

2 ď VLi

ď ki2psiq ` Wci2 `

1

2αuiWui

2 `1

2αviWvi

2.

Computing the time derivative of V ˚i , @i P N along solutions to the closed-loopsystem (5) with controller ui and adversary vi and for Vci along the perturbedsystem (40), leads to

9VLi “BV ˚iBsi

Tˆ„

0 I0 0

si ` di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

`BVciBWci

ˆ

Fi0 ` Fi1

˙

´9WT

uiα´1ui Wui ´

9WT

viα´1vi Wvi . (32)

We can write the coupled Bellman equations (8) as,

BV ˚iBsi

T „

0 I0 0

si “ ´BV ˚iBsi

Tˆ

di

„

0I

pu˚i ` v˚i q

´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

´1

2

ˆ

si2 ` u˚i

2 `ÿ

jPNi

u˚j 2

´ γ2iiv˚i

2 ´ÿ

jPNi

γ2ijv˚j

2

˙

. (33)

Setting ui “ u˚i ´ ui, vi “ v˚i ´ vi, and using (26), (27), and (33), we canbound (32) as,

9VLi ď ´BV ˚iBsi

Tˆ

di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

´1

2

ˆ

si2 ` u˚i

2 `ÿ

jPNi

u˚j 2 ´ γ2iiv

˚i

2 ´ÿ

jPNi

γ2ijv˚j

2

˙

´ αi

›

›

›Wci

›

›

›

2` 2WciFi1 ` W

Tui

ˆ

σuipWci ´ Wui q

´ d2iBφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wui

ωTipωTi ωi ` 1q

Wci


´ÿ

jPNi

d2jBφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wuj

ωTipωTi ωi ` 1q

Wcj

˙

` WTvi

ˆ

σvipWci ´ Wvi q `d2iγ2ii

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wvi

ωTipωTi ωi ` 1q

Wci

`ÿ

jPNi

d2jγ2ij

γ4jj

Bφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wvj

ωTipωTi ωi ` 1q

Wcj

˙

. (34)

Using ´diBV ˚

i

Bsi

T„

0I

“ u˚iT and di

γ2ii

BV ˚i

Bsi

T„

0I

“ v˚iT , (34) can be written as,

9VLi ď u˚iTui ´ v

˚iTvi ´

1

diu˚i

Tÿ

jPNi

uj `γ2iidiv˚iT

ÿ

jPNi

vj

´1

2

ˆ

si2 ` u˚i

2 `ÿ

jPNi

u˚j 2 ´ γ2iiv

˚i

2 ´ÿ

jPNi

γ2ijv˚j

2

˙

´ αi

›

›

›Wci

›

›

›

2` 2Wci

›

›

›

›

›

›

„

´WTi

Bφi

Bsi

ˆ

di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2ui

2 ´1

2γ2iivi

2 ´ uTi u˚i ` γ

2iivTi v˚i

`1

2

ÿ

jPNi

ˆ

uj2 ´ γ2ijvj

2 ´ uTj u˚j ` γ

2ijvTj v˚j

˙

´BεciBsi

Tˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q ´

ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

›

›

›

›

›

›

` WTui

ˆ

σuipWci ´ Wui q

´ d2iBφi

Bsi

„

0 00 I

Bφi

Bsi

T

pWi ´ Wui qωTi

pωTi ωi ` 1qpWi ´ Wci q

´ÿ

jPNi

d2jBφj

Bsj

„

0 00 I

Bφj

Bsj

T

pWj ´ Wuj qωTi

pωTi ωi ` 1qpWj ´ Wcj q

˙

` WTvi

ˆ

σvipWci ´ Wvi q

`d2iγ2ii

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T

pWi ´ Wvi qωTi

pωTi ωi ` 1qpWi ´ Wci q

`ÿ

jPNi

d2jγ2ij

γ4jj

Bφj

Bsj

„

0 00 I

Bφj

Bsj

T

pWj ´ Wvj q

ωTipωTi ωi ` 1q

pWj ´ Wcj q

˙

. (35)

Using Fact 1 in the Appendix, and expanding the parenthesis we can furtherupper bound (35) as,

9VLi ď ´σuiWui2 ´ σviWvi

2 ´1

2si

2

´ αi

›

›

›Wci

›

›

›

2` bi5 `

`

2bi4 ` σuibi2


` σvibi3 ` ωimaxbi2bi1d2iWimax

` ωimaxbi1b2i2d

2i `

ωimaxbi1bi3d2i

γ2iiWimax

`ωimaxbi1b

2i3d

2i

γ2ii

˘

Wci `ÿ

jPNi

`

ωimaxbi2d2j bj1Wjmax

` ωimaxbi2d2j bj1bj2 ` ωimaxbi3

d2j bj1γ2ij

γ4jjWjmax

` ωimaxbi3bj1bj3d

2jγ

2ij

γ4jj

˘

Wcj . (36)

Summing the last term in the equation above over all agents, we concludethat

Nÿ

i“1

ˆ

ÿ

jPNi

´

ωimaxbi2d2j bj1Wjmax

` ωimaxbi2d2j bj1bj2 ` ωimaxbi3

d2j bj1γ2ij

γ4jjWjmax

` ωimaxbi3bj1bj3d

2jγ

2ij

γ4jj

¯

Wcj

˙

ď

Nÿ

i“1

bi7Wci

for sufficient large constants bi7. Completing the square in (36), we thus obtain

9VL –

Nÿ

i“1

9VLi ďNÿ

i“1

´

´ σuiWui2

´ σviWvi2 ´

1

2si

2 ´ p1´ δqαiWci2 ` µi

¯

, (37)

where δ P p0, 1q and

µi – bi5 `1

4αiδ

`

2bi4 ` σuibi2 ` σvibi3 ` ωimaxbi2bi1d2iWimax

` ωimaxbi1b2i2d

2i `

ωimaxbi1bi3d2i

γ2iiWimax `

ωimaxbi1b2i3d

2i

γ2ii` bi7

˘2. (38)

According to Lemma 4.3 in [25] and by defining Qiptq :“ rsiptq Wciptq Wuiptq Wviptqs,there exist class K functions ki5, ki6 such that,

ki5pQiq ď σuiWui2 ` σviWvi

2 `1

2si

2

` p1´ δqαiWci2 ď ki6pQiq,

which can be used to further upper bound (37):

9VL ďNÿ

i“1

´

´ ki5pQiq ` µi

¯

.

From this, we conclude that 9VL ă 0 whenever the combined state pQ1, Q2, ¨ ¨ ¨ , QN qlies outside the compact set

ΩQ “!

pQ1, Q2, ¨ ¨ ¨ , QN q :Nÿ

i“1

ki5pQiq ďNÿ

i“1

µi

)

.


Therefore, there exists a Tu such that for all t ě Tu the state pQ1, Q2, ¨ ¨ ¨ , QN qremains inside ΩQ, from which Uniformly Ultimately Boundedness (UUB)follows [25]. l

Remark 3.8 One can make the set ΩQ small by decreasing the values of theµi in (38). To accomplish this, one can select a large value for the critic tuninggain αi [which decreases the second term in (38)] and select a large number hof basis functions to decrease the errors εci and consequently the constant bi5in (38) (see Fact 1 in the appendix). l

4 Simulation Results

We now illustrate the theoretical developments for a graph G1 with 5 agentsnetworked as shown in Figure 2. The gains in the tuning laws are selectedas σui “ 5, σvi “ 5, αi “ 10, αui “ 1, αvi “ 1, @i P N and all the weightsof the tuning laws are initialized randomly inside r´1, 1s. Although Assump-tion 3.1 requires bounded basis functions, in practice one could pick thesefunctions to be quadratic forms, e.g. si b si, and still obtain with satisfac-tory results. To ensure PE, a probing signal ρptq “ tanhptq

`

sinp2tqcosptq `

4sinpt2q ` 10cosp5tqsinp11tq˘

is added to the input signals for the first 2 se-conds of the simulation.

We start by considering agents that move on a line (m “ 1) with randominitial conditions.

Figures 3(a)-3(b) show a simulation of the same consensus algorithm [33],but with two different attacking scenarios where a constant bias and a timevarying bias is added to the measurements, which can represent two verybasic forms of attacks. The simulation shows that even these simple attackscan result in unstable behavior. This is in contrast to what can be observedin Figures 3(c)-3(d), that shows the behavior of the algorithm proposed inthis paper (with γii “ 5, γij “ 1, @i ‰ j P N ) which results in position andvelocity agreement. We will now consider a prototypical case that will test ouralgorithm to the limits. In this case, the actual adversary uses quadratic basisfunctions of the form si b si while the controller’s adversarial estimate uses asubset of those basis functions. For the latter case, figures 4(a)-4(b) show thatthe actual adversary is able to successfully diverge one agent and oscillate thevelocities of the rest. It will be worth noting that when the controller doesnot use a good set of basis functions, the performance will deviate. On theother side, another interesting example that is more probably should be whenthe actual adversary uses less basis functions than the controller’s adversarialestimate, then the optimal consensus looks similar to figures 3(c)-3(d). Anotherinteresting scenario is when the adversaries and the controllers play their Nashsolution but the adversaries also do not let the agents agree on a commonvelocity value. This case is shown in figures 5(a)-5(b).

We shall now consider agents moving in the plane (m “ 2) with randominitial conditions. Figure 6(a) illustrates how even a simple adversarial inputconsisting of adding a constant bias to the measurements can lead to instability


for the regular consensus algorithm. Figure 6(b) shows that, also here, thealgorithm proposed in this paper (with γii “ 5, γij “ 1, @i ‰ j P N ) results inasymptotic consensus.

Fig. 2 Graph G1 describing the topology of a network of 5 agents.

(a) Adversary adds constant biases. (b) Adversary adds time varying biases.

(c) Positions with the proposed algorithm. (d) Velocities with the proposed algorithm.

Fig. 3 Agents perturbed by adversaries connected as in G1 moving in a line.


(a) Positions when the adversaries use thecomplete quadratic basis sets while the con-troller’s adversarial estimate uses a subset ofthem.

(b) Velocities when the adversaries use thecomplete quadratic basis sets while the con-troller’s adversarial estimate uses a subset ofthem.

Fig. 4 The proposed algorithm when the agents are perturbed by adversaries that use thecomplete quadratic basis sets while the controller’s adversarial estimate uses a subset ofthem, connected as in G1 moving in line.

(a) Positions with the proposed frameworkwhile the intelligent adversaries do not letthe agents reach a steady state velocity va-lue.

(b) Velocities with the proposed frameworkbut the intelligent adversaries also do notlet the agents reach a steady state velocityvalue.

Fig. 5 The proposed algorithm when the agents are perturbed by adversaries that do notlet the agents reach a steady state velocity value, connected as in G1 moving in line.

5 Conclusions

We have derived a game-theoretical actor-critic algorithm for reaching consen-sus of multi agent systems with guaranteed performance when the dynamicsare perturbed by persistent adversaries. It was shown that the proposed archi-tecture is able to optimally reject adversarial inputs using a distributed algo-rithm. The optimization is performed online and results in agreement amongall the agents in the team.

Three approximators are used for each agent: one to approximate the op-timal value function, another to approximate the optimal controller, and a


(a) Without the proposed algorithm. (b) With the proposed algorithm.

Fig. 6 Agents perturbed by adversaries, connected as in G1 and moving in the bi-dimensional space.

final one to approximate the adversary. Each of these approximators uses aspecially designed tuning law. A Lyapunov-based argument is used to ensurethat all the signals remain bounded. Simulation results show the effectivenessof the proposed approach. This work was focused on adversarial inputs thatinvolve measurement corruption. More sophisticated adversarial inputs thatutilize multiple points or multiple methods are important problems for futureresearch. Other problems of interest include coordinated actions, where mul-tiple adversaries cooperate to maximize damage; as well as disguised attacks,where the adversary can mask its actions to induce an erroneous reaction formitigation.

Appendix

Proof of Lemma 3.1. The error in equation (23) after subtracting zero becomes,

ei “ WTci

Bφi

Bsi

ˆ„

0 I0 0

si ` di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

´WTi

Bφi

Bsi

ˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q ´

ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

`1

2

ˆ

ui2 `

ÿ

jPNi

uj2 ´ γ2iivi

2 ´ÿ

jPNi

γ2ijvj2

˙

´1

2

ˆ

u˚i 2 `

ÿ

jPNi

u˚j 2 ´ γ2iiv

˚i

2 ´ÿ

jPNi

γ2ijv˚j

2

˙

´BεciBsi

Tˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q

´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

, @i P N.

Completing the squares we have,

ei “ ´WTciωi ´W

Ti

Bφi

Bsi

ˆ

di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙


`1

2ui

2 ´1

2γ2iivj

2 ´ uTi u˚i ` γ

2iivTi v˚i

`1

2

ÿ

jPNi

ˆ

uj2 ´ γ2ijvj


Tj v˚j

˙

´BεciBsi

Tˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q

´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

, @i P N. (39)

The dynamics of the critic estimation error Wci can be found by substituting (39) in (25)as,

9Wci “ ´αiωiωTi Wci

` αiωi

pωTi ωi ` 1q2

„

´WTi

Bφi

Bsi

ˆ

di

„

0I

pui ` viq

´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2ui

2 ´1

2γ2iivj

2 ´ uTi u˚i ` γ

2iivTi v˚i

`1

2

ÿ

jPNi

ˆ

uj2 ´ 2γ2ijvj


Tj v˚j

˙

´BεciBsi

Tˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q

´ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

, @i P N , (40)

from which the result follows. l

The following fact is a consequence of Assumption 3.1 and the properties of the pro-jection operator Prr.s.

Fact 1 There exist constants bi1, bi2, bi3, bi4, bi5 P IR` for which the following boundshold, for every agent @i P N and time @t ě 0:

›

›

›

›

›

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T›

›

›

›

›

ď bi1, Wui ď bi2, Wvi ď bi3,

›

›

›

›

›

›

´WTi

Bφi

Bsi

ˆ

di

„

0I

pui ` viq ´ÿ

jPNi

„

0I

puj ` vjq

˙

`1

2ui

2 ´1

2γ2iivi

2 ´ uTi u˚i ` γ

2iivTi v˚i

`1

2

ÿ

jPNi

ˆ

uj2 ´ γ2ijvj


Tj v˚j

˙

´BεciBsi

T

ˆ„

0 I0 0

si ` di

„

0I

pu˚i ` v˚i q ´

ÿ

jPNi

„

0I

pu˚j ` v˚j q

˙

›

›

›

›

›

›

ď bi4 (41)

›

›

›

›

›

›

u˚iTui ´ v

˚iTvi ´

u˚iT

di

ÿ

jPNi

uj `γ2iidiv˚iT

ÿ

jPNi

vj


´1

2

ˆ

u˚i 2 `

ÿ

jPNi

u˚j 2 ´ γ2iiv

˚i

2 ´ÿ

jPNi

γ2ijv˚j

2

˙

´WTui

ˆ

d2iBφi

Bsi

„

0 00 I

Bφi

Bsi

T

WiωTi Wi

pωTi ωi ` 1q

´d2iBφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wui

ωTi Wi

pωTi ωi ` 1q

`ÿ

jPNi

d2jBφj

Bsj

„

0 00 I

Bφj

Bsj

T

WjωTi Wj

pωTi ωi ` 1q

´ÿ

jPNi

d2jBφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wuj

ωTi Wj

pωTi ωi ` 1q

˙

`WTvi

ˆ

d2iγ2ii

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T

WiωTi Wi

pωTi ωi ` 1q

´d2iγ2ii

Bφi

Bsi

„

0 00 I

Bφi

Bsi

T

Wvi

ωTipωTi ωi ` 1q

Wi

`ÿ

jPNi

d2jγ2ij

γ4jj

Bφj

Bsj

„

0 00 I

Bφj

Bsj

T

WjωTi Wj

pωTi ωi ` 1q

´ÿ

jPNi

d2jγ2ij

γ4jj

Bφj

Bsj

„

0 00 I

Bφj

Bsj

T

Wvj

ωTi Wj

pωTi ωi ` 1q

˙

›

›

›

›

›

›

ď bi5, (42)

where ui “ ´di

„

0I

T

pBφiBsi

TWui`

BεciBsiq, vi “

diγ2ii

„

0I

T

pBφiBsi

TWvi`

BεciBsiq, and u˚i , v

˚i , ui, vi

are given by (19), (20), (21), (22) respectively. The inequalities (41) and (42) result from thefact that all the quantities that appear in the left-hand side have known definable bounds.Also all the residual errors εci that appear in u˚i [see (19)], v˚i [see (20)], and ui and vi (asshown above) can be reduced by increasing the number of basis functions. l

Acknowledgments This material is based upon work supported in part by a U.S.Office of Naval Research MURI grant No. N00014-16-1-2710 and a NATO grant SPS G5176.

References

1. Abbeel P., Coates Ad., Ng A. Y.: Autonomous Helicopter Aerobatics through Appren-ticeship Learning. In the International Journal of Robotics Research (IJRR). 29 (13),1608-1639 (2010)

2. Alpcan T., Basar T.: Network Security: A Decision and Game Theoretic Approach.Cambridge University Press (2010)

3. Anderson B. D. O.: Exponential stability of linear equations arising in adaptive identifi-cation. IEEE Tran. on Automatic Control. 22 (1), 83-88 (1977)

4. Bardi M., Italo C-D.: Optimal control and viscosity solutions of Hamilton-Jacobi-Bellmanequations. Springer (2008)

5. Basar T., Olsder G. J.: Dynamic non-cooperative game theory. Philadelphia, PA: SIAM(1999)

6. Bauso D., Giarre L., Pesenti R.: Mechanism design for optimal consensus problems. Proc.Conference on Decision and Control. 3381-3386 (2006)

7. Beard W. B., McLain T. W., Nelson D. B., Kingston D., Johanson D.: DecentralizedCooperative Aerial Surveillance Using Fixed-Wing Miniature UAVs. Proc. IEEE, 94 (7),1306-1324 (2006)

8. Beard R. W., Saridis G. N., Wen J. T.: Approximate solutions to the time-invariantHamiltonJacobiBellman equation. Journal of Optimization theory and Applications. 96(3) 589-626 (1998)


9. Bertsekas D. P., Tsitsiklis J. N.: Neuro-dynamic programming. MA: Athena Scientific(1996)

10. Bryson A. E., Ho Y.: Applied Optimal Control. London, England: Hemisphere Publis-hing Corp (1975)

11. Cao C., Hovakimyan N.: Novel L1 Neural Network Adaptive Control Architecture WithGuaranteed Transient Performance. IEEE Transactions on Neural Networks. 18 (4) 1160-1171 (2007)

12. Cardenas A., Amin S., Sinopoli B., Giani A., Perrig A., Sastry S.: Challenges for secu-ring cyber physical systems. In Workshop on Future Directions in Cyberphysical SystemsSecurity (2009)

13. Cardenas A., Amin S., Sastry S. S.: Secure control: Towards survivable cyber-physicalsystems. In First International Workshop on Cyber- Physical Systems. 495-500 (2008)

14. Carli R., Zampieri S.: Networked clock synchronization based on second order linearconsensus algorithms. Proc. 49th IEEE Conference on Decision and Control. 7259-7264(2010)

15. Chen L., Roy S., Saberi A.: On the information flow required for tracking control innetworks of mobile sensing agents. IEEE Transactions on Mobile Computing. 10 (4),519-531 (2011)

16. Chung S.-J., Slotine J.-J. E.: Cooperative robot control and concurrent synchronizationof Lagrangian systems. IEEE Trans. Rob., 25 (3), 686-700 (2009)

17. Chung S.-J., Bandyopadhyay S., Chang I., Hadaegh F. Y., Phase synchronization controlof complex networks of Lagrangian systems on adaptive digraphs. Automatica. 49 (5),1148-1161 (2013)

18. Crandall M. G., Lions P. L.: Viscosity solutions of Hamilton-Jacobi equations. Tran-sactions of the American Mathematical Soc. 277 (1), 1-24 (1983)

19. Hornik K., Stinchcombe M., White H.: Universal approximation of an unknown mappingand its derivatives using multilayer feedforward networks. Neural Networks. 3, 551-560(1990)

20. Holmgren A. J., Jenelius E., Westin J.: Evaluating strategies for defending electric powernetworks against antagonistic attacks. IEEE Trans. Power Syst. 22 (1), 76-84 (2007)

21. Ioannou P. A., Fidan B.: Adaptive Control Tutorial. Advances in design and control.SIAM (PA) (2006)

22. Ioannou P. A., Tao G.: Dominant richness and improvement of performance of robustadaptive control. Automatica. 25 (2), 287-291 (1989)

23. Jadbabaie A., Lin J., Morse A.: Coordination of groups of mobile autonomous agentsusing nearest neighbor rules. IEEE Transactions on Automatic Con. 48 (6), 988-1001(2003)

24. Kearns M., Littman M., Singh S.: Graphical models for game theory. Proc. 17th annualconference on uncertainty in artificial intel. 253-260 (2001)

25. Khalil H. K., Nonlinear Systems. Prentice Hall (2002)26. Khanafer A., Touri B., Basar T.: Consensus in the presence of an adversary. 3rd IFAC

Workshop on Distributed Estimation and Control in Networked Systems (NecSys) (2012)27. Krstic M., Kanellakopoulos I., Kokotovic P. V.: Nonlinear and Adaptive Control Design.

John Wiley and Sons (1995)28. Kunwar F., Benhabib B.: Rendezvous-guidance trajectory planning for robotic dynamic

obstacle avoidance and interception. IEEE Trans. Syst., Man, Cybern. 36 (6), 1432-1441(2006)

29. Lavretsky E., Wise K. A.: Robust adaptive control. In Robust and Adaptive Control,Springer London, 317-353 (2013)

30. LeBlanc H. J., Koutsoukos X. D.: Low complexity resilient consensus in networkedmulti-agent systems with adversaries. Proc. of the 15th ACM international conference onHybrid Systems: Computation and Contr. 5-14 (2012)

31. Lee D., Spong M. W.: Stable flocking of inertial agents on balanced communicationgraphs. IEEE Trans. Automat. Contr. 52 (8), 1469-1475 (2007)

32. Lewis F. L., Jagannathan S., Yesildirek A.: Neural network control of robot manipulatorsand nonlinear systems. Taylor and Francis (1999)

33. Olfati-Saber R., Fax J., Murray R.: Consensus and cooperation in networked multi-agentsystems. Proceedings of the IEEE. 95 (1), 215-233 (2007)


34. Pasqualetti F., Bicchi A., Bullo F.: Consensus Computation in Unreliable Networks:A System Theoretic Approach. IEEE Transactions on Automatic Cont. 57 (1), 90-104(2012)

35. Peymani E., Grip H. F., Saberi A., Wang X., Fossen T. I.: Almost output synchro-nization for heterogeneous networks of introspective agents under external disturbances.Automatica 50 (4), 1026-1036 (2014)

36. Pomet J. B., Praly L.: Adaptive nonlinear regulation: Estimation from Lyapunov equa-tion. IEEE Trans. on Autom. Ctrl. 37, 729-740 (1992)

37. Rao V. G., Bernstein D. S.: Naive control of the double integrator. IEEE Control Syst.21 (5), 86-97 (2001)

38. Ren W., Beard R., Atkins E.: A survey of consensus problems in multiagent coordina-tion. Proc. Amer. Control Conf. 1859-1864 (2005)

39. Semsar E, Khorasani K.: Optimal control and game theoretic approaches to cooperativecontrol of a team of multi-vehicle unmanned systems. Proc. IEEE International Conferenceon Networking, Sensing and Con. 628-633 (2007)

40. Semsar-Kazerooni E., Khorasani K.: An LMI Approach to Optimal Consensus Seekingin Multi-Agent Systems. Proc. Amer. Control Conf. 4519-4524 (2009)

41. Shoham Y., Leyton-Brown K.: Multiagent systems: algorithmic, game-theoretic, andlogical foundations. Cambridge University Press (2009)

42. Sundaram S., Hadjicostis C. N.: Distributed function calculation via linear iterativestrategies in the presence of malicious agents. IEEE Transactions on Automatic Con. 56(7), 1495-1508 (2011)

43. Sutton R. S., Barto A. G.: Reinforcement Learning- An Introduction. MIT Press, Cam-bridge, MA (1998)

44. Teixeira A., Sandberg H., Johansson K. H.: Networked control systems under cyberattacks with applications to power networks. Proc. American Control Conf. 3690-3696(2010)

45. Tsitsiklis J., Problems in decentralized decision making and computation. Ph.D. Dis-sertation. Dept. Elect. Eng. and Comput. Sci. MIT, Cambridge, MA (1984)

46. Vamvoudakis K. G., Lewis F. L., Hudas G. R.: Multi-Agent Differential Graphical Ga-mes: Online Adaptive Learning Solution for Synchronization with Optimality. Automatica48 (8), 1598-1611 (2012)

47. Vamvoudakis K. G., Hespanha J. P., Sinopoli B., Mo Y.: Detection in AdversarialEnvironments. IEEE Transactions on Automatic Control 59 (12), 3209-3223 (2014)

48. Vamvoudakis K. G., Antsaklis P. J., Dixon W. E., Hespanha J. P., Lewis F. L., ModaresH., Kiumarsi B.: Autonomy and Machine Intelligence in Complex Systems: A Tutorial.Proc. American Control Conf. 5062-5079 (2015)

49. Van der Schaft A. J.: L2-gain analysis of nonlinear systems and nonlinear state feedbackH-8 control. IEEE Transactions on Automatic Cont. 37 (6), 770-784 (1992)

50. Vrabie D., Vamvoudakis K. G., Lewis K. G.: Optimal Adaptive Control and Differen-tial Games by Reinforcement Learning Principles. Control Engineering Series, IET Press(2012)

51. Wang J., Elia N.: Distributed Averaging Algorithms Resilient to Communication Noiseand Dropouts. IEEE Trans. on Signal Proc. 61 (9), 2231-2242 (2013)

52. Werbos P. J.: Approximate dynamic programming for real-time control and neural mo-deling. D. A. White, D. A. Sofge (Eds.), Handbook of intelligent control. New York: VanNostrand Reinhold (1992)

53. Yucelen T., Egerstedt M.: Control of Multiagent Systems under Persistent Disturbances.Proc. Amer. Control. Conf. 5263-5269 (2012)

54. Zhu M., Martinez S.: Attack-resilient distributed formation control via online adapta-tion. Proc. CDC-ECE Conf. 6624-6629 (2011)

55. Zhu H., Giannakis G. B., Cano A.: Distributed in-network channel decoding. IEEETrans. Signal Proc. 57 (10), 3970-3983 (2009)

56. Zhu Q., Bushnell L., Basar T.: Resilient distributed control of multi-agent cyber-physicalsystems. Proc. Workshop on Control of Cyber-Physical Systems, Johns Hopkins University(2013)

Game-Theory-Based Consensus Learning of Double-Integrator...

Documents

Transcript of Game-Theory-Based Consensus Learning of Double-Integrator...