Graph Neural Networks for Scalable Radio Resource … · 2020. 7. 16. · networks (GNNs) to solve...

1

Graph Neural Networks for Scalable RadioResource Management: Architecture Design and

Theoretical AnalysisYifei Shen, Student Member, IEEE, Yuanming Shi, Member, IEEE, Jun Zhang, Senior Member, IEEE, and

Khaled B. Letaief, Fellow, IEEE

Abstract—Deep learning has recently emerged as a disruptivetechnology to solve challenging radio resource managementproblems in wireless networks. However, the neural networkarchitectures adopted by existing works suffer from poor scal-ability and generalization, and lack of interpretability. A long-standing approach to improve scalability and generalization isto incorporate the structures of the target task into the neuralnetwork architecture. In this paper, we propose to apply graphneural networks (GNNs) to solve large-scale radio resourcemanagement problems, supported by effective neural networkarchitecture design and theoretical analysis. Specifically, we firstdemonstrate that radio resource management problems canbe formulated as graph optimization problems that enjoy auniversal permutation equivariance property. We then identify afamily of neural networks, named message passing graph neuralnetworks (MPGNNs). It is demonstrated that they not only satisfythe permutation equivariance property, but also can generalizeto large-scale problems, while enjoying a high computationalefficiency. For interpretablity and theoretical guarantees, weprove the equivalence between MPGNNs and a family of dis-tributed optimization algorithms, which is then used to analyzethe performance and generalization of MPGNN-based methods.Extensive simulations, with power control and beamforming astwo examples, demonstrate that the proposed method, trainedin an unsupervised manner with unlabeled samples, matches oreven outperforms classic optimization-based algorithms withoutdomain-specific knowledge. Remarkably, the proposed methodis highly scalable and can solve the beamforming problem inan interference channel with 1000 transceiver pairs within 6milliseconds on a single GPU.

Index Terms—Radio resource management, wireless networks,graph neural networks, distributed algorithms, permutationequivariance.

I. INTRODUCTION

Radio resource management, e.g., power control [2] andbeamforming [3], plays a crucial role in wireless networks.Unfortunately, many of these problems are non-convex and

The materials in this paper were presented in part at the IEEE GlobalCommunications Conference (Globecom) Workshops, 2019 [1]. This workwas supported by the General Research Fund (Project No. 16210719) fromthe Research Grants Council of Hong Kong.

Y. Shen and K. B. Letaief are with the Department of Electronic andComputer Engineering, Hong Kong University of Science and Technology,Hong Kong (E-mail: {yshenaw, eekhaled}@ust.hk). K. B. Letaief is also withPeng Cheng Lab in Shenzhen. Y. Shi is with the School of Information Scienceand Technology, ShanghaiTech University, Shanghai 201210, China (E-mail:[email protected]). J. Zhang is with the Department of Electronicand Information Engineering, The Hong Kong Polytechnic University, HongKong (E-mail: [email protected]). (The corresponding author is J.Zhang.)

computationally challenging. Moreover, they need to be solvedin a real-time manner given the time-varying wireless channelsand the latency requirement of many mobile applications.Great efforts have been put forward to develop effectivealgorithms for these challenging problems. Existing algorithmsare mainly based on convex optimization approaches [4], [5],which have a limited capability in dealing with non-convexproblems and scale poorly with the problem size. Problemspecific algorithms can be developed, which, however, is alaborious process and requires much problem specific knowl-edge.

Inspired by the recent successes of deep learning in manyapplication domains, e.g., computer vision and natural lan-guage processing [6], researchers have attempted to applydeep learning based methods, particularly, “learning to op-timize” approaches, to solve difficult optimization problemsin wireless networks [7]–[15]. The goal of such methods isto achieve near-optimal performance in a real-time mannerwithout domain knowledge, i.e., to automate the algorithmdesign process. There are two common paradigms on thistopic [16], [17]. The first one is “end-to-end learning”, whichdirectly employs a neural network to approximate the optimalsolution of an optimization problem. For example, in [7],to solve the power control problem, a multi-layer perceptron(MLP) was used to approximate the input-output mapping ofthe classic weighted minimum mean square error (WMMSE)algorithm [18] to speed up the computation. The secondparadigm is “learning alongside optimization”, which replacesan ineffective policy in a traditional algorithm with a neuralnetwork. For example, an MLP was utilized in [11] to re-place the pruning policy in the branch-and-bound algorithm.Accordingly, significant speedup and performance gain in theaccess point selection problem was achieved compared withthe optimization-based methods in [19], [20].

A key design ingredient underlying both paradigms of“learning to optimize” is the neural network architecture.Most of the existing works adopt MLPs [7], [9], [11], [21]or convolutional neural networks (CNNs) [8], [12]. Thesearchitectures are inherited from the ones developed for im-age processing tasks and thus are not tailored to problemsin wireless networks. Although near-optimal performance isachieved for small-scale wireless networks, they fail to exploitthe wireless network structure and thus suffer from poorscalability and generalization in large-scale radio resourcemanagement problems. Specifically, the performance of these

arX

iv:2

007.

0763

2v2

[cs

.IT

] 2

9 O

ct 2

020

2

methods degrades dramatically when the wireless networksize becomes large. For example, it was shown in [7] thatthe performance gap to the WMMSE algorithm is 2% when𝐾 = 10 and it becomes 12% when 𝐾 = 30. Moreover, thesemethods generalize poorly when the number of agents inthe test dataset is larger than that in the training dataset. Indense wireless networks, resource management may involvethousands of users simultaneously and the number of userschanges dynamically, thus, making the wide application ofthese learning-based methods very difficult.

A long-standing idea to improve scalability and general-ization is to incorporate the structures of the target task intothe neural network architecture [16], [21]–[23]. A prominentexample is the development of CNNs for computer vision,which is inspired by the fact that the neighbor pixels of animage are useful when they are considered together [24]. Thisidea has also been successfully applied in many applications,e.g., visual reasoning [23], combinatorial optimization [25],and route planning [26]. To achieve better scalability oflearning-based radio resource management, structures in asingle-antenna system with homogeneous agents have recentlybeen exploited for effective neural network architecture design[10], [14]. In static channels, observing that channel statesare deterministic functions of users’ geo-locations in a 2DEuclidean space, spatial convolution was developed in [10],which is applicable in wireless networks with thousands ofusers but cannot handle fading channels. With fading channels,it was observed that the channel matrix can be viewed as theadjacency matrix of a graph [14]. From this perspective, arandom edge graph neural network (REGNN) operating onsuch a graph was developed, which inhibits a good gener-alization property when the number of users in the wirelessnetworks changes. However, in a multi-antenna system or asingle-antenna system with heterogeneous agents, the channelmatrix no longer fits the form of an adjacency matrix and theREGNN cannot be applied.

In this paper, we address the limitations of existing worksby modeling wireless networks as wireless channel graphsand develop neural networks to exploit the graph topology.Specifically, we treat the agents as nodes in a graph, communi-cation channels as directed edges, agent specific parameters asnode features, and channel related parameters as edge features.Subsequently, low-complexity neural network architecturesoperating on wireless channel graphs will be proposed.

Existing works (e.g., [7], [11], [13]) also have another majorlimitation, namely, they treat the adopted neural network asa black box. Despite the superior performance in specificapplications, it is hard to interpret what is learned by the neuralnetworks. To ensure reliability, it is crucial to understandwhen the algorithm works and when it fails. Thus, a goodtheoretical understanding is demanded for the learning-basedradio resource management methods. Compared with learning-based methods, conventional optimization-based methods arewell-studied. This inspires us to build a relationship be-tween these two types of methods. In particular, we shallprove the equivalence between the proposed neural networksand a favorable family of optimization-based methods. Thisequivalence will allow the development of tractable analysis

for the performance and generalization of the learning-basedmethods through the study of their equivalent optimization-based methods.

A. Contributions

In this paper, we develop scalable learning-based methods tosolve radio resource management problems in dense wirelessnetworks. The major contributions are summarized as follows:

1) We model wireless networks as wireless channel graphsand formulate radio resource management problems asgraph optimization problems. We then show that a per-mutation equivariance property holds in general radioresource management problems, which can be exploitedfor effective neural network architecture design.

2) We identify a favorable family of neural networks oper-ating on wireless channel graphs, namely MPGNNs. It isshown that MPGNNs satisfy the permutation equivarianceproperty, and have the ability to generalize to large-scaleproblems while enjoying a high computational efficiency.

3) For an effective implementation, we propose a wirelesschannel graph convolution network (WCGCN) withinthe MPGNN class. Besides inheriting the advantages ofMPGNNs, the WCGCN enjoys several unique advantagesfor solving radio resource management problems. First,it can effectively exploit both agent-related features andchannel-related features effectively. Second, it is insen-sitive to the corruptions of features, e.g., channel stateinformation (CSI), implying that they can be applied withpartial and imperfect CSI.

4) To provide interpretability and theoretical guarantees, weprove the equivalence between MPGNNs and a family ofdistributed optimization algorithms, which include manyclassic algorithms for radio resource management, e.g.,WMMSE [18]. Based on this equivalence, we analyzethe performance and generalization of MPGNN-basedmethods in the weighted sum rate maximization problem.

5) We test the effectiveness of WCGCN for power controland beamforming problems, training with unlabeled data.Extensive simulations will demonstrate that the proposedWCGCN matches or outperforms classic optimization-based algorithms without domain knowledge, and withsignificant speedups. Remarkably, WCGCN can solvethe beamforming problem with 1000 users within 6milliseconds on a single GPU.1

B. Notations

Throughout this paper, superscripts (·)𝐻 , (·)𝑇 , (·)−1 de-note conjugate transpose, transpose, inverse, respectively. Thesymbol 𝑋(𝑖1 , · · · ,𝑖𝑛) denotes an element in tensor 𝑋 indexed by𝑖1, · · · , 𝑖𝑛. For example, X(2,3) is the element in the secondrow third column in matrix X . The set symbol {} in thispaper denotes a multiset. A multiset is a 2-tuple 𝑋 = (𝑆, 𝑚)where 𝑆 is the underlying set of 𝑋 that is formed from itsdistinct elements, and 𝑚 : 𝑆 → N≥1 gives the multiplicity of

1The codes to reproduce the simulation results are available onhttps://github.com/yshenaw/GNN-Resource-Management.

3

elements. For example, {𝑎, 𝑎, 𝑏} is a multiset where element𝑎 has multiplicity 2 and element 𝑏 has multiplicity 1.

II. GRAPH MODELING OF WIRELESS NETWORKS

In this section, we model wireless networks as graphs,and formulate radio resource management problems as graphoptimization problems. Key properties of radio resource man-agement problems will be identified, which will then beexploited to design effective neural network architectures.

A. Directed Graphs and Permutation Equivariance Property

A directed graph can be represented as an order pair 𝐺 =

(𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 is the set of edges.The adjacency matrix of a graph is an 𝑛 × 𝑛 matrix A ∈{0, 1}𝑛×𝑛, where A𝑖, 𝑗 = 1 if and only if (𝑖, 𝑗) ∈ 𝐸 for all𝑖, 𝑗 ∈ 𝑉 . Let [𝑛] = {1, · · · , 𝑛} and we denote the permutationoperator as 𝜋 : [𝑛] → [𝑛]. Given the permutation 𝜋 and agraph adjacency matrix A, the permutation of nodes is denotedby 𝜋 ★A and defined as

(𝜋 ★A) (𝜋 (𝑖1) , 𝜋 (𝑖2)) = A(𝑖1 ,𝑖2) ,

for index 1 ≤ 𝑖1, 𝑖2 ≤ |𝑉 |. Two graphs A and B are said to beisomorphic if there is a permutation 𝜋 such that 𝜋 ★A = B,and this relationship is denoted by A � B.

We now introduce optimization problems defined on di-rected graphs, and identify their permutation invariance andequivariance properties. We assign each node 𝑣𝑖 ∈ 𝑉 anoptimization variable 𝛾𝑖 ∈ R. We denote the optimizationvariable as γ = [𝛾1, · · · , 𝛾 |𝑉 |]𝑇 and the permutation of theoptimization variable as

(𝜋 ★ γ) (𝜋 (𝑖1)) = γ(𝑖1) .

The optimization problem defined on a graph A can bewritten as

𝒬 : minimizeγ

𝑔(γ,A) subject to 𝑄(γ,A) ≤ 0, (1)

where 𝑔(·, ·) represents the objective function and 𝑄(·, ·)represents the constraint.

As A � 𝜋 ★A, optimization problems defined on graphshave the permutation invariance property as stated below.

Proposition II.1. (Permutation invariance) The optimizationproblem defined in (1) has the following property

𝑔(𝚪,A) = 𝑔(𝜋 ★ γ, 𝜋 ★A), 𝑄(γ,A) = 𝑄(𝜋 ★ γ, 𝜋 ★A),

for any permutation operator 𝜋.

Proof. Since adjacency matrices A and 𝜋 ★A represent thesame graph, permuting γ and A simultaneously is simply areordering of the variables. As a result, we have 𝑔(𝚪,A) =𝑔(𝜋 ★ γ, 𝜋 ★A) and 𝑄(γ,A) = 𝑄(𝜋 ★ γ, 𝜋 ★A). �

The permutation invariance property of the objective valueand constraint leads to the corresponding property of sublevelsets. We first define the sublevel sets.

Definition II.1. (Sublevel sets) The 𝛼 sublevel set of afunction 𝑓 : C𝑛 → R is defined as

R𝛼𝑓 = {𝑥 ∈ dom 𝑓 | 𝑓 (𝑥) ≤ 𝛼},

where dom 𝑓 is the feasible domain.

Denote the optimal objective value of (1) as 𝑧∗, and theset of 𝜖-accurate solutions as R𝑧∗+𝜖 . Thus, the properties ofsublevel sets imply the properties of near-optimal solutions.Specifically, the permutation invariance property of the objec-tive function implies the permutation equivariance property ofthe sub-level sets, which is stated in the next proposition.

Proposition II.2. (Permutation equivariance) Denote R𝛼𝑔 asthe sublevel set of 𝑔(·, ·) in (1), and define 𝐹 : A ↦→ R𝛼𝑔 .Then,

𝐹 (𝜋 ★A) = {𝜋 ★ γ |γ ∈ R𝛼𝑔 },


Remark. The permutation equivariance property of sublevelsets is a direct result of the permutation invariance in theobjective function. Please refer to Appendix A for a detailedproof.

In the next subsection, by modeling wireless networks asgraphs, we show that the permutation equivariance property isuniversal in radio resource management problems.

B. Wireless Network as a Graph

A wireless network can be modeled as a directed graphwith node and edge features. Naturally, we treat each agent ofa wireless network, e.g., a mobile user or a base station, asa node in the graph. An edge is drawn from node 𝑖 to node𝑗 if there is a direct communication or interference link withnode 𝑖 as the transmitter and node 𝑗 as the receiver. The nodefeature incorporates the properties of the agent, e.g., users’weights in the weighted sum rate maximization problem [18].The edge feature includes the properties of the correspondingchannel, e.g., a scalar (or matrix) to denote the channel stateof a single-antenna (or multi-antenna) system. We call thesegraphs generated by the wireless network topology as wirelesschannel graphs. Formally, a wireless channel graph is anordered tuple 𝐺 = (𝑉, 𝐸, 𝑠, 𝑡), where 𝑉 is the set of nodes,𝐸 is the set of edges, 𝑠 : 𝑉 → C𝑑1 maps a node to itsfeature, and 𝑡 : 𝐸 → C𝑑2 maps an edge to its feature. Denote𝑉 = {𝑣1, 𝑣2, · · · , 𝑣 |𝑉 |}. Also define the node feature matrix asZ ∈ C |𝑉 |×𝑑1 with Z(𝑖,:) = 𝑠(𝑣𝑖), and the adjacency featuretensor 𝐴 ∈ C |𝑉 |× |𝑉 |×𝑑2 as

𝐴(𝑖, 𝑗 ,:) =

{0, if {𝑖, 𝑗} ∉ 𝐸𝑡 ({𝑖, 𝑗}) otherwise,

(2)

where 0 is a zero vector in C𝑑2 . Given the permutation 𝜋, agraph 𝐺 with its node feature matrix Z and adjacency featuretensor 𝐴, the permutation of nodes is denoted by (𝜋★Z, 𝜋★𝐴)and defined as

(𝜋 ★Z) (𝜋 (𝑖1) ,:) = Z(𝑖1 ,:) , (𝜋 ★ 𝐴) (𝜋 (𝑖1) , 𝜋 (𝑖2) ,:) = 𝐴(𝑖1 ,𝑖2 ,:) .

We assign each node 𝑣𝑖 ∈ 𝑉 an optimization variable γ𝑖 ∈C𝑛. Let 𝚪 = [γ1, · · · , γ |𝑉 |]𝑇 ∈ C |𝑉 |×𝑛, then an optimizationproblem defined on a wireless channel graph can be writtenas

4

12

3

4 5

12

3

45

A transceiver pair asa node

Direct link Interference link

Fig. 1. An illustration of graph modeling of a 𝐾 -user interference channel.

𝒫 : minimize𝚪

𝑔(𝚪,Z, 𝐴) subject to 𝑄(𝚪,Z, 𝐴) ≤ 0,(3)

where 𝑔(·, ·, ·) denotes the objective function and 𝑄(·, ·, ·)denotes the constraint.

Next we elaborate the properties of the radio resource man-agement problems on the wireless channel graphs. Withoutnode features or edge features, a wireless channel graph is adirected graph. As a result, the properties of wireless channelgraphs follow the properties of directed graphs. We elaboratethe permutation equivariance property of problems on wirelesschannel graphs next. Define the permutation of optimizationvariable as

(𝜋 ★ 𝚪) (𝜋 (𝑖1) ,:) = 𝚪 (𝑖1 ,:) .

Similar to optimization problems on directed graphs, theones defined on wireless channel graphs have the permutationinvariance property. As a result, the sub-level sets of 𝑔(·, ·, ·)in (3) also have the permutation equivariance property, whichis stated below.

Proposition II.3. (Permutation equivariance) Let R𝛼𝑔 denotethe sublevel set of 𝑔(·, ·, ·) in (3), and define 𝐹 : (Z, 𝐴) ↦→ R𝛼𝑔 .Then,

𝐹 ((𝜋 ★Z, 𝜋 ★ 𝐴)) = {𝜋 ★ 𝚪 |𝚪 ∈ R𝛼𝑔 }.


Remark. This result establishes a general permutation equiv-ariance property for radio resource management problems.Proposition II.3 is reduced to the results in [14] if Z isan all one matrix and 𝐴 ∈ R |𝑉 |× |𝑉 |×1. By modeling thenode heterogeneity into Z, Proposition II.3 is applicableto heterogeneous agents. By introducing adjacency featuretensor instead of using adjacency matrix, this graph modelingtechnique can incorporate multi-antenna channel states. Theproof is the same as Proposition II.2 by simply changingnotations.

C. Graph Modeling of 𝐾-user Interference Channels

In this subsection, as a specific example, we present graphmodeling of a classic radio resource management problem, i.e.,

beamforming for weighted sum rate maximization in a 𝐾-userinterference channel. It will be used as the main test settingfor the theoretical study in Section IV-C and simulations inSection V. There are in total 𝐾 transceiver pairs where eachtransmitter is equipped with 𝑁𝑡 antennas and each receiver isequipped with a single antenna. Let v𝑘 denote the beamformerof the 𝑘-th transmitter. The received signal at receiver 𝑘 is y𝑘 =h𝐻𝑘,𝑘

v𝑘 𝑠𝑘 +∑𝐾𝑗≠𝑘 h

𝐻𝑗,𝑘

v 𝑗 𝑠 𝑗 +𝑛𝑘 , where h 𝑗 ,𝑘 ∈ C𝑁𝑡 denotes thechannel state from transmitter 𝑗 to receiver 𝑘 and 𝑛𝑘 ∈ Cdenotes the additive noise following the complex Gaussiandistribution CN(0, 𝜎2

𝑘).

The signal-to-interference-plus-noise ratio (SINR) for re-ceiver 𝑘 is given by

SINR𝑘 =|h𝐻𝑘,𝑘

v𝑘 |2∑𝐾𝑗≠𝑘 |h𝐻𝑗,𝑘v 𝑗 |2 + 𝜎

2𝑘

. (4)

Denote V = [v1, · · · , v𝑘 ]𝑇 ∈ C𝐾×𝑁𝑡 as the beamformingmatrix. The objective is to find the optimal beamformer tomaximize the weighted sum rate, and the problem is formu-lated as

maximizeV

𝐾∑𝑘=1

𝑤𝑘 log2 (1 + SINR𝑘 )

subject to ‖v𝑘 ‖22 ≤ 𝑃max,∀𝑘,(5)

where 𝑤𝑘 is the weight for the 𝑘-th pair.a) Graph Modeling: We view the 𝑘-th transceiver pair

as the 𝑘-th node in the graph. As distant agents cause littleinterference, we draw a directed edge from node 𝑗 to node 𝑘only if the distance between transmitter 𝑗 and receiver 𝑘 isbelow a certain threshold 𝐷. An illustration of such a graphmodeling is shown in Fig. 1. The node feature matrix Z ∈C |𝑉 |×(𝑁𝑡+2) is given by

Z(𝑘,:) = [h𝑘,𝑘 , 𝑤𝑘 , 𝜎2𝑘 ]𝑇 ,

and the adjacency feature array 𝐴 ∈ C |𝑉 |× |𝑉 |×𝑁𝑡 is given by

𝐴( 𝑗 ,𝑘,:) =

{0, if { 𝑗 , 𝑘} ∉ 𝐸h 𝑗 ,𝑘 otherwise,

5

where 0 ∈ C𝑁𝑡 is a zero vector. With notations Z, 𝐴, and V ,SINR can be written as

SINR𝑘 =|Z𝐻(𝑘,1:𝑁𝑡 )v𝑘 |

2∑𝐾𝑗≠𝑘 |A𝐻

( 𝑗 ,𝑘,:)v 𝑗 |2 +Z(𝑘,𝑁𝑡+2).

and (5) can be written as

maximizeV

𝑔(V ,Z, 𝐴) =𝐾∑𝑘=1

Z(𝑘,𝑁𝑡+1) log2 (1 + SINR𝑘 )

subject to 𝑄(V ,Z, 𝐴) = ‖v𝑘 ‖22 − 𝑃max ≤ 0,∀𝑘,(6)

Problem (6) has the permutation equivariance property withrespect to V , Z, and 𝐴 as shown in Proposition II.3. To solvethis problem efficiently and effectively, the adopted neuralnetwork should exploit the permutation equivariance property,and incorporate both node features and edge features. We shalldevelop an effective neural network architecture to achieve thisgoal in the next section.

III. NEURAL NETWORK ARCHITECTURE DESIGN FORRADIO RESOURCE MANAGEMENT

In this section, we endeavor to develop a scalable neuralnetwork architecture for radio resource management problems.A favorable family of GNNs, named, message passing graphneural networks, will be identified. The key properties andeffective implementation will also be discussed.

A. Optimizing Wireless Networks via Graph Neural Networks

Most of existing works on “learning to optimize” ap-proaches to solve problems in wireless networks adoptedMLPs as the neural network architecture [7], [9], [11]. Al-though MLPs can approximate well-behaved functions [27],they suffer from poor performance in data efficiency, robust-ness, and generalization. A long-standing idea for improvingthe performance and generalization is to incorporate the struc-tures of the target task into the neural network architecture. Inthis way, there is no need for the neural network to learn suchstructures from data, which leads to a more efficient training,and better generalization empirically [14], [21], [22], [28] andprovably [23].

As discussed above, the structures of radio resource man-agement problems can be formulated as optimization problemson wireless channel graphs, which enjoy the permutationequivariance property. In machine learning, there are twoclasses of neural networks that are able to exploit the permuta-tion equivariance property, i.e., graph neural networks (GNNs)[29] and Deep Sets [30]. Compared with Deep Sets, GNNsnot only respect the permutation equivariance property butcan also model the interactions among the agents. In wirelessnetworks, the agents interact with each other through channels.Thus, GNNs are more favorable than Deep Sets in wirelessnetworks. This motivates us to adopt GNNs to solve radioresource management problems.

B. Message Passing Graph Neural Networks

In this subsection, we shall identify a family of GNNs forradio resource management problems, which extend CNNs towireless channel graphs. In traditional machine learning tasks,the data can typically be embedded in a Euclidean space, e.g.,images. Recently, there is an increasing number of applicationsgenerated from the non-Euclidean spaces that can be naturallymodeled as graphs, e.g., point cloud [32] and combinatorialproblems [25]. This motivates researchers to develop GNNs[29], which effectively exploit the graph structure. GNNsgeneralize traditional CNNs, recurrent neural networks, andauto-encoders to the graph tasks. In wireless networks, whilethe agents are located in the Euclidean space, channel statescannot be embedded in a Euclidean space. Thus, the data inradio resource management problems is also non-Euclideanand neural networks operating on non-Euclidean space arenecessary when adopting “learning to optimize” approachesin wireless networks.

As a background, we first introduce CNNs, which operateon Euclidean data. Compared with MLPs, CNNs have shownsuperior performance in image processing tasks. The motiva-tion for CNNs is that adjacent pixels are meaningful to beconsidered together in images [24]. Like MLPs, CNNs have alayer-wise structure. In each layer, a 2D convolution is appliedto the input. Here we consider a simple CNN with a rectifiedlinear unit and without pooling. In the 𝑘-th layer, for a pixellocated at (𝑖, 𝑗), the update is

x(𝑘)𝑖, 𝑗

= RELU ©«∑

(𝑝,𝑙) ∈N(𝑖, 𝑗)W (𝑘)𝑖−𝑝, 𝑗−𝑙x

(𝑘−1)𝑝,𝑙

ª®¬ , (7)

where x(0)𝑖, 𝑗

denotes pixel (𝑖, 𝑗) of the input image, x(𝑘)𝑖, 𝑗

denotesthe hidden state of pixel (𝑖, 𝑗) at the 𝑘-th layer, RELU(𝑥) =MAX(0, 𝑥) and W (𝑘)

·, · denotes the weight matrix in the 𝑘-thlayer, and N(𝑖, 𝑗) denotes the neighbor pixels of pixel (𝑖, 𝑗).Specifically, for a convolution kernel of size 𝑁 × 𝑁 , we have

N(𝑖, 𝑗) ={(𝑝, 𝑙) : |𝑝 − 𝑖 | ≤ 𝑁 − 1

2, |𝑙 − 𝑗 | ≤ 𝑁 − 1

2

},

and a common choice of 𝑁 is 3.Despite the great success of CNNs in computer vision,

they cannot be applied to non-Euclidean data. In [31], CNNsare extended to graphs from a spatial perspective, which isas efficient as CNNs, while enjoying performance guaranteeson graph isomorphism test. We refer to this architecture asthe spatial graph convolutional networks (SGNNs). In eachlayer of a CNN (7), each pixel aggregates information fromneighbor pixels and then updates its state. As an analogy, ineach layer of a SGNN, each node updates its representationby aggregating features from its neighbor nodes. Specifically,the update rule of the 𝑘-th layer at vertex 𝑖 in a SGNN is

x(𝑘)𝑖

= 𝛼 (𝑘)(x(𝑘−1)𝑖

, 𝜙 (𝑘)({x(𝑘−1)𝑗

: 𝑗 ∈ N (𝑖)}))

, (8)

where x(0)𝑖

= Z(𝑖,:) is the input feature of node 𝑖, x(𝑘)𝑖

denotesthe hidden state of node 𝑖 at the 𝑘-th layer, N(𝑖) denotesthe set of the neighbors of 𝑖, 𝜙 (𝑘) (·) is a set function thataggregates information from the node’s neighbors, and 𝛼 (𝑘) (·)

6

Convolution

(k+1)-th layer

k-th layer

(a) An illustration of CNNs. In each layer, each pixel convolves itself andneighbor pixels.

AGGREGATE

COMBINE

k-th layer

(k+1)-th layer

(b) An illustration of SGNNs [31]. In each layer, each node aggregates fromneighbor nodes and combines itself’s hidden state.

Fig. 2. Illustrations of CNN and SGNNs. CNN can be viewed as a special SGNNs on the grid graph.

is a function that combines aggregated information with itsown information. An illustration of the extension from CNNsto SGNNs is shown in Fig. 2. Particularly, SGNNs includespatial deep learning for wireless scheduling [10] as a specialcase.

Despite the success of SGNNs in graph problems, it isdifficult to directly apply SGNNs on radio resource allocationproblems as they cannot exploit the edge features. This meansthat they cannot incorporate channel states in wireless net-works. We modify the definition in (8) to exploit edge featuresand will refer to it as message passing graph neural networks(MPGNNs). The update rule for the 𝑘-th layer at vertex 𝑖 inan MPGNN is

x(𝑘)𝑖

= 𝛼 (𝑘)(x(𝑘−1)𝑖

, 𝜙 (𝑘)({ [

x(𝑘−1)𝑗

, e 𝑗 ,𝑖

]: 𝑗 ∈ N (𝑖)

})), (9)

where e 𝑗 ,𝑖 = 𝐴( 𝑗 ,𝑖,:) is the edge feature of the edge ( 𝑗 , 𝑖). Werepresent the output of a 𝑆-layer MPGNN as

X =

[x(𝑆)1 , · · · ,x(𝑆)|𝑉 |

]𝑇∈ C |𝑉 |×𝑛 (10)

The extension from SGNNs to MPGNNs is simple butcrucial, due to the following two reasons. First, MPGNNsrespect the permutation equivariance property in PropositionII.3. Second, MPGNNs enjoy theoretical guarantees in radioresource management problems (as discussed in Section IV).These two properties are unique for MPGNNs and are notenjoyed by SGNNs.

C. Key Properties of MPGNNs

MPGNNs enjoy properties that are favorable to solvinglarge-scale radio resource management problems, as discussedin the sequel.

a) Permutation equivariance: We first show thatMPGNNs satisfy the permutation equivariance property.

Proposition III.1. (Permutation equivariance in MPGNNs)Viewing the input output mapping of MPGNNs defined in (9)as Φ : (Z, 𝐴) ↦→ X where Z is the node feature matrix,𝐴 is the adjacency feature tensor and X is the output of anMPGNN in (10), we have

Φ((𝜋 ★Z, 𝜋 ★ 𝐴)) = 𝜋 ★Φ((Z, 𝐴)),


Remark. Please refer to Appendix B for a detailed proof.

The permutation equivariance property of GNNs improvesthe generalization of the neural networks. It also reduces thetraining sample complexity and training time. As shown inProposition II.3, the radio resource management problemsenjoy a permutation equivariant property. This means that thenear-optimal solutions to a permuted problem are permuta-tions of those to the original problem. GNNs well respectthis property while MLPs and CNNs do not. If GNNs canperform well with a specific input, the good generalizationis guaranteed with the permutation of this input, which isnot guaranteed by MLPs or CNNs. Thus, in radio resourcemanagement problems, GNNs enjoy a better generalizationthan MLPs and CNNs. In contrast, to keep the input-outputmapping in MLPs or CNNs permutation equivariant, dataargumentation is needed. In principle, for each training sample,all its permutations should be put into the training dataset. Thisleads to a higher training sample complexity and more trainingtime for MLPs and CNNs.

b) Ability to generalize to different problem scales: InMLPs, the input or output size must be the same duringtraining and testing. Hence, the number of agents in the testdataset must be equal or less than that in the training dataset[7]. This means that MLP based methods cannot be directlyapplied to a different problem size. In MPGNNs, each nodehas a copy of two sub neural networks, i.e., 𝛼 (𝑘) (·) and𝜙 (𝑘) (·), whose input-output dimensions are invariant with thenumber of agents. Thus, we can train MPGNNs on small-scaleproblems and apply them to large-scale problems.

c) Fewer training samples: The required number oftraining samples for MPGNNs is much smaller than thatfor MLPs. The first reason is training sample reusing. Notethat the neural network at each node is identical. For eachtraining sample, each node receives a permuted version of itand processes it with 𝛼 (𝑘) (·) and 𝜙 (𝑘) (·). Thus, each trainingsample is reused 𝐿 times for training {𝛼 (𝑘) } and {𝜙 (𝑘) }, where𝐿 is the number of nodes. Second, input and output dimensionsof the aggregation and combination functions in MPGNNs aremuch smaller than the original problem, which allows the useof much fewer parameters in neural networks.

7

d) High computational efficiency: In each layer, an ag-gregation function is applied to all the edges and a combi-nation function is applied to all the nodes. Thus, the timecomplexity for each layer is O(|𝐸 | + |𝑉 |) and the overall timecomplexity for an 𝐿-layer MPGNN is O(𝐿 ( |𝐸 | + |𝑉 |)). Thetime complexity grows linearly with the number of agentswhen the maximal degree of the graph is bounded. Notethat in MPGNNs, the aggregation function and combinationfunction on each node can be executed in parallel. When theMPGNNs are fully parallelized, e.g., on powerful GPUs, thetime complexity is O (𝐿𝐷), where 𝐷 is the maximal degreeof the graph. This is a constant time complexity when themaximal degree of the graph is bounded. We will verify thisobservation via simulations in Fig. 4.

D. An Effective Implementation of MPGNNs

In this subsection, we propose an effective implementationof MPGNNs for radio resource management problems, named,the wireless channel graph convolution network (WCGCN),which is able to effectively incorporate both agent-relatedfeatures and channel-related features. The design space forMPGNNs (9) is to choose the set aggregation function 𝜙(·)and the combination function 𝛼(·).

As general set functions are difficult to implement, anefficient implementation of 𝜙(·) was proposed in [33], whichhas the following form

𝜙({𝑥1, · · · , 𝑥𝑛}) = 𝜓({ℎ(𝑥1), · · · , ℎ(𝑥𝑛)}),

where 𝑥𝑖 , 1 ≤ 𝑖 ≤ 𝑛 are the elements in the set, 𝜓(·) is asimple function, e.g., max or sum, and ℎ(·) is some existingneural network architecture, e.g., linear mappings or MLPs.For 𝛼(·) and ℎ(·), linear mapping is adopted in popular GNNarchitectures (e.g., GCN [34] and S2V [35]). Nevertheless, asdiscussed in Section IV in [13], linear mappings have difficultyhandling continuous features, which is ubiquitous in wirelessnetworks (e.g., CSI). We adopt MLPs as 𝛼(·) and ℎ(·) fortheir approximation ability [27]. MLP processing unit enablesWCGCN to exploit complicated agent-related features andchannel-related features in wireless networks.

For the aggregation function 𝜓(·), we notice that the fol-lowing property holds if we use 𝜓(·) = MAX(·).

Theorem III.1. (Robustness to feature corruptions) [36] Sup-pose u : X → R𝑝 such that u = MAX𝑥𝑖 ∈𝑆 and 𝑓 = 𝛾 ◦ u.Then,

(a) ∀𝑆, ∃C𝑆 ,N𝑆 ∈ X, 𝑓 (𝑇) = 𝑓 (𝑆) if C𝑆 ⊂ 𝑇 ⊂ N𝑆;(b) |C𝑆 | ≤ 𝑝.

Theorem III.1 states that 𝑓 (𝑆) remains the same up tocorruptions of the input if all the features in C𝑆 are preservedand C𝑆 only contains a limited number of features, whichis smaller than 𝑝. By specifying it to problems in wirelessnetworks, the output of a layer remains unchanged even whenthe CSI is heavily corrupted on some links. In other words, itis robust to missing CSI.

We next specify the architecture for the WCGCN, whichaligns with traditional optimization algorithms. First, in tra-ditional optimization algorithms, each iteration outputs an

updated version of the optimization variables. In the WCGCN,each layer outputs an updated version of the optimizationvariables. Second, these algorithms are often time-invariantsystems, e.g., gradient descent, WMMSE [18], and FPlinQ[37]. Thus, we share weights among different layers of theWCGCN. The update of the 𝑖-th node in the 𝑘-th layer can berewritten as

y (𝑘)𝑖

= MLP2(x(𝑘−1)𝑖

,MAX 𝑗∈N(𝑖){MLP1

(x(𝑘−1)𝑗

, e 𝑗 ,𝑖

)}),

x(𝑘)𝑖

= 𝛽

(y (𝑘)𝑖

),

(11)where MLP1 and MLP2 are two different MLPs, 𝛽 is a differ-entiable normalization function that depends on applications,y (𝑘)𝑖

denotes the output of MLP2 of the 𝑖-th node in the 𝑘-thlayer, x(𝑘)

𝑖denotes the hidden state, and N(𝑖) denotes the set

of neighbor nodes of node 𝑖. For the power control problem,we constrain the power between 0 and 1, and 𝛽 can be asigmoid function, i.e., 𝛽(𝑥) = 1

1+exp(−𝑥) . For more generalconstraints, 𝛽 can be differentiable projection layers [38].

Besides the benign properties of MPGNNs, WCGCN en-joys several desirable properties for solving large-scale radioresource management problems. First, the WCGCN can effec-tively exploit features in multi-antenna systems with hetero-geneous agents (e.g., channel states in multi-antenna systemsand users’ weights in weighted sum rate maximization). Thisis because WCGCN adopts MLP as processing units insteadof linear mappings. This enables it to solve a wider class ofradio resource management tasks than existing works [10],[13], [14] (e.g., beamforming problems and weighted sum ratemaximization). Second, it is robust to partial and imperfectCSI as suggested in Theorem III.1.

IV. THEORETICAL ANALYSIS OF MPGNN-BASED RADIORESOURCE MANAGEMENT

In this section, we investigate performance and generaliza-tion of MPGNNs. We first prove the equivalence betweenMPGNNs and a family of distributed algorithms, which in-clude many classic algorithms for radio resource manage-ment as special examples, e.g., WMMSE [18]. Based on thisobservation, we analyze the performance of MPGNN-basedmethods for weighted sum rate maximization problem.

A. Simplifications

To provide theoretical guarantees for “learning to optimize”approaches for solving radio resource management problems,it is critical to understand the performance and generalizationof neural network-based methods. Unfortunately, the trainingand generalization of neural networks are sill open problems.We make several commonly adopted simplifications to makethe performance analysis tractable. First, we focus on theMPGNN class instead of any specific neural network archi-tecture such as GCNs. Following Lemma 5 and Corollary 6 in[31], we can design an MPGNN with MLP processing unitsas powerful as the MPGNN class, and thus this simplificationwell serves our purpose. Second, we target at proving theexistence of an MPGNN with performance guarantee. Because

8

we train the neural network with a stochastic gradient descentwith limited training samples during the simulations, we maynot find the corresponding neural network parameters. Whilethis may leave some gap between the theory and practice, ourresult is an important first step. These two simplifications havebeen commonly adopted in the performance analysis of GNNs[31], [39], [40].

B. Equivalence of MPGNNs and Distributed Optimization

Compared with the neural network-based radio resourcemanagement, optimization-based radio resource managementhas been well studied. Thus, it is desirable to make connectionsbetween these two types of methods. In [39], the equivalencebetween SGNNs in (8) and graph optimization algorithms wasproved. Based on this result, we shall establish the equivalencebetween MPGNNs and a family of distributed radio resourcemanagement algorithms.

We first give a brief introduction to distributed local algo-rithms, following [41]. The maximal degree of the nodes in thegraph is assumed to be bounded. Distributed local algorithmsare a family of iterative algorithms in a multi-agent system.In each iteration, each agent sends messages to its neighbors,receives messages from its neighbors, and updates its statebased on the received messages. The algorithm terminates aftera constant number of iterations.

We focus on a sub-class of distributed local algorithms,titled, multiset broadcasting distributed local algorithms (MB-DLA) [41], which include a wide range of radio resourcemanagement algorithms in wireless networks, e.g., DTP [42],WMMSE [18], FPlinQ [37], and first-order methods for net-work utility problems [4]. Multiset and broadcasting referto the way for receiving and sending messages, respectively.Denote x(𝑙)

𝑖as the state of node 𝑖 at the 𝑙-th iteration, and the

MB-DLA is shown in Algorithm 1.

Algorithm 1 Multiset broadcasting distributed local algorithm[41]1: Initialize all internal states x

(0)𝑘, ∀𝑘.

2: for communication round 𝑡 = 1, · · · , 𝑇 do3: agent 𝑘 sends ℎ (𝑡 )1 (x

(𝑡−1)𝑘) to all its edges, ∀𝑘

4: agent 𝑘 receives{m(𝑡 )𝑗,𝑘|m(𝑡 )

𝑗,𝑘= ℎ(𝑡 )2

(ℎ(𝑡 )1 (x

(𝑡−1)𝑘) , e 𝑗,𝑘

), 𝑗 ∈ N(𝑘)

}from the edges, ∀𝑘

5: agent 𝑘 updates its internal state x(𝑡 )𝑘

=

𝑔(𝑡 )2

(x(𝑡−1)𝑘

, 𝑔(𝑡 )1

({m(𝑡 )𝑗,𝑘

: 𝑗 ∈ N(𝑘)}))

, ∀𝑘.6: end for7: Output x(𝑇 )

𝑘.

The equivalence between MPGNNs and MB-DLAs roots inthe similarity in their definitions. In each iteration of an MB-DLA, each agent aggregates messages from neighbor agentsand updates its local state. In each layer of an MPGNN,each node aggregates features from neighbor nodes. Theequivalence can be drawn if we view the agents as nodes in agraph and messages as the features. The following propositionstates the equivalence of MPGNNs and MB-DLAs formally.

Theorem IV.1. Let MB-DLA(𝑇) denote the family of MB-DLAwith 𝑇 iterations and MPGNN(𝑇) as the family of MPGNNswith 𝑇 layers, then the following two conclusions hold.

1) For any MPGNN(𝑇), there exists a distributed localalgorithm in MB-DLA(𝑇) that solves the same set ofproblems as MPGNN(𝑇).

2) For any algorithm in MB-DLA(𝑇), there exists anMPGNN(𝑇) that solves the same set of problems as thisalgorithm.

Remark. Please refer to Appendix C for a detailed proof.

The equivalence allows us to analyze the performance ofMPGNNs by studying the performance of MB-DLAs. Thefirst result shows that MPGNNs are at most as powerful asMB-DLAs. The implication is that if we can prove that thereis no MB-DLA capable of solving a specific radio resourcemanagement problem, then MPGNNs cannot solve it. This canbe used to prove a performance upper bound of MPGNNs. Thesecond result shows that MPGNNs are as powerful as MB-DLAs in radio resource management problems. This impliesthat if we are able to identify an MB-DLA that solves aradio resource management problem well, then there existsan MPGNN performs better or at least competitive. Thegeneralization is also as good as the corresponding MB-DLA.We shall give a specific example on sum rate maximization inthe next subsection.

C. Performance and Generalization of MPGNNs

In this subsection, we use the tools developed in the lastsubsection to analyze the performance and generalization ofMPGNNs in the sum rate maximization problem. The analysisis built on the observation that a classic algorithm for thesum rate maximization problem, i.e., WMMSE, is an MB-DLA under some conditions, which is formally stated below.We shall refer to the MB-DLA corresponding to WMMSE asWMMSE-DLA.

Proposition IV.1. When the maximal number of interferenceneighbors is bounded by some constant, then WMMSE with aconstant number of iterations is an MB-DLA.

Remark. When the problem sizes in the training dataset andtest dataset are the same, we can always assume that thenumber of interference neighbors is a common constant. Therestriction of a constant number of interference neighbors onlyinfluences the generalization. Please refer to Appendix D fora detailed proof.

a) Performance: Proposition IV.1 shows that WMMSEis an MB-DLA. Thus, when the problem sizes in the trainingdataset and test dataset are the same, there exists an MPGNNwhose performance is as good as WMMSE. As the WMMSEis hand-crafted, it is not optimal in terms of the numberof iterations. By employing a unsupervised loss function,we expect that MPGNNs can learn an algorithm which hasfewer iterations and may possibly enjoy better performance.In Fig. 3, we observe that a 1-layer MPGNN outperformsWMMSE with 10 iterations and a 2-layer MPGNN outper-forms WMMSE with 30 iterations.

b) Generalization: To avoid the excessive training cost, itis desirable to first train a neural network on small-scale prob-lems and then generalize it to large-scale ones. An intriguing

9

question is when such generalization is reliable. Comparedwith WMMSE, WMMSE-DLA has two constraints: Both thenumber of iterations and the maximal number of interferenceneighbors should be bounded by some constants. As agentsthat are far away cause little interference, the number ofinterference neighbors can be assumed to be fixed when theuser density is kept the same. As a result, the performance ofMPGNNs is stable when the user density in the test dataset isthe user density in the training dataset multiplied by a constant.We will verify this by simulations in Table IV and Table VII.

V. SIMULATION RESULTS

In this section, we provide simulation results to verifythe effectiveness of the proposed neural network architec-ture for three applications. The first application is sum ratemaximization in a Gaussian interference channel, which is aclassic application for deep learning-based methods. We usethis application to compare the proposed method with MLP-based methods [9] and optimization-based methods [18]. Thesecond application is weighted sum rate maximization, and thethird application is beamformer design. The last two problemscannot be solved by existing methods in [10], [13], [14].

For the neural network setting, we adopt a 3-layer WCGCN,implemented by Pytorch Geometric [33]. During the training,the neural network takes channel states and users’ weights asinput and outputs the beamforming vector for each user. Weapply the following loss function at the last layer of the neuralnetwork.

ℓ(𝚯) = −E(𝐾∑𝑘=1

𝑤𝑘 log2

(1 +

|h𝐻𝑘,𝑘

v𝑘 (𝚯) |2∑𝐾𝑗≠𝑘 |h𝐻𝑗,𝑘v 𝑗 (𝚯) |2 + 𝜎

2𝑘

)),

where 𝚯 denotes the weights of the neural network and theexpectation is taken over all the channel realizations. Byadopting this loss function, no labels are required and thusit is an unsupervised learning method. In the training stage, tooptimize the neural network, we adopt the adam optimizer[43] with a learning rate of 0.001. In the test stage, theinput of the neural network consists of the channel states andusers’ weights and the output of the neural network is thebeamforming vector. The SGD (adam) optimizer is not neededin the test stage.

A. Sum Rate Maximization

We first consider the sum rate maximization problem in asingle-antenna Gaussian interference channel. This problem isa special case of (5) with 𝑁𝑡 = 1, ℎ 𝑗 ,𝑘 ∼ CN(0, 1),∀ 𝑗 , 𝑘 , and𝑤𝑘 = 1,∀𝑘 .

We consider the following benchmarks for comparison.• WMMSE [18]: This is a classic optimization-based algo-

rithm for sum utility maximization in MIMO interferingbroadcast channels. We run WMMSE for 100 iterationswith random initialization.

• WMMSE 100 times: For each channel realization, werun WMMSE algorithm for 100 times and take thebest one as the performance. This is often used as anperformance upper bound.

• Strongest: We find a fixed proportion of pairs with thelargest channel gain |ℎ𝑖,𝑖 |, and set the power of thesepairs as 𝑃max while the power levels for remaining pairsare set to 0. This is a simple baseline algorithm withoutany knowledge of interference links.

• PCNet [9]: PCNet is an MLP based method particularlydesigned for the sum rate maximization problem withsingle-antenna channels.

We use 104 training samples for WCGCN and 107 train-ing samples for PCNet. For a specific parameter setting ofWCGCN (11), we set the hidden units of MLP1 in (11)as {5, 32, 32}, MLP2 as {35, 16, 1}, and 𝛽(·) as sigmoidfunction.2 The performance of different methods is shown inTable I. The SNR and number of users are kept the same inthe training and test dataset. For all the tables shown in thissection, the entries are (weighted) the sum rates achieved bydifferent methods normalized by the sum rate of WMMSE.We see that both PCNet and WCGCN achieve near-optimalperformance when the problem scale is small. As the problemscale becomes large, the performance of PCNet approachesStrongest. This shows that it can hardly learn any valuableinformation about interference links. Nevertheless, the perfor-mance of WCGCN is stable as the problem size increases.Thus, GNNs are more favorable than MLPs for medium-scaleor large-scale problems.

TABLE IAVERAGE SUM RATE UNDER EACH SETTING. THE RESULTS ARE

NORMALIZED BY THE SUM RATE OF WMMSE.

SNR Links WCGCN PCNet Strongest WMMSE100 times

0dB10 100.0% 98.9% 87.1% 102.0%30 97.9% 87.4% 82.8% 101.3%50 97.1% 79.7% 80.6% 101.4%

10dB10 103.1% 101.8% 74.4% 104.8%30 103.4% 74.0% 70.0% 105.0%50 102.5% 67.0% 68.9% 104.7%

We further compare the performance of WCGCN andWMMSE with different numbers of iterations. We use thesystem setting 𝐾 = 50, SNR = 10dB. Both WMMSE andWCGCN starts from the same initialization point. The resultsare shown in Fig. 3. From the figure, we see that a 1-layerWCGCN outperforms WMMSE with 10 iterations and a 2-layer WCGCN outperforms WMMSE with 30 iterations. Thisindicates that by adopting the unsupervised loss function,WCGCN can learn a much better message-passing algorithmthan the handcrafted WMMSE.

B. Weighted Sum Rate Maximization

In this application, we consider 𝐾 single-antenna transceiverpairs within a 𝐴×𝐴 area. The transmitters are randomly locatedin the 𝐴 × 𝐴 area while each receiver is uniformly distributedwithin [𝑑min, 𝑑max] from the corresponding transmitter. Weadopt the channel model from [19] and use 10000 trainingsamples for each setting. To reduce the CSI training overhead,

2The performance of WCGCN is not sensitive to the number of hiddenunits.

10

0 5 10 15 20 25 30

Iterations (WMMSE)

1

2

3

4

5

6

7

8

9S

um R

ate

(bits

/s/H

z)

WMMSE1-layer WCGCN2-layer WCGCN

Fig. 3. A comparison between WCGCN and WMMSE with different numbersof iterations.

we assume ℎ 𝑗 ,𝑘 is available to WCGCN only if the distancebetween transmitter 𝑗 and receiver 𝑘 is within 500 meters. Toprovide a performance upper bound, global CSI is assumed tobe available to WMMSE. The weights for weighted sum ratemaximization, i.e., 𝑤𝑘 in (5), are generated from a uniformdistribution in [0, 1] in both training and test dataset. For aspecific parameter setting of WCGCN (11), we set the hiddenunits of MLP1 as {5, 32, 32}, MLP2 as {35, 16, 1}, and 𝛽(·)as sigmoid function.

a) Performance comparison: We first test the perfor-mance of WCGCN when the number of pairs is the same inthe training and test dataset. Specifically, we consider 𝐾 = 50pairs in a 1000m× 1000m region. We test the performance ofWCGCN with different values of 𝑑min and 𝑑max, as shown inTable II. The entries in the table are the sum rates achieved bydifferent methods. We observe that WCGCN with local CSIachieves competitive performance to WMMSE with globalCSI.

TABLE IIAVERAGE SUM RATE PERFORMANCE OF 50 TRANSCEIVER PAIRS.

(𝑑min, 𝑑max) (2m,65m) (10m,50m) (30m,70m) (30m,30m)WCGCN 97.8% 97.5% 96.5% 96.8%

Next, to test the generalization capability of the proposedmethod, we train WCGCN on a wireless network with tensof users and test it on wireless networks with hundreds orthousands of users, as shown in the following two simulations.

b) Generalization to larger scales: We first train theWCGCN with 50 pairs in a 1000m × 1000m region. We thenchange the number of pairs in the test set while the densityof users (i.e., 𝐴2/𝐾) is fixed. The results are shown in TableIII. It can be observed that the performance is stable as thenumber of users increases. It also shows that WCGCN canwell generalize to larger problem scales, which is consistentwith our analysis.

c) Generalization to higher densities: In this test, wefirst train the WCGCN with 50 pairs in a 1000m × 1000m

TABLE IIIGENERALIZATION TO LARGER PROBLEM SCALES BUT SAME DENSITY.

Links Size (𝑚2) (𝑑min, 𝑑max)(10m,50m) (30m,30m)

200 2000 × 2000 98.3% 98.1%400 2828 × 2828 98.9% 98.2%600 3464 × 3464 98.8% 98.7%800 4000 × 4000 98.9% 98.6%1000 4772 × 4772 98.9% 98.7%

region. We then change the number of pairs in the test setwhile fixing the area size. The results are shown in Table IVand the performance loss compared with 𝐾 = 50 is shown inthe bracket. The performance is stable up to a 4-fold increasein the density, and good performance is achieved even whenthere is a 10-fold increase in the density.

TABLE IVGENERALIZATION OVER DIFFERENT LINK DENSITIES. THE PERFORMANCE

LOSS COMPARED TO 𝐾 = 50 IS SHOWN IN THE BRACKET.


100

1000 × 1000

97.6% (+0.1%) 96.4% (−0.1%)200 97.0% (−0.5%) 96.0% (−0.5%)300 95.9% (−1.6%) 94.9% (−1.6%)400 95.6% (−1.9%) 94.5% (−2.0%)500 95.3% (−2.2%) 94.5% (−2.0%)

C. Beamformer Design

In this subsection, we consider the beamforming forsum rate maximization in (5). Specifically, we consider 𝐾transceiver pairs within a 𝐴×𝐴 area, where the transmitters areequipped with multiple antennas and each receiver is equippedwith a single antenna. The transmitters are generated uniformlyin the area and the receivers are generated uniformly within[𝑑min, 𝑑max] from the corresponding transmitters. We adopt thechannel model in [19] and use 50000 training samples for eachsetting. The assumption of the available CSI for WCGCN andWMMSE is the same as the previous subsection. In WCGCN,a complex number is treated as two real numbers. For aspecific parameter setting of WCGCN (11), we set the hiddenunits of MLP1 as {6𝑁𝑡 , 64, 64}, MLP2 as {64+4𝑁𝑡 , 32, 2𝑁𝑡 },and 𝛽(x) = x

max( ‖x‖2 ,1) .a) Performance comparison: We first test the perfor-

mance of WCGCN when the number of pairs in the trainingdataset and the number of pairs in the test dataset are the same.Specifically, we consider 𝐾 = 50 pairs in a 1000 meters by1000 meters region and each transmitter is equipped with 2antennas. We test the performance of WCGCN with different𝑑min and 𝑑max. The results are shown in Table V. We observethat WCGCN achieves comparable performance to WMMSEwith local CSI, demonstrating the applicability of the proposedmethod to multi-antenna systems.

b) Generalization to larger scales: We first train theWCGCN with 50 pairs in a 1000 meters by 1000 meters regionwith 𝑁𝑡 = 2. We then change the number of pairs while thedensity of users (i.e., 𝐴2/𝐾) is fixed. The results are shown in

11

TABLE VAVERAGE SUM RATE PERFORMANCE OF 50 TRANSCEIVER PAIRS WITH

𝑁𝑡 = 2. THE RESULTS ARE NORMALIZED BY THE SUM RATE OF WMMSE.

(𝑑min, 𝑑max) (2m,65m) (10m,50m) (30m,70m) (30m,30m)WCGCN 97.1% 96.0% 94.1% 96.2%

Table VI. The performance is stable as the number of usersincreases, which is consistent with our theoretical analysis.

TABLE VIGENERALIZATION TO LARGER PROBLEM SCALES BUT SAME DENSITY.


200 2000 × 2000 97.3% 96.8%400 2828 × 2828 97.3% 96.7%600 3464 × 3464 97.2% 96.5%800 4000 × 4000 97.2% 96.5%1000 4772 × 4772 97.2% 96.4%

c) Generalization to larger densities: We first train theWCGCN with 50 pairs on a 1000 meters by 1000 metersregion with 𝑁𝑡 = 2. We then change the number of pairs whilefix the area size. The results are shown in Table VII and theperformance loss is shown in the bracket. The performance isstable up to a 2-fold increase in the density and satisfactoryperformance is achieved up to a 4-fold increase in the density.The performance deteriorates when the density grows, whichindicates that extra training is needed when the density in thetest dataset is much larger than that of the training dataset.

TABLE VIIGENERALIZATION OVER DIFFERENT LINK DENSITIES. THE PERFORMANCE

LOSS COMPARED TO 𝐾 = 50 IS SHOWN IN THE BRACKET.


100

1000 × 1000

97.0% (−0.1%) 95.7% (−0.3%)200 95.8%(−1.3%) 94.4% (−1.6%)300 94.5% (−2.6%) 93.0% (−3.0%)400 92.5% (−4.6%) 92.0% (−4.0%)500 91.4% (−5.7%) 90.7%(−5.3%)

d) Computation time comparison: This test compares therunning time of different methods for different problem scales.We run “WCGCN GPU” on GeForce GTX 1080Ti whilethe other methods on Intel(R) Xeon(R) CPU E5-2643 v4 @3.40GHz. The implementation of neural networks exploits theparallel computation of GPU while WMMSE is not able todo so due to its sequential computation flows. The runningtime is averaged over 50 problem instances and shown inFig. 4. The speedup compared with WMMSE becomes largeas the problem scale increases. This benefits from the lowcomputational complexity of WCGCN. As shown in the figure,the computational complexity of WCGCN CPU is linear andWCGCN GPU is nearly a constant, which is consistent withour analysis in Section III-C. Remarkably, WCGCN is able tosolve the problem with 1000 users within 6 milliseconds.

100 200 500 1000 2000 5000

User number

10 -3

10 -2

10 -1

10 0

10 1

10 2

10 3

Com

puta

tion

time

(sec

onds

)

WMMSEWCGCN CPUWCGCN GPU

Fig. 4. Computation time comparison of different methods.

VI. CONCLUSIONS

In this paper, we developed a scalable neural network archi-tecture based on GNNs to solve radio resource managementproblems. In contrast to existing learning based methods, wefocused on the neural architecture design to meet the keyperformance requirements, including low training cost, highcomputational efficiency, and good generalization. Moreover,we theoretically connected learning based methods and opti-mization based methods, which casts light on the performanceguarantee of learning to optimize approaches. We believe thatthis investigation will lead to profound implications in boththeoretical and practical aspects. As for future directions, itwill be interesting to investigate the distributed deploymentof MPGNNs for radio resource management in wireless net-works, and extend our theoretical results to more generalapplication scenarios.

APPENDIX APROOF OF PROPOSITION II.2

Following Proposition II.1, we have

𝑔(𝚪,A) = 𝑔(𝜋 ★ 𝚪, 𝜋 ★A), 𝑄(𝚪,A) = 𝑄(𝜋 ★ 𝚪, 𝜋 ★A),(12)

for any variable 𝚪, adjacency matrix A, and permutationmatrix P .

For any 𝚪 ∈ R𝛼𝑔 , we have

𝑔(𝜋 ★ 𝚪, 𝜋 ★A) = 𝑔(𝚪,A) ≤ 𝛼,𝑄(𝜋 ★ 𝚪, 𝜋 ★A) = 𝑄(𝚪,A) ≤ 0. (13)

Combining (12) and (13), we have

𝐹 (𝜋 ★A) = {𝚪 |𝑔(𝚪, 𝜋 ★A) ≤ 𝛼,𝑄(𝚪, 𝜋 ★A) ≤ 0}={𝜋 ★ 𝚪 |𝑔(𝜋 ★ 𝚪, 𝜋 ★A) ≤ 𝛼,𝑄(𝜋 ★ 𝚪, 𝜋 ★A) ≤ 0}={𝜋 ★ 𝚪 |𝑔(𝚪,A) ≤ 𝛼,𝑄(𝚪,A) ≤ 0}={𝜋 ★ 𝚪 |𝚪 ∈ R𝛼𝑔 }.

12

APPENDIX BPROOF OF PROPOSITION III.1

In the original graph, denote the input feature of node 𝑖 asx(0)𝑖

, the edge feature of edge ( 𝑗 , 𝑖) as e 𝑗 ,𝑖 , and the output ofthe 𝑘-th layer of node 𝑖 as x(𝑘)

𝑖. In the permuted graph, denote

the input feature of node 𝑖 as x(0)𝑖

, the edge feature of edge( 𝑗 , 𝑖) as e 𝑗 ,𝑖 , and the output of the 𝑘-th layer for node 𝑖 asx(𝑘)𝑖

. Due to the permutation relationship, we have

e𝜋 ( 𝑗) , 𝜋 (𝑖) = e 𝑗 ,𝑖 , x(0)𝜋 (𝑖) = x(0)

𝑖,

N(𝜋(𝑖)) = {𝜋( 𝑗), 𝑗 ∈ N (𝑖)}.(14)

For any fixed 𝜋 and 𝑛, we prove x(𝑛)𝑖

= x(𝑛)𝜋 (𝑖) ,∀𝑖 by

induction. 1) The base case of 𝑛 = 0 follows (14).2) Assume x(𝑠−1)

𝑖= x(𝑠−1)

𝜋 (𝑖) ,∀𝑖 when 𝑛 = 𝑠 − 1. Show 𝑛 = 𝑠

holds: In the 𝑠-th layer, the following update rule is applied

x(𝑠)𝑖

= 𝛼 (𝑠)(x(𝑠−1)𝑖

, 𝜙 (𝑠){[x(𝑠−1)𝑗

, e 𝑗 ,𝑖

]: 𝑗 ∈ N (𝑖)

}),

x(𝑠)𝜋 (𝑖) = 𝛼

(𝑠)(x(𝑠−1)𝜋 (𝑖) , 𝜙

(𝑠){[x(𝑠−1)𝑗

, e 𝑗 , 𝜋 (𝑖)]

: 𝑗 ∈ N (𝜋(𝑖))}).

(15)Following (14), (15) and the induction hypothesis, we have

e𝜋 ( 𝑗) , 𝜋 (𝑖) = e 𝑗 ,𝑖 , x(𝑠−1)𝜋 (𝑖) = x(𝑠−1)

𝑖,

N(𝜋(𝑖)) = {𝜋( 𝑗), 𝑗 ∈ N (𝑖)}.(16)

Plugging (16) into (15), we have x(𝑠)𝑖

= x(𝑠)𝜋 (𝑖) . Thus, for an

𝑆-layer MPGNN, the output satisfies

x(𝑆)𝑖

= x(𝑆)𝜋 (𝑖) . (17)

The output matrix of the original graph is X =

[x1, · · · ,x |𝑉 |]𝑇 and the output matrix of the permuted graphis X = [x1, · · · , x |𝑉 |]𝑇 . Thus X(𝑖,:) = X(𝜋 (𝑖) ,:) and we have

Φ((𝜋 ★Z, 𝜋 ★ 𝐴)) = X = 𝜋 ★X = 𝜋 ★Φ((Z, 𝐴)).

APPENDIX CPROOF OF THEOREM IV.1

In MB-DLAs, the maximal degree of nodes should bebounded by some constant, denoted by Δ. The total number ofiterations of MB-DLA and the number of layers of MPGNNdenoted by 𝑆. The update of MB-DLA at the 𝑡-th iteration canbe written as

m(𝑡)𝑖, 𝑗

= ℎ(𝑡)2

(ℎ(𝑡)1 (x

(𝑡−1)𝑖), e𝑖, 𝑗

),

x(𝑡)𝑖

= 𝑔(𝑡)2

(x(𝑡−1)𝑖

, 𝑔(𝑡)1

({m(𝑡)

𝑗 ,𝑖: 𝑗 ∈ N (𝑖)

})). (18)

The update of an MPGNN at the 𝑘-layer can be written as

x(𝑘)𝑖

= 𝛼 (𝑘)(x(𝑘−1)𝑖

, 𝜙 (𝑘){[x(𝑘−1)𝑗

, e 𝑗 ,𝑖

]: 𝑗 ∈ N (𝑖)

}). (19)

1) We first show that the inference stage of an MPGNNcan be viewed as an MB-DLA, i.e., for all 0 ≤ 𝑠 ≤ 𝑆 and{𝛼 (𝑘) , 𝜙 (𝑘) }1≤𝑘≤𝑠 , there exists {ℎ (𝑡)2 , ℎ

(𝑡)1 , 𝑔

(𝑡)2 , 𝑔

(𝑡)1 }1≤𝑡≤𝑠 such

that x(𝑠)𝑖

= x(𝑠)𝑖

. We prove it by induction. The base casex(0)𝑖

= x(0)𝑖

holds because both x(0)𝑖

and x(0)𝑖

are node features

of the same node. We then assume x(𝑛)𝑖

= x(𝑛)𝑖

when 𝑛 = 𝑠−1.When 𝑛 = 𝑠, we construct

y𝑖 = ℎ(𝑠)1 (x

(𝑠−1)𝑖) = x(𝑠−1)

𝑖, ℎ(𝑠)2

(y𝑖 , e𝑖, 𝑗

)= [y𝑖 , e𝑖, 𝑗 ],

and thus

m(𝑠)𝑖, 𝑗

=

[x(𝑠−1)𝑗

, e 𝑗 ,𝑖

].

Let 𝑔 (𝑠)2 = 𝛼 (𝑠) and 𝑔 (𝑠)1 = 𝜙 (𝑠) , we have x(𝑠)𝑖

= x(𝑠)𝑖

.2) We then show that an MB-DLA can be viewed as an

MPGNN, i.e., for all 0 ≤ 𝑠 ≤ 𝑆 and {ℎ (𝑡)2 , ℎ(𝑡)1 , 𝑔

(𝑡)2 , 𝑔

(𝑡)1 }1≤𝑡≤𝑠 ,

there exists {𝛼 (𝑘) , 𝜙 (𝑘) }1≤𝑘≤𝑠 , such that x(𝑠)𝑖

= x(𝑠)𝑖

. We proveit by induction. The base case x(0)

𝑖= x(0)

𝑖holds because both

x(0)𝑖

and x(0)𝑖

are node features of the same node. We thenassume x(𝑛)

𝑖= x(𝑛)

𝑖when 𝑛 = 𝑠 − 1. We first define some

notations. For a set of vectors Z = {z1, · · · zΔ}, where z𝑖 ∈C𝑛, we define the order of variables in the set by the order ofreal part of its first coordinate. Let 𝜏𝑖 (Z), 1 ≤ 𝑖 ≤ Δ denote thefunction that selects the 𝑖-th element in a multiset Z. Denotey 𝑗 ,𝑖 =

[x(𝑠−1)𝑗

, e 𝑗 ,𝑖

], Y𝑖 = {y1,𝑖 , · · ·yΔ,𝑖} and define 𝜒1 : x ↦→

x(1:𝑑1) , 𝜒2 : x ↦→ x(𝑑1+1:𝑑2) . We construct

𝜙 (𝑠) (Y𝑖) = 𝑔 (𝑠)2 ( {ℎ2 (ℎ1 (𝜒1 (𝜏1 (Y𝑖))), 𝜒2 (𝜏1 (Y𝑖))) ,· · · , ℎ2 (ℎ1 (𝜒1 (𝜏Δ (Y𝑖))), 𝜒2 (𝜏Δ (Y𝑖))) } ),

and 𝛼 (𝑠) = 𝑔 (𝑠)1 , we then obtain x(𝑠)𝑖

= x(𝑠)𝑖

.

APPENDIX DPROOF OF PROPOSITION IV.1

Here, we consider WMMSE [18] in the original paper’ssetting, which includes (5) as a special case. The WMMSEalgorithm considers a 𝐾 cell interfering broadcast channelwhere base station (BS) 𝑘 serves 𝐼𝑘 users. Denote H𝑖𝑘 , 𝑗 as thechannel from base station 𝑗 to user 𝑖𝑘 , V𝑖𝑘 as the beamformerthat BS 𝑘 uses to transmit symbols to user 𝑖𝑘 , 𝑤𝑖𝑘 as the weightof user 𝑖𝑘 , and 𝜎2

𝑖𝑘as the variance of noise for user 𝑖𝑘 . The

problem formulation is

maximizeV

𝐾∑𝑘=1

𝐼𝑘∑𝑖=1𝑤𝑖𝑘 logdet (I +H𝑖𝑘 ,𝑘

V𝑖𝑘V𝐻𝑖𝑘

H𝐻𝑖𝑘 ,𝑘

©«∑

(𝑙, 𝑗)≠(𝑖,𝑘)H𝑖𝑘 , 𝑗

V𝑙 𝑗V𝐻𝑙 𝑗

H𝐻𝑖𝑘 , 𝑗+ 𝜎2

𝑖𝑘Iª®¬−1

)

subject to Tr(V𝑖𝑘V𝐻𝑖𝑘) ≤ 𝑃max, ∀𝑘,

The WMMSE algorithm is shown in Algorithm 2. We firstmodel this system as a graph. We treat the 𝑖𝑘 -th user as the 𝑖𝑘 -th node in the graph. The node features are [𝑤𝑖𝑘 , 𝜎𝑖𝑘 ,H𝑖𝑘 ,𝑘 ].The internal state of node 𝑖𝑘 at the (𝑝 − 1)-th iteration is[U (𝑝−1)

𝑖𝑘,W (𝑝−1)

𝑖𝑘,V (𝑝−1)

𝑖𝑘, 𝑤𝑖𝑘 , 𝜎𝑖𝑘 ,H𝑖𝑘 ,𝑘 ]. An edge is drawn

from the 𝑙 𝑗 -th node to the 𝑖𝑘 -th node for all 𝑙 if there is aninterference link between the 𝑗-th BS and the 𝑖𝑘 -th user. Theedge feature of the edge (𝑙 𝑗 , 𝑖𝑘 ) is e𝑙 𝑗 ,𝑖𝑘 = [H𝑖𝑘 , 𝑗 ,H𝑙 𝑗 ,𝑘 ].

We show that a WMMSE algorithm with 𝑇 iterations is anMB-DLA with at most 2𝑇 iterations. In the correspondingMB-DLA, we update the variables U𝑖𝑘 and W𝑖𝑘 at theodd iterations while updating the variable V𝑖𝑘 at the eveniterations. Specifically, at the 𝑝-th iteration with 𝑝 being anodd number, the 𝑙 𝑗 -th node broadcasts its state V (𝑝−1)

𝑙 𝑗along

13

its edges. The edge (𝑙 𝑗 , 𝑖𝑘 ) processes the message by formingm𝑙 𝑗 ,𝑖𝑘 = H𝑖𝑘 , 𝑗V

(𝑝−1)𝑙 𝑗

V (𝑝−1)𝐻𝑙 𝑗

H𝐻𝑖𝑘 , 𝑗

and the node 𝑖𝑘 receivesthe message set {m𝑙 𝑗 ,𝑖𝑘 ,∀𝑙, 𝑗}. The agent 𝑖𝑘 first sums overthe messages M𝑖𝑘 =

∑𝑙, 𝑗 m𝑙 𝑗 ,𝑖𝑘 . Then the 𝑖𝑘 -th node updates

its internal state as U (𝑝)𝑖𝑘

= (M𝑖𝑘 + 𝜎2𝑖𝑘I)−1H𝑖𝑘 ,𝑘V

(𝑝−1)𝑖𝑘

and

W (𝑝)𝑖𝑘

=

(I −U (𝑝)𝐻

𝑖𝑘H𝑖𝑘 ,𝑘V

(𝑝−1)𝑖𝑘

)−1. Specifically, at the 𝑝-th

layer, we construct

x(𝑝−1)𝑙 𝑗

= [U (𝑝−1)𝑙 𝑗

,W(𝑝−1)𝑙 𝑗

,V(𝑝−1)𝑙 𝑗

, 𝑤𝑙 𝑗 , 𝜎𝑖𝑘 ,H𝑙 𝑗 , 𝑗 ],

e𝑙 𝑗 ,𝑖𝑘 = [H𝑖𝑘 , 𝑗 ,H𝑙 𝑗 ,𝑘 ],

y𝑖𝑘 = ℎ(𝑝)1

(x(𝑝−1)𝑖𝑘

)= x(𝑝−1)𝑖𝑘

,

m𝑙 𝑗 ,𝑖𝑘 = ℎ(𝑝)2

(y𝑖𝑘 , e𝑙 𝑗 ,𝑖𝑘

)= H𝑖𝑘 , 𝑗V

(𝑝−1)𝑙 𝑗

V(𝑝−1)𝐻𝑙 𝑗

H𝐻𝑖𝑘 , 𝑗

,

M𝑖𝑘 = 𝑔(𝑝)1 ({m𝑙 𝑗 ,𝑖𝑘 ,∀𝑙, 𝑗}) =

∑𝑙, 𝑗

m𝑙 𝑗 ,𝑖𝑘 ,

U(𝑝)𝑖𝑘

= (M𝑖𝑘 + 𝜎2𝑖𝑘I)−1H𝑖𝑘 ,𝑘V

(𝑝−1)𝑖𝑘

,

W(𝑝)𝑖𝑘

=

(I −U (𝑝)𝐻

𝑖𝑘H𝑖𝑘 ,𝑘V

(𝑝−1)𝑖𝑘

)−1,

x(𝑝)𝑖𝑘

= 𝑔(𝑝)2 (x(𝑝−1)

𝑖𝑘,M𝑖𝑘

= [U (𝑝)𝑖𝑘

,W(𝑝)𝑖𝑘

,V(𝑝−1)𝑖𝑘

, 𝑤𝑖𝑘 , 𝜎𝑖𝑘 ,H𝑖𝑘 ,𝑘 ] .

At the 𝑝-th iteration where 𝑝 is even, the 𝑙 𝑗 -th nodebroadcasts its state [V (𝑝−1)

𝑙 𝑗,W (𝑝−1)

𝑙 𝑗] along its edges. The

edge (𝑙 𝑗 , 𝑖𝑘 ) processes the message by forming m𝑙 𝑗 ,𝑖𝑘 =

H𝐻𝑙 𝑗 ,𝑘

U (𝑝−1)𝑙 𝑗

W (𝑝−1)𝑙 𝑗

U (𝑝−1)𝐻𝑙 𝑗

H𝑙 𝑗 ,𝑘 . Node 𝑖𝑘 receives themessage set {m𝑙 𝑗 ,𝑖𝑘 ,∀𝑙, 𝑗}. The agent 𝑖𝑘 first sums over themessages M𝑖𝑘 =

∑𝑙, 𝑗 m𝑙 𝑗 ,𝑖𝑘 . Then the 𝑖𝑘 -th node updates its

internal state asV (𝑝)𝑖𝑘

= 𝑤𝑖𝑘

(M𝑖𝑘 + `∗𝑖𝑘I

)−1H𝐻𝑖𝑘 ,𝑘

U (𝑝−1)𝑖𝑘

W (𝑝−1)𝑖𝑘

. Specifi-cally, we construct

x(𝑝−1)𝑙 𝑗

= [U (𝑝−1)𝑙 𝑗

,W(𝑝−1)𝑙 𝑗

,V(𝑝−1)𝑙 𝑗

, 𝑤𝑙 𝑗 , 𝜎𝑖𝑘 ,H𝑙 𝑗 , 𝑗 ],

e𝑙 𝑗 ,𝑖𝑘 = [H𝑖𝑘 , 𝑗 ,H𝑙 𝑗 ,𝑘 ],

y𝑖𝑘 = ℎ(𝑝−1)1

(x(𝑝−1)𝑖𝑘

)= x(𝑝−1)𝑖𝑘

,

m𝑙 𝑗 ,𝑖𝑘 = ℎ(𝑝−1)2

(y𝑖𝑘 , e𝑙 𝑗 ,𝑖𝑘

)= 𝑤𝑙 𝑗H

𝐻𝑙 𝑗 ,𝑘

U(𝑝−1)𝑙 𝑗

W(𝑝−1)𝑙 𝑗

U(𝑝−1)𝐻𝑙 𝑗

H𝑙 𝑗 ,𝑘 ,

M𝑖𝑘 = 𝑔(𝑝)1 ({m𝑙 𝑗 ,𝑖𝑘 ,∀𝑙, 𝑗}) =

∑𝑙, 𝑗

m𝑙 𝑗 ,𝑖𝑘 ,

V𝑖𝑘 (`𝑖𝑘 ) = 𝑤𝑖𝑘(M𝑖𝑘 + `𝑖𝑘I


U(𝑝−1)𝑖𝑘

W(𝑝−1)𝑖𝑘

,

`∗𝑖𝑘 = argmin`𝑖𝑘 ≥0,Tr(V 𝐻

𝑖𝑘V𝑖𝑘 ) ≤𝑃max

`𝑖𝑘 ,

V(𝑝)𝑖𝑘

= 𝑤𝑖𝑘

(M𝑖𝑘 + `∗𝑖𝑘I


U(𝑝−1)𝑖𝑘

W(𝑝−1)𝑖𝑘

,

x(𝑝)𝑖

= 𝑔(𝑝)1 (x(𝑝−1)

𝑖𝑘,M𝑖𝑘 )

= [U (𝑝−1)𝑖𝑘

,W(𝑝−1)𝑖𝑘

,V(𝑝)𝑖𝑘

, 𝑤𝑖𝑘 , 𝜎𝑖𝑘 ,H𝑖𝑘 ,𝑘 ] .

This completes the proof for Proposition IV.1.

ACKNOWLEDGMENTS

The authors would like to thank anonymous reviewers andthe editors for their constructive comments.

Algorithm 2 WMMSE Algorithm [18]1: Initialize V𝑖𝑘 .2: for 𝑡 = 0...𝑇 do3: U𝑖𝑘 ←

(∑(𝑙, 𝑗)H𝑖𝑘 , 𝑗

V𝑙 𝑗V𝐻𝑙𝑗

H𝐻𝑖𝑘 , 𝑗+ 𝜎2

𝑖𝑘I)−1

H𝑖𝑘 ,𝑘V𝑖𝑘

4: W𝑖𝑘←

(I −U𝐻

𝑖𝑘H𝑖𝑘 ,𝑘

V𝑖𝑘

)−1

5: V𝑖𝑘 ← 𝑤𝑖𝑘

(∑𝑙, 𝑗 𝑤𝑙 𝑗H

𝐻𝑙𝑗 ,𝑘

U𝑙 𝑗W𝑙 𝑗U𝐻𝑙𝑗

H𝑙 𝑗 ,𝑘+ `∗

𝑖𝑘I)−1

H𝐻𝑖𝑘 ,𝑘

U𝑖𝑘W𝑖𝑘

6: end for7: Output V𝑖𝑘 .

REFERENCES

[1] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “A graph neural networkapproach for scalable wireless power control,” in Proc. IEEE GlobalCommun. Conf. (GLOBECOM’19) Workshops, Waikoloa, HI, USA, Dec.2019.

[2] M. Chiang, P. Hande, and T. Lan, “Power control in wireless cellularnetworks,” Found. Trends Networking, vol. 2, no. 4, pp. 381–533, 2008.

[3] E. Björnson, M. Bengtsson, and B. Ottersten, “Optimal multiuser trans-mit beamforming: A difficult problem with a simple solution structure,”IEEE Signal Process. Mag., vol. 31, no. 4, pp. 142–148, 2014.

[4] Y. Shi, J. Zhang, B. O’Donoghue, and K. B. Letaief, “Large-scaleconvex optimization for dense wireless cooperative networks,” IEEETrans. Signal Process., vol. 63, pp. 4729–4743, Sept. 2015.

[5] Y. Shi, J. Zhang, K. B. Letaief, B. Bai, and W. Chen, “Large-scale convexoptimization for ultra-dense cloud-RAN,” IEEE Wireless Commun.,vol. 22, pp. 84–91, Jun. 2015.

[6] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.MIT press Cambridge, 2016.

[7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,“Learning to optimize: Training deep neural networks for interferencemanagement,” IEEE Trans. Signal Process., vol. 66, pp. 5438 – 5453,Oct. 2018.

[8] W. Lee, M. Kim, and D. Cho, “Deep power control: Transmit powercontrol scheme based on convolutional neural network,” IEEE Commun.Lett., vol. 22, pp. 1276–1279, Apr. 2018.

[9] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power controlvia ensembling deep neural networks,” IEEE Trans. Commun., vol. 68,no. 3, pp. 1760–1776, 2020.

[10] W. Cui, K. Shen, and W. Yu, “Spatial deep learning for wirelessscheduling,” IEEE J. Sel. Areas Commun., vol. 37, Jun. 2019.

[11] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: Learningto optimize for resource management in wireless networks with fewtraining samples,” IEEE Trans. Wireless Commun., vol. 19, no. 1,pp. 665–679, 2020.

[12] W. Xia, G. Zheng, Y. Zhu, J. Zhang, J. Wang, and A. P. Petropulu,“A deep learning framework for optimization of MISO downlink beam-forming,” IEEE Trans. Commun., vol. 68, no. 3, pp. 1866–1880, 2020.

[13] M. Lee, G. Yu, and G. Y. Li, “Graph embedding based wireless linkscheduling with few training samples,” arXiv preprint arXiv:1906.02871,2019.

[14] M. Eisen and A. Ribeiro, “Optimal wireless resource allocation withrandom edge graph neural networks,” IEEE Trans. Signal Process.,vol. 68, pp. 2977–2991, 2020.

[15] J. Dong, J. Zhang, Y. Shi, and J. H. Wang, “Faster activity and datadetection in massive random access: A multi-armed bandit approach,”arXiv preprint arXiv:2001.10237, 2020.

[16] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combi-natorial optimization: a methodological tour d’horizon,” arXiv preprintarXiv:1811.06128, 2018.

[17] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wirelessresource allocation with application to vehicular networks,” Proc. IEEE,2019.

[18] Q. Shi, M. Razaviyayn, Z. Luo, and C. He, “An iteratively weightedMMSE approach to distributed sum-utility maximization for a mimointerfering broadcast channel,” IEEE Trans. Signal Process., vol. 59,pp. 4331–4340, Sept. 2011.

[19] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming forgreen cloud-RAN,” IEEE Trans. Wireless Commun., vol. 13, pp. 2809–2823, May 2014.

[20] Y. Shi, J. Zhang, W. Chen, and K. B. Letaief, “Enhanced group sparsebeamforming for green cloud-RAN: A random matrix approach,” IEEETrans. Wireless Commun., vol. 17, pp. 2511–2524, Apr. 2018.

14

[21] C. Sun, J. Wu, and C. Yang, “Data represention for deep learningwith priori knowledge of symmetric wireless tasks,” arXiv preprintarXiv:2005.08510, 2020.

[22] S. Ravanbakhsh, J. Schneider, and B. Poczos, “Equivariance throughparameter-sharing,” in Proc. Int. Conf. Mach. Learning, pp. 2892–2901,Jul. 2017.

[23] K. Xu, J. Li, M. Zhang, S. Du, K. Kawarabayashi, and S. Jegelka,“What can neural networks reason about?,” in Proc. Int. Conf. LearningRepresentations, Apr. 2020.

[24] W. Brendel and M. Bethge, “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet,” Proc. Int. Conf.Learning Representation, May 2019.

[25] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization with graphconvolutional networks and guided tree search,” in Proc. Adv. NeuralInform. Process. Syst., pp. 539–548, Dec. 2018.

[26] Z. Zhuang, J. Wang, Q. Qi, H. Sun, and J. Liao, “Toward greaterintelligence in route planning: A graph-aware deep learning approach,”IEEE Syst. J., vol. 14, no. 2, pp. 1658–1669, 2020.

[27] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforwardnetworks are universal approximators,” Neural networks, vol. 2, no. 5,pp. 359–366, 1989.

[28] K. Pratik, B. D. Rao, and M. Welling, “RE-MIMO: Recurrentand permutation equivariant neural MIMO detection,” arXiv preprintarXiv:2007.00140, 2020.

[29] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,” IEEE Trans. NeuralNetworks Learning Syst., pp. 1–21, 2020.

[30] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov,and A. J. Smola, “Deep sets,” in Proc. Adv. Neural Inform. Process.Syst., pp. 3391–3401, Dec. 2017.

[31] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graphneural networks?,” Proc. Int. Conf. Learning Representation, May 2019.

[32] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.Solomon, “Dynamic graph CNN for learning on point clouds,” ACMTrans. Graphics, vol. 38, no. 5, pp. 1–12, 2019.

[33] M. Fey and J. E. Lenssen, “Fast graph representation learning withpytorch geometric,” in Proc. Int. Conf. Learning Representations Work-shops, May 2019.

[34] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” Proc. Int. Conf. Learning Representation, Apr.2017.

[35] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latentvariable models for structured data,” in Proc. Int. Conf. Mach. Learning,pp. 2702–2711, Jun. 2016.

[36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning onpoint sets for 3D classification and segmentation,” in Proc. IEEE Conf.Comput. Vision Pattern Recognition, pp. 652–660, Honolulu, HI, USA,Jul. 2017.

[37] K. Shen and W. Yu, “FPLinQ: A cooperative spectrum sharing strategyfor device-to-device communications,” in Proc. IEEE Int. Symp. Inform.Theory, pp. 2323–2327, Jun. 2017.

[38] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter,“Differentiable convex optimization layers,” in Proc. Adv. Neural Inform.Process. Syst., pp. 9562–9574, Dec. 2019.

[39] R. Sato, M. Yamada, and H. Kashima, “Approximation ratios of graphneural networks for combinatorial problems,” in Proc. Adv. NeuralInform. Process. Syst., pp. 4083–4092, Dec. 2019.

[40] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P.Silva, “The logical expressiveness of graph neural networks,” in Proc.Int. Conf. Learning Representations, Apr. 2020.

[41] L. Hella, M. Järvisalo, A. Kuusisto, J. Laurinharju, T. Lempiäinen,K. Luosto, J. Suomela, and J. Virtema, “Weak models of distributedcomputing, with connections to modal logic,” Distributed Comput.,vol. 28, no. 1, pp. 31–53, 2015.

[42] M. Kubisch, H. Karl, A. Wolisz, L. C. Zhong, and J. Rabaey, “Dis-tributed algorithms for transmission power control in wireless sensornetworks,” in Proc. IEEE Wireless Commun. Networking, pp. 558–563,New Orleans, LA, USA, Mar. 2003.

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in Proc. Int. Conf. Learning Representations, May 2014.

Graph Neural Networks for Scalable Radio Resource … · 2020. 7. 16. · networks (GNNs) to solve...

Documents

Transcript of Graph Neural Networks for Scalable Radio Resource … · 2020. 7. 16. · networks (GNNs) to solve...