Indian Institute of Technology, Roorkee, India 2 3

18
Graph neural network initialisation of quantum approximate optimisation Nishant Jain, 1 Brian Coyle, 2 Elham Kashefi, 2, 3 Niraj Kumar 2 1 Indian Institute of Technology, Roorkee, India 2 School of Informatics, University of Edinburgh, EH8 9AB Edinburgh, United Kingdom 3 LIP6, CNRS, Sorbonne Universit´ e, 4 place Jussieu, 75005 Paris, France (Dated: November 5, 2021) Approximate combinatorial optimisation has emerged as one of the most promising application areas for quantum computers, particularly those in the near term. In this work, we focus on the quan- tum approximate optimisation algorithm (QAOA) for solving the Max-Cut problem. Specifically, we address two problems in the QAOA, how to select initial parameters, and how to subsequently train the parameters to find an optimal solution. For the former, we propose graph neural networks (GNNs) as an initialisation routine for the QAOA parameters, adding to the literature on warm- starting techniques. We show the GNN approach generalises across not only graph instances, but also to increasing graph sizes, a feature not available to other warm starting techniques. For train- ing the QAOA, we test several optimisers for the Max-Cut problem up to 14 qubits. These include quantum aware/agnostic optimisers proposed in literature and we also incorporate machine learn- ing techniques such as reinforcement and meta learning, both of which we demonstrate outperform vanilla stochastic gradient descent. With the incorporation of these initialisation and optimisation toolkits, we demonstrate how the QAOA can be trained as an end-to-end differentiable pipeline. I. INTRODUCTION Among the forerunners for use cases of near-term quantum computers, dubbed noisy intermediate-scale quantum (NISQ) [1] are the variational quantum algo- rithms (VQAs). The most well known of these is the variational quantum eigensolver [2] and the quantum ap- proximate optimisation algorithm (QAOA)[3]. In their wake, many new algorithms have been proposed in this variational framework tackling problems in a variety of areas [46]. The primary workhorse in such algorithms is typically the parameterised quantum circuit (PQC), and due to the heuristic and trainable nature of VQAs they have also become synonymous with ‘modern’ quan- tum machine learning [7, 8]. This is particularly evident with the adoption of PQCs as the quantum version of neural networks [9, 10]. In this work, we focus on one particular VQA, the QAOA, primarily used for approximate discrete combi- natorial optimisation. The canonical example of such a problem is finding the ‘maximum cut’ (Max-Cut) of a graph, where one aims to partition the graph nodes into two sets such that the sets have as many edges connecting them as possible. Discrete optimisation problems such as Max-Cut are extremely challenging to solve (specifically NP-Hard) and accurate solutions to such problems take exponential time in general. Aside from its theoretical relevance, Max-Cut finds applica- tions across various fields such as study of the spin glass model, network design, VLSI and other circuit layout designs [11], and across data clustering [12]. While it is not believed quantum computers can solve such prob- lems efficiently, it is hoped that quantum algorithms such as QAOA may be able to outperform classical algo- rithms by some benchmark. For example, provided one believes the unique games conjecture (UGC) holds true for classical solvers [13, 14] we can hope for a violation by quantum solvers [15]. Given the ubiquity of combi- natorial optimisation problems in the real world, even incremental improvements may have large financial and quality impacts. Due to this potential, there has been a rapid devel- opment in the study of the QAOA algorithm and its components, including (but not limited to) theoretical observations and limitations [1623], variations on the circuit structure (ansatz) [2428] used, the cost func- tion [2931] and initialisation and optimisation meth- ods [3237] used for finding optimal solutions. Since the algorithm is suitable for near-term devices, there has also been substantial progress in experimental or numerical benchmarks [3740] and the effect of quantum noise on the algorithm [41, 42]. However, due to the limitations in running real exper- iments on small and unreliable NISQ devices, which cur- rently are typically only accessible via an expensive cloud computing platform [43], it is important to limit the quantum resource (i.e. the overall number of runs, or the time for a single run on quantum hardware) required to solve a problem to the bare minimum. Therefore, effec- tive initialisation and optimisation strategies for VQAs can dramatically accelerate the search for optimal prob- lem solutions. The former ensures the algorithm begins ‘close’ to a solution in the parameter space (a local or global optimum), while the latter enables smooth and efficient traversal of the landscape. This is especially rel- evant given the existence of difficult optimisation land- scapes in VQAs plagued by barren plateaus [4447], local minima [48, 49] and narrow gorges [50]. To avoid these, and to enable efficient optimisation, several initialisa- tion technique have been proposed for VQAs, including using tensor networks [51], meta-learning [5254] and algorithm-specific techniques [32, 33]. Returning to the specifics of combinatorial optimisa- tion, the use of machine and deep learning has been arXiv:2111.03016v1 [quant-ph] 4 Nov 2021

Transcript of Indian Institute of Technology, Roorkee, India 2 3

Page 1: Indian Institute of Technology, Roorkee, India 2 3

Graph neural network initialisation of quantum approximate optimisation

Nishant Jain,1 Brian Coyle,2 Elham Kashefi,2, 3 Niraj Kumar2

1Indian Institute of Technology, Roorkee, India2School of Informatics, University of Edinburgh, EH8 9AB Edinburgh, United Kingdom

3LIP6, CNRS, Sorbonne Universite, 4 place Jussieu, 75005 Paris, France(Dated: November 5, 2021)

Approximate combinatorial optimisation has emerged as one of the most promising applicationareas for quantum computers, particularly those in the near term. In this work, we focus on the quan-tum approximate optimisation algorithm (QAOA) for solving the Max-Cut problem. Specifically,we address two problems in the QAOA, how to select initial parameters, and how to subsequentlytrain the parameters to find an optimal solution. For the former, we propose graph neural networks(GNNs) as an initialisation routine for the QAOA parameters, adding to the literature on warm-starting techniques. We show the GNN approach generalises across not only graph instances, butalso to increasing graph sizes, a feature not available to other warm starting techniques. For train-ing the QAOA, we test several optimisers for the Max-Cut problem up to 14 qubits. These includequantum aware/agnostic optimisers proposed in literature and we also incorporate machine learn-ing techniques such as reinforcement and meta learning, both of which we demonstrate outperformvanilla stochastic gradient descent. With the incorporation of these initialisation and optimisationtoolkits, we demonstrate how the QAOA can be trained as an end-to-end differentiable pipeline.

I. INTRODUCTION

Among the forerunners for use cases of near-termquantum computers, dubbed noisy intermediate-scalequantum (NISQ) [1] are the variational quantum algo-rithms (VQAs). The most well known of these is thevariational quantum eigensolver [2] and the quantum ap-proximate optimisation algorithm (QAOA) [3]. In theirwake, many new algorithms have been proposed in thisvariational framework tackling problems in a variety ofareas [4–6]. The primary workhorse in such algorithmsis typically the parameterised quantum circuit (PQC),and due to the heuristic and trainable nature of VQAsthey have also become synonymous with ‘modern’ quan-tum machine learning [7, 8]. This is particularly evidentwith the adoption of PQCs as the quantum version ofneural networks [9, 10].

In this work, we focus on one particular VQA, theQAOA, primarily used for approximate discrete combi-natorial optimisation. The canonical example of sucha problem is finding the ‘maximum cut’ (Max-Cut) ofa graph, where one aims to partition the graph nodesinto two sets such that the sets have as many edgesconnecting them as possible. Discrete optimisationproblems such as Max-Cut are extremely challenging tosolve (specifically NP-Hard) and accurate solutions tosuch problems take exponential time in general. Asidefrom its theoretical relevance, Max-Cut finds applica-tions across various fields such as study of the spin glassmodel, network design, VLSI and other circuit layoutdesigns [11], and across data clustering [12]. While it isnot believed quantum computers can solve such prob-lems efficiently, it is hoped that quantum algorithmssuch as QAOA may be able to outperform classical algo-rithms by some benchmark. For example, provided onebelieves the unique games conjecture (UGC) holds truefor classical solvers [13, 14] we can hope for a violation

by quantum solvers [15]. Given the ubiquity of combi-natorial optimisation problems in the real world, evenincremental improvements may have large financial andquality impacts.

Due to this potential, there has been a rapid devel-opment in the study of the QAOA algorithm and itscomponents, including (but not limited to) theoreticalobservations and limitations [16–23], variations on thecircuit structure (ansatz) [24–28] used, the cost func-tion [29–31] and initialisation and optimisation meth-ods [32–37] used for finding optimal solutions. Since thealgorithm is suitable for near-term devices, there has alsobeen substantial progress in experimental or numericalbenchmarks [37–40] and the effect of quantum noise onthe algorithm [41, 42].

However, due to the limitations in running real exper-iments on small and unreliable NISQ devices, which cur-rently are typically only accessible via an expensive cloudcomputing platform [43], it is important to limit thequantum resource (i.e. the overall number of runs, or thetime for a single run on quantum hardware) required tosolve a problem to the bare minimum. Therefore, effec-tive initialisation and optimisation strategies for VQAscan dramatically accelerate the search for optimal prob-lem solutions. The former ensures the algorithm begins‘close’ to a solution in the parameter space (a local orglobal optimum), while the latter enables smooth andefficient traversal of the landscape. This is especially rel-evant given the existence of difficult optimisation land-scapes in VQAs plagued by barren plateaus [44–47], localminima [48, 49] and narrow gorges [50]. To avoid these,and to enable efficient optimisation, several initialisa-tion technique have been proposed for VQAs, includingusing tensor networks [51], meta-learning [52–54] andalgorithm-specific techniques [32, 33].

Returning to the specifics of combinatorial optimisa-tion, the use of machine and deep learning has been

arX

iv:2

111.

0301

6v1

[qu

ant-

ph]

4 N

ov 2

021

Page 2: Indian Institute of Technology, Roorkee, India 2 3

2

shown to be effective means of solving this family ofproblems, see for example [55–57]. Of primary inter-est for our purposes are the works of [57, 58]. The for-mer [58] trains a graph neural network (GNN) to solveMax-Cut, while the latter [57] extends this to more gen-eral optimisation problems, up to millions of variables.Based on these insights and the recent trend in the quan-tum domain of incorporating equipping VQAs with neu-ral networks (with software libraries developed for thispurpose [59, 60]) indicates that using both classical andquantum learning architectures synergistically has muchpromise. We extend this hybridisation in this work.

This paper is divided into two parts. In the first part(Sections II-IV), we discuss the previous works in QAOAinitialisation and give our first contribution: an initial-isation strategy using graph neural networks. Specif-ically, we merge GNN solvers with the warm-startingtechnique for QAOA of [32], and demonstrate the effec-tiveness of this via numerical results in Section IV. Bythen examining the conclusion of [57], we can see howour GNN approach would allow QAOA initialisation toscale far beyond the capabilities of near-term quantumdevices. In the second part of the paper (Section V),we then complement this by evaluating several methodsof optimisation techniques for the QAOA proposed inthe literature, including quantum-aware, and quantum-agnostic optimisers, and neural network based optimisa-tion approaches.

A. QAOA for solving Max-Cut

For concreteness in this work, we focus on the discreteoptimisation problem known as Max-Cut. It involvesfinding a division (a cut) of the vertices of a (weighted)graph into two sets, which maximises the sum of theweights over all the edges across the vertex subsets. Forunweighted graphs, this cut will maximise simply thenumber of edges across the two subsets.

The problem can be recast to minimising the weightedsum of operators acting on the vertices of a given graph.Mathematically, this can be stated as follows. Givena graph G := (V, E) with vertices V and edges E ={(i, j)|i, j ∈ V and i 6= j}, the Max-Cut can be foundby minimising the following cost function,

C(z) = −∑〈i,j〉∈E

wij(1− zizj) (1)

where z := z1z2 . . . zn are variables for each vertex, i,such that zi ∈ {+1,−1} and wij is the correspondingweight of the edge between vertices i and j. In this case,the value (sign) of zi determines on which side of the cutthe node resides. The Max-Cut problem is a canonicalNP-complete problem [62], meaning there is no knownefficient polynomial time algorithm in general. Further,Max-Cut is also known to be APX-hard [63], meaning

there is also no known polynomial time approximate al-gorithm. The current best known polynomial time ap-proximate classical approach is the Goemans-Williamson(GW) algorithm which is able to achieve an approxima-tion ratio, r ≈ 0.878, where:

r =Approximate cut

Optimal cut=

2

πmin

0≤θ≤π

θ

1− cos θ≈ 0.878 (2)

Assuming the UGC classically, then the GW algo-rithm achieves the best approximation ratio for Max-Cut.Without this conjecture, it has been proven that it isNP-hard to approximate the Max-Cut value with an ap-proximation ratio better than 16

17 ≈ 0.941. With the cele-brated result of [15] stating the UGC is false for quantumsolvers, one may hope for a quantum algorithm with abetter approximation ratio than is possible classically.

To address Max-Cut quantum mechanically, one canquantise the cost function Eq. (1) by replacing the vari-ables with operators, zi → Zi, where Z is the Pauli-Zmatrix. The cost function can now be described with aHamiltonian:

HC(z) =∑〈i,j〉∈E

wij(1− ZiZj) (3)

where the actual cost - corresponding to the cut size - isextracted as the expectation value of this Hamiltonianwith respect to a quantum state, |ψ〉:

C(z) := −〈ψ|HC|ψ〉 (4)

The goal of a quantum algorithm is then to find theground state, |ψ〉G := arg min|ψ〉〈ψ|HC|ψ〉, i.e. the statewhich minimise the energy of the Hamiltonian, HC. Con-structing the Hamiltonian as in Eq. (3), ensures that theminimum energy state is exactly the state encoding theMax-Cut of the problem graph:

|ψ〉G = |ψ〉Max-Cut (5)

However, since the Max-Cut problem is NP-Hard, weexpect that finding this ground state, |ψ〉Max-Cut, will alsobe hard in general. The QAOA algorithm attempts tosolve this by initialising with respect to an easy Hamil-tonian (also called a ‘mixer’ Hamiltonian):

HM =

n∑i=1

Xi (6)

which has an eigenstate the simple product state,|ψ〉init = |+〉⊗n = H|0〉⊗n where X is the Pauli-X op-erator, and H is the Hadamard operator. This can beviewed as an initialisation which is a superposition of allpossible candidate solutions. The QAOA then attemptsto simulate adiabatic evolution from |ψ〉init to the targetstate |ψ〉Max-Cut by an alternating bang-bang applicationof two unitaries derived from the two Hamiltonians, Eq.(3), Eq. (6), which are respectively:

UC(γ) = e−iγHC and UM(β) = e−iβHM (7)

Page 3: Indian Institute of Technology, Roorkee, India 2 3

3

Figure 1: Initialisation techniques for QAOA. (a) This part highlights the difference between the warm-startingtechniques (using SDP relaxations or GNNs) and TQA. TQA produces initial angles, {β,γ}, whereas warm-startingtechniques initialise the QAOA state. ‘Init’ here refers to any parameter initialisation scheme. In this work, wechoose a specific initialisation technique called the Xavier initialisation [61] for warm-starting techniques. (b) GNNtakes an embedding of the initial graph and applies updates to the embeddings based on the neighbours of eachnode, using an parameterised function, fθ. In this case, the GNN outputs a probability for each node being oneither side of the cut. (c) A unified p-layer QAOA circuit for all initialisation schemes. In TQA, fixed choices forangles θ, φ initialise vanilla QAOA with the standard mixer, whereas warm-starting produces an initial state andmixer encoding a probabilistic solution given by the SDP or GNN. Here, x∗ implicitly encodes the regularisationparameter ε discussed in [32]. Note that both the GNN and QAOA parameters can be trained in an end-to-enddifferentiable manner, in contrast to other schemes.

In the QAOA, the parameters, γ, β, are trainable, andgovern the length of time each operator is applied for.These two unitaries are alternated in p ‘layers’ acting onthe initial state, so the final state is prepared using 2pparameters, {β,γ} := {β1, β2, . . . , βp, γ1, γ2, . . . , γp}:

|ψβ,γ〉 = UM(βp)UC(γp) . . . UM(β1)UC(γ1)|+〉⊗n (8)

Optimising the parameters, {β,γ} serves as a proxyfor finding the ground state, and so we aim that after afinite depth, p, we achieve a state |ψβ,γ〉 which is closeto the target state |ψ〉Max-Cut.

II. INITIALISING THE QAOA

Since searching over the non-convex parameter land-scape for an optimal setting of the {γ,β} parametersdirectly on quantum hardware may be expensive and/orchallenging, any attempts to initialise the QAOA param-eters near a candidate solution are extremely valuable asthe algorithm would them start it’s search from an al-ready good approximate solution. Such approaches aredubbed as ‘warm-starts’ [32], in contrast to ‘cold-starts’.One could consider a cold-start to be a random initial-isation of {γ,β}, or by using an initial state which en-codes no problem information, e.g. |+〉⊗n, as in vanilla

QAOA. In this work, we refer to cold-start as the latter,and ‘random initialisation’ to mean a random setting ofthe parameters, {γ,β}. We first revisit and summarisetwo previous approaches [32, 33], before presenting ourapproach to QAOA initialisation. We illustrate two pre-vious initialisation approaches in Fig. 1, which we re-view briefly in Section II A and Section II B, and alsoour approach based on graph neural networks, which weintroduce in Section III. For simplicity, we focus on thesimplest version of QAOA, but the methods could be ex-tended to other variants, for example recursive QAOA(RQAOA) [32, 64, 65].

A. Continuous relaxations

The first approach of [32] proposed a warm-start forQAOA method which can be applied to Max-Cut as a spe-cial case of a quadratic unconstrained binary optimisa-tion (QUBO) problem. In this work, two sub-approacheswere discussed. The former converts the QUBO intoits continuous quadratic relaxation form which is effi-ciently solvable and directly uses the output to initialisethe QAOA parameters. The latter approach applies therandom-hyperplane rounding method of the GW algo-rithm to generate a candidate solution for the QUBO.

Page 4: Indian Institute of Technology, Roorkee, India 2 3

4

For Max-Cut, this QUBO can be written in terms of thegraph Laplacian, LG = D −A (where D is the diagonaldegree matrix, and A is the adjacency matrix of G) asfollows (we utilise this form later in this work):

maxz∈{−1,1}n

zTLGz (9)

However, by removing the requirement that each zi is bi-nary, one can obtain a continuous relaxation which canbe efficiently solvable and serves a warm-start for solv-ing Eq. (9). Since LG is a positive semidefinite (PSD)matrix, the relaxed form can be trivially written as,

maxz∈[−1,1]n

zTLGz (10)

If the matrix in the QUBO is not PSD however, then onecan obtain another continuous relaxation, as a semidef-inite programme (SDP) [32, 66]. The output of this op-timisation is a real vector z∗, which when a roundingprocedure is performed (i.e. the GW algorithm), is acandidate solution for the original Max-Cut. In orderto use this relaxed solution to initialise QAOA, [32] alsodemonstrated that the initial state from Eq. (8) and themixer Hamiltonian Eq. (6) must also be altered as:

|ψ〉CSinit = |+〉⊗n → |ψ〉WS

init =

n⊗i=1

Ry(θi)|0〉⊗n.

HM →n∑i=1

HWSM,i ,

HWSM,i =

(2z∗i − 1 −2

√z∗i (1− z∗i )

−2√z∗i (1− z∗i ) 1− 2z∗i

)(11)

where θi = 2 sin−1(√z∗i ). One can immediately see that

|ψ〉WSinit is the ground state of HM with eigenvalue −n.

One possible issue that may arise with this warm-startis if the relaxed solution z∗i is either 0 or 1. When thishappens, the qubit i would be initialised to state |0〉 or|1〉, respectively. This means the qubit would be unaf-fected by the problem Hamiltonian Eq. (3) which onlycontains Pauli Z terms. To account for this possibility,[32] modifies θi in Eq. (11) with a regularisation param-eter, ε ∈ [0, 0.5] if the candidate solution, z∗i is too closeto 0 or 1.

Examining Fig. 1, this initialisation scheme is achievedby setting the angles in the initial state, ϕi = 0 andθi = 2 sin−1(

√z∗i ) ∀i where the initial state can be ex-

pressed as |ψ〉WSinit =

⊗Rx(ϕi)Ry(θi)|0〉⊗n. The ‘Init’

in this figure implies that one is free to choose anyQAOA parameter initialisation method as this warm-start approach only modifies the input state and themixer Hamiltonian.

B. Trotterised quantum annealing

A second proposed method [33] to initialise QAOAuses concepts from quantum annealing [67, 68], which

is a popular method of solving QUBO problems of theform Eq. (9). QAOA was proposed as a discrete gate-based method to emulate quantum adiabatic evolution,or quantum annealing. Therefore, one may help that in-sights from quantum annealing may be useful in settingthe initial angles for the QAOA circuit parameters. Ina method proposed by [33], one fixes the QAOA circuitdepth, p, and sets the parameters as:

γk =k

pδt βk =

(1− k

p

)δt (12)

where k = 1, · · · , p and δt is time interval which is a-priori unknown and given as a fraction of the (unknown)total optimal anneal time, T ∗ resulting in δt = T ∗/p.The authors of [33] observed an optimal time step valueδt to be ≈ 0.75 for 3-regular graphs and O(1) for othergraph ensembles. In [33], the QAOA was initialised withδt = 0.75 and observed to help avoid local minima andfind near-optimum minima close to the global minimum.

Note that in contrast to the warm-starting methodfrom the previous section, the TQA approach initialisesthe parameters, {β,γ} rather than the initial QAOAstate (and mixer Hamiltonian) which is set as in vanillaQAOA as |+〉. Again, revisiting Fig. 1, this initial statecan be achieved by choosing ϕj = π,θj = π/2, ∀j. Thisis due to the fact that Ry (π/2)Rx(π)|0〉 ∝ XRx (π/2) |0〉 =H|0〉 = |+〉. Similarly, for the mixer Hamiltonian, wehave Ry (π/2)Rz(−2βk)Ry (−π/2) ∝ HXRz(−2βk)XH =HRz(2βk)H = Rx(2βk), which is the single qubit mixerunitary from vanilla QAOA, up to a redefinition of βk.

III. GRAPH NEURAL NETWORKWARM-STARTING OF QAOA

Now that we have introduced the QAOA, and alterna-tive methods for warm-starting its initial state and/orinitial algorithm parameters, let us turn now to our pro-posed method; the use of graph neural networks. Thisapproach is closest to the relaxation method of Sec-tion II A in that the GNN provides an alternative initialstate to vanilla QAOA, and so it is to this method whichwe primarily compare. One of the main drawbacks of us-ing SDP relaxations and the GW algorithm, is that everygraph for which the Max-Cut must be found, generatesa new problem instance to be initialised. However, onthe other hand, the GW algorithm comes equipped withperformance guarantees (generating an approximate so-lution within 88% of the optimal answer).

As we shall see, using graph neural networks as aninitialiser allows a generalisation across many graph in-stances at once. Importantly, even increasing the num-ber of qubits will not significantly affect the time com-plexity of such approaches as it can be interpreted asa learned prior over graphs. We also demonstrate howthe model can be trained on a small number of qubits,and still perform well on larger problem instances (size

Page 5: Indian Institute of Technology, Roorkee, India 2 3

5

generalisation), a feature not present in any of the previ-ous initialisation methods for QAOA. Furthermore, theincorporation of a differentiable initialisation structureallows the entire pipeline of QAOA to become end-to-end differentiable, which is particularly advantageoussince it makes the problem ameanable to the automaticdifferentiation functionality of many deep learning li-braries [69, 70]. First, let us begin by introducing graphneural networks.

A. Graph Neural Networks

Graph neural networks (GNNs) [71] are a specificneural network model designed to operate on graph-structured data, and typically function via some mes-sage passing process across the graph. They operate bytaking an encoding of the input graph describing a prob-lem of interest and outputting a transformed encoding.Each graph node is initially encoded as a vector, whichare then updated by the GNN to incorporate informa-tion about the relative features of the graph in the node.This is done by taking into account the connections anddirections of the edges within the graph, using quanti-ties such as the node degree, and the adjacency matrix.This transformed graph information is then used to solvethe problem of interest. There are many possible ar-chitectures for how this graph information is utilised inthe GNN, including attention based mechanisms [72] orgraph convolutions [73] for example, see e.g. [74] for areview.

In order to transform the feature embeddings, theGNN is trained for a certain number of iterations (ahyperparameter). For a given graph node, nν , we asso-ciate a vector, hnνt , where t is the current iteration. Inthe next iteration (t+ 1), to update the vector for nodenν , we first compute some function of the vector embed-dings of the nodes in a neighbourhood of nν , denotedas N (nν) = {nj}j:nj∼nν . A-priori, there is no limita-tion on how large this neighbourhood can be, making itlarger will increase training difficulty and speed, but in-crease representational power. These function values arethen aggregated (for example by taking an average) andcombined with the node vector (with perhaps a non-linear activation function) at the previous iteration togenerate hnνt+1. Each nodes update increases the infor-mation contained relative to a larger subset of nodesin the graph. The collective action of these operationscan be described by a parameterised, trainable function,fθ(hnνt , {hnjt }j:nj∼nν ) (see Fig. 1), whose parameters,θ, are suitably trained to minimise a cost function. Inall cases here, we initialise all the elements of the fea-ture vectors, hnν0 to be the degree of the node, nν . Forthe specific GNN architecture we use in the majorityof this work, we choose the line graph neural network(LGNN) [75], shown to be competitive on combinatorialoptimisation problems [58]. However we also incorpo-rate the graph convolutional network (GCN) proposed

for combinatorial optimisation by [57] in some numeri-cal results in Section IV. We give further details aboutthese two architectures in Appendix A.

Once we have a trained GNN for a certain numberof iterations, T , we can use the information encodedin {hnjT }j for the problem at hand. A simple examplewould be to attach a multi-layer perception and per-form classification on each node, where {hnjT } behavesas feature vectors encoding the graph structure. For ourpurposes, we use these vectors to generates probabilitieson the nodes. These are the probabilities that the nodeis in a given side of the Max-Cut, which are then takenas the values z∗ in warm-started QAOA circuit and itsmixer Hamiltonian Eq. (11).

B. Graph neural networks for Max-Cut

To attach this probability, there are at least two possi-ble methods one could apply. Firstly, one could considerusing reinforcement learning or long short term memo-ries (LSTMs) [76, 77]. These methods generate proba-bilities in a step-wise fashion by employing a sequentialdependency. To train these, one may use a policy gradi-ent method [78] (dubbed as an ‘autoregressive decoding ’approach [79]).

The second, simpler, method is to treat edge indepen-dently, and generate the probability of each edge beingpresent in the Max-Cut or not (a ‘non-autoregressive de-coding’). This can be formulated as a vector, p, whereeach element corresponds to a node, nν , generated byapplying a softmax to the final output feature vectors ofthe GNN, {hnjT }j [57, 58]:

pnν (θ) =exp

(hnν ,0T

)∑j∈{0,1} exp

(hnν ,jT

) (13)

In [58], for each node, nν , the final output is the two di-

mensional vector [hnν ,0T ,hnν ,1T ], constructed as the out-put of a final two-output linear layer. The probabilityfor each node is then taken as one of these outputs (sayi = 0) via the softmax in Eq. (13) and then used in thecost function described in the next section.

1. Unsupervised training

Now that we have defined the structure and output ofthe GNN, it must be suitably trained. One approach isto use supervised training, however this may require alarge number of example graphs to serve as the groundtruth. Instead, following [57, 58], we opt for an unsuper-vised approach, bypassing the need for labels. To do wechoose the cost function as [58], which is given by theMax-Cut QUBO itself in terms of the graph Laplacian

Page 6: Indian Institute of Technology, Roorkee, India 2 3

6

(Eq. (9)):

CGNN = −minθ

1

T

T∑t=1

(−CtGNN(θ)) (14)

CtGNN(θ) =1

4(2p− 1)TLG(2p− 1) (15)

where p ∈ [−1, 1]n is the probability vector from theGNN Eq. (13). In Eq. (14), we define the cost functionas an average over a training set of T graphs, {Gt}Tt=1.Note, that the graphs in the training set do not have tobe the same size as the graphs of interest; the GNN canbe trained on an ensemble of graphs of different sizes,or graphs which are strictly smaller than the test graph.We utilise this feature to demonstrate the generalisationcapabilities of the GNN in Section IV B. See [57] for howthe cost function and GNN structure could be adaptedto alternate QUBO type problems.

IV. INITIALISATION NUMERICAL RESULTS

Let us first study the impact of the initialisationschemes discussed above on the QAOA numerically. Inall of the below, the approximation ratio, r, will be thefigure of merit. We also use Xavier initialisation [61] forthe QAOA parameters in all cases except for the TQA ini-tialisation method. This initialises each parameter froma uniform distribution over an interval depending on thenumber of parameters in the QAOA circuit.

We begin by benchmarking the graph neural networkQAOA itself in Fig. 2 for the two architectures dis-cussed above, the line graph neural network (LGNN)and the graph convolutional network (GCN). Fig. 2(a,b) compares the approximation ratio outputted by theGNN directly, against the GNN initialised solution fur-ther optimised by QAOA. Here, we increase the depthof the QAOA proportionally with the number of qubits(p = 3n/4 (a) and p = n/2 in (b)). We see that the GNNQAOA outperforms both the GNN and vanilla QAOA in-dividually. This advantage is more pronounced at lowerrelative depth, but is diminished at higher depth. How-ever, since the depth of quantum circuits is limited bydecoherence in NISQ devices, the advantage of the GNNin the low-depth regime is promising. Fig. 2(c) showsthe success probabilities of vanilla (cold-started) QAOAagainst the GNN initialisation. Since we observe thatthe LGNN appears to outperform the GCN architecture(due to the more complex message passing functional-ity), we opt for the former in the remainder of this work.However, for larger problem sizes, the LGNN may be lessscalable [57].

A. Graph neural network versus SDP relaxations

Next, we benchmark the GNN to the SDP relaxationapproach directly in Fig. 3. Here, we fix the number of

qubits to be 6 (i.e. graph instances with 6 nodes) andcompare the approximation ratio for the SDP relaxationmethod (rSDP) to the GNN generated ratio (rGNN) asa function of the QAOA circuit depth. Here we see thecomparable performance of the two methods, with theGNN approach being able to generate solutions whichare between 85-90% the quality of those generated bythe relaxation approach. However, if we then exam-ine Fig. 3b, then the tradeoff we are making becomesapparent. With a small sacrifice in solution quality, theGNN (once trained) is able to generate candidate so-lutions significantly faster than the GW algorithm. Asmentioned in the introduction, this is due to the factthat the GW algorithm must be run separately for allgraph instances, whereas the GNN has the ability to gen-eralise across multiple instances once suitably trained.The inference time for the GNN is approximately lin-ear, whereas the SDP algorithm behaves as some higherdegree polynomial. This was numerically estimated tobe O

(n3.5

)in [57]. In Section IV B, we take this even

further and demonstrate how the GNN approach is ableto generalise not only across graph instances of the samesize, but also to larger graph instances, which is a fea-ture clearly not possible via the relaxation (or TQA)approach.

As a final remark, we note that our simulation resultsare run only for a small qubit number, n < 25. Thisis primarily due the inherent exponential overhead inclassical simulation of quantum circuits as one increasesthe qubit count. At this relatively small scale, the GNNdoes not necessarily outperform the SDP relaxations inthe quality of the final solution, however, the real meritof the GNN approach would be evident in graph prob-lems with n ∼ 500 nodes or higher where the GNNdemonstrates improved performance, and even has thecapability of outperforming the GW algorithm [58]. Assuch, it is possible that the GNN approach only improvesover the GW algorithm for warm-starting QAOA both inrunning time, and in solution quality as larger quantumcomputers become available.

As a final comparison, we compare the GNN initial-isation technique against all other techniques in Fig. 4.We compare against the warm-starting technique usingrelaxations of [32] (‘Warm-start’), and the Trotterisedquantum annealing (‘TQA’) based approach of [33], asa function of depth and training epochs.

B. Generalisation Capabilities of GNNs

A key feature of using neural networks for certain tasksis capability to generalise. This generalisation ability isone of the driving motivations behind machine learning,and tests the ability of an algorithm to actually learn(rather than just memorise a solution). Similarly, wecan test the generalisation ability of GNNs in warm-starting the QAOA. To do so, we train instantiationsof GNNs on small graph instances, and then directly

Page 7: Indian Institute of Technology, Roorkee, India 2 3

7

4 6 8 10 12 14 160.65

0.7

0.75

0.8

0.85

0.9

0.95

1

No. qubits

Appro

xim

ati

on

rati

or

QAOA depth, p = 3n4

LGNN

LGNN QAOA

GCN QAOA

Cold-start QAOA

(a)

4 6 8 10 12 14 16

No. qubits

QAOA depth, p = n2

(b)

70 80 90 1000

10

20

Cut size (% of true Max-Cut)

%of

gra

phs

LGNN QAOA

Cold-start QAOA

(c)

Figure 2: Performance of the graph neural network on 3-regular graphs for Max-Cut. (a, b) ComparingMax-Cut approximation ratios achieved with LGNN/GCN initialisation versus cold-start (vanilla) QAOA. We alsoplot the raw values outputted by the LGNN with a simple rounding scheme. In (a), the QAOA depth is set to bep = 3n/4, while in (b), p = n/2. Each datapoint is generated via 1000 runs of the LGNN/GCN on random instancesof 3-regular graphs, of the appropriate size to the number of qubits. (c) Histogram shows the number of graphs onwhich GNN QAOA versus cold-start QAOA can achieve a certain ratio of the optimal cut. Here, we set p = 5 andn = 12 qubits and generate the percentages over 50 random graphs.

4 6 8 10 12 14 160.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

Graph Size

r GNN

/r R

elax

(a)

4 6 8 10 12 14 16 18 200

40

80

120

160

200

240

280

No. Qubits

Tim

e(m

s)

Relaxation

LGNN

(b)

Figure 3: Time versus quality tradeoff between GNN versus relaxation initialisation methods. (a)Max-Cut approximation ratios generated by GNN and the continuous relaxation (without QAOA), as a function ofgraph size. rj is the approximation ratio generated by method j ∈ {Relax,GNN} on 6 qubits. We use a simplerounding technique to generate the discrete values from the soft outcomes in both cases. (b) Comparison of timetaken by relaxation initialisation versus GNN initialisation as a function of qubit number. The GNN enables muchfaster inference for Max-Cut. This does not include pre-training time for the GNN, but as an example, training on1000 graphs for 18 qubits takes only 6 minutes.

apply them to larger instances. We test this for exam-ples of between 6 to 14 nodes in Table I. Here, we seethis generalisation feature directly, as the GNN is capa-ble of performing well on graphs larger than those in thetraining set. Note a related generalisation behaviour wasdemonstrated via meta-learning [52, 53] for the param-eters of a variational circuit. The work of [52] utilisesrecurrent neural networks (we revisit this strategy in Sec-

tion V C 2) for training the QAOA, but generalisation inthis case was possible due to the structure of the al-gorithm and parameter concentration [21]. The FLex-ible Initializer for arbitrarily-sized Parametrized quan-tum circuits (FLIP) [53], also has related parametersgeneralisation capabilities, which would be interesting tocompare and incorporate with warm-starting initialisersin future work.

Page 8: Indian Institute of Technology, Roorkee, India 2 3

8

0 25 50 75 100 125 150 175 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epoch

Appro

xim

ati

on

rati

or

TQA QAOA

Relax QAOA

LGNN QAOA

Cold-start QAOA

(a)

1 2 3 4 50.65

0.7

0.75

0.8

0.85

0.9

0.95

1

QAOA depth p

Appro

xim

ati

on

rati

or

TQA QAOA

Relax QAOA

LGNN QAOA

Cold-start QAOA

(b)

Figure 4: Comparison of all initialisation techniques. We use 3-regular graphs over 8 qubits. (a) Convergenceof initialisations as a function of training iteration. The depth of QAOA is fixed to 5. (b) Comparison ofinitialisations as a function of QAOA depth p. We plot the average for each method over 10 runs.

The rows in Table I correspond to the graph size onwhich the model was trained, and the columns corre-spond to the graph size on which the GNN was thentested. For example, both training and testing on graphswith 10 nodes gives an approximation ratio of ≈ 0.93,whereas if we train the GNN using only graphs of 6nodes, there is no drop in performance. In contrast,if we reduce from 14 to 6 training nodes, we only see adrop in the approximation ratio of 7%. Again, we alsomention how we are limited to small problem sizes dueto the overhead of simulating the QAOA circuit. As withthe discussions above, it has been observed that GNNstrained on 30 node graphs have the ability to generaliseto 300 nodes [57, 58]. As such, we would only expectto see improved performance as we have access to largerhardware.

Train sizeTest size

8 10 12 14

6 0.91 0.93 0.89 0.898 0.93 0.92 0.89 0.9110 0.93 0.90 0.8912 0.90 0.8914 0.96

Table I: Value of approximation ratio, r, as a functionof training and test graph size. In each case, both the

train and test set consists of 1000 random graphs of theappropriate sizes. Bold indicates the instances where

the train graph size is the same as the test size.

V. OPTIMISATION OF THE QAOA

Now, we move to the second focus of this work, whichis a comparison between a wide range of different opti-

misers which can be used to train the QAOA when solv-ing the Max-Cut problem. A large variety of optimisersfor VQAs have been proposed in the literature, and eachhas their own respective advantages and disadvantageswhen solving a particular problem. Due to the hybridquantum-classical nature of these algorithms, many ofoptimisation techniques have been taken directly fromclassical literature in, for example, the training of clas-sical neural networks. However, due to the need to im-prove the performance of near term quantum algorithms,there has also been much effort put into the discoveryof quantum-aware optimisers, which may include manynon-classical hyperparameters including, for example,the number of measurement shots to be taken to com-pute quantities of interest from quantum states [80, 81].In the following sections, we implement and comparea number of these optimisers. We begin by evaluatinggradient-based and gradient-free optimisers in Section V.We then compare quantum and classical methods for si-multaneous perturbation stochastic optimisation in Sec-tion V B, which typically have lower overheads than thegradient-free or gradient-based optimisers since all pa-rameters are updated in a single optimisation step, asopposed to parameter-wise updates. Finally, we thenimplement some neural network based optimisers in Sec-tion V C which operate via reinforcement learning andthe meta-learning optimiser mentioned above. In all ofthe above cases, we use vanilla stochastic gradient de-scent optimisation as a benchmark.

A. Gradient-based versus gradient-freeoptimisation

For a given parameterised quantum circuit instance,equipped with parameters at iteration (epoch), t, θt, the

Page 9: Indian Institute of Technology, Roorkee, India 2 3

9

parameters at iteration t+1 are given by an update rule:

θt+1 = θt −∆C(θt) (16)

∆C(θt) is the update rule, which contains informationabout the cost function to be optimised at epoch t, C(θt).In gradient-based optimisers, the update contains infor-mation about the gradients of C(θt) with respect to theparameters, θt. In contrast, gradient-free (or zeroth or-der) optimisation methods use only information aboutC(θt) itself. Only may also incorporate second-orderderivative information also, which tend to outperformthe previous methods, but are typically more expensiveas a result. In the following, we test techniques which fallinto all of these categories, for a range of qubit numbersand QAOA depths.

A general form of this update rule when incorporatinggradient information can be written as:

θt+1 = θt − η(t,θ)g(θ)−1∇C(θt) (17)

where η(t,θ) is a learning rate, which determines thespeed of convergence, and may depend on the previousparameters, θ and t. The quantity g(θ) ∈ Rd×d is ametric tensor, which incorporates information about theparameter landscape. This tensor can be the classical,or quantum Fisher information (QFI) for example. Inthe case of the latter, the elements of g(θ) when dealingwith a parameterised state, |ψθ〉 are given by:

gij(θ) := Re

{⟨∂ψθ∂θi

∣∣∣∣ ∂ψθ∂θj

⟩−⟨∂ψθ∂θi

∣∣∣∣ψθ⟩⟨ψθ ∣∣∣∣ ∂ψθ∂θj

⟩}(18)

In this form, the gradient update Eq. (17) updates theparameters according to the quantum natural gradient(QNG) [82].

If we further simplify by taking g = 1 to be theidentity and choosing different functions for η(θ, t), werecover many popular optimisation routines such asAdam [83] or Adadelta [84], which incorporates notionssuch as momentum into the update rule, and makes thelearning rate time-dependent. Such behaviour is desiredto, for example, allow the parameters to make large stepsat the beginning of optimisation (when far from the tar-get), and take smaller towards the latter stage when oneis close to the optimal solution. The simplest form ofgradient descent is vanilla, which takes η(θ, t) := η to bea constant. The ‘stochastic’ versions of gradient descentuse an approximation of the cost gradient computed withonly a few training examples. In Fig. 5, we begin bycomparing some examples of the above gradient-basedoptimisation rule (specifically using QNG, Adam andRMSProp) to a gradient-free method (COBYLA) [85].We also add a method known as model gradient descent(MGD) which is a gradient-based method introducedby [40] that involves quadratic model fitting as a gradi-ent estimation (see Appendix A of [86] for pseudocode).

The results are shown in Fig. 5 for Max-Cut on 3 regulargraphs for up to 14 qubits. We observe that optimisa-tion using the QNG outperforms other methods, how-ever it does so with a large resource requirement, whichis needed to compute the quantum fisher information(QFI) using quantum circuit evaluations.

4 6 8 10 12 14

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

No. Qubits

Appro

xim

ati

on

rati

o,r

SGD

COBYLA

Adam

RMSProp

QNG

MGD

Figure 5: Gradient-based versus gradient-freeoptimisers. We set the depth of QAOA depth to p = 4and vary the number of qubits. All optimisers havebeen run 10 times and average values have beenplotted.

We next examine the convergence speed of the‘quantum-aware’ QNG optimiser, versus the standardAdam and RMSProp in Fig. 6. Again, QNG significantlyreduces convergence time, but again at the expense of be-ing a more computationally taxing optimisation method.

0 25 50 75 100 125 150 175 2000

0.10.20.30.40.50.60.70.80.9

1

Epochs

Appro

xim

ati

on

rati

o,r

SGD

RMSProp

Adam

QNG

Figure 6: Comparison of optimisers relative toconvergence speed. We fix QAOA depth at p = 6 with14 qubits.

B. Simultaneous perturbation stochasticapproximation optimisation

From the above, using the QNG as an optimisationroutine is very effective, but it has a large computational

Page 10: Indian Institute of Technology, Roorkee, India 2 3

10

burden due to the evaluation of the quantum Fisher in-formation. A strategy to bypass this inefficiency wasproposed by [87], who suggested combining the QNGgradient with the simultaneous perturbation stochasticapproximation (SPSA) algorithm. This algorithm is anefficient method to bypass the linear scaling in the num-ber of parameters using the standard parameter shiftrule [88, 89], to compute quantum gradients. For exam-ple, in the expression Eq. (17) when restricted to vanillagradient descent, one gradient term must be computedfor each of the d (2p in the case of the QAOA) parame-ters. In contrast, SPSA approximates the entire gradi-ent vector by choosing a random direction in parameterspace and estimating the gradient in this direction using,for example, a finite difference method. This requires aconstant amount of computation relatvie to the num-ber of parameters. To incorporate the quantum Fisherinformation, [87] to SPSA actually uplifts a second or-der version of SPSA (called 2-SPSA), which exploits theHessian of the cost function to be optimised. The updaterules for 1-, 2-SPSA and QN-SPSA are given by:

θt+1 = θt − η ×

∇C(θt) 1-SPSA

H−1(θt)∇C(θt) 2-SPSA

g−1(θt)∇C(θt) QN-SPSA

(19)

where stochastic approximation to the Hessian1, H(θ),and the quantum Fisher information are given by:

Ht := −1

2

δC

2ε2∆t

1(∆t2)T + ∆t

2(∆t1)T

2(20)

δC := C(θt + ε∆t1 + ε∆t

2)− C(θt + ε∆t1)

− C(θt − ε∆t1 + ε∆t

2) + C(θt − ε∆t1) (21)

and

gt = −1

2

δF

2ε2∆t

1(∆t2)T + ∆t

2(∆t1)T

2(22)

δF := F (θt,θt + ε∆t1 + ε∆t

2)− F (θt,θt + ε∆t1)

− F (θt,θt − ε∆t1 + ε∆t

2) + F (θt,θt − ε∆t1) (23)

respectively. Here, F (θ,θ + α) = |〈ψθ|ψθ+γ〉|2 is thefidelity between the parameterised state prepared withangles, θ, and a ‘shifted’ version with angles θ+α. Thequantities ∆1,∆2 are uniformly random vectors sam-pled over {−1, 1}d. Also, ε is a small constant arisingfrom finite differencing approximations of the gradient,

1 The actual quantity used in [87] is a weighted quantity combin-ing these approximations at all previous time steps.

for example, we approximate the true gradient, ∇C(θ)

by ∇C(θ), given by:

∇C(θt) :=C(θt + ε∆t)− C(θt − ε∆t)

2ε∆t (24)

We compare all these three perturbation methodsin Fig. 7 against SGD once again, with a fixed learningrate of η = 0.01. Notice that SGD performs comparablyto 1-SPSA, but with the expense of more cost functionevaluations.

C. Neural optimisation

In this section, we move to a different methodology tofind optimal QAOA parameters than those presented inthe previous section. Specifically, as with the incorpora-tion of graph neural networks in the initialisation of thealgorithms, we can test neural network based methodsfor the optimisation itself. Specifically, we test two pro-posals given in this literature to optimise parameterisedquantum circuits. The first, based on a reinforcementlearning approach, uses the method of [36]. The secondis derived from using meta learning to optimise quantumcircuits, proposed by [52]. Both of these approaches in-volve neural networks outputting the optimised parame-ters by either predicting the update rule or directly pre-dicting the QAOA parameters.

1. Reinforcement learning optimisation

The work of [36] frames the QAOA optimisation asa reinforcement learning problem, adapting [90] to theproblem specific nature of QAOA. The primary idea isto construct and learn a policy, π(a, s), via which an re-inforcement learning agent associates a state, st ∈ S toan action, at ∈ A. In [36], an action is the update ap-plied to the parameters (similarly to Eq. (16)), ∆γ,∆β.A state, st = S, consists of the finite differences of theQAOA cost, ∆C(θtl) and the parameters, ∆γtl,∆βtl.Here, l ∈ {t − 1, . . . , t − L} ranges over the previous Lhistory iterations to the current iteration, t. The pos-sible corresponding actions, at ∈ A, are the set of pa-rameter differences, {∆γtl,∆βtl}l=t−1. The goal of thereinforcement learning agent is to maximise the reward,R(st,at, st+1) which in this case is the change of C be-tween two consecutive iterations, t and t+ 1. The agentwill aim to maximise a discounted version of the totalreward over iterations.

The specific approach used to search for a policy pro-posed by [36] is an actor-critic network in the proximalpolicy optimisation (PPO) algorithm [91], and a fullyconnected two hidden layer perceptron with 64 neuronsfor both actor and critic. The authors observed an eight-fold improvement in the approximation ratio compared

Page 11: Indian Institute of Technology, Roorkee, India 2 3

11

1 2 3 4 5 60.10.20.30.40.50.60.70.80.9

1

QAOA depth, p

Appro

xim

ati

on

rati

o,r

(a)

6 8 10 12 14 160.5

0.6

0.7

0.8

0.9

1

No. Qubits

(b)

0 25 50 75 1001251501752000

0.10.20.30.40.50.60.70.80.9

1

Epochs

SGD

1-SPSA

2-SPSA

QN-SPSA

(c)

Figure 7: Simultaneous perturbation stochastic approximation (1-, 2- and QN-SPSA) for Max-Cut withQAOA. Plots show approximation ratio as a function of (a) depth (qubit number fixed at 10), (b) number of qubits(QAOA depth is fixed to p = 5) and (c) training iteration (qubit number and QAOA depth fixed to 10 and 7respectively). In all cases, the average is taken over 10 independent optimisation runs.

to the gradient-free Nelder-Mead optimiser2. Further-more, the ability of this method to generalise across dif-ferent graph sizes is reminiscent of our above QAOA ini-tialisation approach using GNNs.

2. Meta-learning optimisation

A second method to incorporate neural networks isvia meta-learning. In the classical realm, this is com-monly used in the form of one neural network predictingparameters for another. The method we adopt here isone proposed by [52, 92] which uses a neural optimiserthat, when given information about the current state ofthe optimisation routine (in a VQA this could be thethe current expectation value of a quantum Hamilto-nian relative to the parameterised state), proposes a newset of parameters for the quantum algorithm. Specifi-cally, [52, 92] adopts a long short term memory (LSTM)as a neural optimiser (with trainable parameters, ϕ),an example of a recurrent neural network (RNN). Usingthis architecture, the parameters at iteration, t+ 1, areoutput as:

st+1,θt+1 = RNNϕ(st,θt,Ct) (25)

Here, st is the hidden state of the LSTM at iterationt, and is the next state is also suggested by the neuraloptimiser with the QAOA parameters. Ct is used as atraining input to the neural optimiser, which in the caseof a VQA is an approximation to the expectation value ofthe problem Hamiltonian, i.e. Eq. (3). The cost functionfor the RNN chosen by [52] incorporates the averaged

2 This was achieved by a hybrid approach where Nelder-Mead wasapplied to optimise further after a near-optimal set of parame-ters were found by the RL agent.

history of the cost at previous iterations as well as a termthat encourages exploration of the parameter landscape.We compare this approach against SGD (with a fixedlearning rate of η = 0.01) in Fig. 8 and the previous RLbased approach.

0 25 50 75 100 1250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epochs

Appro

xim

ati

on

rati

o,r

SGD

Meta-learning + SGD

RL + SGD

Figure 8: Comparison of neural optimisers. Wecompare the LSTM based meta-learning against areinforcement learning optimiser, with vanillastochastic gradient descent (SGD) used as abenchmark. Once each neural optimiser has converged,we continue the optimisation with SGD. Here we use10 qubits for Max-Cut and QAOA depth p = 6. Theresults are also averaged over independent 10 runs.

VI. CONCLUSION AND OUTLOOK

The work presented in this paper builds new tech-niques into analysing the QAOA algorithm - a powerfulquantum algorithm for constrained optimisation prob-lems. Here, we build an efficient and differentiable pro-

Page 12: Indian Institute of Technology, Roorkee, India 2 3

12

cess using the powerful machinery of graph neural net-works. GNNs have been extensively studied in the classi-cal domain for a variety of graph problems and we adoptthem as an initialisation technique for the QAOA, a nec-essary step to ensure the QAOA is capable of findingsolutions efficiently. Good initialisation techniques areespecially crucial for variational algorithms to achievegood performance when implemented on depth-limitednear term quantum hardware. Contrary to the previousworks on QAOA initialisation, our GNN approach doesnot require separate instances each time one encountersa new problem graph and therefore can speed up infer-ence time across graphs. We demonstrated this in thecase of the QAOA by showing good generalisation capa-bilities on new (even larger) test graphs than the familyof graphs they have been trained on. To complementthe initialisation of the algorithm, we investigate thesearch for optimal QAOA parameters, or optimisation,with a variety of methods. In particular, we incorpo-rate gradient-based/gradient-free classical and quantum-aware optimisers along with more sophisticated optimi-sation methods incorporating meta- and reinforcementlearning.

There is a large scope for future work, particularly

in the further incorporation and investigation of classi-cal machine learning techniques and models to improvenear term quantum algorithms. One could consider al-ternative GNN structures for initialisation, other neu-ral optimisers or utilising transfer learning techniques.For example, one could study the combination of warm-starting abilities of GNNs with generic quantum circuitinitialisers such as FLIP [53], both of which exhibit gen-eralisation capabilities to larger problem sizes.

A second extension of our proposal could be the ex-tension to other graph problems besides simply Max-Cut,and at much larger scales following [57]. Finally, onecould consider the use of truly quantum quantum ma-chine learning models such as quantum graph neural net-works [93] or others [10, 94].

ACKNOWLEDGEMENTS

The results of this paper were produced using Ten-sorflow Quantum [60]. NJ wrote the code and ran theexperiments. BC, EK and NK supervised the project.All authors contributed to manuscript writing.

[1] John Preskill. Quantum Computing in the NISQera and beyond. Quantum, 2:79, August 2018.URL: https://quantum-journal.org/papers/q-2018-08-06-79/, doi:10.22331/q-2018-08-06-79.

[2] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt,Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, AlanAspuru-Guzik, and Jeremy L. O’Brien. A varia-tional eigenvalue solver on a photonic quantum pro-cessor. Nature Communications, 5(1):1–7, July 2014.URL: https://www.nature.com/articles/ncomms5213,doi:10.1038/ncomms5213.

[3] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann.A Quantum Approximate Optimization Algorithm.arXiv:1411.4028 [quant-ph], November 2014. URL:http://arxiv.org/abs/1411.4028.

[4] Jarrod R. McClean, Jonathan Romero, Ryan Babbush,and Alan Aspuru-Guzik. The theory of variationalhybrid quantum-classical algorithms. New Journal ofPhysics, 18(2):023023, February 2016. URL: https:

//doi.org/10.1088%2F1367-2630%2F18%2F2%2F023023,doi:10.1088/1367-2630/18/2/023023.

[5] M. Cerezo, Andrew Arrasmith, Ryan Babbush, Si-mon C. Benjamin, Suguru Endo, Keisuke Fujii, Jar-rod R. McClean, Kosuke Mitarai, Xiao Yuan, LukaszCincio, and Patrick J. Coles. Variational quan-tum algorithms. Nature Reviews Physics, 3(9):625–644, September 2021. URL: https://www.nature.com/articles/s42254-021-00348-9, doi:10.1038/s42254-021-00348-9.

[6] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw,Tobias Haug, Sumner Alperin-Lea, Abhinav Anand,Matthias Degroote, Hermanni Heimonen, Jakob S.Kottmann, Tim Menke, Wai-Keong Mok, Sukin

Sim, Leong-Chuan Kwek, and Alan Aspuru-Guzik.Noisy intermediate-scale quantum (NISQ) algorithms.arXiv:2101.08448 [cond-mat, physics:quant-ph], Jan-uary 2021. URL: http://arxiv.org/abs/2101.08448.

[7] Jacob Biamonte, Peter Wittek, Nicola Pancotti, PatrickRebentrost, Nathan Wiebe, and Seth Lloyd. Quantummachine learning. Nature, 549(7671):195–202, Septem-ber 2017. URL: https://www.nature.com/articles/nature23474, doi:10.1038/nature23474.

[8] Jacob Biamonte. On the Theory of Modern QuantumAlgorithms. arXiv:2009.10088 [math-ph, physics:quant-ph], September 2020. URL: http://arxiv.org/abs/2009.10088.

[9] Edward Farhi and Hartmut Neven. Classification withQuantum Neural Networks on Near Term Processors.arXiv:1802.06002 [quant-ph], February 2018. URL:http://arxiv.org/abs/1802.06002.

[10] Marcello Benedetti, Erika Lloyd, Stefan Sack, andMattia Fiorentini. Parameterized quantum cir-cuits as machine learning models. Quantum Sci.Technol., 4(4):043001, November 2019. URL:https://doi.org/10.1088%2F2058-9565%2Fab4eb5,doi:10.1088/2058-9565/ab4eb5.

[11] Francisco Barahona, Martin Grotschel, Michael Junger,and Gerhard Reinelt. An application of combinatorialoptimization to statistical physics and circuit layout de-sign. Operations Research, 36(3):493–513, 1988. URL:http://jstor.org/stable/170992.

[12] Jan Poland and Thomas Zeugmann. Clustering PairwiseDistances with Missing Data: Maximum Cuts VersusNormalized Cuts. In Ljupco Todorovski, Nada Lavrac,and Klaus P. Jantke, editors, Discovery Science, 9th In-ternational Conference, DS 2006, Barcelona, Spain, Oc-

Page 13: Indian Institute of Technology, Roorkee, India 2 3

13

tober 7-10, 2006, Proceedings, volume 4265 of LectureNotes in Computer Science, pages 197–208. Springer,2006. URL: https://doi.org/10.1007/11893318 21,doi:10.1007/11893318 21.

[13] Subhash Khot. On the power of unique 2-prover 1-roundgames. In In Proceedings of the 34th Annual ACM Sym-posium on Theory of Computing, pages 767–775. ACMPress, 2002.

[14] Subhash Khot, Guy Kindler, Elchanan Mossel, andRyan O’Donnell. Optimal Inapproximability Re-sults for MAX-CUT and Other 2-Variable CSPs?SIAM Journal on Computing, 37(1):319–357, January2007. URL: https://epubs.siam.org/doi/10.1137/S0097539705447372, doi:10.1137/S0097539705447372.

[15] Julia Kempe, Oded Regev, and Ben Toner. The UniqueGames Conjecture with Entangled Provers is False. InAlgebraic Methods in Computational Complexity, 2007.

[16] M. B. Hastings. Classical and Quantum BoundedDepth Approximation Algorithms. arXiv:1905.07047[quant-ph], August 2019. URL: http://arxiv.org/abs/1905.07047.

[17] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, andLeo Zhou. The Quantum Approximate Optimization Al-gorithm and the Sherrington-Kirkpatrick Model at Infi-nite Size. arXiv:1910.08187 [cond-mat, physics:quant-ph], December 2020. URL: http://arxiv.org/abs/1910.08187.

[18] Daniel Stilck Franca and Raul Garcia-Patron. Lim-itations of optimization algorithms on noisy quantumdevices. arXiv:2009.05532 [quant-ph], September 2020.URL: http://arxiv.org/abs/2009.05532.

[19] V. Akshay, H. Philathong, M. E. S. Morales, andJ. D. Biamonte. Reachability Deficits in Quan-tum Approximate Optimization. Phys. Rev.Lett., 124(9):090504, March 2020. URL: https:

//link.aps.org/doi/10.1103/PhysRevLett.124.090504,doi:10.1103/PhysRevLett.124.090504.

[20] Sami Boulebnane. Improving the Quantum Ap-proximate Optimization Algorithm with postselection.arXiv:2011.05425 [quant-ph], November 2020. URL:http://arxiv.org/abs/2011.05425.

[21] V. Akshay, D. Rabinovich, E. Campos, and J. Bia-monte. Parameter Concentration in Quantum Approxi-mate Optimization. Physical Review A, 104(1):L010401,July 2021. URL: http://arxiv.org/abs/2103.11976,doi:10.1103/PhysRevA.104.L010401.

[22] D. Rabinovich, R. Sengupta, E. Campos, V. Akshay,and J. Biamonte. Progress towards analytically op-timal angles in quantum approximate optimisation.arXiv:2109.11566 [math-ph, physics:quant-ph], Septem-ber 2021. URL: http://arxiv.org/abs/2109.11566.

[23] Joao Basso, Edward Farhi, Kunal Marwaha, BenjaminVillalonga, and Leo Zhou. The Quantum ApproximateOptimization Algorithm at High Depth for MaxCuton Large-Girth Regular Graphs and the Sherrington-Kirkpatrick Model. arXiv:2110.14206 [quant-ph], Octo-ber 2021. arXiv: 2110.14206. URL: http://arxiv.org/abs/2110.14206.

[24] Stuart Hadfield, Zhihui Wang, Bryan O’Gorman,Eleanor G. Rieffel, Davide Venturelli, and RupakBiswas. From the Quantum Approximate Optimiza-tion Algorithm to a Quantum Alternating Opera-tor Ansatz. Algorithms, 12(2):34, February 2019.URL: https://www.mdpi.com/1999-4893/12/2/34, doi:

10.3390/a12020034.[25] Ryan LaRose, Eleanor Rieffel, and Davide Venturelli.

Mixer-Phaser Ansatze for Quantum Optimization withHard Constraints. arXiv:2107.06651 [quant-ph], July2021. URL: http://arxiv.org/abs/2107.06651.

[26] Linghua Zhu, Ho Lun Tang, George S. Barron, F. A.Calderon-Vargas, Nicholas J. Mayhall, Edwin Barnes,and Sophia E. Economou. An adaptive quantum approx-imate optimization algorithm for solving combinatorialproblems on a quantum computer. arXiv:2005.10258[quant-ph], December 2020. URL: http://arxiv.org/abs/2005.10258.

[27] Stuart Hadfield, Tad Hogg, and Eleanor G. Rieffel. An-alytical Framework for Quantum Alternating Opera-tor Ans\textbackslashtextbackslashtextbackslash”atze.arXiv:2105.06996 [quant-ph], May 2021. URL: http:

//arxiv.org/abs/2105.06996.[28] Guillaume Verdon, Juan Miguel Arrazola, Kamil

Bradler, and Nathan Killoran. A Quantum Approxi-mate Optimization Algorithm for continuous problems.arXiv:1902.00409 [quant-ph], February 2019. URL:http://arxiv.org/abs/1902.00409.

[29] Panagiotis Kl Barkoutsos, Giacomo Nannicini, An-ton Robert, Ivano Tavernelli, and Stefan Wo-erner. Improving Variational Quantum Optimiza-tion using CVaR. Quantum, 4:256, April 2020.URL: https://quantum-journal.org/papers/q-2020-04-20-256/, doi:10.22331/q-2020-04-20-256.

[30] Ioannis Kolotouros and Petros Wallden. An evolving ob-jective function for improved variational quantum opti-misation. arXiv:2105.11766 [quant-ph], May 2021. URL:http://arxiv.org/abs/2105.11766.

[31] David Amaro, Carlo Modica, Matthias Rosenkranz,Mattia Fiorentini, Marcello Benedetti, and MichaelLubasch. Filtering variational quantum algorithmsfor combinatorial optimization. arXiv:2106.10055[quant-ph], June 2021. URL: http://arxiv.org/abs/2106.10055.

[32] Daniel J. Egger, Jakub Marecek, and Stefan Woerner.Warm-starting quantum optimization. Quantum, 5:479,June 2021. URL: http://dx.doi.org/10.22331/q-2021-06-17-479, doi:10.22331/q-2021-06-17-479.

[33] Stefan H. Sack and Maksym Serbyn. Quantum an-nealing initialization of the quantum approximate op-timization algorithm. Quantum, 5:491, July 2021. URL:http://dx.doi.org/10.22331/q-2021-07-01-491, doi:

10.22331/q-2021-07-01-491.[34] Gian Giacomo Guerreschi and Mikhail Smelyanskiy.

Practical optimization for hybrid quantum-classical al-gorithms. arXiv:1701.01450 [quant-ph], January 2017.URL: http://arxiv.org/abs/1701.01450.

[35] Nikolaj Moll, Panagiotis Barkoutsos, Lev S Bishop,Jerry M Chow, Andrew Cross, Daniel J Egger, Ste-fan Filipp, Andreas Fuhrer, Jay M Gambetta, MarcGanzhorn, and et al. Quantum optimization usingvariational algorithms on near-term quantum devices.Quantum Science and Technology, 3(3):030503, June2018. URL: http://dx.doi.org/10.1088/2058-9565/aab822, doi:10.1088/2058-9565/aab822.

[36] Sami Khairy, Ruslan Shaydulin, Lukasz Cincio, YuriAlexeev, and Prasanna Balaprakash. Reinforcement-Learning-Based Variational Quantum Circuits Opti-mization for Combinatorial Problems. arXiv:1911.04574[quant-ph, stat], November 2019. URL: http://

Page 14: Indian Institute of Technology, Roorkee, India 2 3

14

arxiv.org/abs/1911.04574.[37] Michael Streif and Martin Leib. Training the quan-

tum approximate optimization algorithm without accessto a quantum processing unit. Quantum Science andTechnology, 5(3):034008, May 2020. doi:10.1088/2058-9565/ab8c2b.

[38] Leo Zhou, Sheng-Tao Wang, Soonwon Choi, HannesPichler, and Mikhail D. Lukin. Quantum Approxi-mate Optimization Algorithm: Performance, Mecha-nism, and Implementation on Near-Term Devices. Phys.Rev. X, 10(2):021067, June 2020. URL: https://

link.aps.org/doi/10.1103/PhysRevX.10.021067, doi:

10.1103/PhysRevX.10.021067.[39] David Amaro, Matthias Rosenkranz, Nathan Fitz-

patrick, Koji Hirano, and Mattia Fiorentini. A casestudy of variational quantum algorithms for a jobshop scheduling problem. arXiv:2109.03745 [quant-ph], September 2021. URL: http://arxiv.org/abs/2109.03745.

[40] Matthew P. Harrigan, Kevin J. Sung, Matthew Neeley,Kevin J. Satzinger, Frank Arute, Kunal Arya, Juan Ata-laya, Joseph C. Bardin, Rami Barends, Sergio Boixo,Michael Broughton, Bob B. Buckley, David A. Buell,Brian Burkett, Nicholas Bushnell, Yu Chen, Zijun Chen,Ben Chiaro, Roberto Collins, William Courtney, SeanDemura, Andrew Dunsworth, Daniel Eppens, AustinFowler, Brooks Foxen, Craig Gidney, Marissa Giustina,Rob Graff, Steve Habegger, Alan Ho, Sabrina Hong,Trent Huang, L. B. Ioffe, Sergei V. Isakov, Evan Jef-frey, Zhang Jiang, Cody Jones, Dvir Kafri, Kostyan-tyn Kechedzhi, Julian Kelly, Seon Kim, Paul V. Klimov,Alexander N. Korotkov, Fedor Kostritsa, David Land-huis, Pavel Laptev, Mike Lindmark, Martin Leib, OrionMartin, John M. Martinis, Jarrod R. McClean, MattMcEwen, Anthony Megrant, Xiao Mi, Masoud Mohseni,Wojciech Mruczkiewicz, Josh Mutus, Ofer Naaman,Charles Neill, Florian Neukart, Murphy Yuezhen Niu,Thomas E. O’Brien, Bryan O’Gorman, Eric Ostby, An-dre Petukhov, Harald Putterman, Chris Quintana, Pe-dram Roushan, Nicholas C. Rubin, Daniel Sank, AndreaSkolik, Vadim Smelyanskiy, Doug Strain, Michael Streif,Marco Szalay, Amit Vainsencher, Theodore White,Z. Jamie Yao, Ping Yeh, Adam Zalcman, Leo Zhou,Hartmut Neven, Dave Bacon, Erik Lucero, EdwardFarhi, and Ryan Babbush. Quantum approximate op-timization of non-planar graph problems on a planarsuperconducting processor. Nature Physics, 17(3):332–336, March 2021. URL: https://www.nature.com/articles/s41567-020-01105-y, doi:10.1038/s41567-020-01105-y.

[41] Cheng Xue, Zhao-Yun Chen, Yu-Chun Wu, and Guo-Ping Guo. Effects of Quantum Noise on Quantum Ap-proximate Optimization Algorithm. arXiv:1909.02196[quant-ph], December 2019. URL: http://arxiv.org/abs/1909.02196.

[42] Jeffrey Marshall, Filip Wudarski, Stuart Hadfield, andTad Hogg. Characterizing local noise in QAOA circuits.IOP SciNotes, 1(2):025208, August 2020. doi:10.1088/2633-1357/abb0d7.

[43] Ryan LaRose. Overview and Comparison of Gate LevelQuantum Software Platforms. Quantum, 3:130, March2019. URL: https://quantum-journal.org/papers/q-2019-03-25-130/, doi:10.22331/q-2019-03-25-130.

[44] Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyan-skiy, Ryan Babbush, and Hartmut Neven. Barrenplateaus in quantum neural network training landscapes.Nature Communications, 9(1):4812, November 2018.URL: https://www.nature.com/articles/s41467-018-07090-4, doi:10.1038/s41467-018-07090-4.

[45] Roeland Wiersema, Cunlu Zhou, Yvette de Sereville,Juan Felipe Carrasquilla, Yong Baek Kim, and HenryYuen. Exploring Entanglement and Optimization withinthe Hamiltonian Variational Ansatz. PRX Quan-tum, 1(2):020319, December 2020. URL: https://

link.aps.org/doi/10.1103/PRXQuantum.1.020319, doi:

10.1103/PRXQuantum.1.020319.[46] M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cin-

cio, and Patrick J. Coles. Cost function dependentbarren plateaus in shallow parametrized quantum cir-cuits. Nature Communications, 12(1):1791, March 2021.URL: https://www.nature.com/articles/s41467-021-21728-w, doi:10.1038/s41467-021-21728-w.

[47] Martin Larocca, Piotr Czarnik, Kunal Sharma, Gopikr-ishnan Muraleedharan, Patrick J. Coles, and M. Cerezo.Diagnosing barren plateaus with tools from quantum op-timal control. arXiv:2105.14377 [quant-ph], May 2021.

[48] Xuchen You and Xiaodi Wu. Exponentially Many Lo-cal Minima in Quantum Neural Networks. In Ma-rina Meila and Tong Zhang, editors, Proceedings ofthe 38th International Conference on Machine Learn-ing, volume 139 of Proceedings of Machine Learning Re-search, pages 12144–12155. PMLR, July 2021. URL:https://proceedings.mlr.press/v139/you21c.html.

[49] Javier Rivera-Dean, Patrick Huembeli, Antonio Acın,and Joseph Bowles. Avoiding local minima in Vari-ational Quantum Algorithms with Neural Networks.arXiv:2104.02955 [quant-ph], April 2021. URL: http://arxiv.org/abs/2104.02955.

[50] Andrew Arrasmith, Zoe Holmes, M. Cerezo, andPatrick J. Coles. Equivalence of quantum barrenplateaus to cost concentration and narrow gorges.arXiv:2104.05868 [quant-ph], April 2021. URL: http://arxiv.org/abs/2104.05868.

[51] James Dborin, Fergus Barratt, Vinul Wimalaweera,Lewis Wright, and Andrew G. Green. Matrix Prod-uct State Pre-Training for Quantum Machine Learn-ing. arXiv:2106.05742 [quant-ph], July 2021. URL:http://arxiv.org/abs/2106.05742.

[52] Guillaume Verdon, Michael Broughton, Jarrod R. Mc-Clean, Kevin J. Sung, Ryan Babbush, Zhang Jiang,Hartmut Neven, and Masoud Mohseni. Learning to learnwith quantum neural networks via classical neural net-works. arXiv:1907.05415 [quant-ph], July 2019. URL:http://arxiv.org/abs/1907.05415.

[53] Frederic Sauvage, Sukin Sim, Alexander A. Ku-nitsa, William A. Simon, Marta Mauri, and Ale-jandro Perdomo-Ortiz. FLIP: A flexible initializerfor arbitrarily-sized parametrized quantum circuits.arXiv:2103.08572 [quant-ph], May 2021. URL: http:

//arxiv.org/abs/2103.08572.[54] Alba Cervera-Lierta, Jakob S. Kottmann, and Alan

Aspuru-Guzik. Meta-Variational Quantum Eigen-solver: Learning Energy Profiles of ParameterizedHamiltonians for Quantum Simulation. PRX Quan-tum, 2(2):020329, May 2021. URL: https://

link.aps.org/doi/10.1103/PRXQuantum.2.020329, doi:

10.1103/PRXQuantum.2.020329.

Page 15: Indian Institute of Technology, Roorkee, India 2 3

15

[55] Quentin Cappart, Didier Chetelat, Elias B. Khalil, An-drea Lodi, Christopher Morris, and Petar Velickovic.Combinatorial Optimization and Reasoning with GraphNeural Networks. In Zhi-Hua Zhou, editor, Proceedingsof the Thirtieth International Joint Conference on Arti-ficial Intelligence, IJCAI-21, pages 4348–4355. Interna-tional Joint Conferences on Artificial Intelligence Orga-nization, August 2021. doi:10.24963/ijcai.2021/595.

[56] James Kotary, Ferdinando Fioretto, Pascal Van Hen-tenryck, and Bryan Wilder. End-to-End ConstrainedOptimization Learning: A Survey. In Zhi-Hua Zhou,editor, Proceedings of the Thirtieth International JointConference on Artificial Intelligence, IJCAI-21, pages4475–4482. International Joint Conferences on ArtificialIntelligence Organization, August 2021. doi:10.24963/ijcai.2021/610.

[57] Martin J. A. Schuetz, J. Kyle Brubaker, and Helmut G.Katzgraber. Combinatorial Optimization with Physics-Inspired Graph Neural Networks. arXiv:2107.01188[cond-mat, physics:quant-ph], July 2021. URL: http:

//arxiv.org/abs/2107.01188.[58] Weichi Yao, Afonso S. Bandeira, and Soledad Villar.

Experimental performance of graph neural networks onrandom instances of max-cut. In Wavelets and SparsityXVIII, volume 11138, page 111380S. International So-ciety for Optics and Photonics, September 2019. URL:https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11138/111380S/Experimental-

performance-of-graph-neural-networks-on-

random-instances-of/10.1117/12.2529608.short,doi:10.1117/12.2529608.

[59] Ville Bergholm, Josh Izaac, Maria Schuld, Chris-tian Gogolin, M. Sohaib Alam, Shahnawaz Ahmed,Juan Miguel Arrazola, Carsten Blank, Alain Delgado,Soran Jahangiri, Keri McKiernan, Johannes JakobMeyer, Zeyue Niu, Antal Szava, and Nathan Killo-ran. PennyLane: Automatic differentiation of hy-brid quantum-classical computations. arXiv:1811.04968[physics, physics:quant-ph], February 2020. arXiv:1811.04968. URL: http://arxiv.org/abs/1811.04968.

[60] Michael Broughton, Guillaume Verdon, Trevor Mc-Court, Antonio J. Martinez, Jae Hyeon Yoo, Sergei V.Isakov, Philip Massey, Murphy Yuezhen Niu, RaminHalavati, Evan Peters, Martin Leib, Andrea Skolik,Michael Streif, David Von Dollen, Jarrod R. McClean,Sergio Boixo, Dave Bacon, Alan K. Ho, HartmutNeven, and Masoud Mohseni. TensorFlow Quantum:A Software Framework for Quantum Machine Learning.arXiv:2003.02989 [cond-mat, physics:quant-ph], March2020. URL: http://arxiv.org/abs/2003.02989.

[61] Xavier Glorot and Yoshua Bengio. Understandingthe difficulty of training deep feedforward neural net-works. In Yee Whye Teh and Mike Titterington, ed-itors, Proceedings of the Thirteenth International Con-ference on Artificial Intelligence and Statistics, volume 9of Proceedings of Machine Learning Research, pages249–256, Chia Laguna Resort, Sardinia, Italy, May2010. PMLR. URL: https://proceedings.mlr.press/v9/glorot10a.html.

[62] Michael R. Garey and David S. Johnson. Comput-ers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.

[63] Christos H. Papadimitriou and Mihalis Yan-nakakis. Optimization, approximation, and

complexity classes. Journal of Computer andSystem Sciences, 43(3):425–440, December1991. URL: https://www.sciencedirect.com/science/article/pii/002200009190023X, doi:

10.1016/0022-0000(91)90023-X.[64] Sergey Bravyi, Alexander Kliesch, Robert Koenig, and

Eugene Tang. Hybrid quantum-classical algorithms forapproximate graph coloring. arXiv:2011.13420 [cond-mat, physics:quant-ph], November 2020. URL: http:

//arxiv.org/abs/2011.13420.[65] Sergey Bravyi, Alexander Kliesch, Robert Koenig,

and Eugene Tang. Obstacles to Variational Quan-tum Optimization from Symmetry Protection. Phys.Rev. Lett., 125(26):260505, December 2020. Pub-lisher: American Physical Society. URL: https:

//link.aps.org/doi/10.1103/PhysRevLett.125.260505,doi:10.1103/PhysRevLett.125.260505.

[66] Michael Overton and Henry Wolkowicz. Semidefiniteprogramming. Mathematical Programming, 77:105–109,April 1997. doi:10.1007/BF02614431.

[67] Tadashi Kadowaki and Hidetoshi Nishimori. Quantumannealing in the transverse Ising model. Physical Re-view E, 58(5):5355–5363, November 1998. URL: http://dx.doi.org/10.1103/PhysRevE.58.5355, doi:10.1103/physreve.58.5355.

[68] Philipp Hauke, Helmut G Katzgraber, Wolfgang Lech-ner, Hidetoshi Nishimori, and William D Oliver.Perspectives of quantum annealing: methods andimplementations. Reports on Progress in Physics,83(5):054401, May 2020. URL: http://dx.doi.org/10.1088/1361-6633/ab85b8, doi:10.1088/1361-6633/ab85b8.

[69] Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang,Zachary DeVito, Martin Raison, Alykhan Tejani,Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala. PyTorch: An Impera-tive Style, High-Performance Deep Learning Library.In Advances in Neural Information Processing Sys-tems 32, pages 8024–8035. Curran Associates, Inc.,2019. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-

deep-learning-library.pdf.[70] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Geoffrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, SherryMoore, Derek G. Murray, Benoit Steiner, Paul Tucker,Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu,and Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. arXiv:1605.08695 [cs], May2016. URL: http://arxiv.org/abs/1605.08695.

[71] Franco Scarselli, Marco Gori, Ah Chung Tsoi, MarkusHagenbuchner, and Gabriele Monfardini. The GraphNeural Network Model. IEEE Transactions on Neu-ral Networks, 20(1):61–80, January 2009. doi:10.1109/TNN.2008.2005605.

[72] Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio. GraphAttention Networks. arXiv:1710.10903 [cs, stat], Febru-ary 2018. URL: http://arxiv.org/abs/1710.10903.

Page 16: Indian Institute of Technology, Roorkee, India 2 3

16

[73] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Ma-ciejewski. Graph convolutional networks: a comprehen-sive review. Computational Social Networks, 6(1):11,November 2019. doi:10.1186/s40649-019-0069-y.

[74] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang,Cheng Yang, Zhiyuan Liu, Lifeng Wang, ChangchengLi, and Maosong Sun. Graph neural networks: Areview of methods and applications. AI Open, 1:57–81,January 2020. URL: https://www.sciencedirect.com/science/article/pii/S2666651021000012, doi:

10.1016/j.aiopen.2021.01.001.[75] Zhengdao Chen, Lisha Li, and Joan Bruna. Super-

vised Community Detection with Line Graph NeuralNetworks. In 7th International Conference on Learn-ing Representations, ICLR 2019, New Orleans, LA,USA, May 6-9, 2019. OpenReview.net, 2019. URL:https://openreview.net/forum?id=H1g0Z3A9Fm.

[76] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilk-ina, and Le Song. Learning Combinatorial Opti-mization Algorithms over Graphs. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Sys-tems, volume 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper/2017/file/d9896106ca98d3d05b8cbdf4fd8b13a1-Paper.pdf.

[77] Michel Deudon, Pierre Cournut, Alexandre Lacoste,Y. Adulyasak, and Louis-Martin Rousseau. LearningHeuristics for the TSP by Policy Gradient. In CPAIOR,2018.

[78] Wouter Kool, Herke van Hoof, and Max Welling. Atten-tion, Learn to Solve Routing Problems! In 7th Interna-tional Conference on Learning Representations, ICLR2019, New Orleans, LA, USA, May 6-9, 2019. Open-Review.net, 2019. URL: https://openreview.net/forum?id=ByxBFsRqYm.

[79] Chaitanya K. Joshi, Quentin Cappart, Louis-MartinRousseau, Thomas Laurent, and Xavier Bresson.Learning TSP Requires Rethinking Generalization.arXiv:2006.07054 [cs, stat], June 2020. URL: http:

//arxiv.org/abs/2006.07054.[80] Ryan Sweke, Frederik Wilde, Johannes Jakob Meyer,

Maria Schuld, Paul K. Fahrmann, BarthelemyMeynard-Piganeau, and Jens Eisert. Stochasticgradient descent for hybrid quantum-classical optimiza-tion. Quantum, 4:314, August 2020. URL: https:

//quantum-journal.org/papers/q-2020-08-31-314/,doi:10.22331/q-2020-08-31-314.

[81] Jonas M. Kubler, Andrew Arrasmith, Lukasz Cin-cio, and Patrick J. Coles. An Adaptive Opti-mizer for Measurement-Frugal Variational Algorithms.arXiv:1909.09083 [quant-ph], October 2019. URL:http://arxiv.org/abs/1909.09083.

[82] James Stokes, Josh Izaac, Nathan Killoran, andGiuseppe Carleo. Quantum Natural Gradi-ent. Quantum, 4:269, May 2020. URL: https:

//quantum-journal.org/papers/q-2020-05-25-269/,doi:10.22331/q-2020-05-25-269.

[83] Diederik P. Kingma and Jimmy Ba. Adam: A Methodfor Stochastic Optimization. In Yoshua Bengio and YannLeCun, editors, 3rd International Conference on Learn-ing Representations, ICLR 2015, San Diego, CA, USA,May 7-9, 2015, Conference Track Proceedings, 2015.URL: http://arxiv.org/abs/1412.6980.

[84] Matthew D. Zeiler. ADADELTA: An Adaptive LearningRate Method. arXiv:1212.5701 [cs], December 2012.

[85] M. J. D. Powell. A Direct Search Optimization MethodThat Models the Objective and Constraint Functionsby Linear Interpolation. In Susana Gomez and Jean-Pierre Hennart, editors, Advances in Optimization andNumerical Analysis, pages 51–67. Springer Netherlands,Dordrecht, 1994. doi:10.1007/978-94-015-8330-5 4.

[86] Kevin J. Sung, Jiahao Yao, Matthew P. Harrigan,Nicholas C. Rubin, Zhang Jiang, Lin Lin, Ryan Bab-bush, and Jarrod R. McClean. Using models to improveoptimizers for variational quantum algorithms. Quan-tum Science and Technology, 5(4):044008, October 2020.doi:10.1088/2058-9565/abb6d9.

[87] Julien Gacon, Christa Zoufal, Giuseppe Carleo, andStefan Woerner. Simultaneous Perturbation Stochas-tic Approximation of the Quantum Fisher Information.arXiv:2103.09232 [quant-ph], March 2021. URL: http://arxiv.org/abs/2103.09232.

[88] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fu-jii. Quantum circuit learning. Phys. Rev. A,98(3):032309, September 2018. URL: https://

link.aps.org/doi/10.1103/PhysRevA.98.032309, doi:

10.1103/PhysRevA.98.032309.[89] Maria Schuld, Ville Bergholm, Christian Gogolin,

Josh Izaac, and Nathan Killoran. Evaluating ana-lytic gradients on quantum hardware. Phys. Rev.A, 99(3):032331, March 2019. URL: https://

link.aps.org/doi/10.1103/PhysRevA.99.032331, doi:

10.1103/PhysRevA.99.032331.[90] Ke Li and Jitendra Malik. Learning to Optimize.

arXiv:1606.01885 [cs, math, stat], June 2016. URL:http://arxiv.org/abs/1606.01885.

[91] John Schulman, Filip Wolski, Prafulla Dhariwal, AlecRadford, and Oleg Klimov. Proximal Policy Optimiza-tion Algorithms. arXiv:1707.06347 [cs], August 2017.URL: http://arxiv.org/abs/1707.06347.

[92] Max Wilson, Sam Stromswold, Filip Wudarski, Stu-art Hadfield, Norm M. Tubman, and Eleanor Rief-fel. Optimizing quantum heuristics with meta-learning.arXiv:1908.03185 [quant-ph], August 2019. URL: http://arxiv.org/abs/1908.03185.

[93] Guillaume Verdon, Jacob Marks, Sasha Nanda, StefanLeichenauer, and Jack Hidary. Quantum Hamiltonian-Based Models and the Variational Quantum Thermal-izer Algorithm. arXiv:1910.02071 [quant-ph], October2019. URL: http://arxiv.org/abs/1910.02071.

[94] Amira Abbas, David Sutter, Christa Zoufal, AurelienLucchi, Alessio Figalli, and Stefan Woerner. The powerof quantum neural networks. Nature ComputationalScience, 1(6):403–409, June 2021. URL: https://

www.nature.com/articles/s43588-021-00084-1, doi:

10.1038/s43588-021-00084-1.[95] Florent Krzakala, Cristopher Moore, Elchanan Mos-

sel, Joe Neeman, Allan Sly, Lenka Zdeborova, andPan Zhang. Spectral redemption in clustering sparsenetworks. Proceedings of the National Academy ofSciences, 110(52):20935–20940, 2013. URL: https:

//www.pnas.org/content/110/52/20935, doi:10.1073/pnas.1312486110.

Page 17: Indian Institute of Technology, Roorkee, India 2 3

17

Appendix A: Graph neural network architectures

1. Line graph neural network

As mentioned in the main text, the primary GNN architecture we choose is a line graph neural network, alsoadopted by [58] for combinatorial optimisation and proposed by [75]. Given a graph, G := (VG , EG) with vertices VGand edges EG = {(i, j)|i, j ∈ VG and i 6= j}, the line graph, denoted L(G), is then constructed by taking the edges ofG which become the nodes of L(G), EG → VL(G). L(G) only has an edge between two vertices if the vertices shared anode in the original graph, G. For example, if we had three nodes in G, a, b and c which were connected as a− b andb − c, the vertex set of L(G) would contain nodes labelled (a − b) and (b − c) and would have a edge between themsince they both contain the vertex b. This behaviour is described by a ‘non-backtracking’ operator [58, 75] introducedby [95] and enables information to propagate in a directed fashion on L(G).

The LGNN then actually contains two separate graph neural networks, one defined on G and another defined onL(G). The GNN on G has feature vectors, hnνt ∈ Rd for each node nν and each iteration, t. Similarly the GNN onL(G) has feature vectors, g

nµt for every node nµ in L(G).

Without the information from the line graph, the feature vectors for G would be updated as:

ynνt+1 := hnνt θ0t +Dhnνt θ

1t +

J∑j=1

Ajhnνt θjt (A1)

ynνt+1 := f(ynνt+1

)(A2)

hnνt+1 = [ynνt+1, ynνt+1] (A3)

Where hnνt+1 results from the concatenation of the two vectors, ynνt+1, ynνt+1. D is the degree matrix of G and [Aj ]lm :=

min(1, [A2j ]lm) are power graph adjacency matrices (where A is the adjacency matrix of G), which allows information

to be aggregated from different neighbourhoods. The matrix element [A2j ]lm gives the number of walks between nodel and node m of length 2j and Aj converts this information into a binary matrix describing whether a walk exists (oflength 2j) between l and m. f is a nonlinear function, taken in [58, 75] to be ReLu, f(x) = max(0, x)

Now, including updates from the line graph into the GNN, the feature vectors from each graph are updated intandem as follows:

ynνt+1 := hnνt θ0t +Dhnνt θ

1t +

J∑j=1

Ajhnνt θjt + Sgnµt θ

J+1t + Ugnµt θ

J+2t (A4)

znµt+1 := g

nµt ϕ

0t +DL(G)g

nµt ϕ

1t +

J∑j=1

Bjgnµt ϕ

jt + Shnνt ϕJ+1

t + Uhnνt ϕJ+2t (A5)

ynνt+1 := f(ynνt+1

), z

nµt+1 := f

(znµt+1

)(A6)

hnνt+1 = [ynνt+1, ynνt+1], g

nµt+1 = [z

nµt+1, z

nµt+1] (A7)

Here, θ,ϕ are the trainable parameters of the GNN over G and the GNN over L(G) respectively. The matrices, S,Uare signed and unsigned incidence matrices. These are defined for every node i of G and the nodes (k → l) of L(G)(which are the edges of G) as:

Ui,(k→l) =

{1 if i = k

0 otherwise, Si,(k→l) =

1 if i = k

−1 if i = l

0 otherwise

(A8)

Finally, B is the non-backtracking operator describing L(G), defined as:

B(i→j),(k→l) =

{1 if j = k and i 6= l

0 otherwise(A9)

with its power graphs, Bj defined analogously to Aj .

Page 18: Indian Institute of Technology, Roorkee, India 2 3

18

2. Graph convolutional network

The alternative architecture chosen by [57] is the graph convolutional network architecture, which is simpler thanthe line graph neural network above. Here, the embedding vector updates have the following form:

hnνt+1 = f

θ0t ∑j:nj∈N (nν)

hnjt

|N (nν)|+ θ1th

nνt

(A10)

where N (nν) is the local neighbourhood of nν , N (nν) = {nj ∈ V|(nν , nj) ∈ E} and | · | denotes cardinality. This isa simpler architecture since the updates in a single step only involve information passing from the immediately localnodes to a given one, whereas the LGNN involves contributions across the graph in a single step.