Software Pipelining for Quantum Loop Programs

23
Software Pipelining for Quantum Loop Programs Guo Jingzhe Department of Computer Science and Technology Tsinghua University China [email protected] Mingsheng Ying CQSI, FEIT, University of Technology Sydney Australia Institute of Software, CAS, China Tsinghua University, China [email protected] Abstract We propose a method for performing software pipelining on quantum for-loop programs, exploiting parallelism in and across iterations. We redefine concepts that are useful in pro- gram optimization, including array aliasing, instruction de- pendency and resource conflict, this time in optimization of quantum programs. Using the redefined concepts, we present a software pipelining algorithm exploiting instruction-level parallelism in quantum loop programs. The optimization method is then evaluated on some test cases, including pop- ular applications like QAOA, and compared with several baseline results. The evaluation results show that our ap- proach outperforms loop optimizers exploiting only in-loop optimization chances by reducing total depth of the loop program to close to the optimal program depth obtained by full loop unrolling, while generating much smaller code in size. This is the first step towards optimization of a quantum program with such loop control flow as far as we know. Keywords: quantum program scheduling, quantum program compilation 1 Introduction Quantum computer hardware has reached the so-called quan- tum supremacy showing that quantum computation can ac- tually outperform classical computation for certain tasks, but it is still in the NISQ (Noisy-Intermediate-Scale-Quantum) era where there are no sufficient quantum bits (qubits, for short) for quantum error correction. Program optimization is particularly important for ex- ecuting a quantum program on NISQ hardware in order to reduce the number of required qubits, the length of gate pipeline, and to mitigate quantum noise. Indeed, there has already been plenty of work on optimization and paralleliza- tion of quantum programs. Theoretically, it was proved in [5] that compilation of quantum circuits with discretized time and parallel execution can be NP complete. Practically, quantum hardware architectures, especially those based on superconducting qubits, provide instruction level support for exploiting parallelism in quantum programs; for example, Rigetti’s Quil [20] allows programmers to explicitly spec- ify multiple instructions that do not involve same qubits to be executed together, while in Qiskit, ASAP or ALAP scheduling is performed implicitly [23]. Furthermore, sev- eral compilers have been implemented that can optimize quantum circuits by exploiting instruction level parallelism; for example, ScaffCC [11] introduces critical path analysis to find the β€œdepth” of a quantum program efficiently, re- vealing how much parallelism there is in a quantum circuit; commutativity-aware logic scheduling is proposed in [18] to adopt a more relaxing quantum dependency graph than β€œqubit dependency” by taking in mind commutativity be- tween the gates and CNOT gates as well as high-level commutative blocks while scheduling circuits. There are also some more sophisticated optimization strategies reported in in previous works [10, 13, 19, 22]. Quantum hardware will soon be able to execute quan- tum programs with more complex program constructs, e.g. for-loops. However, most of the optimization techniques in previous work only deal with sequential quantum circuits. Some methods allow loop programs as their input, but those loops will be unrolled immediately and optimization will be performed on the unrolled code. Loop unrolling is the technique that allows optimization across all iterations of a loop, but comes at a price of long compilation time, re- dundant final code and run-time compulsory cache misses. As quantum hardware in the near future may allow up to hundreds of qubits, it will often be helpful to preserve loop structure during optimization since the growth in number of qubits will also lead to increment in total gate count, as well as increment in difficulty unrolling the entire program. Software pipelining [12] is a common technique in op- timizing classical loop prosgrams. Inspired by the execution of an unrolled loop on an out-of-order machine, software pipelining reorganizes the loop by a software compiler in- stead of by hardware. There are two major approaches for software pipelining: β€’ Unrolling-based software pipelining usually unrolls loop for several iterations and finds repeating pattern in the unrolled part; see for example [2]. β€’ Modulo scheduling guesses an initiation interval first and try to schedule instructions one by one under dependency constraints and resource constraints; see for example [12]. Our Contributions: We hereby presents a software pipelin- ing algorithm for parallelizing a certain kind of quantum loop programs. Our parallelization technique is based on a arXiv:2012.12700v1 [quant-ph] 23 Dec 2020

Transcript of Software Pipelining for Quantum Loop Programs

Page 1: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop ProgramsGuo Jingzhe

Department of Computer Scienceand Technology

Tsinghua UniversityChina

[email protected]

Mingsheng YingCQSI, FEIT, University of Technology Sydney

AustraliaInstitute of Software, CAS, China

Tsinghua University, [email protected]

AbstractWe propose a method for performing software pipelining onquantum for-loop programs, exploiting parallelism in andacross iterations. We redefine concepts that are useful in pro-gram optimization, including array aliasing, instruction de-pendency and resource conflict, this time in optimization ofquantum programs. Using the redefined concepts, we presenta software pipelining algorithm exploiting instruction-levelparallelism in quantum loop programs. The optimizationmethod is then evaluated on some test cases, including pop-ular applications like QAOA, and compared with severalbaseline results. The evaluation results show that our ap-proach outperforms loop optimizers exploiting only in-loopoptimization chances by reducing total depth of the loopprogram to close to the optimal program depth obtained byfull loop unrolling, while generating much smaller code insize. This is the first step towards optimization of a quantumprogram with such loop control flow as far as we know.

Keywords: quantumprogram scheduling, quantumprogramcompilation

1 IntroductionQuantum computer hardware has reached the so-called quan-tum supremacy showing that quantum computation can ac-tually outperform classical computation for certain tasks, butit is still in the NISQ (Noisy-Intermediate-Scale-Quantum)era where there are no sufficient quantum bits (qubits, forshort) for quantum error correction.

Program optimization is particularly important for ex-ecuting a quantum program on NISQ hardware in order toreduce the number of required qubits, the length of gatepipeline, and to mitigate quantum noise. Indeed, there hasalready been plenty of work on optimization and paralleliza-tion of quantum programs. Theoretically, it was proved in[5] that compilation of quantum circuits with discretizedtime and parallel execution can be NP complete. Practically,quantum hardware architectures, especially those based onsuperconducting qubits, provide instruction level support forexploiting parallelism in quantum programs; for example,Rigetti’s Quil [20] allows programmers to explicitly spec-ify multiple instructions that do not involve same qubitsto be executed together, while in Qiskit, ASAP or ALAP

scheduling is performed implicitly [23]. Furthermore, sev-eral compilers have been implemented that can optimizequantum circuits by exploiting instruction level parallelism;for example, ScaffCC [11] introduces critical path analysisto find the β€œdepth” of a quantum program efficiently, re-vealing how much parallelism there is in a quantum circuit;commutativity-aware logic scheduling is proposed in [18]to adopt a more relaxing quantum dependency graph thanβ€œqubit dependency” by taking in mind commutativity be-tween the 𝑅𝑍 gates and CNOT gates as well as high-levelcommutative blocks while scheduling circuits. There are alsosome more sophisticated optimization strategies reported inin previous works [10, 13, 19, 22] .Quantum hardware will soon be able to execute quan-

tum programs with more complex program constructs, e.g.for-loops. However, most of the optimization techniques inprevious work only deal with sequential quantum circuits.Some methods allow loop programs as their input, but thoseloops will be unrolled immediately and optimization willbe performed on the unrolled code. Loop unrolling is thetechnique that allows optimization across all iterations ofa loop, but comes at a price of long compilation time, re-dundant final code and run-time compulsory cache misses.As quantum hardware in the near future may allow up tohundreds of qubits, it will often be helpful to preserve loopstructure during optimization since the growth in numberof qubits will also lead to increment in total gate count, aswell as increment in difficulty unrolling the entire program.

Software pipelining [12] is a common technique in op-timizing classical loop prosgrams. Inspired by the executionof an unrolled loop on an out-of-order machine, softwarepipelining reorganizes the loop by a software compiler in-stead of by hardware. There are two major approaches forsoftware pipelining:

β€’ Unrolling-based software pipelining usually unrollsloop for several iterations and finds repeating patternin the unrolled part; see for example [2].

β€’ Modulo scheduling guesses an initiation interval firstand try to schedule instructions one by one underdependency constraints and resource constraints; seefor example [12].

OurContributions:We hereby presents a software pipelin-ing algorithm for parallelizing a certain kind of quantumloop programs. Our parallelization technique is based on a

arX

iv:2

012.

1270

0v1

[qu

ant-

ph]

23

Dec

202

0

Page 2: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

novel and more relaxed set of dependency rules on a CZ-architecture (Theorems 1 and 2). The algorithm is essentiallya combination of unrolling-based software pipelining andmodulo scheduling [12], with several modifications to makeit work on quantum loop programs.

We carried out experiments on several examples and com-pared the results with the baseline result obtained by loopunrolling. Our approach proves to be a steady step towardbridging the gap between optimization results without con-sidering across-loop optimization and fully unrolling resultswhile restraining the increase in code size.

Organization of the Paper: In Section 2, we review somebasic definitions used in this paper. The theoretical tools fordefining and exploiting parallelism in quantum loop programare developed in Section 3. In Section 4, we present our ap-proach of rescheduling instructions across loops, extractingprologue and epilogue so that depth of the loop kernel canbe reduced. The evaluation results of our experiments aregiven in Section 5. The conclusion is drawn in the Section 6.[For conciseness, all proofs are given in the Appendices.]

2 Preliminaries and ExamplesThis section provides some backgrounds [14, 25] on quantumcomputing and quantum programming.

2.1 Basics of quantum computingThe quantum counterparts of bits are qubits. Mathematically,a state of a single qubit is represented by a 2-dimensionalcomplex column vector (𝛼, 𝛽)𝑇 , where𝑇 stands for transpose.It is often written in the Dirac’s notation as |πœ“ ⟩ = 𝛼 |0⟩+𝛽 |1⟩with |0⟩ = (1, 0)𝑇 , |1⟩ = (0, 1)𝑇 corresponding to classicalbits 0 and 1, respectively. It is required that |πœ“ ⟩ be unit:βˆ₯𝛼 βˆ₯2+βˆ₯𝛽 βˆ₯2 = 1. Intuitively, the qubit is in a superposition of0 and 1, andwhenmeasuring it, wewill get 0with probabilityβˆ₯𝛼 βˆ₯2 and 1 with probability βˆ₯𝛽 βˆ₯2. A gate on the qubit isthen modelled by a 2 Γ— 2 complex matrix π‘ˆ . The output ofπ‘ˆ on an input |πœ“ ⟩ is quantum state |πœ“ β€²βŸ©. Its mathematicalrepresentation as a vector is obtained by the ordinary matrixmultiplicationπ‘ˆ |πœ“ ⟩. To guarantee that |πœ“ β€²βŸ© is always unit,π‘ˆ must be unitary in the sense that π‘ˆ β€ π‘ˆ = 𝐼 where π‘ˆ † isthe adjoint of π‘ˆ obtained by transposing and then complexconjugatingπ‘ˆ . In general, a state of 𝑛 qubits is representedby a 2𝑛-dimensional unit vector, and a gate on 𝑛 qubits isdescribed by a 2𝑛 Γ— 2𝑛 unitary matrix. [For convenience ofthe readers, we present the basic gates used in this paper inAppendix A.]

2.2 Quantum execution environmentSoftware pipelining is a highly machine-dependent approachof optimization. So we must give out some basic assumptionsabout the underlying machine that our algorithm requires.State-of-the-art universal quantum computers differ in manyways:

β€’ Instruction set: A quantum computer chooses a uni-versal set of quantum gates as its low-level instructions.For example, IBM Q[4] uses controlled-NOT CNOTand three one-qubit gatesπ‘ˆ1,π‘ˆ2,π‘ˆ3, but Rigetti Quil[20]uses controlled-Z CZ and one-qubit rotations 𝑅𝑋 , 𝑅𝑍 .We use the universal gate set {π‘ˆ3,𝐢𝑍 } for the reasonthatπ‘ˆ3 itself is universal for single qubit gates, whichallows us to merge single qubit gates at compile time.[see Appendix A for the definition of these gates.]

β€’ Instruction parallelism: Different quantum comput-ers are implemented on different technologies, con-straining their power to execute multiple instructionssimultaneously. Usually superconductive quantum com-puters support parallelism while ion-trap ones do not.We assume qubit-level parallelism: instructions on dif-ferent qubits can always be executed simultaneously.

β€’ Timing: Different quantum computers may use differ-ent timing strategies, using continuous time or discretetime. Also execution time of different instructions maydiffer and is highly machine-dependent. Usually a two-qubit gate (e.g. CZ and CNOT ) costs much longertime than single qubit gates. We use a discrete timemodel with every gate requiring 1 tick equally.

β€’ Qubit connectivity: Different machines may havedifferent qubit topologies. However, we assume that allgates in the input are directly executable, which mayrequire a layout synthesis step before our optimization.

β€’ Classical control. The support for classical controlflow varies among different quantum computers; forexample, IBM Q does not support any complex controlflow, while Rigetti Quil supports branch statements.We assume such classical controls [see Appendix C].

The above assumptions do not fit into the existing quan-tum hardware architecture perfectly (for instance, IBM Qrequires CNOT and Quil disallowsπ‘ˆ3), while the architec-ture of Google’s devices[22] fits these requirements most.With some slight modifications, however, our method can beeasily adapted to unsupported architectures [see AppendixL].

2.3 Quantum loop programsWe focus on a special family of quantum loop programs,called one-dimensional for-loop programs, defined as below:

program :=header statementβˆ—header :=[(qdef | udef)βˆ—]

qdef :=π‘žπ‘’π‘π‘–π‘‘ ident[N];udef :=𝑑𝑒 𝑓 π‘”π‘Žπ‘‘π‘’ ident[N] = gate;

gate :=[(C2Γ—2)βˆ—] | 𝑅𝑍 | 𝑅+𝑍 | π‘ˆπ‘›π‘˜π‘›π‘œπ‘€π‘›

gateref :=ident[expr]qubit :=ident[expr]

op :=𝑆𝑄 (gateref) qubit | 𝐢𝑍 qubit, qubit;

Page 3: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

statement :=op | 𝑓 π‘œπ‘Ÿ ident 𝑖𝑛 Z π‘‘π‘œ Z{opβˆ—}| 𝑓 π‘œπ‘Ÿ ident 𝑖𝑛 ident π‘‘π‘œ ident{opβˆ—}

expr :=Z βˆ— ident + Z

where:β€’ The loop involves a group of one-dimensional qubitarray variables defined by qdef.

β€’ The loop has only one iteration variable 𝑖 starting fromπ‘Ž to 𝑏 with stride 1. The range [π‘Ž, 𝑏] is completelyknown at compile time, or completely unknown untilexecution. This allows our algorithm to be performedon a program with parametric loop range.

β€’ All array index expressions are in the form (π‘˜π‘– + 𝑏),where 𝑖 is the iteration variable, andπ‘˜, 𝑏 ∈ Z are knownconstants.

β€’ All operations in the loop body are either an one-qubitgate, or a 𝐢𝑍 gate on two qubits. We don’t considermeasurement operations.

β€’ One-qubit gates are defined by udef. They are givenas known matrices, or β€œan element in an array of un-known matrices” when a hint on whether the matrixarray is diagonal or antidiagonal can be given. Thisallows our algorithm to be performed on a programwith parametric gates or performing different gates ondifferent iterations.

At the very start of the entire program, all qubit arrays are ini-tialized as |0⟩. Our optimization may introduce some branchstatements if the endpoints π‘Ž and 𝑏 are unknown before codeexecution. As a result, the output language of the compileris a superset of the input language above, with support forbranch statements [see Appendix C for one possible definitionof output language]. To show versatility of the above loop,let us consider several popular quantum algorithms.

Example 1. Grover algorithm [9] is designed for the black-box searching problem: given a function 𝑓 : {0, 1}𝑛 β†’ {0, 1},find a bitstring π‘₯ : {0, 1}𝑛 such that 𝑓 (π‘₯) = 1. While a classicalalgorithm requires Ξ©(𝑛) calls to the oracle, Grover search canfind a solution in 𝑂 (

βˆšπ‘›) calls of quantum oracle π‘ˆπ‘“ ( |π‘₯⟩ βŠ—

|π‘βŸ©) = |π‘₯⟩ βŠ— |𝑏 βŠ• 𝑓 (π‘₯)⟩. This is done by repeating a series ofquantum gates, called Grover iteration. Grover search can bewritten as the loop program:

for i in 0 to N-1 do𝐻 [π‘ž [𝑖]]

end for𝐻 [π‘žπ‘€π‘œπ‘Ÿπ‘˜ ]for i in 1 to 𝑂 (

βˆšπ‘ ) do

π‘ˆπ‘“ [π‘ž, π‘žπ‘€π‘œπ‘Ÿπ‘˜ ]; (2 |πœ“ ⟩ βŸ¨πœ“ | βˆ’ 𝐼 ) [π‘ž]end for

Example 2. A Quantum Approximate Optimization Algo-rithm (QAOA for short) is designed in [8] to solve the MaxCutproblem on a given graph 𝐺 = βŸ¨π‘‰ , 𝐸⟩. It can be written as aparametric quantum loop program:

for i=0 to (N-1) do𝐻 [π‘ž [𝑖]]

end forfor i=1 to p dofor (π‘Ž, 𝑏) ∈ 𝐸 do𝐢𝑁𝑂𝑇 [π‘ž [π‘Ž], π‘ž[𝑏]];π‘ˆπ΅ [𝑖] [π‘ž [𝑏]]; 𝐢𝑁𝑂𝑇 [π‘ž [π‘Ž], π‘ž[𝑏]]

end forfor j=0 to (N-1) doπ‘ˆπΆ [ 𝑗] [π‘ž [ 𝑗]]

end forend for

Here, we use parametric gate arrays π‘ˆπΆ [𝑖] = 𝑅𝑋 (𝛽𝑖 , 𝑗) andπ‘ˆπ΅ [𝑖] = 𝑅𝑍 (βˆ’πœ”π‘Žπ‘π›Ύπ‘– ) of rotations. The two innermost loops canbe unrolled to satisfy our input language requirements. SinceQAOA repeatedly executes the circuit but each time with dif-ferent sets of angles {𝛽𝑖 } and {𝛾𝑖 }, an optimizer has to supportcompilation of the circuit above without knowing all parame-ters in advance. Note that the compiler can know in advancethat π‘ˆπ΅ [𝑖] are diagonal matrices, and this hint might be usedduring optimization. [for a further explanation of QAOA seesee Appendix B]

3 Theoretical toolsIn this section, we develop a handful of theoretical techniquesrequired in our optimization. To start, let us identify someof the most critical challenges in optimizing quantum loopprograms:

β€’ Instructions may bemerged together at compile time,potentially reducing the total depth. However, merginginstructions needs to know which instructions may beadjacent in the unrolled pattern, thus requiring us toresolve all possible qubit aliasings.

β€’ Data dependency graph in a quantum program isusually much denser than that in classical program,since generally two matrices are not commutable, thatis, 𝐴𝐡 β‰  𝐡𝐴.

β€’ Resource constraint, which prevents instructionsthat do not have dependency from executing together,is quite different in quantum case from classical case.

We will show how much optimization can be done by miti-gating these challenges in loop reordering.

3.1 Gate mergingOur assumptions allow several instructions to be mergedinto a single instruction with the same effect:

β€’ Two adjacent one-qubit gates on the same qubit canbe merged, since we are usingπ‘ˆ3.

β€’ Two adjacent 𝐢𝑍 gates on the same qubits can canceleach other.

Example 3. Figure 1 is a simple case for periodical gate merg-ing pattern. The two one-qubit gates in different iterations may

Page 4: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

f o r i =0 t o 3 doU q [ i ] ; V q [ i + 1 ] ; W q [ i + 2 ] ;

end f o r

(a) Loop program.|q0〉 U

|q1〉 V U

|q2〉 W V U

|q3〉 W V U

|q4〉 W V

|q5〉 W

(b) Unrolled circuit.

|q0〉 U

|q1〉 UV

|q2〉 UVW

|q3〉 UVW

|q4〉 VW

|q5〉 W

(c) Merged.

Figure 1. Single qubit gates can be merged periodically.

|i〉 β€’

|o〉 H H

|j〉 β€’(a) 𝑖 β‰  0 ∧ 𝑖 β‰  βˆ’1

|o〉 H β€’ H

|j〉 β€’(b) 𝑖 = 0

|i〉 β€’

|o〉 H β€’ H

(c) 𝑖 = βˆ’1

Figure 2. The 𝐢𝑍 gate prevents the two Hadamard gatesfrom merging, due to potential qubit aliasing.

merge with each other, thus simplifying the dependency graphand introducing new opportunities for optimization.

Gate merging allows us to decrease count of gates, andthus reduce total execution time. However, the existence ofpotential aliasing adds to the difficulty of finding β€œadjacent”pairs of gates. Figuring out pairs of gates that can be safelymerged is one of the critical problems when scheduling theprogram.

Example 4. Even for a simple program, it can be hard todecide whether two adjacent instructions on a qubit can bemerged. Consider the simple program:

for i=a to b do𝐻 [π‘ž [0]]; 𝐢𝑍 [π‘ž [𝑖], π‘ž[𝑖 + 1]]; 𝐻 [π‘ž [0]];

end forWe can merge the Hadamard gates if and only if βˆ€π‘–, 𝑖 β‰ 

0 ∧ (𝑖 + 1) β‰  0. Three possible cases of 𝑖 lead to three differentresults, as Figure 2 shows.

The above example reveals that resolving qubit aliasingsis crucial in gate merging.

f o r i =0 to 3 doH q [ 1 ] ;CZ q [ i ] , q [ i + 1 ] ;H q [ 1 ] ;

end f o r

(a) Loop program.|q0〉 β€’

|q1〉 H β€’ H H β€’ H H H H H

|q2〉 β€’ β€’

|q3〉 β€’ β€’

|q4〉 β€’

(b) Unrolled circuit.

Figure 3. Unrolled loop does not reveal periodic feature dueto qubis aliasing.

f o r i =0 to 3 doH q [ i ] ;CZ q [ i ] , q [ i + 1 ] ;H q [ i + 1 ] ;

end f o r

(a) Loop program.|q0〉 H β€’

|q1〉 β€’ H H β€’

|q2〉 β€’ H H β€’

|q3〉 β€’ H H β€’

|q4〉 β€’ H

(b) Unrolled circuit.

Figure 4. Periodic feature in the unrolled loop can be cap-tured.

3.2 Qubit aliasing resolutionAllowing arbitrary linear expressions being used to indexqubit arrays introduces the problem of qubit aliasing bothin a single iteration and across iterations. Potential aliasingin quantum programs leads two kinds of problems: lack ofperiodic features in unrolled schedule, and extra complexity indetecting aliasings.

The first problem is that non-periodic features cannot becaptured using software-pipelining (or other loop schedul-ing methods). For example, in Figure 3, the situation where𝐢𝑍 blocks two Hadamards from merging only occurs in oneor two iterations of the loop program, but it prevents themerging in all iterations, since software pipelining can onlygenerate a periodic pattern and has to generate conservativecode. The only kind of aliasing (two different qubit expres-sions refering to the same qubit) that software pipeliningcan capture is those expressions on the same qubit array andwith the same slope, as shown in Figure 4.

Page 5: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

Figure 5. An example for across-loop qubit aliasing withπ‘˜1 = 3 and π‘˜2 = 2. For 𝑇 = Z, Δ𝑖 = 1, while for 𝑇 = [4, 10],Δ𝑖 = 2.

To see the second problem, we note that detection of mem-ory aliasing [1] is usually solved by an Integer Linear Pro-gramming (ILP) problem solver such as Z3[7]. However, ageneral ILP problem is NP-complete in theory and may takelong time to solve in practice. Fortunately, we will see thatall problems that we are facing can be solved efficiently in𝑂 (1) time without an ILP solver.

We consider two references to a same qubit array:π‘ž [π‘˜1𝑖 + 𝑏1] ,π‘ž [π‘˜2𝑖 + 𝑏2] , 𝑖 ∈ 𝑇 , where 𝑇 is the loop interval when theloop range is known and Z when unknown.

Definition 1. In-loop qubit aliasing: To check whethertwo instructions can always be executed together, we haveto check if one qubit reference may be an alias of another, thatis, (βˆƒπ‘– ∈ 𝑇 ) (π‘˜1𝑖 + 𝑏1 = π‘˜2𝑖 + 𝑏2) .

This problem can be easily solved by checking whether(𝑏2 βˆ’ 𝑏1) is a multiple of (π‘˜1 βˆ’ π‘˜2) and 𝑏2βˆ’π‘1

π‘˜1βˆ’π‘˜2lies in 𝑇 .

Definition 2. Across-loop qubit aliasing: To checkwhetherthere is an across-loop dependency between two instructions,we have to check if one qubit reference may be an alias of an-other qubit reference several iterations later. Thus, we needto find the minimal increment Δ𝑖 β©Ύ 1, s.t.

(βˆƒπ‘– ∈ 𝑇 ) ((𝑖 + Δ𝑖 ∈ 𝑇 ) ∧ (π‘˜1𝑖 + 𝑏1 = π‘˜2 (𝑖 + Δ𝑖) + 𝑏2)) . (1)

This issue can be reduced to the Diophantine equation(π‘˜2 βˆ’ π‘˜1)𝑖 + π‘˜2 (Δ𝑖) = 𝑏1 βˆ’ 𝑏2, 𝑖 ∈ 𝑇, 𝑖 + Δ𝑖 ∈ 𝑇,Δ𝑖 β©Ύ 1, (2)

which can be solved in𝑂 (1) time [see Appendix D]. We solvethe equation every time when needed rather than memoriz-ing its solution. A visualization of across-loop qubit aliasingis presented in Figure 5.

3.3 Instruction data dependencyOne most important step in rescheduling a loop is to find thedata dependencies - instrucions that can not be reorderedwhile scheduling. Previous work mostly defined instructiondependency according to matrix commutativity: the order oftwo instructions can change if their unitary matrices satisfy𝐴𝐡 = 𝐡𝐴. This captures most commutativity between gates,but not all. Here, we relax this requirement by establishingseveral novel and more relexed commutativity rules betweenquantum instructions. Since 𝐢𝑍 gates is the only two-qubit

gate we use and any two𝐢𝑍 gates commute with each other,what we need to care about is commutativity between 𝐢𝑍gates and one-qubit gates.

Definition 3. (CZ conjugation) If for one-qubit gatesπ‘ˆπ΄,π‘ˆπ΅ ,𝑉𝐴 and 𝑉𝐡 , we have πΆπ‘π‘ˆπ΄π‘ˆπ΅πΆπ‘ = 𝑉𝐴𝑉𝐡 , we say 𝐢𝑍 conju-gatesπ‘ˆπ΄ βŠ— π‘ˆπ΅ into 𝑉𝐴 βŠ— 𝑉𝐡 .

Conjugation allows us to swap a 𝐢𝑍 gate with a pair ofone-qubit gates, at the price of changing π‘ˆπ΄ and π‘ˆπ΅ to 𝑉𝐴and 𝑉𝐡 correspondingly. The following theorem identifiesall possible conjugations.

Theorem 1. (CZ conjugation of single qubit gates) 𝐢𝑍 con-jugatesπ‘ˆπ΄ βŠ—π‘ˆπ΅ into some𝑉𝐴 βŠ—π‘‰π΅ if and only ifπ‘ˆπ΄ andπ‘ˆπ΅

are diagonal or anti-diagonal: π‘ˆπ‘– = 𝑅𝑍 (\ ) or π‘ˆπ‘– = 𝑅+𝑍(\ ) for

𝑖 ∈ {𝐴, 𝐡}.Note 1. The antidiagonal rule has been named β€œEjectPhased-Paulis” in [22]. However we propose the rules for both necessityand sufficiency: no more commutation rules can be obtainedat gate level.

Since identity matrix 𝐼 is diagonal, π‘ˆπ΄ and π‘ˆπ΅ can bethought of as going under conjugation separately. Thus, weonly need to consider two special cases: 𝐼 βŠ— 𝑅𝑍 and 𝐼 βŠ— 𝑅+

𝑍.

Note that in conjugation rules 𝑅+𝑍will always introduce a 𝑍

gate to the other qubit. This inspires us to generalize Theo-rem 1 for a generalized form of 𝐢𝑍 defined in the following:

Definition 4. (Generalized 𝐢𝑍 gates) For π‘₯,𝑦 ∈ {0, 1}, wedefine following variants of 𝐢𝑍 gate:𝐢𝑍11 [π‘Ž, 𝑏] = 𝐢𝑍 [π‘Ž, 𝑏], 𝐢𝑍00 [π‘Ž, 𝑏] = βˆ’π‘ [π‘Ž]𝑍 [𝑏]𝐢𝑍 [π‘Ž, 𝑏]𝐢𝑍10 [π‘Ž, 𝑏] = 𝑍 [π‘Ž]𝐢𝑍 [π‘Ž, 𝑏], 𝐢𝑍01 [π‘Ž, 𝑏] = 𝑍 [𝑏]𝐢𝑍 [π‘Ž, 𝑏]

Equivalently,𝐢𝑍π‘₯𝑦 can be defined as follows:𝐢𝑍π‘₯𝑦 |π‘Žπ‘βŸ© =(βˆ’1)π›Ώπ‘Žπ‘₯𝛿𝑏𝑦 |π‘Žπ‘βŸ©, where 𝛿𝑖 𝑗 is Kronecker delta. Now we havethe following commutativity rules for generalized 𝐢𝑍 :

Theorem 2. (Generalized 𝐢𝑍 conjugation of single qubitgates) When exchanged with 𝑅+

𝑍, 𝐢𝑍 gate changes into one of

its variants by toggling the corresponding bit.1. 𝑅𝑍 (𝛼) [𝑏]𝐢𝑍π‘₯𝑦 [π‘Ž, 𝑏] = 𝐢𝑍π‘₯𝑦 [π‘Ž, 𝑏]𝑅𝑍 (𝛼) [𝑏];2. 𝑅+

𝑍(𝛼) [𝑏]𝐢𝑍π‘₯𝑦 [π‘Ž, 𝑏] = 𝐢𝑍π‘₯ (1βˆ’π‘¦) [π‘Ž, 𝑏]𝑅+

𝑍(𝛼) [𝑏].

Since generalized 𝐢𝑍 gates are also diagonal, they com-mute with each other and can be scheduled just as ordinary𝐢𝑍 gate and converted back to 𝐢𝑍 by adding 𝑍 gates.

3.4 Instruction resource constraintQubits have properties that resemble both data and resource:qubits work as quantum data registers and carry quantumdata; meanwhile, qubit-level parallelism allows all instruc-tions, if they operate on different qubits, to be executed simul-taneously. This results in a surprising property for quantumprograms: the resources should be described using linear ex-pressions, instead of by a static β€œresource reservation table”as in the classical case. Using the rules for detecting qubit

Page 6: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

Kernel Compaction

Loop Unrolling

Loop Rotation

Modulo Scheduling #1

Modulo Scheduling #2

Loop Rotation Prologue

Loop Rotation Epilogue

MS #1 Prologue

MS #2 Prologue

MS #2 Epilogue

MS #1 Epilogue

Loop Unrolling Epilogue

Kernel

Branch by m mod C

Figure 6. The entire compilation flow of our approach.

aliasings, we simply check if there is an aliasing betweenthe qubit references from two instructions, that is, the twoinstructions share a same qubit at some iteration and cannotbe executed simultaneously.

4 Rescheduling loop bodyNow we are ready to present the main algorithm for pipelin-ing quantum loop programs. It is based on modulo schedul-ing via hierarchical reduction [3], but several modificationsto the original algorithm are required to fit into schedulingquantum instructions on qubits. The entire flow of our ap-proach is depicted in Figure 6. For simplicity we suppose thenumber of iterations is large enough so that we don’t worryabout generating a long prologue/epilogue.

4.1 Loop body compactionAt first we compact the loop kernel to merge the gates thatcan be trivially merged, including: (a) adjacent single qubitgates; (b) diagonal or antidiagonal single qubit gates andtheir nearby single qubit gates, maybe at the other side of a𝐢𝑍 gate; and (c) adjacent 𝐢𝑍 gates. To this end, we definethe following compaction procedure, which considers thepotential aliasing between qubits:

Definition 5. A greedy procedure for compacting loop kernel:

β€’ Initialize all qubits with an ideneity gate.β€’ Place all instructions one by one. Initialize operationto β€œBlocked”. Check the new instruction (A) against allplaced instructions (B). Update operation according toTable 1.

β€’ Perform the last operation according to the table.– β€œBlocked” means the instruction is put at the end ofthe instruction list.

– β€œMerge with B” means the single qubit instructionis merged with the placed single qubit gate B. If theplaced gate is an antidiagonal,𝑍 gates should be addedfor uncancelled 𝐢𝑍 gates that occur earlier but areplaced after the antidiagonal.

– β€œCancelled” means two 𝐢𝑍 gates are cancelled. Notethat the added 𝑍 gates are not cancelled. Also, a thirdarriving𝐢𝑍 can β€œuncancel” a cancelled𝐢𝑍 , which wealso call as β€œCancelled”.

This compaction can be done in two directions: compact-ing to the left or to the right. They can be seen as the resultsof ASAP schedule and ALAP correspondingly. However, thisprocedure does not guarantee compacting once will con-verge: not all the outputs from the procedure are fixpoints ofthe procedure. For example, the circuit in Figure 7 only con-verges after three applications of left compaction. In general,we have the following:

Theorem 3. Compacting three times results in a fixpoint ofthe compaction procedure.

Note that we allow using unknown single-qubit gates. Ifall components are known to be diagonal or antidiagonal,the product of these matrices is also diagonal or antidiagonal[see Appendix F]. Otherwise, we can only see the productas a general matrix. However, this does not affect our resultof three-time compaction. Also compacting in one directiondoes not capture all chances of merging. Figure 8 shows thatsome single-qubit merging changes are missed out. In prac-tice we perform a left compaction after a right compaction.

4.1.1 Loop unrolling and rotation. Loop kernel com-paction can only discover gate merging and cancellationin one iteration. However, gate merging and cancellationcan also occur across iterations. For example, in Figure 4the last 𝐻 gate in the previous iteration can be merged andcancelled with the first𝐻 gate in the next iteration. This kindof cancellation cannot be discovered by software pipeliningeither, since it is a reordering technique and cannot cancelinstructions out.An instruction 𝑖 in one iteration may merge or cancel

with instruction 𝑗 from 𝑑 β©Ύ 1 iterations later. All poten-tial merging of single qubit gates and cancellable 𝐢𝑍 gatescan be written out by enumerating all pairs of instructions.Loop rotation[15] is an optimization technique to convertacross-loop dependency to in-loop dependency (so that somevariables can be privatized and optimized out). Consider aloop ranging fromπ‘š to 𝑛: {𝐴𝑖𝐡𝑖𝐢𝑖 }π‘›π‘š . Here, 𝐴𝑖 can be ro-tated to the tail of the loop: π΄π‘š {𝐡𝑖𝐢𝑖𝐴𝑖+1}π‘›βˆ’1π‘š 𝐡𝑛𝐢𝑛, and 𝐢𝑖

and 𝐴𝑖+1 are now in one iteration. If 𝐢𝑖 writes into a tem-porary variable and 𝐴𝑖+1 reads from it, this variable can beprivatized. For merging candidates with 𝑑 = 1, we can use asimilar procedure:

Definition 6. An instruction is considered movable if it sat-isfies one of following conditions:

β€’ The instruction is a single-qubit gate, and there are nogates on the same qubit or on an aliasing qubit before it;in this case the instruction can be rotated to the right.

Page 7: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

A\B SQ with same qubit SQ with in-loop aliasing CZ with same qubit CZ with aliasing qubitDiagonal SQ Merge with B BlockedAntiDiagonal SQ Merge with B Blocked BlockedGeneral SQ Blocked Blocked Blocked BlockedCZ Blocked Blocked If exactly-same then Cancel

Table 1. Operation table for loop kernel compaction. Empty cell means using previous operation. Check is performed fromleft to right, so antidiagonal can pass through 𝐢𝑍 with a same qubit and an aliasing qubit.

|a〉 Z β€’

|b〉 X β€’ H Z H

(a) Original circuit

|a〉 Z β€’

|b〉 X β€’ X

(b) Compacting #1

|a〉 Z β€’ Z

|b〉 β€’(c) Compacting#2

|a〉 β€’

|b〉 β€’(d) Compacting#3

Figure 7. Compacting more than once yields better result.

|a〉 β€’

|b〉 Z β€’ H

Figure 8. Left compaction will miss the chance of compact-ing the 𝑍 gate and the 𝐻 gate.

β€’ The instruction is a 𝐢𝑍 gate, and there are no single-qubit gates on the same qubit or on ailasing qubits; inthis case the instruction can be rotated to the right.

β€’ The instruction is a 𝐢𝑍 gate, and there are no single-qubit gates on the same qubit or on ailasing qubitsexcept the 𝐢𝑍 gate has only one linear offset referencewith π‘˜ = 0 and there is a single-qubit gate on this qubit.In this case, the instruction will be rotated to the rightalong with this single qubit gate.

This definition of movable instructions guarantees theprograms before and after the rotation are equivalent. Weuse the following procedure to rotate one instruction fromleft to right:

1. Find the first unmarked movable instruction that,there exists another instruction to merge or cancelwith 𝑑 = 1.

2. Mark the chosen instruction, and rotate the instruc-tion to the right. The instruction is added to prologueand the others added to epilogue.

3. Perform left compaction on the new loop kernel. Notethat the left-compaction algorithm is modified, so thatmerging single-qubit gates or cancelling𝐢𝑍 gates willclear the mark.

4. If there is no rotatable instruction, stop the procedure.

Corollary 4. If the original loop has only candidates with𝑑 = 1 and no one-qubit gate merges with itself, this procedureeliminates all across-loop merging or cancellation. That is, ifwe unroll the loop after rotation, the unrolled quantum β€œcircuit”should be a fixpoint of compaction procedure.

However, loop rotation can only handle potential gatemerging across one iteraion (i.e. from nearby iterations). Tohandle potential merging across many iteraions, we adoptloop unrolling from classical loop optimization. While themajor objective for loop unrolling is usually to reduce branchdelay, Aiken et al. [2] also used loop unrolling to unroll firstfew iterations of loop and schedule them ASAP, so that re-peating patterns can be recognized into an optimal softwarepipelining schedule. Our approach uses modulo schedulinginstead of kernel recognition, but we can still exploit thepower of loop unrolling to capture patterns that requiremany iterations to reveal. The key point is that unrollingdecreases 𝑑 . Suppose we use a graph to represent all β€œcandi-dates for instruction merging”, with edge𝐴 π‘‘βˆ’β†’ 𝐡 indicatinginstruction 𝐴 will merge with or cancel out instruction 𝐡

from 𝑑 iterations later, if we unroll the loop by 𝐢 times, theweight of the edges in the graph will decrease.

Example 5. Figure 9 gives an example showing the connec-tion between the β€œmerging graph” before unrolling and the oneafter unrolling: if βˆ€π‘‘,𝐢 β©Ύ 𝑑 , there are no edges with 𝑑 > 1.

There is a tradeoff between generated code length (deter-mined by𝐢) and remaining 𝑑 > 1 edges. For example, if thereis an edge with 𝑑 = 10000, we are not likely to unroll theloop for 10000 times just to merge the two single qubit gates.Also for eliminating self-cancelling 𝐢𝑍 gates (i.e. 𝐢𝑍 gateson a pair of constant qubits), we may want 𝐢 β©Ύ 2 and 𝐢

even. In the following discussion we use 𝐢 as a configurablevariable in our algorithm determining the maximal allowedunroll time (and the minimal time of iterations of the loop).The new unrolled loop will be in the form

𝑓 π‘œπ‘Ÿ (𝑖 =π‘š; 𝑖 β©½ 𝑛; 𝑖+ = 𝐢) {π‘œπ‘ (π‘˜π‘‘ 𝑖 + 𝑏𝑑 )}𝑓 π‘œπ‘Ÿ (𝑖 =π‘šβ€²; 𝑖 β©½ 𝑛; 𝑖 + 1) {π‘œπ‘ (π‘˜π‘‘ 𝑖 + 𝑏𝑑 )}

(3)

and the first loop should be written into𝑓 π‘œπ‘Ÿ (𝑖 = 0; 𝑖 β©½ 𝑛′; 𝑖+ = 1) {π‘œπ‘ (πΆπ‘˜π‘‘ 𝑖 + 𝑏𝑑 +π‘šπ‘˜π‘‘ )} (4)

where 𝑛′ =βŒŠπ‘›βˆ’π‘š+1

𝐢

βŒ‹βˆ’ 1 andπ‘šβ€² = 𝐢 (𝑛′ + 1) +π‘š. This step

of transformation makes sure the loop stride is still 1 after

Page 8: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

Figure 9. Example for the QDGs of loop β€œπ΄ 4βˆ’β†’ 𝐡” unrolled2, 3, 4 and 5 times. Unrolling the loop decreases the edgeweight 𝑑 . When 𝐢 =π‘šπ‘Žπ‘₯ {𝑑} all edges will be decreased toweight 1.

loop unrolling. Note that item (π‘šπ‘˜π‘‘ ) appears in every offsetof the loop body. Ifπ‘š is unknown we can’t proceed with ouralgorithm. Fortunately, sinceπ‘š = 𝑝𝐢 +π‘ž, π‘ž =π‘š mod 𝐢 , wehave πΆπ‘˜π‘‘ 𝑖 + 𝑏𝑑 +π‘šπ‘˜π‘‘ = πΆπ‘˜π‘‘ (𝑖 + 𝑝) + 𝑏𝑑 + π‘žπ‘˜π‘‘ , showing thatwhen the range is unknown, the results of array dependencydepend only on the Euclidean modulo π‘ž =π‘š mod 𝐢 . In thiscase, we can generate 𝐢 copies of code for each case of π‘ž,and perform following parts of the algorithm on each copy.Let us briefly summarize our compilation flow till now:

we compact the loop kernel, unroll the loop by 𝐢 , and rotatesome instructions in the unrolled loop kernel. The unrollingstep may copy the loop by 𝐢 times, and steps after unrolling(including rotation) will be performed on each copy.

4.2 Modulo schedulingOur next step is modulo scheduling borrowed from [12]:

1. Find in-loop and loop-carried dependencies.2. Estimate an initialization interval 𝐼 𝐼 . For simplicity

we use binary search and the maximum 𝐼 𝐼 is totalinstruction count. Use Floyd to check validity.

3. Using Tarjan algorithm to find strong connected com-ponents and schedule all SCCs by in-loop dependencysubgraph.

4. Merge every SCC in DDG into one node, obtaining anew DDG.

5. Schedule the new DDG by list scheduling.There are some major differences between quantum pro-grams and the classical programs considered in [12]:

4.2.1 Quantum dependency graph. The instruction de-pendency for quantum programs is described by a QDG

(Quantum Dependency Graph) as a generalization of DDG(Data Dependency Graph), where vertices represent instruc-tions and edges represent precedence constraints that mustbe satisfies while reordering. In modulo scheduling, a de-pendency edge is described by two integers: π‘šπ‘–π‘› and 𝑑𝑖 𝑓 .Suppose there is an edge pointing from instruction 𝐴 to in-struction 𝐡 with parameter (π‘šπ‘–π‘›,𝑑𝑖 𝑓 ), it means β€œinstruction𝐡 from 𝑑𝑖 𝑓 iterations later should be scheduled at leastπ‘šπ‘–π‘›

ticks later than instruction 𝐴 in this iteration”. Recall fromSection 3.2 and 3.3, our dependency is defined by the rules:

1. There are no dependencies between 𝐢𝑍 gates, or be-tween a 𝐢𝑍 and a diagonal single qubit gate.

2. In-loop dependency: if two offsets are on the samequbit array and reveal in-loop qubit aliasing, there isa dependency edge (1, 0) between the correspondinginstructions. To unify with across-loop, we set Δ𝑖 = 0.

3. Across-loop dependency: if two offsets are on the samequbit array and reveal across-loop qubit aliasing withΔ𝑖 , there is a dependency edge (1,Δ𝑖) between thecorresponding instructions.

4. Exception on antidiagonal gates: if the qubit (π‘˜1𝑖 +𝑏1)of an antidiagonal gate aliases with one operand π‘˜2𝑖 +𝑏2 of a 𝐢𝑍 gate and π‘˜1 = π‘˜2, we remove the edge ifthere’s no aliasing on the other operand.

5. Exception on single qubit gates: if two single qubitgates operate on the same qubit array where offsets(π‘˜1𝑖 + 𝑏1) and (π‘˜2𝑖 + 𝑏2) aliases with each other andπ‘˜1 = π‘˜2, we specify the dependency edge to be valued(0,Δ𝑖), that is,π‘šπ‘–π‘› = 0 rather thanπ‘šπ‘–π‘› = 1.

There may be multiple edges in the graph connecting thesame pair of instructions; for example, an in-loop depen-dency and an across-loop dependency between the two in-structions. Since we are going to use Floyd algorithm on thegraph to compute largest distance in modulo scheduling, weonly need the edge with the maximal (π‘šπ‘–π‘› βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓 ) afterassigning 𝐼 𝐼 . Fortunately we don’t need to save all multipleedges, since the following theorem guarantees that we cancompare (π‘šπ‘–π‘› βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓 ) before assigning different 𝐼 𝐼s.

Theorem 5. Suppose (π‘šπ‘–π‘›1, 𝑑𝑖 𝑓1), (π‘šπ‘–π‘›2, 𝑑𝑖 𝑓2) are two edgeswithπ‘šπ‘–π‘›1 β©½ 1,π‘šπ‘–π‘›2 β©½ 1 and 𝑑𝑖 𝑓1 > 𝑑𝑖 𝑓2. Then for all 𝐼 𝐼 β©Ύ 1,we have:π‘šπ‘–π‘›1 βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓1 β©½ π‘šπ‘–π‘›2 βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓2.

This theorem allows us to sort multiple edges by lexicalordering of (𝑑𝑖 𝑓 ,βˆ’π‘šπ‘–π‘›) (i.e. compare 𝑑𝑖 𝑓 first, and compare(βˆ’π‘šπ‘–π‘›) if 𝑑𝑖 𝑓1 = 𝑑𝑖 𝑓2) and the smallest one is exactly theedge with maximal (π‘šπ‘–π‘› βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓 ).

4.2.2 Resource conflict handling. Another importantissue when inserting an instruction into modulo schedul-ing table or merging two strong connected components isresource conflict: there is no dependency between two 𝐢𝑍gates, yet they may not be executed together because theymay share a same qubit. To solve this issue, let us first intro-duce several notations:

Page 9: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

f o r x=m to n doCNOT q1 [ x βˆ’50 ] , q0 [ x + 0 ] ;CNOT q1 [ x βˆ’50 ] , q0 [ x + 0 ] ;

end f o r

(a) Loop program.

(b) Corresponding QDG.

Figure 10. Quantum dependency graph example. Tuplesrepresent (π‘šπ‘–π‘›,𝑑𝑖 𝑓 ).

1. 𝐼 𝐼 is the current iteration interval being tested.2. 𝐿 is the length of the original loop kernel.3. The 𝑐-th instruction in the original loop is placed in

the modulo scheduling table at tick 𝑑 = 𝑝𝐼𝐼 + π‘ž, where𝑝 β©Ύ 0, 0 β©½ π‘ž < 𝐼 𝐼 .

Example 6. Figure 11 is a simple example for modulo sched-uling. In this case, 𝐼 𝐼 = 2 and 𝐿 = 4. Instructions are placedat time slot 0, 2, 3, 4. Thus, 𝐴 from one iteration, 𝐡 from a pre-vious iteration, and 𝐷 from previous 2 iterations are executedsimultaneously, while 𝐢 is executed alone.

We use the retrying scheme: if a resource conflict is de-tected, try next tick. The basic approach to detect resourceconflict is detecting in-loop qubit aliasing. This leads to twonew problems that do not exist in the classical case:

1. The array offsets of instruction operands may increase.As 𝑑 increases, 𝑝 also increase, and the instructioncomes from one more iteration earlier, thus changingarray offsets.

2. The pair of instructions for resource conflict checkingmay not both exist in some iterations. Increasing 𝑑

leads to a long prologue and long epilogue, shrinkingthe range for loop kernel, and may eliminate the re-source conflict that once existed (when the loop rangeis known).

(a) Rescheduled sin-gle iteration. 𝐼 𝐼=2.

(b) Issuingeach iterationreveals loopkernel.

(c) Modulo schedul-ing table. Column in-dex represents origi-nal iteration.

Figure 11. Example for modulo scheduling loop 𝐴𝑖𝐡𝑖𝐢𝑖𝐷𝑖 .In this case 𝐼 𝐼 = 2, 𝐿 = 4, 𝑇 = [0, 2].

Example 7. Suppose when generating the schedule in Figure11, we have inserted instructions 𝐴, 𝐡 and 𝐢 , and are ready toinsert 𝐷 at time slot 4.

1. Since 4 = 2𝐼 𝐼 + 0, the 𝐷 in the loop kernel is from twoiterations earlier compared with the iteration that the𝐴 is in. We have to decrease offset of 𝐷 operands by 2𝑖 .The offseted index may no longer conflict with 𝐴.

2. When checking if there is resource conflict between𝐷 and𝐴, we only need to check the case where both iterationsare valid; that is, 𝑖 = 2. This means the scheduling is stillvalid even if 𝐴0 has a resource conflict with π·βˆ’2, sinceπ·βˆ’2 does not even exist.

In the original modulo scheduling and other classicalscheduling algorithms, the retry strategy only allows 𝐼 𝐼 re-tries. For example, if there is not enough π΄πΏπ‘ˆ or πΉπ‘ƒπ‘ˆ forinstruction 𝐴𝑖 in modulo scheduling table tick π‘ž, there isalso not enough resource for instruction π΄π‘–βˆ’1 from previousiteration. However, this is not true for our case, and we haveto modify the strategy.

Example 8. Suppose we perform modulo scheduling on theprogram in Figure 12. Since the three𝐢𝑍 s are exactly the same,we may expect 𝐼 𝐼 = 3 due to resource conflict. However, if weallow more retries, these 𝐢𝑍 s can be separated into differentiterations and can be executed concurrently with 𝐢𝑍 s fromother iterations.

We consider the general case where loop range is un-known. When placing an instruction in the modulo schedul-ing table, we check its operands with all operands scheduledat this tick. Suppose now we check operand (π‘˜2 (𝑖 βˆ’π‘2) +𝑏2)with operand (π‘˜1 (π‘–βˆ’π‘1)+𝑏1), and we find an aliasing, that is,βˆƒπ‘–0 ∈ Z, π‘˜2 (𝑖0 βˆ’ 𝑝2) + 𝑏2 = π‘˜1 (𝑖0 βˆ’ 𝑝1) + 𝑏1. In case π‘˜1 = π‘˜2,βˆ€π‘– ∈ Z, π‘˜2 (𝑖 βˆ’ 𝑝2) + 𝑏2 = π‘˜1 (𝑖 βˆ’ 𝑝1) + 𝑏1. When π‘˜1 = 0,

Page 10: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

f o r x=0 to 6 doCZ q [ x ] , q [ x + 1 ] ;CZ q [ x ] , q [ x + 1 ] ;CZ q [ x ] , q [ x + 1 ] ;

end f o r

(a) Original Program.

|q0〉 β€’ β€’ β€’

|q1〉 β€’ β€’ β€’ β€’ β€’ β€’

|q2〉 β€’ β€’ β€’ β€’ β€’ β€’

|q3〉 β€’ β€’ β€’ β€’ β€’ β€’

|q4〉 β€’ β€’ β€’ β€’ β€’ β€’

|q5〉 β€’ β€’ β€’ β€’ β€’ β€’

|q6〉 β€’ β€’ β€’ β€’ β€’ β€’

|q7〉 β€’ β€’ β€’

(b)Unrolled Program, for a clearerview.

CZ q [ 0 ] , q [ 1 ] ;CZ q [ 1 ] , q [ 2 ] ;CZ q [ 0 ] , q [ 1 ] ; CZ q [ 2 ] , q [ 3 ] ;CZ q [ 1 ] , q [ 2 ] ; CZ q [ 3 ] , q [ 4 ] ;f o r x=4 to 6 p a r a l l e l do

CZ q [ x βˆ’4 ] , q [ x βˆ’ 3 ] ;CZ q [ x βˆ’2 ] , q [ x βˆ’ 1 ] ;CZ q [ x ] , q [ x + 1 ] ;

end f o rCZ q [ 3 ] , q [ 4 ] ; CZ q [ 5 ] , q [ 6 ] ;CZ q [ 4 ] , q [ 5 ] ; CZ q [ 6 ] , q [ 7 ] ;CZ q [ 5 ] , q [ 6 ] ;CZ q [ 6 ] , q [ 7 ] ;

(c) Software pipelined version.

|q0〉 β€’ β€’ β€’

|q1〉 β€’ β€’ β€’ β€’ β€’ β€’

|q2〉 β€’ β€’ β€’ β€’ β€’ β€’

|q3〉 β€’ β€’ β€’ β€’ β€’ β€’

|q4〉 β€’ β€’ β€’ β€’ β€’ β€’

|q5〉 β€’ β€’ β€’ β€’ β€’ β€’

|q6〉 β€’ β€’ β€’ β€’ β€’ β€’

|q7〉 β€’ β€’ β€’

(d) Software pipelined version, un-rolled.

Figure 12. Three 𝐢𝑍 gates in a row. Although there seems to be resource conflicts, the minimal 𝐼 𝐼 = 1.

this is the same as classical resource scheduling; otherwise,βˆ€Ξ”π‘ β‰  0,βˆ€π‘– ∈ Z, π‘˜2 (𝑖 βˆ’ 𝑝2 βˆ’ Δ𝑝) +𝑏2 β‰  π‘˜1 (𝑖 βˆ’ 𝑝1) +𝑏1. Thismeans if we delay the instruction by Δ𝑝𝐼𝐼 ticks, the conflictwill be resolved. We call it false conflict. In case π‘˜1 β‰  π‘˜2,after Δ𝑝𝐼𝐼 ticks it will fall in the same time slot. There is stilla conflict iff βˆƒπ‘–1 ∈ Z, π‘˜2 (𝑖1 βˆ’π‘2 βˆ’Ξ”π‘) +𝑏2 = π‘˜1 (𝑖1 βˆ’π‘1) +𝑏1;that is, 𝑖1 = 𝑖0 + Ξ”π‘π‘˜2

π‘˜2βˆ’π‘˜1, which means (π‘˜2 βˆ’ π‘˜1) |Ξ”π‘π‘˜2. The

conflict appears periodically as Δ𝑝 increases. However, inthe worst case where (π‘˜2 βˆ’ π‘˜1) |π‘˜2, there is always a conflictand can be seen as classical resource scheduling. We call it,together with the case where π‘˜1 = π‘˜2 = 0, true conflict.We insert an instruction or an entire schedule into the

modulo scheduling table in the following way: if there isno conflict, we insert the instructions; if there is only falseconflict, we try next tick. As an exception, false conflictsbetween two single qubit gates are also seen as no conflict;and if there is true conflict, we start a β€œdeath countdown”before trying next tick: if next (𝐼 𝐼 βˆ’1) retries do not succeed,give up, as we do in classical retry scheme.

4.2.3 Inversion pair correction. The commutativity be-tween antidiagonal 𝑅+

𝑍gates and 𝐢𝑍 gates comes at a price

of a Z gate. In modulo scheduling stage we allowed themto commute freely, ignoring the generated Z gates. Now wehave to fill them back to ensure equivalence. By the term

β€œinversion”, we mean that our scheduling alters the executionorder of instructions compared with original ordering:Definition 7. If the original 𝑐th instruction ismodulo-scheduledat 𝑑 = 𝑝𝐼𝐼 + π‘ž in new loop (where the π‘˜th original loop is is-sued), we define the absolute order of the instruction to be𝑇 = (π‘˜ βˆ’ 𝑝)𝐿 + 𝑐 = π‘˜πΏ + (𝑐 βˆ’ 𝑝𝐿).Example 9. Suppose 𝐿 = 4 and 𝐡 in Figure 11 is the secondinstruction in the original loop (𝑐 = 1). 𝐡 is placed in themodulo scheduling table at 𝑝 = 1 and π‘ž = 0.

1. The first 𝐡 instruction is issued in the prologue (incom-plete loop kernel) where the second (π‘˜ = 1) iterationis issued. Thus the absolute order of the instruction is𝑇 = 1.

2. The second 𝐡 instruction is issued in the loop kernelwhere the third (π‘˜ = 2) iteration is issued. Thus theabsolute order is 𝑇 = 5.

3. The third 𝐡 instruction is issued in the epilogue (againincomplete loop kernel) where the fourth (π‘˜ = 3) itera-tion is issued (or, should be issued). The absolute order is𝑇 = 9.

We see that the absolute order is exactly the time when theinstruction is executed in the original loop.

Our idea is to check all inversion pairs in the moduloschedule. There are two kind of order-inversions:

Page 11: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

Figure 13. An example of inverted pairs of instructionsacross loop iterations.

Definition 8. 1. In-loop inversion: For two instructionsin the π‘˜-iteration in new scheduling (i.e. the iterationwhere π‘˜th iteration of original loop is issued), if the firstprecedes the second while its absolute order succeeds theabsolute order of the second instruction:π‘˜πΏ+(𝑐1βˆ’π‘1𝐿) >π‘˜πΏ + (𝑐2 βˆ’ 𝑝2𝐿), there is an in-loop inversion.

2. Loop-carried inversion: For two instructions inπ‘˜-iterationand (π‘˜ + π‘Ÿ )-iteration (π‘Ÿ β©Ύ 1), if π‘˜πΏ + (𝑐1 βˆ’ 𝑝1𝐿) >

(π‘˜ + π‘Ÿ )𝐿 + (𝑐2 βˆ’ 𝑝2𝐿), there is an across-loop inversion.

Since the π‘˜πΏ term can be cancelled, inversion pairs inmodulo schedule also reveals periodicity. Figure 13 shows anexample with periodic π‘Ÿ = 1 inversions, and π‘Ÿ = 2 inversions.Since the term (π‘˜ + π‘Ÿ )𝐿 + (𝑐2 βˆ’ 𝑝2𝐿) increases as π‘Ÿ increases,there exists π‘Ÿ0 s.t. βˆ€π‘Ÿ > π‘Ÿ0 there is no across-loop inversion.We can increase π‘Ÿ and find pairs of inversion from iterationπ‘˜ and (π‘˜ + π‘Ÿ ), until there is no inversion pair. When findingall inversion pairs, we can check the pairs to see if one is𝐢𝑍and the other is antidiagonal on one of 𝐢𝑍 ’s operand. If so,we add a 𝑍 gate at the tick where 𝐢𝑍 is placed.

4.2.4 Code generation for kernel, prologue and epi-logue. We generate prologue and epilogue by removingnon-existing instructions from the loop kernel.

Example 10. Consider in Figure 11 (remember 𝑇 = [0, 2]),the iteration where π‘˜th original iteration is issued (or shouldbe issued) by enumerating π‘˜ from βˆ’βˆž to∞:

1. For π‘˜ < 0, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} ∩ 𝑇 = Ξ¦, no instruction isput.

2. For π‘˜ = 0, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} βˆ©π‘‡ = {π‘˜}, only 𝐴 is put.3. For π‘˜ = 1, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} βˆ©π‘‡ = {π‘˜, π‘˜ βˆ’ 1}, 𝐴, 𝐡,𝐢 are

put.4. For π‘˜ = 2, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} βˆ©π‘‡ = {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2}. This

is the complete loop kernel.5. For π‘˜ = 3, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2}βˆ©π‘‡ = {π‘˜ βˆ’ 1, π‘˜ βˆ’ 2}, 𝐡,𝐢, 𝐷

are put.6. For π‘˜ = 4, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} βˆ©π‘‡ = {π‘˜ βˆ’ 2}, 𝐷 is put.

7. For π‘˜ > 4, {π‘˜, π‘˜ βˆ’ 1, π‘˜ βˆ’ 2} ∩ 𝑇 = Ξ¦, no instruction isput.

For prologue and epilogue, we have to remove instructionsfrom iterations that do not exist; for extra 𝑍 gates from theinversion of a 𝐢𝑍 and an antidiagonal, removing either gatewill make the 𝑍 gate disappear. After removing non-existinginstructions, we perform compaction and ASAP schedule onthe two parts.For loop kernel, we need to merge the single qubit gates

on the same qubit in the same time slot (from the resourceconflict exception) by their absolute order.

4.3 Modulo scheduling againIn the first round of modulo scheduling, inversion of 𝐢𝑍and antidiagonal gates may introduce 𝑍 gates overlapping𝐢𝑍s, resulting an illegal schedule. To generate an executableschedule, we performmodulo scheduling again, but this timewe no longer allow β€œcommutativity” between antidiagonalsand𝐢𝑍s, and thus the inversion-fix step can be skipped. Thescheduled loop by this second round of modulo schedulingis directly executable on the device.[An analysis on the complexity of our algorithm presented

in this section is given in Appendix K.]

5 EvaluationWe have implemented our method and carried out exper-iments on several quantum programs. Some of them areintrinsically parallel, while others are not. Baselines for ourevaluation come from the following sources:

β€’ Kernel-ASAP performs compaction and ASAP sched-uling on the loop kernel. We expect our work to out-perform this naive approach.

β€’ Unroll unrolls the loop and performs compaction aswell as ASAP scheduling on the unrolled circuit. Thesoftware-pipelined version should generate a programwith similar depth but much smaller code size.

β€’ Cirq uses the optimization passes in [22] to unroll theloop. This gives another perspective of loop unrollingbesides our implementation.

The experiment results are in Table 2. We hereby analyzesome of the important examples:

5.1 Grover SearchGrover search is a test case with long dependency chain andlittle space for optimization. Yet our approach can reducethe overall depth by merging adjacent gates in iterationand across iterations. We use the 𝐢𝐢𝑁𝑂𝑇 case from [6] andSudoku solver from [4]. Since Grover search is a hard-to-optimize case, we inspected the optimized code and got thefollowing findings:

Although examples do not revealmuch optimization chance,there is a pitfall for ASAP optimizers that may cause a di-agonal 𝑇 † gate to be scheduled at the first tick alone. This

Page 12: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

Test case Input Loop Output Loop Known range resultsASAP 𝐢 𝐢-ASAP Pre K Post #Iter K-ASAP Unroll Cirq QSP#Iter QSP

Cluster 4 2 5 4 1 4 200 800 203 203 96 104Array 1 5 2 10 8 4 5 100 500 500 500 48 205Array 2 3 2 5 4 1 4 100 300 201 201 46 54Array 3 11 2 17 12 12 17 100 1100 605 606 48 605Grover 1 13 2 26 26 24 871 99 1287 1287 1288 15 1257Grover 2 71 2 141 141 135 40881 1000 71000 70001 71001 207 68967QAOA-Hard 1 21 2 41 41 40 2021 1001 21021 20021 20021 449 20022QAOA-Hard 2 21 2 41 41 40 2061 1001 21021 20021 20021 448 20022QAOA-Hard 3 16 2 27 41 18 1121 1001 16016 11016 11016 448 9226QAOA-Hard 4 33 2 47 60 31 3882 1000 33000 14019 14019 360 15102QAOA-Par 1 15 2 26 46 20 943 201 3015 2215 2215 56 2109QAOA-Par 2 15 2 26 45 20 1009 201 3015 2215 2215 53 2114QAOA-Par 3 18 2 29 43 18 1080 201 3618 2218 2218 50 2023QAOA-Par 4 15 2 29 29 25 3668 1000 15000 14001 14001 368 12897

Table 2. Evaluation results. ASAP is the minimal depth of original loop body. 𝐢-ASAP is the minimal depth of the originalloop body unrolled by 𝐢 times. Pre, K and Post represents prologue, kernel and epilogue. For each test case a range sized #Iteris assigned, and the span of the output loop is QSP#Iter.

|a〉 β€’ H

|b〉 β€’ β€’

|c〉 β€’

(a) Original program,depth=3.

|a〉 β€’ H

|b〉 β€’ β€’

|c〉 β€’

(b) New program by acci-dental inversion of two𝐢𝑍s,depth=2.

Figure 14. The accidental inversion of 𝐢𝑍s reduced kerneldepth by 1.

is prevented in our approach by performing bidirectionalcompactions. Moreover, the depth cut mainly comes from in-version of a pair of 𝐢𝑍 s while scheduling, which indeed ourapproach does not consider. (see Figure 14). This inspires usto find more optimization chances while placing instructionswithout dependency, like a program with many 𝐢𝑍s.

5.2 QAOAThe QAOA programs in [8] (in Figure 15), as well as theQAOA example in [22] are used in our experiment, but witha 𝑝 (i.e. the number of iterations) large enough. Since thedecomposition of QAOA into gates affects how it can be op-timized on our architecture, we consider two different ways:QAOA-Par where QAOA is decomposed to expose morecommutativity (see the details in Appendix J), and QAOA-Hard, where QAOA is decomposed into a harder form, witha long dependency chain formed by cross-qubit operationsthat is unable to be detected by gate-level optimizers.

(a) (b) (c)

Figure 15. QAOA-MaxCut examples in [8].

The evaluation results in Table 2 show that in all cases,our approach can reduce the loop kernel size compared withKernel-ASAP, and can sometimes outperform unrollingresults. This advantage is more evident in the QAOA-Parcases than in the QAOA-Hard cases, since QAOA-Par revealsmore commutativity chances than QAOA-Hard. Anotherfinding is that QAOA-Hard generates larger code thanQAOA-Par, and thus requires more iterations for software-pipeliningto take effect.

[More discussions on examples are in Appendix M.]

6 ConclusionWe proposed a compilation flow for optimizing quantumprograms with control flow of for-loops. In particular, datadependencies and resource dependencies are redefined toexposes more chances for optimization algorithms. Our ap-proach is tested against several important quantum algo-rithms, revealing code-size advantages over the existing ap-proaches while keeping depth advantage close to loop rolling.Yet there is still gap for optimization of more complex quan-tum programs, on different architectures, and with lowercomplexity, which could be filled in future works.

Page 13: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

References[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.

Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.

[2] Alexander Aiken and Alexandru Nicolau. 1988. Optimal Loop Paral-lelization. Technical Report. 308–317 pages. https://doi.org/10.1145/53990.54021

[3] Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan.1995. Software Pipelining. ACM Comput. Surv. 27, 3 (1995), 367–432.https://doi.org/10.1145/212094.212131

[4] Abraham Asfaw, Luciano Bello, Yael Ben-Haim, Sergey Bravyi,Nicholas Bronn, Lauren Capelluto, Almudena Carrera Vazquez, JackCeroni, Richard Chen, Albert Frisch, Jay Gambetta, Shelly Garion,Leron Gil, Salvador De La Puente Gonzalez, Francis Harkins, TakashiImamichi, David McKay, Antonio Mezzacapo, Zlatko Minev, RamisMovassagh, Giacomo Nannicni, Paul Nation, Anna Phan, Marco Pis-toia, Arthur Rattew, Joachim Schaefer, Javad Shabani, John Smolin,Kristan Temme, Madeleine Tod, Stephen Wood, and James Woot-ton. 2020. Learn Quantum Computation Using Qiskit. http://community.qiskit.org/textbook

[5] Adi Botea, Akihiro Kishimoto, and Radu Marinescu. 2018. On theComplexity of Quantum Circuit Compilation. In Proceedings of theEleventh International Symposium on Combinatorial Search, SOCS 2018,Stockholm, Sweden - 14-15 July 2018, Vadim Bulitko and Sabine Storandt(Eds.). AAAI Press, 138–142. https://aaai.org/ocs/index.php/SOCS/SOCS18/paper/view/17959

[6] Patrick J. Coles, Stephan J. Eidenbenz, Scott Pakin, Adetokunbo Ade-doyin, John Ambrosiano, Petr M. Anisimov, William Casper, GopinathChennupati, Carleton Coffrin, Hristo Djidjev, David Gunter, SatishKarra, Nathan Lemons, Shizeng Lin, Andrey Y. Lokhov, Alexander Ma-lyzhenkov, David Dennis Lee Mascarenas, Susan M. Mniszewski, BaluNadiga, Dan O’Malley, Diane Oyen, Lakshman Prasad, Randy Roberts,Philip Romero, Nandakishore Santhi, Nikolai Sinitsyn, Pieter Swart,Marc Vuffray, Jim Wendelberger, Boram Yoon, Richard J. Zamora, andWei Zhu. 2018. Quantum Algorithm Implementations for Beginners.CoRR abs/1804.03719 (2018). arXiv:1804.03719 http://arxiv.org/abs/1804.03719

[7] Leonardo de Moura and Nikolaj BjΓΈrner. 2008. Z3: An Efficient SMTSolver. In Tools and Algorithms for the Construction and Analysis ofSystems, C. R. Ramakrishnan and Jakob Rehof (Eds.). Springer BerlinHeidelberg, Berlin, Heidelberg, 337–340.

[8] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A QuantumApproximate Optimization Algorithm. arXiv:quant-ph/1411.4028

[9] Lov K. Grover. 1996. A Fast Quantum Mechanical Algorithm forDatabase Search. In Proceedings of the Twenty-Eighth Annual ACMSymposium on Theory of Computing (Philadelphia, Pennsylvania, USA)(STOC ’96). Association for ComputingMachinery, New York, NY, USA,212–219. https://doi.org/10.1145/237814.237866

[10] Gian Giacomo Guerreschi and Jongsoo Park. 2018. Two-step approachto scheduling quantum circuits. Quantum Science and Technology 3, 4(Jul 2018), 045003. https://doi.org/10.1088/2058-9565/aacf0b

[11] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, AlexeyLvov, Frederic T. Chong, and Margaret Martonosi. 2015. ScaffCC:Scalable compilation and analysis of quantum programs. ParallelComput. 45 (2015), 2–17. https://doi.org/10.1016/j.parco.2014.12.001

[12] Monica S. Lam. 1988. Software Pipelining: An Effective SchedulingTechnique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88Conference on Programming Language Design and Implementation(PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat(Ed.). ACM, 318–328. https://doi.org/10.1145/53990.54022

[13] Prakash Murali, David C. McKay, Margaret Martonosi, and Ali Javadi-Abhari. 2020. Software Mitigation of Crosstalk on Noisy Intermediate-Scale Quantum Computers. In ASPLOS ’20: Architectural Support forProgramming Languages and Operating Systems, Lausanne, Switzerland,

March 16-20, 2020 [ASPLOS 2020 was canceled because of COVID-19],James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1001–1016.https://doi.org/10.1145/3373376.3378477

[14] Michael A. Nielsen and Isaac L. Chuang. 2011. Quantum Computa-tion and Quantum Information: 10th Anniversary Edition (10th ed.).Cambridge University Press, USA.

[15] Bill Pottenger. [n.d.]. Loop Rotation. http://polaris.cs.uiuc.edu/projects/rec/node8.html

[16] Robert Raussendorf, Daniel E Browne, and Hans J Briegel. 2003.Measurement-based quantum computation on cluster states. Physicalreview A 68, 2 (2003), 022312.

[17] Vivek V. Shende, Stephen S. Bullock, and Igor L. Markov. 2006. Synthe-sis of quantum-logic circuits. IEEE Trans. on CAD of Integrated Circuitsand Systems 25, 6 (2006), 1000–1010. https://doi.org/10.1109/TCAD.2005.855930

[18] Yunong Shi, Nelson Leung, Pranav Gokhale, Zane Rossi, David I. Schus-ter, Henry Hoffmann, and Frederic T. Chong. 2019. Optimized Compi-lation of Aggregated Instructions for Realistic Quantum Computers.In Proceedings of the Twenty-Fourth International Conference on Archi-tectural Support for Programming Languages and Operating Systems,ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, Iris Bahar, MauriceHerlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 1031–1044.https://doi.org/10.1145/3297858.3304018

[19] Seyon Sivarajah, Silas Dilkes, Alexander Cowtan, Will Simmons, AlecEdgington, and Ross Duncan. 2020. t |π‘˜π‘’π‘‘ ⟩: a retargetable compilerfor NISQ devices. Quantum Science and Technology 6, 1 (nov 2020),014003. https://doi.org/10.1088/2058-9565/ab8e92

[20] Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2016. APractical Quantum Instruction Set Architecture. CoRR abs/1608.03355(2016). arXiv:1608.03355 http://arxiv.org/abs/1608.03355

[21] Bochen Tan and Jason Cong. 2020. Optimal Layout Synthesis forQuantum Computing. arXiv:cs.AR/2007.15671

[22] Quantum AI team and collaborators. 2020. Cirq. https://doi.org/10.5281/zenodo.4062499

[23] Qiskit Development Team. [n.d.]. Qiskit Terra basic sched-ulers. https://qiskit.org/documentation/stubs/qiskit.scheduler.methods.basic.html#module-qiskit.scheduler.methods.basic

[24] Mingsheng Ying. 2009. Commutativity between CNOT and one-qubitgates (Unpublished notes). (2009).

[25] Mingsheng Ying. 2016. Foundations of Quantum Programming. Mor-gan Kaufmann, Boston. https://doi.org/10.1016/B978-0-12-802306-8.00002-1

Page 14: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

A Basic quantum gatesThe following are the frequently-used one-qubit gates repre-sented in 2 Γ— 2 unitary matrices:

Pauli gates : 𝑋 =

[0 11 0

],

π‘Œ =

[0 βˆ’π‘–π‘– 0

],

𝑍 =

[1 00 βˆ’1

],

Hadamard gate : 𝐻 =1√2

[1 11 βˆ’1

],

Phase andπœ‹

8gates : 𝑆 =

[1 00 𝑖

],

𝑇 =

[1 0

0 π‘’π‘–πœ‹4

],

Pauli Rotations : 𝑅𝑋 (𝛼) =[

π‘π‘œπ‘  𝛼2 βˆ’π‘–π‘ π‘–π‘› 𝛼2

βˆ’π‘–π‘ π‘–π‘› 𝛼2 π‘π‘œπ‘  𝛼2

],

π‘…π‘Œ (𝛼) =[π‘π‘œπ‘  𝛼2 βˆ’π‘ π‘–π‘› 𝛼

2𝑠𝑖𝑛 𝛼

2 π‘π‘œπ‘  𝛼2

],

𝑅𝑍 (𝛼) =[π‘’βˆ’

𝑖𝛼2 0

0 𝑒𝑖𝛼2

].

They combined with one of the (two-qubit) controlled gates

CNOT =

1

11

1

,𝐢𝑍 =

1

11

βˆ’1

.are universal for quantum computing; that is, they can beused to construct arbitrary quantum gate of any size.

Beside the above, we will use the following auxiliary gatesto simplify the presentation of our approach:

π‘…βˆ’π‘‹ (𝛼) =

[cos \

2 βˆ’π‘– sin \2

𝑖 sin \2 βˆ’ cos \

2

],

𝑅+𝑍 (𝛼) =

[0 𝑒𝑖𝛼/2

π‘’βˆ’π‘–π›Ό/2 0

]= 𝑋𝑅𝑍 (𝛼),

𝐻 (𝛼) = 1√2

[1 1𝑒𝑖𝛼 βˆ’π‘’π‘–π›Ό

]= 𝑅𝑍 (𝛼)𝐻,

π»βˆ’ (𝛼) = 1√2

[1 βˆ’1𝑒𝑖𝛼 𝑒𝑖𝛼

]= 𝑅𝑍 (𝛼)𝐻𝑍 .

Note that parameter 𝛼 in the above gates is a real number.The 𝑅+

𝑍(𝛼) gate can represent all single qubit gates that are

anti-diagonal, i.e. only anti-diagonal entries are not 0. Theother three notations are used in Appendix I.

For real-world quantum computers, a quantum devicemay only support a discrete or contiguous set of single qubitgates while keeping the device universal. For example, IBM’sdevices allow the following three kinds of single qubit gatesto be executed directly[4]:

π‘ˆ1 (_) =[1 00 𝑒𝑖_

],

π‘ˆ2 (πœ™, _) =1√2

[1 βˆ’π‘’π‘–_π‘’π‘–πœ™ 𝑒𝑖_+π‘–πœ™

],

π‘ˆ3 (\, πœ™, _) =[

π‘π‘œπ‘  ( \2 ) βˆ’π‘’π‘–_𝑠𝑖𝑛( \2 )π‘’π‘–πœ™π‘ π‘–π‘›( \2 ) 𝑒𝑖_+π‘–πœ™π‘π‘œπ‘  ( \2 )

]Note that π‘ˆ2 (πœ™, _) = π‘ˆ3 ( πœ‹2 , πœ™, _) and π‘ˆ1 (_) = π‘ˆ3 (0, 0, _).Also note that gate π‘ˆ3 itself is universal for single-qubitgates, and the main reasons for supportingπ‘ˆ1 andπ‘ˆ2 is tomitigate error, which is beyond our consideration.

B More Examples for quantum loopprograms

We hereby presents more quantum algorithms that can bewritten into quantum loop programs and can thus be poten-tially optimized by our approach.

B.1 One-way quantum computingPreparation circuit for simulating one-way quantum com-putation on quantum circuit is another example that allowseach iteration to be performed on different qubits.

Example 11. One-way quantum computing 𝑄𝐢C[16] is aquantum computing scheme that is quite different from thecommonly used quantum-circuit based schemes. Instead ofstarting from |0⟩,𝑄𝐢C initializes all qubits (on a 2-dimensionalqubit grid) in a highly-entangled state, called cluster state.After the preparation step, 𝑄𝐢C performs single-qubit mea-surements on all qubits and extract the computation resultfrom these measurement outcomes.To simulate one-way quantum computing with quantum

circuit, we first need to prepare the cluster state from |0⟩. Thiscan be done by first performing Hadamard gates on all qubits,then performing 𝐢𝑍 gate on each pair of adjacent qubits onthe qubit grid.

The preparation circuit can be written in a nested loop man-ner. If we assume the grid has a fixed width (3 in our case), wecan unroll the innermost loop to get the flattened loop:𝐻 [π‘ž [0]]𝐻 [π‘ž [1]]𝐻 [π‘ž [2]]𝐢𝑍 [π‘ž [0], π‘ž[1]]𝐢𝑍 [π‘ž [1], π‘ž[2]]for i=1 to (L-1) do𝐻 [π‘ž [3𝑖]]𝐻 [π‘ž [3𝑖 + 1]]𝐻 [π‘ž [3𝑖 + 2]]

Page 15: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

𝐢𝑍 [π‘ž [3𝑖], π‘ž[3𝑖 + 1]]𝐢𝑍 [π‘ž [3𝑖 + 1], π‘ž[3𝑖 + 2]]𝐢𝑍 [π‘ž [3𝑖], π‘ž1 [3𝑖 βˆ’ 3]]𝐢𝑍 [π‘ž [3𝑖 + 1], π‘ž2 [3𝑖 βˆ’ 2]]𝐢𝑍 [π‘ž [3𝑖 + 2], π‘ž3 [3𝑖 βˆ’ 1]]

end forFigure 16 shows the gates and qubits involved in each iter-

ation where 𝐿 = 5. The optimization of this program will bediscussed in Appendix M.

B.2 Quantum Approximate OptimizationAlgorithm

Example 12. QuantumApproximate Optimization Algorithm(QAOA)[8] can be used to solve MaxSat problems, for example,MaxCut problems on 3-regular graphs, say 𝐺 = βŸ¨π‘‰ , 𝐸⟩. QAOAperforms quantum computation and classical computation al-ternatively. On the quantum part, it requires us to create thestate:

|𝛾, π›½βŸ© =π‘βˆπ‘–=1

π‘ˆ (𝐡, 𝛽𝑖 )π‘ˆ (𝐢,𝛾𝑖 ) |+⟩ (5)

where:

π‘ˆ (𝐢, 𝛽𝑖 ) =∏

(π‘Ž,𝑏) ∈𝐸

1

π‘’βˆ’π‘–πœ”π‘Žπ‘π›Ύπ‘–

π‘’βˆ’π‘–πœ”π‘Žπ‘π›Ύπ‘–

1

(6)

π‘ˆ (𝐡,𝛾𝑖 ) =π‘›βˆ’1βˆπ‘—=0

𝑅𝑋 (𝛽𝑖 , 𝑗). (7)

The sets of parameters {𝛽𝑖 } and {𝛾𝑖 } are computed in theclassical computation between every two quantum epochs. Thisrequires the optimizer to support compilation of the circuitabove without knowing all parameters in advance.π‘ˆ (𝐡,𝛾𝑖 ) are products of Pauli𝑋 rotations on all qubits. Since

in our caseπ‘ˆ (𝐢, 𝛽𝑖 ) can be decomposed in the following way:1

π‘’βˆ’π‘–πœ”π‘Žπ‘π›Ύπ‘–

π‘’βˆ’π‘–πœ”π‘Žπ‘π›Ύπ‘–

1

=|a〉 β€’ β€’

|b〉 βŠ• RZ(βˆ’Ο‰abΞ³i) βŠ•,

(8)we can define parametric gate arrays π‘ˆπΆ [𝑖] = 𝑅𝑋 (𝛽𝑖 , 𝑗) andπ‘ˆπ΅ [𝑖] = 𝑅𝑍 (βˆ’πœ”π‘Žπ‘π›Ύπ‘– ), and the QAOA quantum part can bewritten as a parametric quantum loop program:

for i=0 to (N-1) do𝐻 [π‘ž [𝑖]]

end forfor i=1 to p dofor (π‘Ž, 𝑏) ∈ 𝐸 do𝐢𝑁𝑂𝑇 [π‘ž [π‘Ž], π‘ž[𝑏]]π‘ˆπ΅ [𝑖] [π‘ž [𝑏]]𝐢𝑁𝑂𝑇 [π‘ž [π‘Ž], π‘ž[𝑏]]

end forfor j=0 to (N-1) do

π‘ˆπΆ [ 𝑗] [π‘ž [ 𝑗]]end for

end for

The two nested loops can be fully unrolled by hand, andthe outcome loop satisfies our requirements for optimization.

C Output languageIf the input range of the loop program is unknown, we mayhave to add guard statements into the orginal program, forexample, when we want to check if the range is large enoughfor us to use the software-pipelined version. Those featuressuch as guard statements, unfortunately, are not supportedin our definition of input language. So we have to define thefollowing language for the optimization result:

program :=header statementβˆ—header :=[(qdef | udef)βˆ—]

qdef :=π‘žπ‘’π‘π‘–π‘‘ ident[N];udef :=𝑑𝑒 𝑓 π‘”π‘Žπ‘‘π‘’ ident[N] = gate;

gate :=[(C2Γ—2)βˆ—] | 𝑅𝑍 | 𝑅+𝑍 | π‘ˆπ‘›π‘˜π‘›π‘œπ‘€π‘›

gateref :=ident[expr]qubit :=ident[expr]

op :=𝑆𝑄 (gateref) qubit;| 𝐢𝑍 qubit, qubit;

statement :=op| 𝑓 π‘œπ‘Ÿ ident 𝑖𝑛 expr π‘‘π‘œ expr{statementβˆ—}| π‘π‘Žπ‘Ÿπ‘Žπ‘™π‘™π‘’π‘™{statementβˆ—}| π‘”π‘’π‘Žπ‘Ÿπ‘‘{(compare => {statementβˆ—})βˆ—

π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ => {statementβˆ—}}

expr :=ident | 𝑒π‘₯π‘π‘Ÿ + 𝑒π‘₯π‘π‘Ÿ | 𝑒π‘₯π‘π‘Ÿ βˆ’ 𝑒π‘₯π‘π‘Ÿ

| 𝑒π‘₯π‘π‘Ÿ βˆ— 𝑒π‘₯π‘π‘Ÿ | 𝑒π‘₯π‘π‘Ÿ/𝑒π‘₯π‘π‘Ÿ | 𝑒π‘₯π‘π‘Ÿ%𝑒π‘₯π‘π‘Ÿ | Zcompare :=expr ordering exprordering := == | ! = | > | < | >= | <=

The main differences between the input language and theoutput language are:

1. The π‘π‘Žπ‘Ÿπ‘Žπ‘™π‘™π‘’π‘™ notation is added to explicitly point outwhich instructions are scheduled together.

2. The π‘”π‘’π‘Žπ‘Ÿπ‘‘ statement is added to check whether theinput range is suitable for the software-pipelined ver-sion if the range is unknown at compilation time, andto separate cases with different (π‘šπ‘šπ‘œπ‘‘ 𝐢). The π‘”π‘’π‘Žπ‘Ÿπ‘‘statement executes the first statement block with asatisfied guard condition.

Page 16: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

Figure 16. Converting cluster state preparation circuit into loop program. Fig (a) is a 3 Γ— 5 two-dimensional qubit network.The preparation is done by performing a layer of Hadamard gates (Fig (b)) and a layer of𝐢𝑍 gates (Fig (c)). One way to performthose 𝐢𝑍 gates without qubit conflict is to split them into four non-overlapping groups and execute each group separately, asin Fig (d) to Fig (g). The procedure can also be written into loop program, as in Fig (h) to Fig (l).

3. The 𝑒π‘₯π‘π‘Ÿ allows for more general indexing into qubitarrays and gate arrays. Note that the division and mod-ulo operators are Euclidean, i.e. it always holds that{

𝑠𝑖𝑔𝑛(π‘Ž%𝑏) = 𝑠𝑖𝑔𝑛(𝑏)π‘Ž%𝑏 + (π‘Ž/𝑏) βˆ— 𝑏 = π‘Ž

(9)

D Solving Diophantine equationsIn this appendix we focus on solving the Diophantine equa-tion:

(π‘˜2 βˆ’π‘˜1)𝑖 +π‘˜2 (Δ𝑖) = 𝑏1 βˆ’π‘2, 𝑖 ∈ 𝑇, 𝑖 +Δ𝑖 ∈ 𝑇,Δ𝑖 β©Ύ 1. (10)

We rewrite it into:

π‘Žπ‘₯ + 𝑏𝑦 = 𝑐, π‘₯ ∈ 𝑇, π‘₯ + 𝑦 ∈ 𝑇,𝑦 β©Ύ 1. (11)

We recall the solutions 𝑆 for linear Diophantine equationswith two variables:

Lemma 1. Solutions for linear Diophantine equationswith two variables

π‘Žπ‘₯ + 𝑏𝑦 = 𝑐, π‘₯ ∈ Z, 𝑦 ∈ Z. (12)

1. If π‘Ž = 0 and 𝑏 = 0, 𝑆 = Ξ¦ if 𝑐 β‰  0 and 𝑆 = Z Γ— Z if𝑐 = 0.

2. If π‘Ž = 0 but 𝑏 β‰  0 (similar for 𝑏 = 0 but π‘Ž β‰  0),a. If 𝑏 |𝑐 , 𝑆 = Z Γ—

{𝑐𝑏

}.

b. Otherwise, 𝑆 = Ξ¦.3. If π‘Ž β‰  0 and 𝑏 β‰  0:a. If 𝑐 = 𝑑 Β· 𝑔𝑐𝑑 (π‘Ž, 𝑏),

β€’ Special solution (π‘₯0, 𝑦0) where

π‘Žπ‘₯0 + 𝑏𝑦0 = 𝑔𝑐𝑑 (π‘Ž, 𝑏) (13)

can be solved using extended Euclidean algorithm.β€’ General solution

(π‘˜ 𝑏𝑔𝑐𝑑 (π‘Ž,𝑏) ,βˆ’π‘˜

π‘Žπ‘”π‘π‘‘ (π‘Ž,𝑏)

)for equa-

tionπ‘Žπ‘₯ + 𝑏𝑦 = 0 (14)

is known.β€’ The total solution space is

𝑆 =

{(π‘₯0 + π‘˜

𝑏

𝑔𝑐𝑑 (π‘Ž, 𝑏) , 𝑦0 βˆ’ π‘˜π‘Ž

𝑔𝑐𝑑 (π‘Ž, 𝑏)

)|π‘˜ ∈ Z

}. (15)

We rewrite the equation into:

𝑆 = {(π‘₯0 + π‘˜Ξ”π‘₯,𝑦0 + π‘˜Ξ”π‘¦) |π‘˜ ∈ Z} . (16)

b. Otherwise, 𝑆 = Ξ¦.

For our original question with constraints, we only con-sider the cases where π‘Ž β‰  0 and 𝑏 β‰  0.

When 𝑇 = Z, the constraints no longer exist and we onlyneed to find the minimal positive integer in set {𝑦0 + π‘˜Ξ”π‘¦},which can be solved by an Euclidean division. With loss ofgenerality, we can just let π‘˜ = 0 by choosing 𝑦0 to be exactlythe smallest positive integer in {𝑦0 + π‘˜Ξ”π‘¦} and adjust π‘₯0accordingly, without affecting the solution set 𝑆 .When 𝑇 = [π‘Ž, 𝑏], the corresponding π‘₯0 may not lie in

𝑇 . In this case we may want to find a secondary-minimalpositive integer. Without loss of generality we assume Δ𝑦 >

Page 17: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

0 (otherwise choose Ξ”π‘₯ = βˆ’Ξ”π‘₯ and Δ𝑦 = βˆ’Ξ”π‘¦). Then theproblem becomes: find minimal π‘˜ ∈ 𝑁+ s.t.{

π‘₯0 + π‘˜Ξ”π‘₯ >= π‘Ž

π‘₯0 + π‘˜Ξ”π‘₯ <= 𝑏, (17)

which is equivalent to{π‘˜Ξ”π‘₯ >= π‘Ž βˆ’ π‘₯0

π‘˜Ξ”π‘₯ <= 𝑏 βˆ’ π‘₯0(18)

which can thus be solved by a routine calculation: a minimalπ‘˜ exists, or does not exist at all.

E Proofs of Theorems 1 (CZ conjugationrules)

In this section we give out proof for our new rules of instruc-tion data dependency. We will show that our definition ofdependency is β€œsufficient and necessary” for quantum gatesets using 𝐢𝑍 .

We first restate Theorem 1 as follows:

πΆπ‘π‘ˆπ΄π‘ˆπ΅πΆπ‘ = 𝑉𝐴𝑉𝐡,

if and only ifπ‘ˆπ΄ andπ‘ˆπ΅ are diagonal or anti-diagonal. Thatis,π‘ˆπ‘– = 𝑅𝑍 (\ ) orπ‘ˆπ‘– = 𝑅+

𝑍(\ ) for 𝑖 ∈ {𝐴, 𝐡}.

Proof. We here introduce our methodology of proving quan-tum gate algebra equations: first we give a necessary condi-tion by trying several input states, and show that the condi-tion is also sufficient for the equation to hold.The first lemma is a criteria for deciding whether a state

is separable or entangled:

Lemma 2. Two-qubit state |πœ“ ⟩ = (π‘Ž, 𝑏, 𝑐, 𝑑)𝑇 is separable ifand only if:

π‘Žπ‘‘ βˆ’ 𝑏𝑐 = 0. (19)

Proof. (Necessity) If |πœ“ ⟩ is separable, there exists two singlequbit states |πœ“1⟩ and |πœ“2⟩, s.t.

|πœ“ ⟩ = |πœ“1⟩ βŠ— |πœ“2⟩ (20)

Suppose|πœ“1⟩ = (𝛼1, 𝛽1)𝑇 , (21)|πœ“2⟩ = (𝛼2, 𝛽2)𝑇 , (22)

We have

|πœ“ ⟩ = (𝛼1𝛼2, 𝛼1𝛽2, 𝛽1𝛼2, 𝛽1𝛽2)𝑇 , (23)

and it can be easily verified that π‘Žπ‘‘ βˆ’ 𝑏𝑐 = 0.(Sufficiency) If

|πœ“ ⟩ = (π‘Ž, 𝑏, 𝑐, 𝑑)𝑇 (24)

with π‘Žπ‘‘ βˆ’ 𝑏𝑐 = 0,1. If 𝑏 = 0, this indicates π‘Ž = 0 or 𝑑 = 0. If π‘Ž = 0, let{

|πœ“1⟩ = |1⟩|πœ“2⟩ = 𝑐 |0⟩ + 𝑑 |1⟩

; (25)

otherwise 𝑑 = 0, and let{|πœ“1⟩ = π‘Ž |0⟩ + 𝑐 |1⟩|πœ“2⟩ = |0⟩

. (26)

2. If 𝑐 = 0, this indicates π‘Ž = 0 or 𝑑 = 0. If π‘Ž = 0, let{|πœ“1⟩ = 𝑏 |0⟩ + 𝑑 |1⟩|πœ“2⟩ = |1⟩

; (27)

otherwise 𝑑 = 0, and let{|πœ“1⟩ = |0⟩|πœ“2⟩ = π‘Ž |0⟩ + 𝑏 |1⟩

. (28)

3. Otherwise π‘Ž, 𝑏, 𝑐, 𝑑 β‰  0. Let|πœ“1⟩ =

(π‘βˆš

βˆ₯𝑏 βˆ₯2+βˆ₯𝑑 βˆ₯2, π‘‘βˆš

βˆ₯𝑏 βˆ₯2+βˆ₯𝑑 βˆ₯2

)𝑇|πœ“2⟩ =

(π‘Ž

π‘βˆš

βˆ₯ π‘Žπ‘βˆ₯2+βˆ₯1βˆ₯2

, 1√βˆ₯ π‘Žπ‘βˆ₯2+βˆ₯1βˆ₯2

)𝑇 . (29)

It can be verified that βˆ₯ |πœ“1⟩ βˆ₯ = βˆ₯ |πœ“2⟩ βˆ₯ = 1, and that

|πœ“1⟩ βŠ— |πœ“2⟩ =(π‘Ž, 𝑏, 𝑐, 𝑑)π‘‡βˆšοΈƒ

(βˆ₯𝑏βˆ₯2 + βˆ₯𝑑 βˆ₯2) (βˆ₯ π‘Žπ‘βˆ₯2 + βˆ₯1βˆ₯2)

, (30)

which is exactly (π‘Ž, 𝑏, 𝑐, 𝑑)𝑇 since tensor product pre-serves norm.

β–‘

Lemma 3. (Necessity) For the equation to hold,π‘ˆπ΄ andπ‘ˆπ΅

have to be diagonal or anti-diagonal. This meansπ‘ˆπ‘– transforms|0⟩ to |0⟩ or |1⟩, up to a global phase.

Proof. Suppose |πœ™βŸ© = π‘ˆπ΄ |0⟩ = (π‘Ž, 𝑏)𝑇 , thus

πΆπ‘π‘ˆπ΄π‘ˆπ΅πΆπ‘ ( |0⟩ βŠ— (π‘ˆ †𝐡|πœ™βŸ©)) (31)

=𝐢𝑍 |πœ™βŸ© βŠ— |πœ™βŸ© (32)=(π‘Ž2, π‘Žπ‘, π‘Žπ‘,βˆ’π‘2)𝑇 , (33)

which should be a separable state since this is also𝑉𝐴𝑉𝐡 ( |0βŸ©βŠ—(π‘ˆ †

𝐡|πœ™βŸ©)), which is separable. Thus π‘Ž2𝑏2 = 0, so π‘Ž = 0 (𝑅+

𝑍

case) or 𝑏 = 0 (𝑅𝑍 case). This is the same forπ‘ˆπ΅ . β–‘

Lemma 4. (Sufficiency) 𝑅𝑍 and 𝑅+𝑍satisfies the conjugation

rules.

Proof. Note that 𝑅+𝑍= 𝑋𝑅𝑍 and𝐢𝑍𝑋𝐴 = 𝑋𝐴𝑍𝐡𝐢𝑍 . By simple

computation we can see the conjugation holds. β–‘

β–‘

F Proof of Theorem 3 (Convergence ofcompaction)

We show that compaction procedure will converge afterapplying the procedure three times.If we look at the factors that prevents compaction proce-

dure from reaching its fixpoint, there are two main reasons:

Page 18: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

1. Single qubit merging results in new diagonal gatesor antidiagonal gates, which is not recognized whenthe first gate is placed. Compacting #1 in Figure 7shows an example where three gates merge into anantidiagonal 𝑋 gate, which can merge through the𝐢𝑍gate on next compaction.

2. Antidiagonal and 𝐢𝑍 changing order will add 𝑍 gatesto the circuit. Compacting #2 in Figure 7 shows anexample.

Fortunately, these problems will not occur at the thirdtime of compaction. This is because diagonal gates and an-tidiagonal gates forms a subgroup ofπ‘ˆ2:

Lemma 5. Let

𝐺𝑍 = {𝑅𝑍 (\ ) |\ ∈ [0, 2πœ‹)} , (34)𝐺+𝑍 =

{𝑅+𝑍 (\ ) |\ ∈ [0, 2πœ‹)

}, (35)

𝐺 = 𝐺𝑍 βˆͺ𝐺+𝑍 , (36)

thus𝐺𝑍 ,𝐺 are subgroups ofπ‘ˆ2, whileβˆ€π‘”1, 𝑔2 ∈ 𝐺+𝑍, 𝑔1𝑔2 ∈ 𝐺𝑍 .

Corollary 6. βˆ€π‘”1 ∈ π‘ˆ2\𝐺,𝑔2 ∈ 𝐺,𝑔2𝑔1 ∈ π‘ˆ2\𝐺 .

On #2 compaction, single qubit gates can only mergewhen they are on different sides of a 𝐢𝑍 gate and one isdiagonal or antidiagonal (otherwise they should have beenmerged on #1 compaction). According to corollary 6, thismerging will not add new diagonals or antidiagonals, andall new gates from compaction #2 come from moving an-tidiagonal through 𝐢𝑍 . The last compaction merges theseadditional 𝑍 gates to their left.

G Proof of Theorem 5 (Remove multipleedges)

In the QDG defined in Section 4, Theorem 5 is proposed sothat multiple edges can be removed before 𝐼 𝐼 is assigned.The proof of Theorem 5 is listed below:

Proof. Since 𝑑𝑖 𝑓1 and 𝑑𝑖 𝑓2 are integers,1 + 𝑑𝑖 𝑓2 β©½ 𝑑𝑖 𝑓1, (37)

Since 𝐼 𝐼 β©Ύ 1,βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓1 β©½ βˆ’πΌ 𝐼 βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓2 β©½ βˆ’1 βˆ’ 𝐼 𝐼 Β· 𝑑𝑖 𝑓2. (38)

Sinceπ‘šπ‘–π‘›1 β©½ 1 andπ‘šπ‘–π‘›2 β©½ 1,π‘šπ‘–π‘›1 β©½ π‘šπ‘–π‘›2 + 1. (39)

Adding up Equation 38 and 39 shows the result. β–‘

H Resource scheduling complexityanalysis

In Secion IV we mentioned that we can keep retrying if thereis a β€œresource conflict” and the death countdown is not timed-out (i.e. resource conflict are all caused by false conflicts),which may lead to too many retries that may dominate thecomplexity of the algorithm. This requires us to give an

upper bound of maximum number of retries to estimate thetotal complexity.

Recall how we perform resource checking when insertinginstructions into the schedule:

β€’ For every time slot, we have scheduled a bunch ofinstructions in this time slot.

β€’ When adding an instruction or a group of instructions,we check the operands of each instruction to be addedagainst instructions in the time slot where it will beadded.

β€’ If there is a resource conflict, we have to try next tick(and perhaps start a death countdown).

We first show that if there is only false conflict, the loopcan be written into an equivalent form where all π‘˜ = 1. Infact, this is achieved by the fact:

π‘˜π‘– + 𝑏 = π‘˜ (𝑖 + (𝑏/π‘˜)) + (𝑏 mod π‘˜), (40)

where

(𝑏 mod π‘˜) ∈ [0, βˆ₯π‘˜ βˆ₯) , π‘˜ (𝑏/π‘˜) + (𝑏 mod π‘˜) = 𝑏. (41)

According to this fact, the array can be split into βˆ₯π‘˜ βˆ₯ slices,and resource conflict can occur if the two qubit referencesfall into the same slice. Figure 17 is an example for π‘˜ = 3.Offsets 3𝑖 and (3𝑖 βˆ’ 1) will never conflict with each other,since they fall into different slices π‘ž0 and π‘ž2.This splitting allows us to use one integer 𝑏 β€² = (𝑏/π‘˜) to

represent an expression in the slice: in the Figure 17 case wecan use 0 for π‘ž [3𝑖] in slice π‘ž0, 0 for π‘ž [3𝑖 + 1] in slice π‘ž1, and(βˆ’1) for π‘ž [3𝑖 βˆ’ 1] in slice π‘ž2.

Corollary 7. For the modulo scheduling, if a resource is sched-uled 𝐼 𝐼 ticks later, the integer 𝑏 β€² representing the resource de-creases by 1.

This allows to use a stricter model for upper-bound esti-mation:

β€’ For the entire schedule, we use a universal set to storeall integer representations {𝑏 β€²} of linear expressions.

β€’ When adding an instruction or a group of instructions,we check the operands to be added against the univer-sal set, rather than the time-slot set. This means twoinstructions with the same operand but scheduled atdifferent ticks will also be seen as conflicted.

β€’ If the integer representation of operand is already inthe set, there is a resource conflict. To find the worstcase, we suppose the next (𝐼 𝐼 βˆ’ 1) tries will definitelyfail. The next retry that will possibly success is the𝐼 𝐼 -th retry where the instruction is going to be placedin the same time slot again.

β€’ The array indexπ‘ž and slice index𝑏 mod π‘˜ are ignored.For example, operands π‘ž [3𝑖] and π‘ž [3𝑖 + 1] will be seenas conflicted since they have the same representation0, even though the two expressions will never be equalto each other.

Page 19: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

Figure 17. Example for splitting the qubit array when π‘˜ = 3. Resource conflict can only occur inside each slice, and resourcesin each slice can be represented by one integer.

This strict set of rules reduces our upper bound problemto a clearer problem:

Theorem 8. For finite set 𝐴 βŠ‚ 𝑍 standing for resources (inte-gers representing each resource) already scheduled, and 𝐡 βŠ‚ 𝑍

being resources to be scheduled. Define

𝐡 βˆ’ (π‘˜ ∈ 𝑁 ) = {π‘₯ βˆ’ π‘˜ |π‘₯ ∈ 𝐡} (42)

to be the resource set of 𝐡 after π‘˜πΌπΌ retries. Let π‘˜π‘šπ‘–π‘› be theminimal π‘˜ , s.t.

𝐴 ∩ (𝐡 βˆ’ π‘˜) = Ξ¦, (43)then π‘˜π‘šπ‘–π‘›πΌ 𝐼 retries is required at most in our algorithm.

Figure 18. Resource 3(π‘ž [π‘₯ + 3]) and 5(π‘ž [π‘₯ + 5]) are nowoccupied, and resource 4 to 6 required to scheduled. Nowπ‘˜π‘šπ‘–π‘› = 4.

A naive estimation of π‘˜π‘šπ‘–π‘› would be

π‘˜π‘šπ‘–π‘› β©½ π‘šπ‘Žπ‘₯ (𝐡) βˆ’π‘šπ‘–π‘›(𝐴), (44)

which is not acceptable. Fortunately, we can give out a moreprecise estimation not in the values in 𝐴 or 𝐡, but only inthe size of sets.

Theorem 9. Let βˆ₯𝑆 βˆ₯ be size of set 𝑆 ,π‘˜π‘šπ‘–π‘› β©½ βˆ₯𝐴βˆ₯βˆ₯𝐡βˆ₯. (45)

Proof. Consider the set

𝐷 = {𝑏 βˆ’ π‘Ž |π‘Ž ∈ 𝐴,𝑏 ∈ 𝐡, (𝑏 βˆ’ π‘Ž) β©Ύ 0} . (46)

thus π‘˜ βˆ‰ 𝐷 if and only if 𝐴 ∩ (𝐡 βˆ’ π‘˜) = Ξ¦. Thus π‘˜π‘šπ‘–π‘› is thefirst natural number not appearing in 𝐷 . However, βˆ₯𝐷 βˆ₯ β©½βˆ₯𝐴βˆ₯βˆ₯𝐡βˆ₯ according to its definition, so π‘˜ β©½ βˆ₯𝐴βˆ₯βˆ₯𝐡βˆ₯. β–‘

Corollary 10. Insertingπ‘š instructions at one time (e.g. merg-ing to scheduled blocks) into a schedule with 𝑛 instructionsrequires at most 𝑂 (π‘šπ‘›πΌπΌ ) retries. If each retry takes 𝑂 (π‘šπ‘›)queries to find a conflict, the total complexity is atmost𝑂 (π‘š2𝑛2𝐼 𝐼 ).

According to the theorem, we can get some several impor-tant results on the complexity:

Corollary 11. 1. Inserting one instruction into the mod-ulo scheduling table sized 𝑏 requires 𝑂 (𝑏𝐼𝐼 ) retries and𝑂 (𝑏2𝐼 𝐼 ) time. Thus inserting all 𝑏 instructions require𝑂 (𝑏3𝐼 𝐼 ) time.

2. The span of themodulo scheduling table above is boundedby 𝑂 (𝑏2𝐼 𝐼 ).

3. Suppose the loop kernel sized 𝑛 is split into π‘Ž β©Ύ 2 strongconnected components sized 𝑏, the total complexity forscheduling all SCCs is π‘Žπ‘‚ (𝑏3𝐼 𝐼 ) = 𝑂 (π‘Žπ‘3𝐼 𝐼 ) = 𝑂 (𝑛4),and the total time required to merge all SCCs together is

π‘Žβˆ’1βˆ‘οΈπ‘–=1

𝑂 (𝑏2 (𝑖𝑏)2𝐼 𝐼 ) = 𝑂 (π‘Ž3𝑏4𝐼 𝐼 ) = 𝑂 (𝑛5). (47)

4. The span of the total schedule is

π‘Žπ‘‚ (𝑏2𝐼 𝐼 ) +π‘Žβˆ’1βˆ‘οΈπ‘–=1

𝑏 (𝑖𝑏)𝐼 𝐼 = 𝑂 (π‘Žπ‘2𝐼 𝐼 +π‘Ž2𝑏2𝐼 𝐼 ) = 𝑂 (𝑛2𝐼 𝐼 ). (48)

Thus we expect the length of prologue and epilogue to be

𝑂 (𝑛2)βˆ‘οΈπ‘–=1

𝑖 Β· 𝐼 𝐼 = 𝑂 (𝑛3). (49)

Page 20: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

I CNOT conjugation rulesThese results are taken directly from [24].

Theorem 12. (𝐢𝑁𝑂𝑇 conjugation) 𝐢𝑁𝑂𝑇 conjugates singlequbit gates if and only if the conjugation satisfies one of thefollowing eight cases:

1.

|a〉 β€’ RZ(Ξ±) β€’

|b〉 βŠ• βŠ•=|a〉 RZ(Ξ±)

|b〉(50)

2.

|a〉 β€’ R+Z (Ξ±) β€’

|b〉 βŠ• βŠ•=|a〉 R+

Z (Ξ±)

|b〉 X(51)

3.

|a〉 β€’ β€’

|b〉 βŠ• RX(Ξ±) βŠ•=

|a〉

|b〉 RX(Ξ±)(52)

4.

|a〉 β€’ Rβˆ’X(Ξ±) β€’

|b〉 βŠ• βŠ•=

|a〉 Z

|b〉 Rβˆ’X(Ξ±)

(53)

5.

|a〉 βŠ• H(Ξ±) β€’

|b〉 β€’ H(Ξ²)† βŠ•=

|a〉 H(Ξ±)

|b〉 H(Ξ²)†(54)

6.

|a〉 βŠ• Hβˆ’(Ξ±) β€’

|b〉 β€’ H(Ξ²)† βŠ•=

|a〉 Hβˆ’(Ξ±)

|b〉 H(Ξ² + Ο€)†(55)

7.

|a〉 βŠ• H(Ξ±) β€’

|b〉 β€’ Hβˆ’(Ξ²)† βŠ•=

|a〉 H(Ξ±+ Ο€)

|b〉 Hβˆ’(Ξ²)†(56)

8.

|a〉 βŠ• Hβˆ’(Ξ±) β€’

|b〉 β€’ Hβˆ’(Ξ²)† βŠ•=

|a〉 Hβˆ’(Ξ±+ Ο€)

|b〉 Hβˆ’(Ξ² + Ο€)†(57)

It is easy to check that 𝐢𝑁𝑂𝑇 conjugation rules and 𝐢𝑍conjugation rules are equivalent to each other, by converting𝐢𝑁𝑂𝑇 to 𝐢𝑍 and vice versa.

J Parallel QAOA DecompositionQAOA is one of the fashionable algorithms in NISQ era. Wewill use the QAOA program for solving MaxCut problemsas our optimization test cases.However, we face the problem of lacking commutativ-

ity when optimizing 𝑄𝐴𝑂𝐴 programs: our device can’t ex-ecute π‘ˆ (𝐡, 𝛽𝑖 ) operation directly and it has to be decom-posed into basic gates according to Equation 8, and the block-commutativity optimization chances by commutativity be-tweenπ‘ˆ (𝐡, 𝛽𝑖 ) matrices are missed.There have been different ways to optimize QAOA cir-

cuits with π‘ˆ (𝐡, 𝛽𝑖 ) commutable with each other in mind.For example, [18] detects all two-qubit diagonal structuresin the circuit and aggregate them, so that commutativitydetection can be performed on aggregated blocks. Anotherlayout synthesis algorithm (scheduling considering devicelayout) QAOA-OLSQ[21] schedules QAOA circuits twice,the first time on a large granularity (named TB-OLSQ) andthe second time on a small granularity (named OLSQ). Thelarge-granularity pass allows block commutativity to be con-sidered and gates are placed in blocks. The small-granularitypass finishes the scheduling.

However, these two approaches both require the optimiza-tion algorithm to perform coarse-grain block-level schedul-ing in addition to fine-grain gate-level scheduling. We maywant to find another way to give commutativity hints to agate-scheduling algorithm without modifying the algorithmitself.Equation 8 inspires us with the fact that the shape of

decomposed form of π‘ˆ (𝐡, 𝛽𝑖 ) is a bit like 𝐢𝑁𝑂𝑇 gate: ithas a β€œcontroller” qubit and a β€œcontrolled” qubit; multipleblocks with the same β€œcontroller” qubit can be commutedand interleaved freely at gate level, and can be finished in 2ticks on average instead of 3, as in Figure 19.

|a〉 βŠ• RZ(βˆ’Ο‰abΞ³i) βŠ•

|b〉 β€’ β€’ β€’ β€’

|c〉 βŠ• RZ(βˆ’Ο‰bcΞ³i) βŠ•

Figure 19. The two blocks can be executed interleavingly.

The level of β€œblocks” according to the discovery abovecan be derived by directing and coloring all edges in theundirected graph 𝐺 = βŸ¨π‘‰ , 𝐸⟩:

β€’ First, we assign every edge with the direction in whichwe would perform the 8 decomposition (i.e. assignthe graph with an orientation). Suppose the directionpoints from the controller qubit to the controlled qubit.

β€’ Then, we colour all edges with minimal number ofcolours under the following constraints:

Page 21: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

(a) Graph for QAOA. (b) One orientation for thegraph.

(c) One coloring satisfyingthe constraints.

(d) The equialent vertex-coloring problem.

Figure 20. Example for one possible orientation and layeringof a graph.

1. All in-degree edges of a vertex should be coloureddifferently from each other.

2. Out-degree edges of a vertex should be coloureddifferently from all in-degree edges of the vertex.

The minimal number of required colors over all possibleorientations is the minimal number of layers we can putthese gates into.Note that finding the minimal edge colouring under the

constraints can be reduced to the problem of finding minimalvertex colouring of a new graph. In the new graph, verticesrepresent original edges; vertices for out-degree edges arefully connected; vertices for in-degree edges are connectedwith those for out-degree edges. Figure 20 is an exampleof assigning directions and colours for edges in the graph,and the equivalent vertex-colouring problem to the edge-colouring one.

One direct way to compute the block placement strategyis to use an SMT solver, for example, 𝑄𝐴𝑂𝐴 βˆ’ π‘ƒπ‘Žπ‘Ÿ test casesin our evaluation are generated using Z3 Solver[7]. We leaveit as an open problem whether there is an efficient approach.

K Complexity AnalysisIn this section we give a rough estimation of complexity ofthe scheduling algorithm above. We put the main complexityresults in table 3, with some notes below to explain.

K.1 Complexity of loop compactionComplexity for compacting a piece of loop program sized𝑂 (𝑛) once is𝑂 (𝑛2), since when adding every instruction wecheck it against all instructions that are previously added.

K.2 Complexity of loop unrollingFinding merging or cancelling candidates requires 𝑂 (𝑛2)time. Suppose the loop range is unknown, we have to per-form the following steps on 𝐢 loops sizedπ‘š = 𝑂 (𝐢𝑛).

Step Time Code SizeCompaction 𝑂 (𝑛2) 𝑂 (𝑛)Unrolling 𝑂 (𝑛2 +𝐢2𝑛) 𝐢 loops sized 𝑂 (𝐢𝑛)

For each loop sized 𝑂 (π‘š)Rotation 𝑂 (π‘š3) 𝑂 (π‘š3)Try 𝐼 𝐼 𝑂 (π‘™π‘œπ‘”π‘š) -Tarjan 𝑂 (π‘š2) -Floyd 𝑂 (π‘š3) -Scheduling 𝑂 (π‘š5) Span=𝑂 (π‘š3)Add 𝑍 𝑂 (π‘š4) -Codegen 𝑂 (π‘š6) 𝑂 (π‘š3)Total 𝑂 (π‘š6)π‘™π‘œπ‘”π‘š 𝑂 (π‘š3)

In TotalOverall 𝑂 (𝐢6𝑛6 (π‘™π‘œπ‘”πΆπ‘›)) 𝑂 (𝐢4𝑛3)

Table 3. Complexity of our software pipelining approach.

K.3 Complexity of loop rotationA loop sized 𝑂 (𝑛) can be rotated for at most 𝑂 (𝑛2) times,since loop rotation will not introduce new β€œqubit” into theloop, and the 𝑂 (𝑛) qubits can be placed in an partial order:π‘žπ‘Ž β‰Ί π‘žπ‘ if a single qubit gate onπ‘žπ‘Ž will be onπ‘žπ‘ after rotation.

This will create a prologue sized 𝑂 (𝑛2), an epilogue sized𝑂 (𝑛3) and a new loop sized 𝑂 (𝑛). Each rotation requires𝑂 (𝑛2) time (to find a rotatable gate) so the total complexityis 𝑂 (𝑛4).

K.4 Complexity of modulo schedulingWe need 𝑂 (π‘™π‘œπ‘”π‘š) retries to binary-search the minimal 𝐼 𝐼 .Complexity of Tarjan algorithm on a dense graph is 𝑂 (π‘š2),and complexity of Floyd algorithm is 𝑂 (π‘š3).We leave the proof of complexity from retrying due to

resource conflict in Appendix H.

K.5 Inversion pair detectionThe complexity for detecting in-loop inversion pair if𝑂 (π‘š2).The complexity for detecting across-loop inversion dependson the span of the total schedule. Note that according toDefinition 8:

π‘Ÿ ≀ (𝑝2 βˆ’ 𝑝1) +𝑐1 βˆ’ 𝑐1

𝐿, (58)

where 𝑝1, 𝑝2 = 𝑂 (π‘š2). Thus

π‘Ÿ = 𝑂 (π‘š2). (59)

The total complexity of checking 𝑂 (π‘š2) pairs of instruc-tions across π‘Ÿ iterations is 𝑂 (π‘š4).

K.6 Code generationThe complexity for code generation is just the length of pro-logue and epilogue, 𝑂 (π‘š3). The compaction is of quadraticcomplexity so the total complexity is 𝑂 (π‘š6). However, forcases where the loop range is known, using a hash set to store

Page 22: Software Pipelining for Quantum Loop Programs

A Preprint, December 23, 2020 Guo, et al.

the last operation on each qubit can reduce the complexityto 𝑂 (π‘š3).

Theorem 13. The total time complexity for our algorithm is

𝑂 (𝐢6𝑛6 (π‘™π‘œπ‘”πΆπ‘›)), (60)and the size of the generated code is

𝑂 (𝐢4𝑛3). (61)

L Adapting to existing architecturesNote that we are building our approach of optimization basedon a specific quantum circuit model as specified in Section2.2. Recall some of the features of the model that we use:

β€’ Classical computation and loop guards can be carriedout instantly.

β€’ The hardware can execute arbitrary single qubit oper-ations and 𝐢𝑍 gates between arbitrary qubit pairs. Allinstructions can finish in one cycle.

β€’ Instructions on totally different qubits can be carriedout at the same time.

L.1 Powerful classical controlA quantum processor is usually split into classical part andquantum part, and all the classical logics (i.e. branch state-ments) are run on the classical part.To implement fast classical guard for 𝑓 π‘œπ‘Ÿ -loops, we can

use several classical architecture mechanisms, such as su-perscalar, classical branch prediction and speculative exe-cution. As long as classical part commits instructions fasterthan quantum part executing instructions, we may keep thequantum part fully-loaded without introducing unnecessarybubbles.

If we want classical operations that affect the control flowof quantum part (e.g. classical branch statements), one waywould be converting them to their quantum version. Onepractical example would be measurements with feedback:if we want to use the measurement outcome to control thefollowing operations, we can just use a qubit array to replaceclassical memory, use 𝐢𝑁𝑂𝑇 gate to replace measurement,and use controlled gate to replace classical control. The clas-sical trick of register renaming can be adopted when convert-ing measurement to quantum gates: different iterations canβ€œmeasure to” different qubits to prevent unnecessary namedependency.Also on real quantum processors the full-parallelism is

not likely to be achieved, for example, there may be a limitof instruction issuing width on the device. For this case, wecan just limit the maximal issuing width in resource conflictchecking.

L.2 CNOT-based instruction setOne major difference between our assumptions and the real-world architectures is that most existing models and archi-tectures adopt a 𝐢𝑁𝑂𝑇 -based instruction set, instead of a

𝐢𝑍 -based one. We provide two possible approaches for ex-tending our method to the 𝐢𝑁𝑂𝑇 -architecture case.One approach is to convert the original circuit to 𝐢𝑍 -

version directly, using the equation 𝑋 [𝑏]𝐢𝑍 [π‘Ž, 𝑏]𝑋 [𝑏] =

𝐢𝑁𝑂𝑇 [𝑏]. After optimization, an additional step is requiredto convert each𝐢𝑍 gate into𝐢𝑁𝑂𝑇 gates by addingHadamardgates. Note that the way of adding Hadamard gates can affectthe depth of the kernel.

Example 13. Adding Hadamard gates on the same qubit oftwo adjacent 𝐢𝑍 gates saves gate depth by 1, compared to theversion adding Hadamard gates on different qubits of the two𝐢𝑍 gates.

|a〉 β€’

|b〉 β€’

|c〉 β€’ β€’

=

|a〉 H βŠ•βŠ• H

|b〉 β€’

|c〉 β€’

=

|a〉 H βŠ• H β€’

|b〉 β€’

|c〉 H βŠ• H

(62)

However, deciding all directions of 𝐢𝑁𝑂𝑇 gates can bea hard problem. We can formulate the problem as an ILPproblem. A rough description is as follows:

β€’ Each 𝐢𝑍 is given a boolean variable, indicating the di-rection of 𝐢𝑁𝑂𝑇 (and where to add Hadamard gates).

β€’ If one 𝐢𝑍 is adjacent to a single qubit gate, the 𝐻 canbe absorbed.

β€’ If one 𝐢𝑍 is adjacent to another 𝐢𝑍 and if they addHadamard on the same qubit, the two Hadamard canbe cancelled and no depth is added.

β€’ Otherwise the depth is added by 1 from Hadamard.If there is an aliasing, the depth need to be added bymore than 1 so that 𝐻 gates on qubits with aliasingwill be placed at two different ticks.

β€’ The objective is to minimize the depth on all qubits.We leave the best conversion from𝐢𝑍 program into𝐢𝑁𝑂𝑇

program with minimal depth as a remaining problem.Another way to port our approach is to modify our QDG

definition to the 𝐢𝑁𝑂𝑇 -based instruction set. But in fact,the most commonly used 𝐢𝑁𝑂𝑇 commutation rules thatare based on intuition are only part of the complete 𝐢𝑁𝑂𝑇

conjugation rules:

Lemma 6. (𝐢𝑁𝑂𝑇 conjugation rules)[24] There are 8 rulesin total for 𝐢𝑁𝑂𝑇 conjugation rules, similar to 𝐢𝑍 rules. SeeAppendix I.

If we want to exploit full power of these rules, we haveto consider all these rules while building QDG, instead ofconsidering only the intuitive rules (usually the first 4 rules).

Page 23: Software Pipelining for Quantum Loop Programs

Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020

(a) Before. (b) After. The numbers correspond to the inter-cept 𝑏 in expression π‘ž [6𝑖 + 𝑏].

Figure 21. Loop kernel for cluster state preparation (𝑁 = 3).Shaded dots are qubits for Hadamard operands and closeddots are 𝐢𝑍 operands.

But this time, the rewriting trick in Theorem 2 no longerworks for 𝐢𝑁𝑂𝑇 rules. How to use these rules directly forQDG construction remains an open problem.

L.3 Working with device topologyOne problem about a controlled-Z architecture is that itcan be hard to perform long-distance 𝐢𝑍 operation. For the𝐢𝑁𝑂𝑇 case, a long distance 𝐢𝑁𝑂𝑇 gate with length π‘˜ canbe implemented using (4π‘˜ βˆ’ 4) according to [17]. However,this is not true for 𝐢𝑍 gates, as β€œamplitude” can’t propagatethrough 𝐢𝑍 gates.

A direct conversion approach can be taken by converting𝐢𝑍 to𝐢𝑁𝑂𝑇 and back forth. Since every𝐢𝑁𝑂𝑇 is on criticalpath and no adjacent controlled bits can be found on criticalpath, this would require (8π‘˜ βˆ’ 8 + 1) = (8π‘˜ βˆ’ 7) gates oncritical path. The exception is π‘˜ = 2, since the lastπ»π‘Žπ‘‘π‘Žπ‘šπ‘Žπ‘Ÿπ‘‘

on the critical path should be removed and total depth is 8.

M Optimization of Cluster StatePreparation, etc.

This chapter introduces the Cluster and Array test casesused in our evaluation.

Cluster is an example of cluster-state preparation pro-gram, which is a for-all loop: increasing count of iterationsdoes not add to the overall depth of the program, which onthe 2-dimensional grid is a constant 5 (4 for 𝐢𝑍s in fourdirections and 1 for Hadamard). Despite that, we can stillperform loop optimization on this program to get a loop withkernel sized 1.For 𝐢 = 2, the loop kernels before and after rotation fol-

lowed by software-pipelining is given in Figure 21. Our ap-proach split 𝐢𝑍 gates that conflicts with each other intodifferent iterations so that they can be executed together,and the kernel size is reduced to 1, the best result for anyloop-optimization approach except fully-unrolling.

Array series are several artificially-crafted loop programson qubit arrays.Array 1 performs three𝐢𝑍 gates as in Figure12, while two Hadamard gates are added between 𝐢𝑍s toprevent cancellation. Array 2 performs non-cancelling 𝐢𝑍gates so that they can be parallelized maximally. Array 3constructs a huge Toffoli gate using Toffoli gates and ancillas:

in each iteration, a Toffoli is performed on a source qubit, anancilla and the next ancilla.The instruction operands of these examples contain the

iteration variable and are thus simpler to optimize comparedwith those on fixed set of qubits.