[IEEE 2012 IEEE/MTT-S International Microwave Symposium - MTT 2012 - Montreal, QC, Canada...

Method of Moments Modeling of Microstrip Patch Antennas with Automatic GPU Acceleration

Alex Cerjanic1, Boris Sheikman2, and Indira Chatterjee1 1University of Nevada, Reno Department of Electrical and Biomedical Engineering MS260, Reno, NV

89557, 2 General Electric, 1631 Bently Parkway South, Minden, NV 89423

Abstract — Using interpreted languages such as MATLAB to prototype electromagnetic modeling codes often provides efficiency at the cost of performance. To sidestep this tradeoff, we demonstrate how a frequency domain method of moments routine written in MATLAB for microstrip patch antennas can be automatically accelerated on GPU hardware. The interpreted MATLAB code performance is compared with the accelerated code, and the relative complexity of the automatic translation of the toolkit is contrasted. Acceleration of the matrix filling routine was observed to be around 99 times the interpreted MATLAB code in the results described.

Index Terms — Method of Moments, microstrip antennas, GPU acceleration.

I. INTRODUCTION

Microwave engineers are often faced with challenges in designing microstrip patch antennas. While design equations exist for many patch antennas, the derivations of the design equations often require assumptions or approximations that limit the ultimate accuracy of these equations. Computational Electromagnetic (CEM) modeling can deliver much greater accuracy at the cost of computational time. For custom and research CEM codes, performance is often a significant issue since performance optimization can be challenging. Recently, the application of Graphics Processing Units (GPUs) to CEM codes have radically increased the amount of processing power available for modeling. While some approaches such as the Finite Difference Time Domain method are easily adapted for processing on GPUs, with impressive speed-up factors of 30 – 50x reported in the literature [1]. Other CEM methods such as the Method of Moments (MoM) are not as easily translated into highly parallel code necessary for efficient computation on GPUs. Previous efforts to port MoM code to GPU programming environments such as NVIDIA’s CUDA API, have taken one of two different approaches. Either, the standard matrix filling routines are ported directly to CUDA code via rewritten custom code in C or Fortran [2] or the MoM routines are simplified via the use of approximations such as transforming multidimensional integrals into quasi-1D integrals [3].

While these approaches have successfully applied GPUs to the MoM, the highly technical nature of porting the code limits these techniques to programmers with extensive experience with low level GPU programming environments. New tools have made scripting languages such as MATLAB able to dynamically shift code to run on GPUs, saving significant time and effort. While the translation is automatic, only code that is appropriately parallel will see a speed up as a result of the shift to GPU computing. With the application of

appropriate methods and tools to the MoM impedance matrix in MATLAB, it is possible to GPU accelerate the code with very little code rewrite. Utilizing these techniques we have adapted a CPU based MoM code written in MATLAB to take advantage of GPU acceleration and compared the modeling results for a microstrip patch antenna over the frequency range 2.5-3.5 GHz with those obtained using a CPU clustered version of the same code.

II. METHODS

A. Method of Moments

By transforming an integral equation formulation of Maxwell’s Equations into a system of linear equations, an approximate solution for the current distribution on a patch antenna can be found. In applying the MoM, basis and testing functions must be chosen: the basis functions to efficiently model the true current distribution, and a set of testing functions to measure and minimize the error between the modeled current distribution and the true distribution. Most approaches fall into one of two groups, those following a Galerkin method, choosing basis functions to be the same as testing functions, or pulse testing routines, choosing the testing functions to be pulse or rectangular functions. In practical terms, for a 2D current distribution, a Galerkin method will require calculation of 4D integrals, whereas an equivalent pulse testing routine will require calculation of only 2D integrals [4].

B. Quasi Monte Carlo Integration

Generally, the most computationally intensive step in the application of the MoM is the numerical integration of the impedance matrix terms. If Gaussian quadrature is used, the 4D integrals used in a Galerkin method are very burdensome, as the rules must be chained for each additional dimension. Monte Carlo integration (MCI) (1) presents a potentially more efficient approach as the error estimates reduce linearly with the number of samples (N) and independently from the number of dimensions. In (1), the samples are distributed according to (2) being a uniform random distribution. 1

(1)

~ , (2)

978-1-4673-1088-8/12/$31.00 ©2012 IEEE

Recent work on applying MCI to the MoMcareful application of MCI can eliminate the nthe singularity in overlapping field andevaluations of the Green’s functions apimpedance matrix calculations. A very eleganquasi-MCI (QMCI), substituting quasi-randomas the Halton series, to generate independent points that will never overlap [5].

C. GPU Acceleration

Additionally, since QMCI is fundamentamean estimator, the evaluations of theindependent and embarrassingly parallel wheThus, QMCI presents an additional way tocomputation of the impedance matrix. Sincewe have worked with is written in MATparallelization strategy would simply executeQMCI based impedance matrix routines on a a CPU. While no such tool currently exists, seus to parallelize the MoM code with modifications to the existing CPU based coefficiencies. Fig. 1 shows the steps in the wobe rewritten to allow the vast majority of

matrix calculation routine to execute on the GPIII. RESULTS AND DISCUSS

A. Test Antenna

A circularly polarized, coaxially fed microsdesigned and fabricated to serve as a benevaluation of the CPU and GPU versions ofcurrent generation GPUs are not fully comp754, it was felt that results from the CPU and the code should each be compared to the meallow the potential impact of the hardware to b

A photograph of the antenna is displayed ingeometry is displayed in Fig. 2b. The antenna 20 mil thick Rogers 4350B, with a relative diof 3.66.

Fig. 1 MoM workflow with GPU and CPU speci

M has shown that negative effect of d source point ppearing in the nt approach uses m numbers, such source and field

ally based on a e integrand are en implemented. o parallelize the e the MoM code TLAB, the ideal e the preexisting GPU rather than

everal tools allow only minimal

ode with varying orkflow that must f the impedance

PU. SION

strip antenna was nchmark for the f the code. Since pliant with IEEE GPU versions of

easured results to be estimated. n Fig. 2a and the was designed on

ielectric constant

B. CPU Cluster Results The initial version of the Mo

MATLAB 2011a and utilizing Toolbox R2011a (PCT) to parallepresented in this paper. All CPU accomplished by using the parfor laby the PCT. Actual execution wabuilt HPC cluster utilizing 6 availaequipped with dual Intel Xeon E560RAM, and SDR Infiniband interequipped with MATLAB DistriR2011a. Apart from any GPU or Ccode was identical in all GPU and Code. The CPU cluster version approximately 1200 seconds to run compared to the measured S11 in Fig

Measurements were performed unetwork analyzer. A difference of 3the measured and modeled results.also differed from the measured bamay be due to the MoM modconductors as well as due to posscode was automatically translate

Fig. 2a (left) Photograph of fabrica(right) Geometry of antenna with dcoaxial probe feed.

ific workflows.

Fig. 3 S11 results for CPU clustered measured antenna S11 parameter.

oM code was written in the Parallel Computing

elize the frequency sweeps based parallelization was

anguage structure provided as performed on a custom able nodes. All nodes were 05 quad–core CPUs, 12 GB rconnect. The cluster was buted Computing Server

CPU specific operations, all CPU versions of the MoM of the MoM code took

on 6 nodes with the results g. 3.

using an HP 8720B vector 3dB was observed between The computed bandwidth

andwidth. This discrepancy del using perfect electric sible variations in how the ed to the GPU and its

ated test antenna. Fig. 2b imensions and location of

MoM as compared to the

978-1-4673-1088-8/12/$31.00 ©2012 IEEE

nonstandard numerical precision.

C. GPU Accelerated Results

The GPU accelerated version of the MoM was accomplished by utilizing Jacket (Ver 1.8.2, Accelereyes, Inc.), a proprietary GPU acceleration toolbox for MATLAB. Jacket provides a number of language extensions for MATLAB to automatically translate MATLAB code for execution on NVIDIA CUDA supported GPUs. All GPU code execution was performed on a single Dell Precision T5500 with a single NVIDA Tesla C2050 card. The only modifications made to the GPU accelerated version of the code were to utilize the gfor language extension provided by Jacket to parallelize the matrix filling step of each frequency point in the frequency sweep.

As the Jacket documentation states that Jacket uses Just-In-Time compilation, caching, and code translation as well as its own pre-translated operations, comparisons between the runtimes of each version of the code were limited to comparing the entire frequency sweep, and the matrix filling steps. While the speed of GPU calculations can depend on problem parameters and how the problem efficiently distributes GPU hardware, no attempt to manually tweak the parallelization was made, as we relied on the automatic GPU acceleration provided by Jacket. Fig 4 shows the GPU accelerated MoM results as compared to the measured S11 parameter of the test antenna. A smaller discrepancy of less than 0.5dB was obtained at resonance.

D. Runtime Comparison The runtimes for the CPU and GPU versions of the code are compared in Table 1. The runtime for the GPU case was much

longer at 3056 seconds. While this was longer than the CPU clustered code, the

GPU runtime was achieved with only one machine and one GPU allocated to the task. The majority of the GPU accelerated version runtime was taken up by the computation of the Green’s functions for each frequency point, as this was not parallelized in the GPU version. In the CPU case, each frequency point was fully parallelized, including the computation of the Green’s functions.

IV. CONCLUSION

Through the use of automatic GPU acceleration toolbox, MoM code was parallelized for computation on GPUs. In line with the parallel efficiency of GPUs, a speed up factor of over 99 was seen for the matrix filling step. While the entire frequency sweep was faster on the CPU cluster as compared to the single GPU, the GPU accelerated version was competitive with far less hardware and complexity as compared to a cluster. Strategies to accelerate the computation of the Green’s functions, such as the DCIM may be one way to further accelerate the MoM on GPUs. Examining the matrix filling step performance, the benefit of GPU acceleration for the MoM is clear. For microwave engineers who often do not have access to HPC clusters, GPU acceleration of the MoM may be one cost effective strategy for CEM simulation.

ACKNOWLEDGEMENT

This research was supported by a grant from General Electric. The authors wish to acknowledge the assistance and support of David O’Connor (General Electric), Jennilyn Vallejera, and Lara LaDage.

REFERENCES [1] S. Adams, J. Payne, and R. Boppana, “Finite Difference Time

Domain (FDTD) Simulations Using Graphics Processors,” in HPCMP Users Group Conf., Los Alamitos, CA, USA, 2007, vol. 0, pp. 334-338.

[2] E. Lezar and D. B. Davidson, “GPU acceleration of method of moments matrix assembly using Rao-Wilton-Glisson basis functions,” in Electronics and Information Engineering (ICEIE), 2010 Int. Conf. On, 2010, vol. 1, pp. V1-56-V1-60.

[3] D. De Donno, A. Esposito, G. Monti, and L. Tarricone, “GPU-based acceleration of MPIE/MoM matrix calculation for the analysis of microstrip circuits,” in Proceedings of the 5th European Conference on Antennas and Propagation (EUCAP), 2011, pp. 3921-3924.

[4] R. F. Harrington, Field Computation by Moment Methods. Wiley-IEEE Press, 1993.

[5] M. Mishra and N. Gupta, “Application of quasi monte carlo integration technique in EM scattering from finite cylinders,” Progress In Electromagnetics Research Letters, vol. 9, pp. 109-118, 2009.

TABLE ISUMMARY OF RUNTIMES AND SPEED UP FACTORS

Entire Runtime Matrix Filling Step per Frequency Point CPU Cluster Version GPU Version CPU Cluster Version GPU Version

Runtime 1182s 3056s 765s 7.6759s Speed Up Factor 1 0.39 1 99.66

Fig. 4 S11 results for GPU accelerated MoM as compared to the measured antenna S11 parameter.

978-1-4673-1088-8/12/$31.00 ©2012 IEEE

[IEEE 2012 IEEE/MTT-S International Microwave Symposium - MTT 2012 - Montreal, QC, Canada...

Documents

Transcript of [IEEE 2012 IEEE/MTT-S International Microwave Symposium - MTT 2012 - Montreal, QC, Canada...