Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Can You Get Performance from Xeon

Phi Easily?Lessons Learned from

Two Real Cases

20 years supporting research,

development and innovation

in Galicia

Objective• Check the amount of work to use Intel

Xeon Phi.• Minimal modifications using only

pragmas.• Two applications:

– CalcunetW. Test MKL Libraries.– GammaMaps. Test pragmas.

• Two modes:– Native: Only compiled to execute on Xeon Phi– Offload: Uses Host+Xeon Phi

CalcuNetw: Calculate Measurements in Complex

Networks • Complex networks, consisting of sets of

nodes or vertices joined together in pairs by links or edges.

• Application Calculates for each network:– Subgraph Centrality (SC): characterizes the participation

of each node in all subgraphs in a network.– SC odd: account only paths of long odd– SC even: account only paths of long even– Bipartivity: Is a proportion of even to total number of closed

walks in the network. – Network Communicability for Connected Nodes: C(p,q):

Measures how well communicated are two nodes in the network.

– Network Communicability C(G): is the mean of all the C(p,q), Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico

CESGA-2005-003

CalcuNetW

• Uses intensively DGEMM from BLAS• Calculates parameters for input• Plus n random matrixes

GammaMaps: A figure-of-merit in Radiation Therapy

X

Y

Z

),

𝑑(𝑟 )

Dose in voxel i,j,k

X

Y

Z

𝑟 Dose Reference

Dose Test

GammaMaps: A figure-of-merit in Radiation Therapy

Read Doses

Initialise and normalise

Compute Gamma

Store Gamma

• Application in FORTRAN 90• Parallelised using OpenMP• Geometric algorithm*• 512 x 512 x 128 =

33,554,432 voxels• Auto-vectorization• Pragmas for offload

* T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.

Results of Experiments

PlatformHost

CPU Model Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Nr. of cores 16

Memory 32788 MB

Operating System Linux 2.6.32-279.el6.x86_64

Compiler Version 2013U2Intel Xeon Phi

Model Beta0 Engineering Sample

Nr. of cores 61 at 1.09GHz

Memory 7936 MB

Operating System MPSS Gold U1

Compiler Version 2013U2

GDDR Technology GDDR5

GDDR Frecuency 2750000 KHz

• Remote access to Intel systems

• Feb. 2013

COMPACT - FINE

C1 C2 C3 C4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

0 1 2 3 4 5 6 7

Intel Xeon Phi Affinity Policies

SCATTER - FINE

C1 C2 C3 C4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

0 4 1 5 2 6 3 7

BALANCED - FINE

C1 C2 C3 C4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

0 1 2 3 4 5 6 7

BALANCED - CORE

C1 C2 C3 C4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

HT1

HT2

HT3

HT4

{0,1} {2,3} {4,5} {6,7}

• TYPE– Compact– Scatter– Balanced

• Granularity– Fine or Thread– Core

Results for CalcunetW

CalcunetW

Results for GammaMaps

GammaMaps

Host

0 2 4 6 8 10 12 14 16 180

200

400

600

800

1000

1200

1400

Host

local-compact-corelocal-compact-finelocal-scatter-finelocal-scatter-core

Nr. of Threads

Ela

pse

d T

ime

(s)

GammaMaps

Xeon Phi poor I/O

Conclusions• Using MKL library is easy and does

not require changes in the code.• Easy pragmas on code permit fast

usage• I/O performance issues in Xeon Phi• 1 Xeon Phi ~ 1 Xeon E5-2680• Improve performance requires

additional work.

Acknowledge

The authors would like to thank Intel for providing access to Intel

Xeon Phi coprocessor.

Questions

Andrés Gómez

José Carlos Mouriño

Carmen Cotelo

Aurelio Rodríguez

The TEAM

Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Technology

Transcript of Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases