Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases
-
Upload
andres-gomez -
Category
Technology
-
view
624 -
download
2
description
Transcript of Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases
Can You Get Performance from Xeon
Phi Easily?Lessons Learned from
Two Real Cases
20 years supporting research,
development and innovation
in Galicia
Objective• Check the amount of work to use Intel
Xeon Phi.• Minimal modifications using only
pragmas.• Two applications:
– CalcunetW. Test MKL Libraries.– GammaMaps. Test pragmas.
• Two modes:– Native: Only compiled to execute on Xeon Phi– Offload: Uses Host+Xeon Phi
CalcuNetw: Calculate Measurements in Complex
Networks • Complex networks, consisting of sets of
nodes or vertices joined together in pairs by links or edges.
• Application Calculates for each network:– Subgraph Centrality (SC): characterizes the participation
of each node in all subgraphs in a network.– SC odd: account only paths of long odd– SC even: account only paths of long even– Bipartivity: Is a proportion of even to total number of closed
walks in the network. – Network Communicability for Connected Nodes: C(p,q):
Measures how well communicated are two nodes in the network.
– Network Communicability C(G): is the mean of all the C(p,q), Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico
CESGA-2005-003
CalcuNetW
• Uses intensively DGEMM from BLAS• Calculates parameters for input• Plus n random matrixes
GammaMaps: A figure-of-merit in Radiation Therapy
X
Y
Z
),
𝑑(𝑟 )
Dose in voxel i,j,k
X
Y
Z
𝑟 Dose Reference
Dose Test
GammaMaps: A figure-of-merit in Radiation Therapy
Read Doses
Initialise and normalise
Compute Gamma
Store Gamma
• Application in FORTRAN 90• Parallelised using OpenMP• Geometric algorithm*• 512 x 512 x 128 =
33,554,432 voxels• Auto-vectorization• Pragmas for offload
* T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.
Results of Experiments
PlatformHost
CPU Model Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Nr. of cores 16
Memory 32788 MB
Operating System Linux 2.6.32-279.el6.x86_64
Compiler Version 2013U2Intel Xeon Phi
Model Beta0 Engineering Sample
Nr. of cores 61 at 1.09GHz
Memory 7936 MB
Operating System MPSS Gold U1
Compiler Version 2013U2
GDDR Technology GDDR5
GDDR Frecuency 2750000 KHz
• Remote access to Intel systems
• Feb. 2013
COMPACT - FINE
C1 C2 C3 C4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
0 1 2 3 4 5 6 7
Intel Xeon Phi Affinity Policies
SCATTER - FINE
C1 C2 C3 C4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
0 4 1 5 2 6 3 7
BALANCED - FINE
C1 C2 C3 C4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
0 1 2 3 4 5 6 7
BALANCED - CORE
C1 C2 C3 C4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
HT1
HT2
HT3
HT4
{0,1} {2,3} {4,5} {6,7}
• TYPE– Compact– Scatter– Balanced
• Granularity– Fine or Thread– Core
Results for CalcunetW
CalcunetW
CalcunetW
CalcunetW
Results for GammaMaps
GammaMaps
Host
0 2 4 6 8 10 12 14 16 180
200
400
600
800
1000
1200
1400
Host
local-compact-corelocal-compact-finelocal-scatter-finelocal-scatter-core
Nr. of Threads
Ela
pse
d T
ime
(s)
GammaMaps
Xeon Phi poor I/O
Conclusions• Using MKL library is easy and does
not require changes in the code.• Easy pragmas on code permit fast
usage• I/O performance issues in Xeon Phi• 1 Xeon Phi ~ 1 Xeon E5-2680• Improve performance requires
additional work.
Acknowledge
The authors would like to thank Intel for providing access to Intel
Xeon Phi coprocessor.
Questions
Andrés Gómez
José Carlos Mouriño
Carmen Cotelo
Aurelio Rodríguez
The TEAM