Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Can You Get Performance from

Xeon Phi Easily?

Lessons Learned from Two Real

Cases

Objective

• Check the amount of work to use Intel

Xeon Phi.

• Minimal modifications using only pragmas.

• Two applications: – CalcunetW. Test MKL Libraries.

– GammaMaps. Test pragmas.

• Two modes: – Native: Only compiled to execute on Xeon Phi

– Offload: Uses Host+Xeon Phi

CalcuNetw: Calculate Measurements in Complex Networks

• Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges.

• Application Calculates for each network: – Subgraph Centrality (SC): characterizes the

participation of each node in all subgraphs in a network.

– SC odd: account only paths of long odd

– SC even: account only paths of long even

– Bipartivity: Is a proportion of even to total number of closed walks in the network.

– Network Communicability for Connected Nodes: C(p,q): Measures how well communicated are two nodes in the network.

– Network Communicability C(G): is the mean of all the C(p,q),

Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico

CESGA-2005-003

CalcuNetW

GammaMaps: A figure-of-merit in Radiation

Therapy

X

Y

Z

Dose in voxel i,j,k

X

Y

Z

GammaMaps: A figure-of-merit in

Radiation Therapy Read

Doses

Initialise and

normalise

Compute

Gamma

Store

Gamma

• Application in FORTRAN 90

• Parallelised using OpenMP

• Geometric algorithm*

• 512 x 512 x 128 = 33,554,432

voxels

• Auto-vectorization

• Pragmas for offload

* T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution

comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.

Results of Experiments

Platform Host

CPU Model Intel(R) Xeon(R) CPU E5-2680

0 @ 2.70GHz

Nr. of cores 16

Memory 32788 MB

Operating System Linux 2.6.32-279.el6.x86_64

Compiler Version 2013U2 Intel Xeon Phi

Model Beta0 Engineering Sample

Nr. of cores 61 at 1.09GHz

Memory 7936 MB

Operating System MPSS Gold U1

Compiler Version 2013U2

GDDR Technology GDDR5

GDDR Frecuency 2750000 KHz

• Remote

access to

Intel systems

• Feb. 2013

COMPACT - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 1 2 3 4 5 6 7

Intel Xeon Phi Affinity Policies

SCATTER - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 4 1 5 2 6 3 7

BALANCED - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 1 2 3 4 5 6 7

BALANCED - CORE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

{0,1} {2,3} {4,5} {6,7}

• TYPE – Compact

– Scatter

– Balanced

• Granularity – Fine or Thread

– Core

Results for CalcunetW

CalcunetW

Results for GammaMaps

GammaMaps

Host

0

200

400

600

800

1000

1200

1400

0 5 10 15 20

Ela

psed

Tim

e (

s)

Nr. of Threads

Host

local-compact-core

local-compact-fine

local-scatter-fine

local-scatter-core

GammaMaps

Xeon Phi poor I/O

Conclusions

• Using MKL library is easy and does not

require changes in the code.

• Easy pragmas on code permit fast usage

• I/O performance issues in Xeon Phi

• 1 Xeon Phi ~ 1 Xeon E5-2680

• Improve performance requires additional

work.

Acknowledge

The authors would like to thank Intel for

providing access to Intel Xeon Phi

coprocessor.

Questions

Andrés Gómez

José Carlos Mouriño

Carmen Cotelo

Aurelio Rodríguez

The TEAM

Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Technology

Transcript of Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases