AMD technologies for HPC

27
Hands on work on AMD technologies for HPC solutions [email protected] ABSTRACT: The goal of this talk is to present in a practical way (through a hands on session) how latest AMD technology works and meets current high performance computing requirements. Concepts such as the performance metrics of GFLOPs and GB/s, performance efficiencies of FPU and memory controllers/channels, scalability of the multi socket platforms, tuning tips such as process/thread affinity, multi Infiniband/GPU and their I/O affinity, impact of appropriate math libraries and compilers, power consumption characteristics on a system when heavily stressed with different HPC workloads,….will be reviewed. By the end of the talk/session you should walk away with some good foundation on what building block technologies matter for you and how to design and exploit your own HPC solutions. ISUM 2012, Guanajuato, Mexico

description

Hands on session on AMD technologies for HPC.

Transcript of AMD technologies for HPC

Page 1: AMD technologies for HPC

Hands on work on AMD technologies for HPC solutions

[email protected]

ABSTRACT:

The goal of this talk is to present in a practical way (through a hands on session) how latest AMD technology works and meets current high performance computing requirements. Concepts such as the performance metrics of GFLOPs and GB/s, performance efficiencies of FPU and memory controllers/channels, scalability of the multi socket platforms, tuning tips such as process/thread affinity, multi Infiniband/GPU and their I/O affinity, impact of appropriate math libraries and compilers, power consumption characteristics on a system when heavily stressed with different HPC workloads,….will be reviewed. By the end of the talk/session you should walk away with some good foundation on what building block technologies matter for you and how to design and exploit your own HPC solutions.

ISUM 2012, Guanajuato, Mexico

Page 2: AMD technologies for HPC

Performance metrics

– GFLOP/s (SP,DP) (SSE, FMA)

– GB/s (SP,DP) (streaming stores)

– Memory Latency (local/remote)

– Memory Bandwidth (local/remote)

– Network Latency

– Network Bandwidth

– Message rate (Network)

– IOPs, sustained reads/writes (storage)

– Roofline model (performance modeling)

ISUM 2012, Guanajuato, Mexico

Page 3: AMD technologies for HPC

Roofline model: ISUM 2012, Guanajuato, Mexico

Page 4: AMD technologies for HPC

Scalability

• Hardware based: – Multicore

– Numanodes in socket package

– Multisocket

– Probe filter (HT assist)

– Multichipset

• Software based: – Compiler, Math libraries, MPI, OpenMP, affinity.

– Algorithm, computation/communication overlap, non blocking collectives.

ISUM 2012, Guanajuato, Mexico

Page 5: AMD technologies for HPC

Probe filter Necessary for scaling of memory bound applications, since it keeps track (cache directory in L3) of where data is on what memory bank when cores request data again.

ISUM 2012, Guanajuato, Mexico

SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS

No Yes Yes Yes

1 8 10 13 18.5

2 16 20 26 37

4 21 40 52 74

8 22 80 104 148

SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS

No Yes Yes Yes

1 29.44 44.16 44.16 58.88

2 58.88 88.32 88.32 117.76

4 117.76 176.64 176.64 235.52

8 235.52 353.28 353.28 471.04

FLOPs aggregated Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL

(GF/s)

Probe filter

# numanodes

Processors

# numanodes

memory bandwidth aggregated

(GB/s)

Probe filter

Page 6: AMD technologies for HPC

Bulldozer architecture

• Bulldozer compute unit – Core pair

• Core shared resources – L2 cache

– Floating Point Unit

– Instruction scheduler

– Power management

• Core independent resources – L1 Data cache

– Integer Unit

ISUM 2012, Guanajuato, Mexico

Page 7: AMD technologies for HPC

Bulldozer block diagram

ISUM 2012, Guanajuato, Mexico

• HPC workloads are using all the cores for the same nature of computation, also synchronized.

• High workload flexibility such as in Cloud under power budget

Example: Cloud workloads can use 1 core for integer work and the other the whole FPU for number crunching

Page 8: AMD technologies for HPC

Socket block diagram

ISUM 2012, Guanajuato, Mexico

16 cores grouped in 8 compute units by core-pairs grouped in 2 numanodes. Each numanode has 2 memory channels. The numanodes are interconnected through cHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W

Page 9: AMD technologies for HPC

Bulldozer architecture (cont)

• Flexible Floating Point Unit

– Work that 1 core can do. 8 DP FLOPs/clk

– Work that 2 cores can do. 4 DP FLOPs/clk

• Example of DGEMM from ACML.

• FMA4 and FMA3 instructions

– FMA4 on Interlagos d = a (+/-) b*c

– FMA3 on Abudhabi c = a (+/-) b*c

• AVX instructions

– Increase IPC by compacting instructions

ISUM 2012, Guanajuato, Mexico

Page 10: AMD technologies for HPC

Where are FMA instructions used ?

Page 11: AMD technologies for HPC

Bulldozer architecture (cont)

• Power management:

– Maxpower (eg. 135W), TDP (115W), ACP (85W)

– Power capping (to limit power consumption)

• Boost states

– Pstates (HW and SW views)

• HPC mode (mostly for HPL benchmark)

• Throttling

– Power (too much power consumption, HPL)

– Thermal (too hot, not enough cooling, protection)

ISUM 2012, Guanajuato, Mexico

Page 12: AMD technologies for HPC

Power management

Measured Dynamic Power

0%

20%

40%

60%

80%

100%

120%

Ma

xP

ow

er1

28

HL

T

NO

P

Wu

pw

ise

Sim

Mg

rid

Ap

plu

Me

sa

Ga

lge

l

Art

Eq

ua

ke

Fa

ce

rec

Am

mp

Lu

ca

s

Fm

a3

d

Six

tra

ck

Ap

si

Gzip

Vp

r

Gcc

Mcf

Cra

fty

Pa

rse

r

Eo

n

Pe

rlb

mk

Ga

p

Vo

rte

x

Bzip

2

To

lf

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

Base P-state

SW View HW View

Boost P-states

ISUM 2012, Guanajuato, Mexico

POWER HEADROOM AVAILABLE FOR BOOST

TDP

Page 13: AMD technologies for HPC

Coherent and non coherent fabric

• Coherent Hypertransport fabric – Connects the numanodes with cache coherence

• MOESI protocol

– X8 cHT links, x16 cHT links

– Scenic routing, reroutes traffic to make even x8 / x16 resources

• Non Coherent Hypertransport – RD890 chipset (PCIegen2)

– Connects the numanodes with PCI devices

– multichipset

ISUM 2012, Guanajuato, Mexico

Page 14: AMD technologies for HPC

Coherent and non coherent fabric

ISUM 2012, Guanajuato, Mexico

Page 15: AMD technologies for HPC

Software Ecosystem

• Operating Systems

• Compilers

– Open64, GCC, PGI

• Math library

– ACML, AMDlibM

• Profilers

– CodeAnalyst

• Instruction Based Profiling

ISUM 2012, Guanajuato, Mexico

Page 16: AMD technologies for HPC

Operating systems for Interlagos

• Basic list of OS providing proper performance – Windows Server 2008 R2

– RHEL6.2

– CentOS 6.2

– SLES11sp2

– Scientific Linux 6.2

Older versions need specific patches in order to perform.

ISUM 2012, Guanajuato, Mexico

Page 17: AMD technologies for HPC

Compiler flags

• Open64 version >= 4.2.5

• GCC version >= 4.6

• PGI version >= 11.9

• Open64 and GCC

– Compile/link flags: -Ofast -march=bdver1

• PGI

– Compile/link flags: -fast -tp Interlagos-64

ISUM 2012, Guanajuato, Mexico

Page 18: AMD technologies for HPC

AMD Core Math Library, download @ developer.amd.com

ISUM 2012, Guanajuato, Mexico

Page 19: AMD technologies for HPC

AMD Code Analyst Profiler, download @ developer.amd.com

ISUM 2012, Guanajuato, Mexico

Page 20: AMD technologies for HPC

NUMA definition

ISUM 2012, Guanajuato, Mexico

Page 21: AMD technologies for HPC

Feeding locally versus remotely

• Locally

• Remotely

21

0 1

2 3

Channel 0

Channel 1 NUMA node 0

Eg. 12GB/s

0 1

2 3

Channel 0

Channel 1

0 1

2 3

Channel 0

Channel 1

cHT x8, x16

NUMA node 0

NUMA node 1

Constrain bandwidth Higher latency (1 hop)

Eg. 7GB/s at x16, 5GB/s at x8

ISUM 2012, Guanajuato, Mexico

Page 22: AMD technologies for HPC

Affinity

• numa [ctl/stat] tool (Linux)

• Start tool (Windows)

• HWLOC toolset (Windows, Linux) – www.open-mpi.org/projects/hwloc

• LIKWID toolset (Windows, Linux) – http://code.google.com/p/likwid/

• openMP environment variables – Eg. Open64: O64_OMP_AFFINITY_MAP

• MPI runtime flags – Eg. OpenMPI: --bind-to-core

ISUM 2012, Guanajuato, Mexico

Page 23: AMD technologies for HPC

numactl –hardware and numastat

23

Good, no misses

Physical memory on numa node and how much is available (free)

Core ids for numa node 3

Detecting wrong BIOS settings configuration of system , If NODE INTERLEAVED was ENABLED then it would only be 1 numa node with core ids 0,1,2….30,31 and with 64 GB of memory.

ISUM 2012, Guanajuato, Mexico

Page 24: AMD technologies for HPC

EXAMPLE using likwid Hybrid MPI+OPenMP

• Build application file and launch mpi job with hybrid openMP with 1 thread per compute unit on 2 . Using 4 compute nodes.

• export OMP_NUM_THREADS=4

• mpirun –app ./appfile,

• Where appfile is

-h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application

-h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application

-h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application

-h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application

…………………………………………….

-h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application

-h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application

-h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application

-h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application

24

Repeated core id for the binding of MPI process + 4 worker threads

ISUM 2012, Guanajuato, Mexico

Page 25: AMD technologies for HPC

Putting it all together

Pre-exascale (high computing density) system

– Multicore

– Multisocket

– Multichipset

– Multirail

– MultiGPU

– dynamically reconfigurable multi root PCI devices through workload analysis

ISUM 2012, Guanajuato, Mexico

Page 26: AMD technologies for HPC

ISUM 2012, Guanajuato, Mexico

Page 27: AMD technologies for HPC

More @ http://developer.amd.com • X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/) • AMD Developer Tools (http://developer.amd.com/tools/) • AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/ • AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide

(http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf) • AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/) • AMD HPC (www.amd.com/hpc) • AMD APP SDK Documentation

(http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx) • Using the x86 Open64 Compiler Suite

(http://developer.amd.com/tools/open64/Documents/open64.html) • x86 Open64 4.2.5.2 Release Notes

(http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt) • ACML 5.0 Information

(http://developer.amd.com/libraries/acml/features/pages/default.aspx) • Software Optimization Guide for “Bulldozer” processors

(http://support.amd.com/us/Processor_TechDocs/47414.pdf) • AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4

Instructions (http://support.amd.com/us/Embedded_TechDocs/43479.pdf)

• Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core, 2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite. http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf

ISUM 2012, Guanajuato, Mexico