Evaluation of AMD EPYC · Evaluation of AMD EPYC Chris Hollowell HEPiX Fall 2018, PIC Spain. 2 ......

Post on 09-Jul-2020

2 views 0 download

Transcript of Evaluation of AMD EPYC · Evaluation of AMD EPYC Chris Hollowell HEPiX Fall 2018, PIC Spain. 2 ......

Evaluation of AMD EPYCChris Hollowell <hollowec@bnl.gov>

HEPiX Fall 2018, PIC Spain

2

What is EPYC?

EPYC is a new line of x86_64 server CPUs from AMD based on their Zen microarchitecture

Same microarchitecture used in their Ryzen desktop processorsReleased June 2017

First new high performance series of server CPUs offeredby AMD since 2012

Last were Piledriver-based OpteronsSteamroller Opteron products cancelled

AMD had focused on low power server CPUs instead

x86_64 Jaguar APUsARM-based Opteron A CPUs

Many vendors are now offering EPYC-based servers, including Dell, HP and Supermicro

3

How Does EPYC Differ From Skylake-SP?

Intel’s Skylake-SP Xeon x86_64 server CPU line also released in 2017

Both Skylake-SP and EPYC CPU dies manufactured using 14 nm process

Skylake-SP introduced AVX512 vector instruction support in XeonAVX512 not available in EPYCHS06 official GCC compilation options exclude autovectorizationStock SL6/7 GCC doesn’t support AVX512

Support added in GCC 4.9+ Not heavily used (yet) in HEP/NP offline computing

Both have models supporting 2666 MHz DDR4 memorySkylake-SP

6 memory channels per processor3 TB (2-socket system, extended memory models)

EPYC8 memory channels per processor4 TB (2-socket system)

4

How Does EPYC Differ From Skylake (Cont)?Some Skylake-SP processors include built in Omnipath networking,or FPGA coprocessors

Not available in EPYC

Both Skylake-SP and EPYC have SMT (HT) support2 logical cores per physical core (absent in some Xeon Bronze models)

Maximum core count (per socket)Skylake-SP – 28 physical / 56 logical (Xeon Platinum 8180M) EPYC – 32 physical / 64 logical (EPYC 7601)

Maximum socket countSkylake-SP – 8 (Xeon Platinum)EPYC – 2

Processor InteconnectSkylake-SP – UltraPath Interconnect (UPI)EYPC – Infinity Fabric (IF)

PCIe lanes (2-socket system)Skylake-SP – 96EPYC – 128 (some used by SoC functionality)

Same number available in single socket configuration

5

EPYC: MCM/SoC Design

EPYC utilizes an SoC designMany functions normally found in motherboardchipset on the CPU

SATA controllersUSB controllers etc.

Each EPYC processor consists of four CPU dies,interconnected via Infinity Fabric

Multi-Chip Module (MCM) architecture”CPU Complexes” (CCX)

Each CCX attached to its own memory2 memory channels per CCX

All Skylake-SP cores are on a single die

AMD claims MCM results in a cost reduction by improving yields

Believed to scale better than monolithic die approach as core counts continueto increaseDrawback: higher memory latency for non-NUMA-aware applications

6

EPYC: MCM/SoC Design (Cont.)# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 64On-line CPU(s) list: 0-63Thread(s) per core: 2Core(s) per socket: 16Socket(s): 2NUMA node(s): 8Vendor ID: AuthenticAMDCPU family: 23Model: 1Model name: AMD EPYC 7351 16-Core ProcessorStepping: 2CPU MHz: 2400.000CPU max MHz: 2400.0000CPU min MHz: 1200.0000BogoMIPS: 4799.41Virtualization: AMD-VL1d cache: 32KL1i cache: 64KL2 cache: 512KL3 cache: 8192KNUMA node0 CPU(s): 0-3,32-35NUMA node1 CPU(s): 4-7,36-39NUMA node2 CPU(s): 8-11,40-43NUMA node3 CPU(s): 12-15,44-47NUMA node4 CPU(s): 16-19,48-51NUMA node5 CPU(s): 20-23,52-55NUMA node6 CPU(s): 24-27,56-59NUMA node7 CPU(s): 28-31,60-63

# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 72On-line CPU(s) list: 0-71Thread(s) per core: 2Core(s) per socket: 18Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHzStepping: 4CPU MHz: 2700.000BogoMIPS: 5404.41Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 25344KNUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71

EPYC vs Skylake-SP (SNC Disabled) NUMA Configuration

7

Socket LGA 3647 & SP3

Both CPUs/sockets are quite large

Visible quadrants in the SP3 socket for the four CPU complexes in the EPYC processor

Skylake SP – Socket LGA 3647 EPYC – Socket SP3

8

Skylake and EPYC Model Lineup ComparisonModel Base Frequency Cores SMT TDP Memory Retail

Xeon Bronze 3104 1.7 GHz (no turbo) 6 No 85W 2133 MHz DDR4 $213

Xeon Silver 4110 2.1 GHz 8 Yes 85W 2400 MHz DDR4 $501

Xeon Gold 5115 2.4 GHz 10 Yes 85W 2666 MHz DDR4 $1,221

Xeon Gold 6130 2.1 GHz 16 Yes 125W 2666 MHz DDR4 $1,900

Xeon Gold 6136 3.0 GHz 12 Yes 150W 2666 MHz DDR4 $2,460

Xeon Gold 6148 2.4 GHz 20 Yes 150W 2666 MHz DDR4 $3,072

Xeon Gold 6150 2.7 GHz 18 Yes 165W 2666 MHz DDR4 $3,358

Xeon Platinum 8170 2.1 GHz 28 Yes 165W 2666 MHz DDR4 $7,405

Xeon Platinum 8180M 2.5 GHz 28 Yes 205W 2666 MHz DDR4 $13,011

EPYC 7251 2.1 GHz 8 Yes 120W 2400 MHz DDR4 $475

EPYC 7351 2.4 GHz 16 Yes 170W 2666 MHz DDR4 $1,110Uniprocessor (P) - $750

EPYC 7401 2.0 GHz 24 Yes 170W 2666 MHz DDR4 $1,850Uniprocessor (P) - $1,075

EPYC 7451 2.3 GHz 24 Yes 180W 2666 MHz DDR4 $2,400

EPYC 7551 2.0 GHz 32 Yes 180W 2666 MHz DDR4 $3,400

EPYC 7601 2.2 GHz 32 Yes 180W 2666 MHz DDR4 $4,200

9

EPYC vs Skylake-SP: HEP/NP PerformanceBenchmarks

HEPSPEC06“all_cpp” subset of SPEC-CPU2006 run in parallel

CERN Cloud Benchmark SuiteVarious benchmarks, run in parallel

DB12WhetstoneATLAS KV

Unless noted, memory configured to utilize all 8 channels per CPU on EPYC, and 6 channels per CPU for Skylake-SP, with at least 2 GB RAM/logical core

~11% HS06 performance degradation seen for EPYC 7441 whenonly populating half of the memory channelsAll 2666 MHz DDR4

Noted dual rank (DR) DIMMs downclocked to 2400 MHz for EPYC

All run under SL/CentOS/RHEL 7

SMT/Hyperthreading enabled, unless otherwise indicated

Systems are dual CPU, unless noted

10

EPYC HEPSPEC06: SMT Off vs On

CPU0

200

400

600

800

1000

1200

1400

368

489541

780

883

1101

872

1148

1078

1296

EPYC 7401P@2.0 GHz [Uniprocessor - 24 threads]EPYC 7401P@2.0 GHz [Uniprocessor - 48 threads]

EPYC 7351@2.4 GHz [32 threads]EPYC 7351@2.4 GHz [64 threads]EPYC 7451@2.3 Ghz [48 threads] DDR-2400EPYC 7451@2.3 Ghz [96 threads] DDR-2400EPYC 7551@2.0 GHz [64 threads]EPYC 7551@2.0 GHz [128 threads]EPYC 7601@2.2 GHz [64 threads] DDR4-2400EPYC 7601@2.2 GHz [128 threads] DDR4-2400

EP

YC

7401P

SM

T

EP

YC

7351

EP

YC

7351 S

MT

EP

YC

7451 S

MT EP

YC

7551

EP

YC

7551 S

MT

EP

YC

7601

EP

YC

7601 S

MT

EP

YC

7451

HS06

EP

YC

7401P

25%+ HS06 performance improvement with SMT (“hyperthreading”) enabled

11

EPYC vs Skylake-SP: HEPSPEC06

CPU0

200

400

600

800

1000

1200

1400

394

729

790

10681035

1261

489

780

11011148

1296

Xeon Gold 5115@2.4 GHz [40 threads] +Xeon Gold 6130@2.1 GHz [64 threads]Xeon Gold 6136@3.0 GHz [48 threads]Xeon Gold 6148@2.4 GHz [80 threads]Xeon Gold 6150@2.7 GHz [72 threads]Xeon Platinum 8170@2.1 GHz [104 threads] *EPYC 7401P@2.0 GHz [Uniprocessor - 48 threads]EPYC 7351@2.4 GHz [64 threads]EPYC 7451@2.3 Ghz [96 threads] DDR-2400

EPYC 7551@2.0 GHz [128 threads]EPYC 7601@2.2 GHz [128 threads] DDR4-2400

Xeon

Gold 61 3

0

Xeon

Gold 61 3

6

Xeon

Gold 61 4

8

EP

YC

7401P

Xeon

Platinu

m 817

0

EP

YC

7351

EP

YC

7451

EP

YC

7551

EP

YC

7601

Xeon

Gold 61 5

0

HS06

+ = System using only 3 memory channels per CPU* = Value reported by CERN

Xe

on Gold 51

15

12

EPYC vs Skylake: HEPSPEC06 (Cont.)

Larger values are better

Similar maximum HS06 (~1,275) performance for the models testedData for highest level EPYC (7601), but not highest model Skylake-SP (8180M)Can assume Xeon Skylake 8180M would perform better than the 8170 value listed

Same number of cores/threads as 8170, but higher clock speed2.5 GHz vs 2.1 GHz

Mid-range model HS06 performance also similar~700 HS06 - ~1100 HS06

TDP somewhat higher for EPYC CPUs vs Xeon Gold, in general165 W max Xeon Gold, vs 180 W max EPYCCan likely expect EPYC to use a bit more power as a result

13

EPYC vs Skylake-SP: CERN Cloud Benchmarks

DB12 (aggregate) Whetstone (aggregate) ATLAS KV (aggregate)0

200

400

600

800

1000

1200

1400

220

114

15

998

262

65

733

210

67

1256

361

120

Xeon Gold 5115@2.4 GHz [40 threads] +Xeon Gold 6150@2.7 GHz [72 threads]EPYC 7351@2.4 GHz [64 threads]EPYC 7551@2.0 GHz [128 threads]

Xeon

Gold 61 5

0

+ = System using only 3 memory channels per CPU

Xe

on Gold

5115

EP

YC

7351

EP

YC

7551

Xeon G

old 51 15

Xeon

Gold 61 50

EP

YC

7351

EP

YC

7551

Xeon G

old 5115

Xeon G

old 61 50

EP

YC

7351

EP

YC

7551

Dirac HS06 Est.

Events/Sec

BWIPS

14

Results for a limited number of CPUs: only possible to run the full suite(including KV) on systems with CVMFS client installed/setup

By default the CERN cloud benchmarks run one instance of a benchmarkper logical core in parallel

However, only reports performance per logical coreInterested in aggregate system performance, not performance/logical core

For DB12, and Whetstone, simply multiplied result by number of logicalcoresFor KV, average seconds/event per logical core is reported

Took the inversion, and multiplied by the number of logical coresto obtain total events/sec

Larger graphed values are better

DB12 and Whestone results fairly in line with HS06

Expected somewhat better KV performance for the Xeon Gold 6150

EPYC/Skylake: CERN Cloud Benchmarks (Cont.)

15

EPYC vs Skylake-SP: CPU HS06/Dollar

CPU0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.32

0.19

0.160.17

0.15

0.09

0.45

0.35

0.23

0.17

0.15

Xeon Gold 5115@2.4 GHz [40 threads] +

Xeon Gold 6130@2.1 GHz [64 threads]Xeon Gold 6136@3.0 GHz [48 threads]Xeon Gold 6148@2.4 GHz [80 threads]Xeon Gold 6150@2.7 GHz [72 threads]Xeon Platinum 8170@2.1 GHz [104 threads]EPYC 7401P@2.0 GHz [Uniprocessor - 48 threads]EPYC 7351@2.4 GHz [64 threads]EPYC 7451@2.3 Ghz [96 threads] DDR-2400EPYC 7551@2.0 GHz [128 threads]EPYC 7601@2.2 GHz [128 threads] DDR4-2400

Xe

on Gold 61

30

Xe

on Gold 61

36

Xeon

Gold 61 4

8

EP

YC

7401P

Xe

on P

latinum

8170

EP

YC

7351

EP

YC

7451 EP

YC

7551

EP

YC

7601

Xeon G

old 61 50

HS06/$

Only retail CPU cost accounted for in calculationsDoes not represent reality given memory and base server pricing

+ = System using only 3 memory channels per CPU

Xeon

Gold 51 1

5

16

EPYC vs Skylake-SP: Estimated 25kHS06 Cost

CPU0

50

100

150

200

250

300

350

400

450

332

256 258

233

257

417

256

189

221

256 251

Xeon Gold 5115@2.4 GHz [6*16 GB DIMMs, 64 servers] +

Xeon Gold 6130@2.1 GHz [12*16 GB DIMMS, 34 servers]

Xeon Gold 6136@3.0 GHz [12*8 GB DIMMs, 32 servers]

Xeon Gold 6148@2.4 GHz [12*16 GB DIMMs, 23 servers]

Xeon Gold 6150@2.7 GHz [12*16 GB DIMMs, 24 servers]

Xeon Platinum 8170@2.1 GHz [12*32 GB DIMMS, 20 servers]

EPYC 7401P@2.0 GHz [8*16 GB DIMMs, 51 servers]

EPYC 7351@2.4 GHz [16*8 GB DIMMs, 32 servers]

EPYC 7451@2.3 GHz [16*16GB DIMMs, 23 servers]

EPYC 7551@2.0 GHz [16*16 GB, 22 servers]]

EPYC 7601@2.2 GHz [16*16 GB, 19 servers]

Xeon

Gold 61 3

0

Xe

on Gold 61

36

Xeon

Gold 61 4

8

EP

YC

7401P

Xe

on Platinu

m 81

70

EP

YC

7351

EP

YC

7451

EP

YC

7551

EP

YC

7601

Xe

on Gold 61

50

$1k

Estimated total cost of 25kHS06 +-500HS06Assuming $1,500 irreducible server cost, and retail CPU/memory pricing

Xe

on Gold 51

15

+ = System using only 3 memory channels per CPU

17

Server counts to achieve 25kHS06 +-500HS06 (+-2%)

Majority of compute node cost in CPU and memoryAssuming no excessive local storage space or IOPs requirements

Typically the case for HEP/NP

Retail CPU costs used in estimateLikely to receive volume or competitive discounts

Estimate assumes a server without CPUs and memory costs $1,500Includes power supply, disk, NIC, etc.Only accounts for cost of servers themselves. Associated costs suchas racks, network switches, integration, shipping, etc. not includedServer vendors typically increase base prices for servers which supporthigher performing CPU models with higher TDP (i.e. due to bigger PSUs, etc.)

$1,500 server base cost may be lower than reality for systems with higher end CPUs

Memory costs: retail Samsung server DIMM pricingDDR4 2666 MHz, ECC, registered

8 GB DIMM - $13716 GB DIMM - $20832 GB DIMM - $380

EPYC vs Skylake-SP: Est. 25kHS06 Cost (Cont.)

18

Enough memory populated per system to provide 2 GB/logical coreFairly standard for HEP/NP

Ensured all 6 (Skylake) or 8 (EPYC) memory channels per CPU utilized formaximum bandwidth, and NUMA performance

Often ended up with more RAM than required to satisfy 2 GB/logical core6 channels makes this particularly difficult for Skylake: installed memory not a power of two

Problem compounded by server manufacturers not offering smaller(i.e. 4 GB), less expensive DDR4 DIMMs

Estimated cost for 25kHS06 fairly similar between Skylake-SP Xeon Gold and EPYC servers: most close to $250k

Dual-CPU EPYC 7351 systems appear very cost effective, however: est. $189k~25% less than the Xeon Gold 6148 Required memory/logical core and DIMM channel parityCPU itself is inexpensive compared to other Skylake/EPYC counterparts

EPYC 7401P uniprocessor system HS06/$ for CPU cost initially looked promising

Large number of servers required: irreducible per server cost added upEstimated in the $250k range like many of the other CPUs

EPYC vs Skylake-SP: Est. 25kHS06 Cost (Cont.)

19

Side-Channel Attacks

Jan 2018 - New class of side-channel information disclosure vulnerabilities in CPU hardware made public

Meltdown, SpectreExploit speculative execution and caching optimizations in CPUs

List of similar side-channel attack vectors continues togrow

Speculative Store Bypass VulnerabilityForeshadow (L1TF)

Meltdown Spectre

Microcode updates for Skylake-SP released for all of the above

AMD claims EPYC is not vulnerable to Meltdown or ForeshadowDue to existing protections in their paging architectureReleased microcode updates for Spectre

20

The Future: EPYC 2 and Cascade LakeEPYC 2 - Rome

Expected in early 20197 nm processSupport for DDR4-3200 MHz DIMMs expected

Still 8 channelsMax core count per socket increased to 64 (128 threads)AVX512?

Cascade Lake XeonExpected end of 201814 nm process

10 nm process expected in Ice Lake in 2020Max memory speed: DDR4-2600 DIMMs

Still 6 channelsMax core count per socket remains at 28 (56 threads)Expected to support VNNI instructions

Utilizes AVX512 unitsSupport for Optane 3D Xpoint memoryAnnounced that the Frontera supercomputer at TACC will be Cascade Lake based – estimated to provide 35-40 PFLOPS without GPUs

Both CPUs will have Spectre mitigations built into hardware

21

Conclusions

The EPYC MCM architecture is considerably different from Skylake-SP’s single die configuration

Similar HEP/NP benchmark performance from mid/upper range Skylake-SP Xeon Gold and mid/upper range AMD EPYC CPUs

Pricing also similarHowever, dual-CPU EPYC 7351-based systems appear to be a sweet spot for applications requiring 2GB/logical core (somewhattypical for HEP/NP software)

Competition in the server CPU market will likely help reduce cost and spur innovation

EPYC2 (Rome) with its 7nm process and 64 physical cores appears poised to disrupt the existing balance of server CPU market share