Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and...

18
C Ce en nt te er r f fo or r C Co om mp pu ut ta at ti io on na al l S Sc ci ie en nc ce es s, , U Un ni iv v. . o of f T T s su uk ku ub ba a Yuetsu KODAMA Center for Computational Sciences University of Tsukuba [email protected] Tightly Coupled Accelerator for HA-PACS 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Transcript of Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and...

Page 1: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

Yuetsu KODAMA Center for Computational Sciences

University of Tsukuba [email protected]

Tightly Coupled Accelerator for HA-PACS

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 2: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)

!! Two parts !! Base Cluster:

!! for development of GPU-accelerated code for target fields, and performing product-run of them

!! TCA: (Tightly Coupled Accelerators) !! for elementary research on new communication technology for

reducing latency between accelerators. !! Our original communication system based on PCI-Express named

“PEARL”, and a prototype communication chip named “PEACH” !! Improve PEACH for TCA : PEACH2

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 3: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

Problems of GPU Cluster!! Problems of GPGPU for HPC

!! Data I/O performance limitation !! Ex) GPGPU: PCIe gen2 x16 (8GB/s ) 665 GFLOPS (NVIDIA M2090)

!! Memory size limitation !! Ex) M2090: 6GByte vs CPU: 128GByte

!! Communication between accelerators: no direct path (external) communication latency via CPU becomes large

!! Researches for direct communication between GPUs are required

TThhee ttaarrggeett ooff TTCCAA iiss ddeevveellooppiinngg aa ddiirreecctt ccoommmmuunniiccaattiioonn ssyysstteemm bbeettwweeeenn eexxtteerrnnaall GGPPUUss ffoorr aa ffeeaassiibbiilliittyy ssttuuddyy ffoorr ffuuttuurree aacccceelleerraatteedd ccoommppuuttiinngg

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

PCIe

Node

PCIe

PCIe

PCIeGPU

GPUCPU

CPU

IB HCA

IB HCA

IB Switch

MEMMEM

MEM MEM

Page 4: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

HA-PACS: TCA (Tightly Coupled Accelerator)

!! TCA: Tightly Coupled Accelerator !! Direct connection between accelerators (GPUs) !! Using PCIe as a communication device between accelerator

!! Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device)

!! An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices

!! PEARL: PCI Express Adaptive and Reliable Link !! We already developed such PCIe device (PEACH, PCI Express Adaptive

Communication Hub) on JST-CREST project “low power and dependable network for embedded system”

Improving PEACH for HPC to realize TCA : PEACH2

LBNL and CCS-Tsukuba Joint Workshop 20122012/03/19

Page 5: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEACH, PCI Express Adaptive Communication Hub

!! Applying PCIe link as “a direct link between nodes” !! PCI-E link edge control feature: “root complex” and “end points” are

automatically switched (flipped) according to the connection handling !! Each PCIe is an independent bus, and Intelligent controller on PEACH

transfers each packet between PCIe buses.

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

PCIePCIe PCIe

PEACHCPUPCIe

RC RCEPRC

EP

CPUPCIe

PEACH

RCEPEP

EP

RC

PEACHCPUPCIe

RC RCEPRC

EP

CPUPCIe

PEACH

RCEPEP

EP

RC

Page 6: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEACH chip [Otani et al., ISSCC2011]

!"#$%&'()

*+,-./0

*+,-./'

Transfer Processing block

Control Processing block

*+,-./&

*+,-./1

23435678493:7;

*83<7;;38<387

!! CPU: Renesas M32R 4core SMP (max. 400MHz) !! small core size, low power!! SMP !! Controlling PCIe 4 port!! health check for the node, link!! communication link management!! route reconfiguration

PCIe#0

L2 Cache

PCIe#3

PCIe#2

CPU SRAM

DD

R3

I/F

PCIe#1

•! comm. link PCI Express Gen2 x4 lanes (20Gbps) x4 port

•! SuperHyway DMAC

•! Low power –! Controlling #lanes for each port –! gen1/gen2 switching –! core frequency control

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 7: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

HA-PACS/TCA (Tightly Coupled Accelerator)!! Enhanced version of PEACH

PPEEAACCHH22 !! x4 lanes -> x8 lanes !! hardwired on main data path and

PCIe interface fabric

PEACH2CPU

PCIe

CPUPCIe

Node

PEACH2

PCIe

PCIe

PCIe

GPU

GPU

!! True GPU-direct !! current GPU clusters require 3-

hop communication (3-5 times memory copy)

!! For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput

PCIe

PCIe

Node

PCIe

PCIe

PCIeGPU

GPUCPU

CPU

IB HCA

IB HCA

IB Switch

MEMMEM

MEM

MEMMEM MEM

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 8: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

Implementation of PEACH2: ASIC FPGA!! FPGA based implementation

!! today’s advanced FPGA allows to use PCIe hub with multiple ports !! currently gen2 x 8 lanes x 4 ports are available

soon gen3 will be available (?) !! easy modification and enhancement !! fits to standard (full-size) PCIe board !! internal multi-core general purpose CPU with programmability is available

easily split hardwired/firmware partitioning on certain level on control layer

!! Controlling PEACH2 for GPU communication protocol !! collaboration with NVIDIA for information sharing and discussion !! based on CUDA4.0 device to device direct memory copy protocol

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 9: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

HA-PACS/TCANode Cluster = NC

PEACH2

C x 2

PPEEAARRLL RRiinngg NNeettwwoorrkk

IInnffiinniibbaanndd LLiinnkk

Node Cluster with 16 nodes •! GPUx64 (G) •! CPUx32 (C) •! GPU comm with PCIe •! IB link / node •! CPU: Xeon E5 •! GPU: Kepler

Node Cluster

Node Cluster

Node Cluster

Node Cluster

Node Cluster

IInnffiinniibbaanndd NNeettwwoorrkk

................

44 NNCC wwiitthh 1166 nnooddeess,, oorr 88 NNCC wwiitthh 88 nnooddeess == 336600 TTFFLLOOPPSS eexxtteennssiioonn ttoo bbaassee cclluusstteerr

•!High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 4GB/s/link) •!Infiniband QDR (x2) for NC-NC comm. (4GB/s/link)

Gx4

PEACH2

C x 2

Gx4

PEACH2

C x 2

Gx4

.....

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 10: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEARL/PEACH2 variation (1)

!! Option 1: !! GPU host driver is available as-is !! 4 GPUs are equivalent from PEACH2

C C

C C

C C

C CQPI

PCIe

GPU GPU GPU IB HCA

PEACH2

GPU

G2 x8

G3 x16

G3 x16

G3 x8

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 11: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEARL/PEACH2 variation (2)

C C

C C

C C

C CQPI

PCIe

GPU GPU GPUIB

HCA

GPUPEACH2

PCIe SWG2 x8

G3 x16

G3 x16

G3 x8

G3 x8

!! Option 1: !! Performance comparison among IB and PEARL can be evenly

compared !! Additional latency by PCIe switch

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 12: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEARL/PEACH2 variation (3)

C C

C C

C C

C CQPI

PCIe

GPU GPU

GPU

IB HCA

GPU PEACH2

PCIe SWG2 x8

G3 x16

G3 x16

G3 x16

G3 x16

G3 x8

!! Option 3: !! Requires only 72 lanes in total !! asymmetric connection among 3 blocks of GPUs

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 13: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEACH2 FPGA test bedHSMC-PCIe converter board (newly

developed) -! HSMC (High-Speed Mezzanine Card)

General I/O port -! PCIe x8 cable connector -! RootComplex / Endpoint switchable -! both x4 / x8 available

generic FPGA evaluation board (DEV-4SGX530N)

-! PCIe x8 endpoint -! HSMC support PCIe x 8 Gen2 -! HSMC support PCIe x 4 Gen2 -! Stratix IV GX530KH40

(1517pin, 531K LE, 20Mbit, 4 PCIe IP, 24 8.5Gbps Transceiver)

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Read: 3.2GB/s, Write 3.3GB/s in 64K

Page 14: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PEACH2 PCIe compatible board(~ Mar. 2012)

Altera Stratix IV GX D

DR3-

SODIMM

PCI-Express gen2 x8 card edge

PCI-Express gen2 x8 cable connector

PCI-Express gen2 x8 cable connector

PCI-Express gen2 x16 (using only x8) cable

connector

Stratix IV GX530NF45 (1932pin, 32 8.5Gbps Transceiver)

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Gate Usage: 16% Clock: 250MHz Latency: 376ns

Page 15: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

Summary!! HA-PACS/TCA for elementary study for advanced technology on

direct communication among accelerating devices (GPUs) !! FPGA implementation of PEACH2 will be finished for the

prototype version on Mar. 2012 and enhanced for final version in following 6 months

!! HA-PACS/TCA with at least 200 TFLOPS additional performance will be installed around Mar. 2013

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 16: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

Page 17: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PCI express IP core performance (Gen2 x 8)

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

0

500

1000

1500

2000

2500

3000

3500

4000

128 512 2048 8192 32768 131072

bbaanndd

wwiidd

tthh ((MM

bbyyttee

ss//ssee

cc))

ddaattaa ssiizzee ((bbyytteess))

read MB/s write MB/s

Page 18: Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012 Center

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

PCI Express

!! I/O Interface for PC after PCI, PCI-X !! Standard by PCI-SIG !! Used in connecton between PC and NIC

!! Serial communication !! Speed of 1 lane: 2.5Gbps(Gen1),

5.0Gbps(Gen2) !! Multilane is supported for high

throughput !! x1, x2, x4, x8, x16

!! Line length is limited !! 30cm for Gen2 on board !! 2m with external cable !! Enough for Embeded or small scale

cluster

CPU

Root Complex (RC)

Device Device

Switch

EndPoint (EP)