1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu,...

20
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano [email protected] Performance Evaluation of Power-aware Multi-tree Ethernet for HPC Interconnects

Transcript of 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu,...

Page 1: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

1

Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao,

Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano

[email protected]

Performance Evaluation of Power-aware Multi-tree

Ethernet for HPC Interconnects

Page 2: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

HPC PC Clusters with Ethernet• Host/CPU

– Various low-power techniques are used

• DVFS• Power Gating

• Ethernet Switch– Always preparing

(active) for packet injection

We evaluate our power-aware On/Off Link Activation for Ethernet on PC clusters

PC Ethernet switch

Interconnects share@TOP500 (Nov 2011 ) Gigabit Ethernet

45%GbE

Page 3: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

• Ethernet for HPC– Link aggregation (channel group) + multi-paths

• Our On/Off link activation method

• Evaluations– Performance and power consumption of PC

clusters

Outline

Page 4: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Ethernet on HPC systemsIncreasing the number of ports of GbE switches

- 24/48-port switches provide the lowest cost per port

Improving the computation power of host ( > 10GFlops)

Link aggregation [IEEE 802.3ad] + multi-path topology [Kudoh, IEEE Cluster, 2004][Viking, Infocom2004][Koibuchi et al, IEEE TPDS2011]

- drastically increasing the number of links

switch

host

Link aggr. using 2 links

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

2 paths

Page 5: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

• Power cons is almost constant regardless of traffic load• # of activated ports dominates the power cons of switches

– Power cons of port is reduced down to ZERO by port-shutdown operation

Power cons of GbE switchesProduct Port Other

(Xbar) Total ( ratio of ports )

PC5324 1.2 14.9 42.9(65%)

PC6224 2.0 42.5 91.1(53%)

PC6248 2.1 56.8 155.2(63%)

SF-420 1.0 32.6 55.4(41%)

C-3750 1.8 84.5 127.7(34%)

Unit :W

Page 6: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Overview of the on/off link method  

Traffic load becomes low

( turning off a part of links)

Network load is not always high (e.g. during computation time

Switch ports consume 40-60% of the total power

switch

host0 41 2 3 5 6 7

コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Page 7: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

• Ethernet for HPC– Link aggregation (channel group) + multi-paths

• Our On/Off link activation method• Evaluations

– Performance and power consumption of PC clusters

Outline

Page 8: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

A framework of on/off link methodEg : port monitor,

IPTraf, pilot execution

How is it implemented on Ethernet?

Low or high-load links appear

Selection of on/off links and paths

Update of on/off link operation

Traffic monitoring

No

Yes

Traffic load becomes low

Paths: Before & After

The before path is deactivated

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Page 9: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Requirements for the on/off link method  

No update of the MPI communication library

Hide the overhead to activate the link

Stabilize the MAC address tables during updating paths

Switch

Host

Before

After0 41 2 3 5 6 7

コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Page 10: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Changing the paths for on/off link op

• Using switch-tagged ・ VLAN routing method[Otsuka,ICPP06]

– Specifying the path by attaching the VLAN tag to a frame ( Port VLAN ID: PVID)

– Each host sends and receives usual (untagged) frames• When an frame arrives at a switch from a host, add a VLAN tag (PVID) to it• When it leaves to a host, removes the VLAN tag

The path of PVID#v1The path of PVID#v0

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

VLAN v0

VLAN v1

PVID v0 v1

VLAN tag #v0 is

attached

Page 11: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

When a deactivated link is activated • (1) Activating the target link

– Using no-shutdown command of switch• (2) Create VLAN v0 for the new path set that includes the

target link, and make its MAC address table• (3) Update the PVIDs of the ports for connecting hosts to v0

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Updating PVID to v0

Before

PVID v0

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Step 3

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Step 1,2Activate links

VLAN v0

When the traffic increases

Page 12: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

When an activated link is deactivated• (1) Create VLAN v1 for the new path set that avoids the target

link, and make its MAC address table

• (2) Update the PVID of the ports for connecting hosts to v1• (3) Deactivating the link

The path of PVID v0

PVID #v0 v1

Before

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータStep 3

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Deactivating

Decreasing the traffic

0 41 2 3 5 6 7コンピュータコンピュータ コンピュータコンピュータ コンピュータコンピュータコンピュータコンピュータ

Step 1,2

The path of PVID v1

Page 13: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

• Ethernet for HPC– Link aggregation (channel group) + multi-paths

• On/Off link activation method• Evaluations

– Performance and power consumption of PC clusters

Outline

Page 14: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Performance evaluation on a PC cluster

• PC Cluster – 66 hosts, 528 cores – CPU Quad-Core AMD Opteron 2.3GHz– Memory DDR2 667 MHz 8GB– NIC & driver Broadcom BCM95721, Tigon3– Kernel 2.6.9-67.0.15.ELsmp

• GbE switch– Dell PC 6248

• 48port@8

• Application– NPB 3.2 / HPL (OpenMPI 1.3 /MPICH-1.2.7p1)

Dell PC6248SW

Page 15: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Topology of the cluster• Tree or completely connected graph,

– Up to 5 links between switches• Enabling the link aggregation (IEEE 803.ad)

• Pre-executing the applications for estimating traffic amount– Set up the on/off link set before executing

• Performing our simple link regularation algorithm

Completely (fully) Connected TopologyTree

Page 16: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Pre-evaluation (even link removal) P

erf

orm

ance

(T

flop

s)

0

0.5

1

1.5

2

2.5

3

3.5

Tree(1link) Tree(2link) Tree(5link) Compl(1link) Compl(2link) Compl(5link) Ideal

Per

form

ance

(Tflo

ps)

0

1

2

3

4

5

6

7

8

CG FT IS LU MG BT SP

Rel

ativ

e M

op/s

Tree(1link) Tree(2link) Tree(3link)Tree(4link) Tree(5link) Compl(1link)Compl(2link) Compl(5link) ideal

Rmax/Rpeak=61%

(2) Linpack (HPL)

(3) NPB, Class C

0

100

200

300

400

500

600

700

800

900

Matrix transpose Bit- reversal

Thr

ough

put(

Mbp

s/ho

st)

Tree(1link) Tree(2link) Tree(3link) Tree(4link) Tree(5link)Compl(1link) Compl(2link) Compl(3link) Compl(4link) Compl(5link)

Tree

Tree Compl

Compl

All the applications drastically decrease the performance if links are uniformly removed

(1) Synthetic traffic

Page 17: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Performance and Power in HPL

Rmax/Rpeak=61%

Over 20% power reduction with almost same performance

Almost same performance

Page 18: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Performance and Power in NPB64

Rmax/Rpeak=61%

Over 25% power reduction with almost same performance

CLASS C

IS, LU, BT, SP keep performance

Page 19: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

Performance and Power in NPB128

Rmax/Rpeak=61%

CLASS C

Over 20% power reduction with almost same performance

LU, MG keep performance

Page 20: 1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano koibuchi@nii.ac.jp.

• We evaluated our on/off link method on Ethernet– Multi-tree topologies & link aggre. are enabled – Using port-shutdown command for reducing

power cons• Ports consume up to 60% of switch power

• Reducing by up to 37% NW power in the 528-core PC cluster

Conclusions