Memory Scalability Evaluation of the Next-Generation Intel...

Memory Scalability Evaluation of theNext-Generation Intel Bensley Platform

with InfiniBand

Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda

Network-Based Computing LaboratoryDepartment of Computer Science & Engineering

The Ohio State University

Introduction

• Computer systems have increased significantly inprocessing capability over the last few years invarious ways– Multi-core architectures are becoming more prevalent

– High-speed I/O interfaces, such as PCI-Express haveenabled high-speed interconnects such as InfiniBandto deliver higher performance

• The area that has improved the least during thistime is the memory controller

Traditional Memory Design

• Traditional memory controller design has limited thenumber of DIMMs per memory channel as signalrates have increased

• Due to high pin count (240) required for eachchannel, adding additional channels is costly

• End result is equal or lesser memory capacity inrecent years

Fully-Buffered DIMMs (FB-DIMMs)

• FB-DIMM uses seriallanes with a buffer oneach chip to eliminatethis tradeoff

• Each channel requiresonly 69 pins

• Using the buffer allowslarger numbers ofDIMMs per channel aswell as increasedparallelism

Image courtesy of Intel Corporation

Evaluation

• With multi-core systems coming, a scalable memorysubsystem is increasingly important

• Our goal is to compare FB-DIMM against a traditionaldesign and evaluate the scalability

• Evaluation Process– Test memory subsystem on a single node

– Evaluate network-level performance with two InfiniBandHost Channel Adapters (HCAs)

Outline

• Introduction & Goals

• Memory Subsystem Evaluation– Experimental testbed

– Latency and throughput

• Network results

• Conclusions and Future work

Evaluation Testbed

• Intel “Bensley” system– Two 3.2 GHz dual-core Intel Xeon “Dempsey” processors– FB-DIMM-based memory subsystem

• Intel Lindenhurst system– Two 3.4 GHz Intel Xeon processors– Traditional memory subsystem (2 channels)

• Both contain:– 2 8x PCI-Express slots– DDR2 533-based memory– 2 dual-port Mellanox MT25208 InfiniBand HCAs

Bensley Memory Configurations

Br anch0Br anch1

Cha n nel3Channel2

Ch an ne l1C han nel0

Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3

Processor 0

Processor 1

• The standardallows up to 6channels with 8DIMMs/channel for192GB

• Our systems have4 channels, eachwith 4 DIMM slots

• To fill 4 DIMMslots we have 3combinations

Subsystem Evaluation Tool

lmbench 3.0-a5: Open-source benchmark suite for

evaluating system-level performance• Latency

– Memory read latency

• Throughput– Memory read benchmark

– Memory write benchmark

– Memory copy benchmark

Aggregate performance is obtained by running multiple long-runningprocesses and reporting the sum of averages

Bensley Memory Throughput

• To study the impact of additional channels we evaluated using 1, 2,and 4 channels

• Throughput increases significantly from one to two channels in alloperations

0

1000

2000

3000

4000

5000

6000

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)

1 Channel 2 Channels 4 Channels

0

500

1000

1500

2000

2500

3000

3500

4000

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)


0

500

1000

1500

2000

2500

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)


Read Write Copy

Access Latency Comparison

0

20

40

60

80

100

120

140

160

180

200

4GB 8GB 16GB 2GB 4GB

Read

Late

ncy (

ns)

Unloaded Loaded • Comparison when unloadedand loaded

• Loaded is when a memoryread throughput test is run inthe background while thelatency test is runnning

• From unloaded to loadedlatency:

– Lindenhurst: 40% increase

– Bensley: 10% increaseBensley Lindenhurst

Memory Throughput Comparison

• Comparison of Lindenhurst and Bensley platforms with increasingmemory size

• Performance increases with two concurrent read or write operationson the Bensley platform

0

1000

2000

3000

4000

5000

6000

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)

Bensley 4GB Bensley 8GB Bensley 16GBLindenhurst 2GB Lindenhurst 4GB

0

500

1000

1500

2000

2500

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)


0

500

1000

1500

2000

2500

3000

3500

4000

1 2 4

Number of Processes

Ag

gre

gate

Th

rou

gh

pu

t (M

B/

sec)


Read Write Copy

Outline




• Network results


OSU MPI over InfiniBand

• Open Source High Performance Implementations– MPI-1 (MVAPICH)

– MPI-2 (MVAPICH2)

• Has enabled a large number of production IB clusters all over theworld to take advantage of InfiniBand– Largest being Sandia Thunderbird Cluster (4512 nodes with 9024

processors)

• Have been directly downloaded and used by more than 395organizations worldwide (in 30 countries)– Time tested and stable code base with novel features

• Available in software stack distributions of many vendors

• Available in the OpenFabrics(OpenIB) Gen2 stack and OFED

• More details athttp://nowlab.cse.ohio-state.edu/projects/mpi-iba/

Experimental Setup

• Evaluation is with two InfiniBand DDR HCAs, whichuses the “multi-rail” feature of MVAPICH

• Results with one process use both rails in a round-robinpattern

• 2 and 4 process pair results are done using a processbinding assignment

HC

A 0

HC

A 1

P0

P1

P2

P3

HC

A 0

HC

A 1

P0

P1

P2

P3

HC

A 0

HC

A 1

P0

P1

P2

P3H

CA

0H

CA

1

P0

P1

P2

P3

Round Robin Process Binding

Uni-Directional Bandwidth

• Comparison of Lindenhurst and Bensley with dual DDR HCAs

• Due to higher memory copy bandwidth, Bensley signficantlyoutperforms Lindenhurst for the medium-sized messages

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16

K32

K64

K12

8K25

6K51

2K 1M

Message Size (bytes)

Th

rou

gh

pu

t (M

B/

sec)

Bensley: 1 Process

Bensley: 2 Processes


Lindenhurst: 1 Process

Lindenhurst: 2 Processes

Bi-Directional Bandwidth

• At 1K improvement:

– Lindenhurst: 1 to 2 processes:15%

– Bensley: 1 to 2 processes: 75%, 2 to 4: 45%

• Lindenhurst peak bi-directional bandwidth is only 100 MB/secgreater than uni-directional

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16

K32

K64

K12

8K25

6K51

2K 1M


Th

rou

gh

pu

t (M

B/

sec)

Bensley: 1 Process





Messaging Rate

• For very small messages, both show similar performance

• At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1process, Bensley still shows 100% improvement

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16

K32

K64

K12

8K25

6K


Th

rou

gh

pu

t (m

illion

s m

essag

es/sec)

Bensley: 1 Process





Outline




• Network results


Conclusions and Future Work

• Performed detailed analysis of the memory subsystemscalability of Bensley and Lindenhurst

• Bensley shows significant advantage in scalablethroughput and capacity in all measures tested

• Future work:– Profile real-world applications on a larger cluster and

observe the effects of contention in multi-corearchitectures

– Expand evaluation to include NUMA-based architectures

21

AcknowledgementsOur research is supported by the following organizations

• Current Equipment support by

• Current Funding support by

22

Web Pointers

http://nowlab.cse.ohio-state.edu/

MVAPICH Web Pagehttp://nowlab.cse.ohio-state.edu/projects/mpi-iba/

{koop, huanwei, vishnu, panda}@cse.ohio-state.edu

Memory Scalability Evaluation of the Next-Generation Intel...

Documents

Transcript of Memory Scalability Evaluation of the Next-Generation Intel...