Memory Scalability Evaluation of the Next-Generation Intel...
Transcript of Memory Scalability Evaluation of the Next-Generation Intel...
Memory Scalability Evaluation of theNext-Generation Intel Bensley Platform
with InfiniBand
Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda
Network-Based Computing LaboratoryDepartment of Computer Science & Engineering
The Ohio State University
Introduction
• Computer systems have increased significantly inprocessing capability over the last few years invarious ways– Multi-core architectures are becoming more prevalent
– High-speed I/O interfaces, such as PCI-Express haveenabled high-speed interconnects such as InfiniBandto deliver higher performance
• The area that has improved the least during thistime is the memory controller
Traditional Memory Design
• Traditional memory controller design has limited thenumber of DIMMs per memory channel as signalrates have increased
• Due to high pin count (240) required for eachchannel, adding additional channels is costly
• End result is equal or lesser memory capacity inrecent years
Fully-Buffered DIMMs (FB-DIMMs)
• FB-DIMM uses seriallanes with a buffer oneach chip to eliminatethis tradeoff
• Each channel requiresonly 69 pins
• Using the buffer allowslarger numbers ofDIMMs per channel aswell as increasedparallelism
Image courtesy of Intel Corporation
Evaluation
• With multi-core systems coming, a scalable memorysubsystem is increasingly important
• Our goal is to compare FB-DIMM against a traditionaldesign and evaluate the scalability
• Evaluation Process– Test memory subsystem on a single node
– Evaluate network-level performance with two InfiniBandHost Channel Adapters (HCAs)
Outline
• Introduction & Goals
• Memory Subsystem Evaluation– Experimental testbed
– Latency and throughput
• Network results
• Conclusions and Future work
Evaluation Testbed
• Intel “Bensley” system– Two 3.2 GHz dual-core Intel Xeon “Dempsey” processors– FB-DIMM-based memory subsystem
• Intel Lindenhurst system– Two 3.4 GHz Intel Xeon processors– Traditional memory subsystem (2 channels)
• Both contain:– 2 8x PCI-Express slots– DDR2 533-based memory– 2 dual-port Mellanox MT25208 InfiniBand HCAs
Bensley Memory Configurations
Br anch0Br anch1
Cha n nel3Channel2
Ch an ne l1C han nel0
Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3Slot 0Slot 1Slot 2Slot 3
Processor 0
Processor 1
• The standardallows up to 6channels with 8DIMMs/channel for192GB
• Our systems have4 channels, eachwith 4 DIMM slots
• To fill 4 DIMMslots we have 3combinations
Subsystem Evaluation Tool
lmbench 3.0-a5: Open-source benchmark suite for
evaluating system-level performance• Latency
– Memory read latency
• Throughput– Memory read benchmark
– Memory write benchmark
– Memory copy benchmark
Aggregate performance is obtained by running multiple long-runningprocesses and reporting the sum of averages
Bensley Memory Throughput
• To study the impact of additional channels we evaluated using 1, 2,and 4 channels
• Throughput increases significantly from one to two channels in alloperations
0
1000
2000
3000
4000
5000
6000
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
1 Channel 2 Channels 4 Channels
0
500
1000
1500
2000
2500
3000
3500
4000
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
1 Channel 2 Channels 4 Channels
0
500
1000
1500
2000
2500
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
1 Channel 2 Channels 4 Channels
Read Write Copy
Access Latency Comparison
0
20
40
60
80
100
120
140
160
180
200
4GB 8GB 16GB 2GB 4GB
Read
Late
ncy (
ns)
Unloaded Loaded • Comparison when unloadedand loaded
• Loaded is when a memoryread throughput test is run inthe background while thelatency test is runnning
• From unloaded to loadedlatency:
– Lindenhurst: 40% increase
– Bensley: 10% increaseBensley Lindenhurst
Memory Throughput Comparison
• Comparison of Lindenhurst and Bensley platforms with increasingmemory size
• Performance increases with two concurrent read or write operationson the Bensley platform
0
1000
2000
3000
4000
5000
6000
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
Bensley 4GB Bensley 8GB Bensley 16GBLindenhurst 2GB Lindenhurst 4GB
0
500
1000
1500
2000
2500
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
Bensley 4GB Bensley 8GB Bensley 16GBLindenhurst 2GB Lindenhurst 4GB
0
500
1000
1500
2000
2500
3000
3500
4000
1 2 4
Number of Processes
Ag
gre
gate
Th
rou
gh
pu
t (M
B/
sec)
Bensley 4GB Bensley 8GB Bensley 16GBLindenhurst 2GB Lindenhurst 4GB
Read Write Copy
Outline
• Introduction & Goals
• Memory Subsystem Evaluation– Experimental testbed
– Latency and throughput
• Network results
• Conclusions and Future work
OSU MPI over InfiniBand
• Open Source High Performance Implementations– MPI-1 (MVAPICH)
– MPI-2 (MVAPICH2)
• Has enabled a large number of production IB clusters all over theworld to take advantage of InfiniBand– Largest being Sandia Thunderbird Cluster (4512 nodes with 9024
processors)
• Have been directly downloaded and used by more than 395organizations worldwide (in 30 countries)– Time tested and stable code base with novel features
• Available in software stack distributions of many vendors
• Available in the OpenFabrics(OpenIB) Gen2 stack and OFED
• More details athttp://nowlab.cse.ohio-state.edu/projects/mpi-iba/
Experimental Setup
• Evaluation is with two InfiniBand DDR HCAs, whichuses the “multi-rail” feature of MVAPICH
• Results with one process use both rails in a round-robinpattern
• 2 and 4 process pair results are done using a processbinding assignment
HC
A 0
HC
A 1
P0
P1
P2
P3
HC
A 0
HC
A 1
P0
P1
P2
P3
HC
A 0
HC
A 1
P0
P1
P2
P3H
CA
0H
CA
1
P0
P1
P2
P3
Round Robin Process Binding
Uni-Directional Bandwidth
• Comparison of Lindenhurst and Bensley with dual DDR HCAs
• Due to higher memory copy bandwidth, Bensley signficantlyoutperforms Lindenhurst for the medium-sized messages
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16
K32
K64
K12
8K25
6K51
2K 1M
Message Size (bytes)
Th
rou
gh
pu
t (M
B/
sec)
Bensley: 1 Process
Bensley: 2 Processes
Bensley: 4 Processes
Lindenhurst: 1 Process
Lindenhurst: 2 Processes
Bi-Directional Bandwidth
• At 1K improvement:
– Lindenhurst: 1 to 2 processes:15%
– Bensley: 1 to 2 processes: 75%, 2 to 4: 45%
• Lindenhurst peak bi-directional bandwidth is only 100 MB/secgreater than uni-directional
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16
K32
K64
K12
8K25
6K51
2K 1M
Message Size (bytes)
Th
rou
gh
pu
t (M
B/
sec)
Bensley: 1 Process
Bensley: 2 Processes
Bensley: 4 Processes
Lindenhurst: 1 Process
Lindenhurst: 2 Processes
Messaging Rate
• For very small messages, both show similar performance
• At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1process, Bensley still shows 100% improvement
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16
K32
K64
K12
8K25
6K
Message Size (bytes)
Th
rou
gh
pu
t (m
illion
s m
essag
es/sec)
Bensley: 1 Process
Bensley: 2 Processes
Bensley: 4 Processes
Lindenhurst: 1 Process
Lindenhurst: 2 Processes
Outline
• Introduction & Goals
• Memory Subsystem Evaluation– Experimental testbed
– Latency and throughput
• Network results
• Conclusions and Future work
Conclusions and Future Work
• Performed detailed analysis of the memory subsystemscalability of Bensley and Lindenhurst
• Bensley shows significant advantage in scalablethroughput and capacity in all measures tested
• Future work:– Profile real-world applications on a larger cluster and
observe the effects of contention in multi-corearchitectures
– Expand evaluation to include NUMA-based architectures
21
AcknowledgementsOur research is supported by the following organizations
• Current Equipment support by
• Current Funding support by
22
Web Pointers
http://nowlab.cse.ohio-state.edu/
MVAPICH Web Pagehttp://nowlab.cse.ohio-state.edu/projects/mpi-iba/
{koop, huanwei, vishnu, panda}@cse.ohio-state.edu