ECLIPSE Performance Benchmarks and Profiling

ECLIPSE Performance Benchmarks and Profiling

January 2009

2

Note

• The following research was performed under the HPC Advisory Council activities

– AMD, Dell, Mellanox, Schlumberger

– HPC Advisory Council Cluster Center

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com,

http://www.slb.com/

http://www.mellanox.com/

http://www.dell.com/hpc

http://www.amd.com/

http://www.slb.com/

3

Schlumberger ECLIPSE

• Oil and gas reservoir simulation software

– Developed by Schlumberger

• Offers multiple choices of numerical simulation techniques for

accurate and fast simulation for

– Black-oil

– Compositional

– Thermal

– Streamline

– Others

• ECLIPSE support MPI to achieve high performance and scalability

4

Objectives

• The presented research was done to provide best practices

– ECLIPSE performance benchmarking

– Interconnect performance comparisons

– Ways to increase ECLIPSE productivity

– Understanding ECLIPSE communication patterns

– Power-efficient simulations

5

Test Cluster Configuration – System Upgrade

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ Model 2382 processors (“Shanghai”)

• Mellanox® InfiniBand ConnectX® DDR HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

• MPI: Platform MPI 5.6.5

• Application: Schlumberger ECLIPSE Simulators 2008.2

• Benchmark Workload

– 4 million cell model ( 2048 200 10) Blackoil 3 phase model with ~ 800 wells

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management– Design for clustering and storage interconnect

• Price and Performance– 40Gb/s node-to-node– 120Gb/s switch-to-switch– 1us application latency– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload– Kernel bypass– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

9

ECLIPSE Performance Results - Interconnect

• InfiniBand enables highest scalability – Performance accelerates with cluster size

• Performance over GigE and 10GigE is not scaling – Slowdown occurs beyond 8 nodes

Lower is better Single job per cluster size

Schlumberger ECLIPSE (FOURMILL)

0

1000

2000

3000

4000

5000

6000

4 8 12 16 20 22 24

Number of Nodes

Elap

sed

Tim

e (S

econ

ds)

GigE 10GigE InfiniBand

10

ECLIPSE Performance Results - Interconnect

• InfiniBand outperforms GigE by up to 500% and 10GigE by up to 457%– As node number increases, bigger advantage is gained

Schlumberger ECLIPSE (InfiniBand vs GigE & 10GigE)

0%

100%

200%

300%

400%

500%

600%

4 8 12 16 20 22 24

Number of Nodes

Perfo

rman

ce A

dvan

tage

GigE 10GigE

11

ECLIPSE Performance Results - Productivity

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously– Providing required productivity for reservoir simulations

• Three cases are presented– Single job over the entire systems– Four jobs, each on two cores per CPU per server – Eight jobs, each on one CPU core per server

• Eight jobs per node increases productivity by up to 142%

Higher is better InfiniBand


0

50

100

150

200

250

300

8 12 16 20 22 24Number of Nodes

Num

ber o

f Job

s

1 Job per Node 4 Jobs per Node 8 Jobs per Node

12

ECLIPSE Performance Results - Productivity

• InfiniBand offers unparalleled productivity compared to Ethernet– GigE shows performance decrease beyond 8 nodes– 10GigE demonstrates no scaling beyond 16 nodes

4 Jobs on each nodeHigher is better


0

50

100

150

200

250

4 8 12 16 20 22

Number of Nodes

Num

ber o

f Job

s


13

ECLIPSE MPI ProfiliingMPI_Isend

0

2

4

6

8

10

[0..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity]

Num

ber o

f Mes

sage

s (M

illio

ns)

Message Size

4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

ECLIPSE Profiling – Data Transferred

14

ECLIPSE MPI ProfiliingMPI_Recv

01234567

[0..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity]

Num

ber

of M

essa

ges

(Mill

ions

)

Message Size

4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

ECLIPSE Profiling – Data Transferred

15

Eclipse MPI Profiliing

30%

40%

50%

60%

4 8 12 16 20 24

Number of Nodes

Per

cent

age

of M

essa

ges

MPI_Isend < 128 Bytes MPI_Isend < 256 KBMPI_Recv < 128 Bytes MPI_Recv < 256 KB

ECLIPSE Profiling – Message Distribution

• Majority of MPI messages are large size• Demonstrating the need for highest throughput

16

Interconnect Usage by ECLIPSE

• Total server throughput increases rapidly with cluster size

Data Sent

0

200

400

600

800

1000

1200

1400

1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793

Timing (s)

Dat

a Tr

ansf

erre

d (M

B/s

)

4 Nodes 8 Nodes

16 Nodes 24 Nodes

This data is per node based

Data Sent

0

200

400

600

800

1000

1200

1400

1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031

Timing (s)

Dat

a Tr

ansf

erre

d (M

B/s

)

Data Sent

0

200

400

600

800

1000

1200

1400

1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191

Timing (s)

Dat

a Tr

ansf

erre

d (M

B/s

)

Data Sent

0

200

400

600

800

1000

1200

1400

1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847

Timing (s)

Dat

a Tr

ansf

erre

d (M

B/s

)

17

ECLIPSE Profiling Summary - Interconnect

• ECLIPSE was profiled to determine networking dependency • Majority of data transferred between compute nodes

– Done with 8KB-256KB message size– Data transferred increases with cluster size

• Most used message sizes– <128B messages – mainly synchronizations– 8KB-256KB – data transferring

• Message size distribution– Percentage of smaller messages (<128B) slightly decreases with cluster size– Percentage of mid-size messages (8KB-256KB) increases with cluster size

• ECLIPSE interconnects sensitivity points – Interconnect latency and throughput for <256KB message range– As node number increases, interconnect throughput becomes more critical

18

Power Consumption

0200400600800

100012001400160018002000

Pow

er p

er J

ob (W

h)


Power Consumption/Productivity Comparison

• InfiniBand enables power efficient simulations• Reducing system power/job consumption up to 66% vs GigE and 33% vs 10GigE

– For productivity case – 4 jobs per node– When using single job approach, InfiniBand reduces power/job consumption by more

than 82% compared to 10GigE

66%

33%

4 Jobs on each node

19

Conclusions

• Eclipse is widely used to perform reservoir simulation– Developed by Schulmberger

• ECLIPSE performance and productivity relies on– Scalable HPC systems and interconnect solutions

– Low latency and high throughout interconnect technology

– NUMA aware application for fast access to memory

– Reasonable job distribution can dramatically improve productivity• Increasing number of jobs per day while maintaining fast run time

• Interconnect comparison shows– InfiniBand delivers superior performance and productivity in every cluster size

– Scalability requires low latency and “zero” scalable latency

• InfiniBand enables lowest power consumption per job– Optimizing power/job ratio

2020

Thank YouHPC Advisory [email protected]

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

ECLIPSE Performance Benchmarks and Profiling

Documents

Transcript of ECLIPSE Performance Benchmarks and Profiling