HPC Workshop – University of Kentucky May 9, 2007 – …€¦ · HPC Workshop – University of...

IBM ATS Deep Computing

© 2007 IBM Corporation

HPC Workshop – University of KentuckyMay 9, 2007 – May 10, 2007

Balaji Veeraraghavan, Ph. D.Andrew Komornicki, Ph. D.Gregory Verstraeten



Agenda

Introduction

Intel Cluster Architecture

Single Core Optimization

Software Tools

Parallel Programming

Power5+ Environment vs Intel

(Some) User Environment

Wrap-up



Software Priorities to Keep in Mind

Correctness

Numerical Stability

(Frequently) Accurate Discretization

Flexibility

Performance or Efficiency (Memory and Speed)



Generic Look at CPU Industry

Clock Speed, execution optimization, Cache size

Physical Limitation

Moore’s Law Over?

Hyper-threading, Multi-core, Cache, Memory subsystem, I/O subsystem

Concurrency

Intel CPU / wikipedia



Does Dual core mean 2x Speed?



Going Forward

Performance Optimization is a High Priority

But what does Performance Optimization mean?– Serial vs Parallel

– Awareness to the Operating Environment

– Portability vs Fine Tuning

– Rethinking/Re-Engineering Algorithms and Data Structures



Performance

Hardware

Software Algorithm

Hardware: CPU, Memory, I/O, Network

Software: O/S, Compilers, Libraries

Algorithm: Data structures, Data Locality, procedures



Current Architecture at University of Kentucky

Hardware Layer



HPC Server Environment Architecture Existing UKY User Network

16 CPU @ 1.9 GHz128 GB memory2 disks @74 GB

~260 lines

x3650StorageNode 1

16 CPU @ 1.9 GHz128 GB memory2 disks @74 GB

16 CPU @ 1.9 GHz64 GB memory

2 disks @74 GB


2 disks @74 GB


2 disks @74 GB

...

x3650Management

server

p520Q management

serverUser Login

Blade 1User Login

Blade 2

...

DS4800

EXP810

EXP810

EXP810

EXP810

EXP810

x3650StorageNode 1

x3650StorageNode 1

x3650StorageNode 1

HS21: 9 Racks, 25 9U Chasses, 340 BladesEach blade: 4 w'crest cores, 3.0 GHz, 8 GB, 73 GB SAS

p575: 8 systems

1 DS48005 EXP81080 500 GB/7.2K SATADirect fiber ch'al attached

InfiniBand Switch (Voltaire 9288, 288 ports) InfiniBand Switch (Voltaire 9288, 288 ports)

~80 lines

2 Force10 GbE Switchs - 1 used for admin, 1 used for GPFS and other user activities

9 lines



Cluster Basic Building Block: IBM HS21 Bladecenter system

IBM

HS21 Bladecenter

Processors 4 cores @ 3.0 GHzIntel Woodcrest

Memory 8 GB per blade

OS Linux: SuSE SLES V9

Integrated Network

1 Gbit Ethernet, andInfiniband 4X

Application supported All Applications

Compiler Intel C and C++ V8.0Intel Fortran V9.0

A single HS21 blade

An HS21 Bladecenter ChaseContains 14 blades

UKY installed 25 Chassis with 340 blades



SMP Building Block: IBM p5-575 Server

Single IBM p5-575+ System

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

Installed at UKY8 p575 Systems

128 total processors

p575+ System

Processors 16 X 1.9 Ghz POWER5+

Memory 64 or 128 GB

Integrated Network

1 Gbit Ethernet, and InfiniBand

OS Linux: SuSE SLES V9

Compilers IBM xlf Fortran, c and C++

Application supported

Gaussian andother apps that need large

Memory or SMP



Intel Subsystem



Basic Layout (Blade level)

2 Sockets / Blade

2 Cores / Socket

4 MB L2 Cache / Socket

1333 MHz Front-side Bus

8 GB RAM Fully-Buffered DIMM

Front Side System Bus

Memory

Core 0 Core 1 Core 0 Core 1



Xeon 5100 Series (Woodcrest) DP Architecture



Processor Subsystem



What is important to Software Performance As far as CPU is concerned?

CPU Speed

L1/L2 cache size

L1/L2 Latency

Execution rate (keeping the processor busy)

Taking advantage of the Instruction Set

Support for Threading



Intel Core Micro-Architecture From: http://www.intel.com/technology/architecture/coremicro/#anchor2

First in Xeon 5100 Series (Woodcrest) then Tigerton MP processors (later)– Major change from NetBurst (Current XeonDP and XeonMP)– NetBurst – Socket F 604, Core Processors - LGA 771

• Dempsey/Tulsa are the last of the NetBurst processor family– Completely new core based on both NetBurst and Mobile Cores– Key Features

• Wide Dynamic Execution• Advanced Smart Cache• Smart Memory Access• Advanced Digital Media Boost• Intelligent Power Capability



Wide Dynamic Execution From:http://www.intel.com/technology/architecture/coremicro/#anchor2

Executes 4 instructions per clock cycle compared to 3 instructions per cycle for NetBurst

Net Burst

Core Microarchitecture



Xeon vs. Core™ Dual-Core Design (Smart Cache)Cache to cache data sharing is now done through shared cache

Cache to cache data sharing was done through bus interface (slow)

Intel Core™ Architecture

CPU0 CPU1

4 MB Shared Cache

BusInterface

CPU0

2MB L2 Cache

Intel Xeon Dual-Core Architecture

CPU1

2MB L2 Cache

BusInterface

In Xeon 5100 Series (Woodcrest) L2 Cache can be dynamically shared so if one processor needs all cache it can be used, or it can be shared equally



Smart Memory AccessFrom:http://www.intel.com/technology/architecture/coremicro/#anchor2

Improved Prefetch: Cores can speculatively load data for all instructions even before previous store instructions are flushed– In NetBurst speculation or prefetch cannot progress if previous store is in

the pipeline because the logic did not know if that store was in conflict



Advanced Digital Media BoostFrom:http://www.intel.com/technology/architecture/coremicro/#anchor2

Enables 128bit SSE Instructions to be executed in one clock cycle– SSE is Streaming SIMD instructions used in multi-media and array computation



Intelligent Power CapabilityFrom:http://www.intel.com/technology/architecture/coremicro/#anchor2

Intelligent Power management uses ultra-fine grained chip control to power down areas of the chip which are not active and turn them back on in an instant when needed for execution



Hyper-Threading

To improve Single Core performance of– Multi-threaded Application

– Multi-threaded Operating System

– Single-threaded Application in Multi-tasking environment



core

AS

core

AS AS

Physical Logical

AS - Architectural State



Memory Subsystem



Memory Operation – Bandwidth vs. Latency

Memory bandwidth is the sustainable throughput of a memory configuration for a particular workload– Usually measured under ideal and optimal conditions

• Sequential cache-line reads as rapidly as possible with no I/O– aka STREAM Benchmark

Unloaded memory latency usually called “memory latency” refers to the time it takes to read memory when the system is idle– Unloaded latencies are typically bandied about by technical experts– Usually expressed in nSec for the fastest possible access supported by

the memory configuration• Typical x64 unloaded memory latencies are 50 – 200nSec



Loaded memory latency is the average time to read and write memory while the system is running a particular applicationo Loaded memory latency is critically important to systems performance

o Loaded latency depends upon application workload

• Sensitive to read/write, cache hit and local/remote memory ratios

• Usually measured running Application workloads

It is important to appreciate how these characteristics correlate to system level performance

So let’s first learn how memory works!



Basic Memory Read - Overview

CPU

MemoryController

ADDRESS

ROW ADDESS STROBE

RAS to CAS LATENCY

COLUMN ADDRESS STROBEDATA TRANSFER

DIMMS

Data

DECODE LATENCYCAS LATENCYReview: Steps To Access Memory

1. Memory Controller Decode Latency2. RAS Latency3. RAS to CAS Latency4. CAS Latency5. Data Transfer6. Pre-charge *



Basic Memory Read Continued Sequential Access

CPU

MemoryController

COLUMN ADDRESS STROBE

CAS LATENCY

DATA TRANSFER

DIMMS

Data1




CPU

MemoryController


CAS LATENCY

DATA TRANSFER

DIMMS

Data2




CPU

MemoryController


CAS LATENCY

DATA TRANSFER

DIMMS

Data3



Sequential Memory Operation Overview

CPU

MemoryController

CAS LATENCY

DIMMS

Data3Data Data1 Data2

Data7Data4 Data5 Data6

DataBData8 Data9 DataA

DataFDataC DataD DataE

Potential BandwidthBottleneck !!!



Random Memory Operation Overview

CPU

MemoryController

RAS LATENCYRAS to CAS LATENCY

CAS LATENCYDATA TRANSFER

PRECHARGE

DIMMS

Data3Data Data1 Data2

Data7Data4 Data5 Data6

DataBData8 Data9 DataA

DataFDataC DataD DataE

Memory LatencyBottleneck !!!



Summary of Memory OperationSequential memory accesses are very fast and can saturate the bandwidth of the processor to memory interface– No Row Address Latency (which is a long time)– Only fast Column Address Latency (which is usually short)– Very low memory decode latency for N+1 address

• Just increment address– Data Transfer

But each new random address must incur long latency– Full memory controller decode latency– Row Address Latency– RAS to CAS latency – CAS Latency– Data Transfer– Pre-charge*

• This was omitted to simplify the discussion– Pre-charge is time to close a row (or page) and prepare a new row for reading



Memory Bandwidth ObservationsAs the number of threads or cores increase…– The randomness of memory accesses tend to also increase

– So for systems with greater numbers of processors random loaded memory latency often has a greater affect on systems performance than DIMM bandwidth

But for applications that utilize few threads that access memory sequentially the sustainable bandwidth of the system has a greater affect on performance than memory latency



CPU Bottleneck Performance Fundamentals

Core Intensive - Processor is executing instructions as fast as CPU core can process

Latency Intensive - Processor is executing instructions as fast as memory latency allows

Bandwidth Intensive - Processor is executing instructions as fast as memory bandwidth allows

Potential Processor Bottlenecks

Core Intensive

Bandw

idth

Inte

nsive

Latency

Intensive



Xeon vs. Opteron Performance Fundamentals

Potential Processor Bottlenecks

Core Intensive

Bandw

idth

Inte

nsive

Latency

Intensive

Woodcrest and Tulsa WinBy as much as 20+%

Woodcrest and Opteron

About the same

Opter

on W

ins by

as m

uch

As 2X



Question: Which Design Has The Lower Unloaded Latency?

2 ChannelMemory

Controller

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

4 ChannelMemory

Controller

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

CPU

CPU

ADDR

RAS LatencyRAS to CAS Latency

CAS LatencyData TransferPre-Charge

Data

85 nSeconds Total 100 nSeconds TotalExtra 15nSec Latency For Decode of 4 Channels

ADDR

Data





But Which Design Has The Lower Loaded Latency?

2 ChannelMemory

Controller

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

4 ChannelMemory

Controller

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

CPU

CPU

ADDR



Data

2 Transfer in 90 nSecAvg. Loaded Latency = 45nSec

4 Transfers in 115 nSecAvg. Loaded Latency = 29nSec!

ADDR

Data



Data

Data

Data

Data



As Memory Gets Faster There is Another Challenge: Capacity vs. Clock Speed

Memory capacity is limited by the number of DIMMs designers can economically engineer into the system

But with most memory technologies sustainable clock speed of thememory decreases as the number of DIMMs on a memory channel increase– Due to capacitance loading of each successive DIMM installed

The evolution to solve this problem has been…– SDRAM evolves to DDR

– DDR evolves to DDR2

– DDR2 evolves to FBD



And We Still Have The Capacity vs. Speed Trade-offDDR2 DIMMs add electrical loading to memory bus– This means that as memory clock speed increases the number of

DIMMs that can be supported on the memory channel decreases because of electrical loading

MemoryController Memory Bus

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

400 MHzMemory

Controller Memory Bus

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

533 MHz

667 MHz

MemoryController Memory Bus

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Not representative of any particular systemDiagram is intended to illustrate speed and DIMM count limitations



FBDIMM Solves This Problem With Serial Memory Bus And On-DIMM Advanced Memory Buffer (AMB)

Serial Address Bus

Serial Data BusMemoryController

Same DDR2 DRAM Technology



FBDIMM Serial Bus Add Latency Due to Hops

Serial Address Bus

Serial Data BusMemoryController

Address

Data



FB-DIMMs2 Channels

DDR2 DIMMs1 Channel

FBDIMM Serial Interface Reduces Wiring Complexity Which Enables Greater Number of Memory Channels

DIMM Connectors DIMM

Connectors

Memory Controller

Memory Controller



Additional Memory Channels = Greater Capacity And Greater Throughput Which Offsets Additional Latency Under Load

DDR2 MemoryController

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

FBD MemoryController

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Greater MemoryBandwidth

Less MemoryBandwidth



Additional Memory Channels = Greater Capacity And Greater Throughput Which Offsets Additional Latency Under Load

Source: Intel



Measured DDR2 vs. FBD Memory Throughput

39% In

creas

e

2.8x I

ncreas

e

Memory Throughput for DDR2 vs. FBD

0

1000

2000

3000

4000

5000

6000

Sequential Reads Random Reads

Mem

ory

Thro

ughp

ut B

ytes

/Sec

3.2GHz Xeon DDR23.0GHz Woodcrest FBD

39% In

creas

e

2.8x I

ncreas

eSource – Systems x Performance Lab



Memory SummaryExisting DDR2 memory employs multi-drop parallel bus– Electrical loadings increase as DIMMs are added to the bus

• This limits the speed of the memory bus– Parallel bus limits number of memory channels in system

• Physical wiring space limits number of memory channels on planar boards• Memory controller pin-count too great with more than two channels

FBDIMM solves problem by placing an Advanced Memory Buffer (AMB) on DDR2 DIMM and employs a serial memory bus– Serial bus greatly reduces wiring requirements and enables greater number of

memory buses supported in a system• This increases capacity and throughput

– Serial AMB adds latency & increases DIMM power consumption• ~5 Watt /DIMM

– Expect second generation AMB tol consume even lower power• But greater throughput results in LOWER average latency when under load, improving

performance



So What Does It Mean?

FBD Memory is a technical solution to the problems encountered by using standard DDR1 or DDR2 DIMMS which require a parallel bus– FBD adds an Advance Memory Buffer (AMB) to the standard DDR2 DIMM

to enable a serial interface• This adds HW to the DIMM that consumes additional power

– About 3 – 5Watts per DIMM• By using less board space compared to the serial interface of DDR

– FBD enables 4 channels of memory vs. 2 channels (standard DDR1 or 2)

– FBD enables full-duplex operation (concurrent reads and writes)• DDR is half-duplex (either a read OR a write)

– Four channels and concurrent reads/writes translate to much higher memory performance, especially for random workload

Bottom line – FBD has nearly 3x higher throughput for multi-threaded applications but consumes slightly greater power and adds some latency



HPC Application Spectrum

Bandwidth and processor compute capability assessed

– Applications span the spectrum

– No single industry accepted metric exists

Bandwidth Limited

CoreLimited

~1 Byte per Flop

SparseMV SPECfp2000 LinpackDGEMM

Simple FluidDynamicsOcean Models

Petro ReservoirAuto NVH

Auto CrashWeather

SeismicComp Chem

StreamDAXPYDDOT

Increasing Xeon LeadershipIncreasing Opteron Leadership



Relative HPC Benchmark Results HPC Workloads - Memory Bandwidth Constrained???

1.31 1.26 1.19

1.4 1.461.32

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

ABAQUASSTD

Fluent LS Dyna 3 Car SEISM CPMD 64Atom

CHARMm

All 2 Socket Configurations

Relative Performance Gain of 3.0GHz WoodcrestCompared to 2.4GHz Opteron



I/O (local) SubSystem

SAS drives

73 GB

10K RPM

We will look at GPFS later



Serial SCSI (SAS) vs Parallel SCSI

Parallel SCSI320 MB/s Half-duplex

Race condition on bits

Serial SCSI600 MB/s Full-duplex

No bit race

Deferential Signal Pair

Each direction @

300 MB/s

320 MB/s

Half-duplex

Shared bus

300 MB/s

Full-duplex

Point-to-point



SATA, SAS, FC Comparison



Network Subsystem



PCI-E Bus

Point-to-Point, Serial, Low-Voltage Interconnect

Low-latency communications to maximize data throughput and efficiency

Uses chip-to-chip or board-to-board (cabling) interconnect

Scalable performance via aggregate Lanes

Data Integrity and Error-handling focus



Node Communication Considerations

Protocol– MPI – TCP & UDP– RDMA– Multicast

Traffic Patterns (hierarchical and non-hierarchical)Packet Size Distribution of messages (large vs small)



Node-to-Node Interconnect options

Ethernet– Ubiquitous, low cost, low complexity– High message latency

Optimized Ethernet– RDMA, TCP offload engines– Optimal for single server - multi-clinet

Specialized Interconnect– Low Latency and High bandwidth – E.g., Myrinet, Infiniband, Quadrics

Voltaire Infiniband 9288 – 10 Gb/s



Infiniband Characteristics

Standards-based

Optimized for HPC

Supports Server and Storage attachments

Built-in RDMA capabilities

Bandwidth for 4x (SDR) is 10 Gb/s – (measured: 8Gb/s)

RDMA

Socket

Layer

Application

Hardware

TCP/IP

Transport

Driver

Kernel

user

Trad

ition

al

Ker

nel b

y-pa

ss



IB Protocols

Fiber-Channel SAN attachmentSCSI RDMA Protocol (SRP)

RDMA flexible programming API (Oracle RAC)

Direct Access Programming Library (uDAPL)

Accelerates socket-based applications that use RC or RDMA

Sockets Direct Protocol (SDP)

HPC applications – low latencyMPI

Enables IP-based applications over IB

IP over IB (IPoIB)



Voltaire IB Performance



Software Layer 1 (Operating System)



Remember: Users typically do not have root access



Linux

Monolithic-kernel but modular like Micro-kernelDynamic Loading of kernel modulesPreemptive SMP supportedThreads are just like any other processesoo device modelEliminates Unix features that are considered poorFree

System Call Interface

Device Drivers

Kernel Subsystem

App. 2App. 1 App. 3

Hardware

Ker

nel s

pace

Use

r spa

ce

Source: Linux Kernel Development by Robert Love



Some Kernel parameters relevant to HPC users - 1

The kernel parameters can be set in /etc/sysctl.conf, run “sysctl -p” to apply them.

Shared memory

– SHMMAX: define the maximum size (in bytes) for a shared memory segment• kernel.shmmax = 2147483648 (default: 33554432)

– SHMMNI: define the maximum number of shared memory segments system wide• kernel.shmmni = 4096 (default)

– SHMALL: define the total amount of shared memory (in pages) that can be used at one time on the system. To be set at least “to ceil(SHMMAX/PAGE_SIZE)”



Some Kernel parameters relevant to HPC users - 2Semaphores– SEMMSL: control the maximum number of semaphores per

semaphore set.– SEMMNI: control the maximum number of semaphore sets on

the entire Linux system.– SEMMNS: control the maximum number of semaphores (not

semaphore sets) on the entire Linux system.– SEMOPM: control the number of semaphore operations that

can be performed per semop system call.

cat /proc/sys/kernel/sem250 256000 32 1024

SEMMSL SEMMNI SEMMNS SEMOPM

kernel.sem="250 32000 100 128"



Some Kernel parameters relevant to HPC users - 3

Large pages

vm.nr_hugepages = 1000

vm.disable_cap_mlock = 1

Maximum number of open files

fs.file-max=65536

Other parameters: I/O scheduler, network receive/send buffers,………



User Limits/etc/security/limits.conf<domain> <type> <item> <value>

Items:core - limits the core file size (KB)data - max data size (KB)fsize - maximum filesize (KB)memlock - max locked-in-memory address space (KB)nofile - max number of open filesrss - max resident set size (KB)stack - max stack size (KB)cpu - max CPU time (MIN)nproc - max number of processesas - address space limitmaxlogins - max number of logins for this usermaxsyslogins - max number of logins on the systempriority - the priority to run user process withlocks - max number of file locks the user can holdsigpending - max number of pending signalsmsgqueue - max memory used by POSIX message

queues (bytes)nice - max nice priority allowed to raise tortprio - max realtime priority



ulimit – user command to control limits



IPCS and IPCRM – Interprocess communication

ipcs -a shows all the active message queues, semaphores, shared memory segments

ipcs -q for active message queues

ipcs -m for active shared memory segments

ipcs -s for active semaphores

ipcrm [-q msgid | -m shmid | -s semid] to delete the particular identifier



Server Performance Indicators

CPU

Memory

Storage IONetwork IO

Application

internalsApplication

performance



CPU - /proc/cpuinfo

cat /proc/cpuinfo

processor : 0vendor_id : GenuineIntelcpu family : 6model : 14model name : Intel(R) Xeon(TM) CPU 000 @ 2.00GHzstepping : 8cpu MHz : 2000.361cache size : 2048 KBphysical id : 0siblings : 2core id : 0cpu cores : 2fdiv_bug : nohlt_bug : nof00f_bug : nocoma_bug : nofpu : yesfpu_exception : yescpuid level : 10wp : yesflags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor vmx est tm2 xtprbogomips : 4005.92


© 2007 IBM Corporation76

CPU - monitoring the utilization

vmstat:show the vmstat output with an interval of 10sec:vmstat 10procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa1 0 327308 11552 10860 138800 2 2 24 35 17 81 10 2 87 20 0 327308 11428 10876 138800 12 0 12 84 1160 1514 10 2 85 30 0 327308 10428 10892 138800 28 0 28 128 1134 1563 12 9 76 30 0 327308 10056 10896 139048 72 0 328 0 1164 1534 15 14 61 10

sar:-collect the system statistics every 10s, 1000 times and store them in file.sarsar -A -o file.sar 10 1000

-show the CPU utilisation for the recorded periodsar -u -f file.sar- show the processes queue length and load averagessar -B -f file.sar



Memory - /proc/meminfo

meminfo

cat /proc/meminfo MemTotal: 8309276 kBMemFree: 6550956 kBBuffers: 182356 kBCached: 1484032 kBSwapCached: 0 kBActive: 760512 kBInactive: 915900 kBHighTotal: 7470784 kBHighFree: 5969668 kBLowTotal: 838492 kBLowFree: 581288 kBSwapTotal: 4192956 kBSwapFree: 4192956 kBDirty: 4 kBWriteback: 0 kBMapped: 21592 kBSlab: 62376 kBCommitLimit: 8347592 kBCommitted_AS: 68376 kBPageTables: 600 kBVmallocTotal: 112632 kBVmallocUsed: 5516 kBVmallocChunk: 106524 kBHugePages_Total: 0HugePages_Free: 0HugePages_Rsvd: 0Hugepagesize: 2048 kB



Memory -monitoring the utilizationvmstat:procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa1 0 327308 11552 10860 138800 2 2 24 35 17 81 10 2 87 20 0 327308 11428 10860 138800 0 0 0 0 1202 1573 9 3 88 00 0 327308 11428 10876 138800 12 0 12 84 1160 1514 10 2 85 3

sar:- show the paging activity for the recorded period

sar -B -f file.sar- show the memory and swap space utilization statistics

sar -r -f file.sar



IO -monitoring the utilizationvmstat:procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa1 0 327308 11552 10860 138800 2 2 24 35 17 81 10 2 87 20 0 327308 11428 10860 138800 0 0 0 0 1202 1573 9 3 88 00 0 327308 11428 10876 138800 12 0 12 84 1160 1514 10 2 85 3

sar:- show the IO activity globally for the system

sar -b -f file.sar- show the IO activity for each devices (sector=512 bytes)

sar -d -f file.sar

iostat:avg-cpu: %user %nice %sys %iowait %idle

0.03 0.00 0.01 0.02 99.94Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtnsda 0.26 0.58 5.36 2932784 26939926sdb 0.06 1.45 7.44 7293650 37386696sdc 0.00 0.00 0.00 8182 0sdd 0.00 0.00 0.00 8182 0



Network - monitoring the utilization

sar:- show the paging activity for the recorded period

sar -n DEV -f file.sar

01:00:01 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s01:10:01 PM lo 0.00 0.00 0.00 0.00 0.00 0.0001:10:01 PM eth0 2.33 0.00 186.47 0.00 0.00 0.0001:10:01 PM eth1 0.00 0.00 0.00 0.00 0.00 0.0001:10:01 PM eth2 0.00 0.00 0.00 0.00 0.00 0.00Average: lo 0.00 0.00 0.10 0.10 0.00 0.00Average: eth0 2.36 0.02 187.89 3.02 0.00 0.00Average: eth1 0.00 0.00 0.00 0.00 0.00 0.00Average: eth2 0.00 0.00 0.00 0.00 0.00 0.00



Network - monitoring the utilizationNtop provides detailed and graphical network statistics

www.ntop.org



NMON Performance Tool

CPU Utilization

Memory Use

Kernel Statistics and run queue information

Disk I/O information

Network I/O information

Paging space and rate

etc

http://www-128.ibm.com/developerworks/aix/library/au-analyze_aix/index.html

http://www-941.haw.ibm.com/collaboration/wiki/display/WikiPtype/nmon



Process affinitytaskset

usage: taskset [options] [mask | cpu-list] [pid | cmd [args...]]set or get the affinity of a process

-p, --pid operate on existing given pid-c, --cpu-list display and specify cpus in list format-h, --help display this help-v, --version output version information

The default behavior is to run a new command:taskset 03 sshd -b 1024

You can retrieve the mask of an existing task:taskset -p 700

Or set it:taskset -p 03 700

List format uses a comma-separated list instead of a mask:taskset -pc 0,3,7-11 700

Ranges in list format can take a stride argument:e.g. 0-31:2 is equivalent to mask 0x55555555



Process schedulingChrt

usage: chrt [options] [prio] [pid | cmd [args...]]

manipulate real-time attributes of a process

-f, --fifo set policy to SCHED_FF-p, --pid operate on existing given pid-m, --max show min and max valid priorities-o, --other set policy to SCHED_OTHER-r, --rr set policy to SCHED_RR (default)-h, --help display this help-v, --verbose display status information-V, --version output version information

You must give a priority if changing policy.



GPFS – General Parallel File System

Parallel Cluster File System Based on Shared Disk (SAN) Model

Cluster – fabric-attached nodes (IP, SAN, …)

Shared disk - all data and metadata on fabric-attached disk

Parallel - data and metadata flows from all of the nodes to all of the disks in parallel.

GPFS File System Nodes

Switching fabric(System or storage area network)

Shared disks(SAN-attached or network

block device)



What GPFS is notNot a client-server file system

like NFS, CIFS, or AFS/DFS: no single-server bottleneck, no protocol overhead for data transfer

no distinct metadata server



Why is GPFS needed?Clustered applications impose new requirements on the file system

Parallel applications need fine-grained access within a file from multiple nodesSerial applications dynamically assigned to processors based on load

– need high-performance access to their data from wherever they run

Both require good availability of data and normal file system semantics

GPFS supports this via:

uniform access – single-system image across clusterconventional Posix interface – no program modificationhigh capacity – multi-TB files, petabyte file systemshigh throughput – wide striping, large blocks, many GB/sec to one fileparallel data and metadata access – shared disk and distributed lockingreliability and fault-tolerance - node and disk failuresonline system management – dynamic configuration and monitoring



Parallel File Access from Multiple NodesGPFS allows parallel applications on multiple nodes to access non-overlapping ranges of file with no conflict

Byte-range locks serialize access to overlapping ranges of a file

GPFS File

node0 node1 node2 node3

Node 2 and 3 areboth trying to accessthe same section of the file

Concurrency achieved by token based distributed lock manager



Large File Block Size

GPFS is designed assuming that most files in the file system are large and need to be accessed quickly

Conventional file systems store data in small blocks to pack data more densely and use disk more efficiently

GPFS uses large blocks (256 KB default) to optimize disk transfer speed

This means that realized file-system performance can be much better.

This also means that GPFS does not store small files efficiently



Sequential Access patterns are best

Advice:

Access records sequentially

Multi-node: make every process responsible for a 1/n contiguous chunk of the file



GPFS Usage Model

Naïve ModelIgnore that it is a parallel file system – treat it like any other

For Sequential I/O, this is okay

Standard Posix ModelUse standard Posix file functions (open, lseek, write, close, etc.,)

Low level, Great performance using direct-access files

MPI-IO (MPI-2) ModelFull Parallel I/O featured – best suited for HPC applications



MPI-IO Models (some)

Node 0 gathers and writes sequential Posix I/O files

Each node independently and in parallel doing sequential posix I/O to separate files

Each node independently and in parallel doing MPI-IO to separate files

Each node independently and in parallel doing MPI-IO to a single file

Reading using individual file pointers using MPI version of lseek

Collective IO



For Efficient use of GPFS:

Make friends with System administrators to fine tune GPFS parameters

Block Size

Stripe Method

Indirect Block Size

(just the basic parameters you need to know)



GPFS ResourcesWebsites

– Main GPFS website:• http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.htm

– GPFS Documentation:• http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.cluster.gpfs.doc/gpfsbo

oks.html– GPFS FAQs:

• http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfs_faqs.html

– Clusters Literature:• http://www-03.ibm.com/servers/eserver/clusters/library/wp_aix_lit.html• http://www.broadcastpapers.com/asset/IBMGPFS01.htm

HPC Workshop – University of Kentucky May 9, 2007 – …€¦ · HPC Workshop – University of...

Documents

Transcript of HPC Workshop – University of Kentucky May 9, 2007 – …€¦ · HPC Workshop – University of...