Lecture 4 Parallel Computing Performanceece.uprm.edu/~wrivera/ICOM6025/Lecture4.pdf · Lecture 4...

Dr. Wilson Rivera

ICOM 6025: High Performance ComputingElectrical and Computer Engineering Department

University of Puerto Rico

Lecture 4Parallel Computing Performance

• Goal: understand different methodologies and metrics to monitor, evaluate, and actuate under performance considerations. – Performance challenges– Performance Metrics– Energy efficiency– Benchmarking– Monitoring Tools

Outline

2ICOM 6025: High Performance Computing

Performance model Challenges

Transactions Video Streaming Batch jobs

Workloads

Memory Network Storage Power

Resource Allocation

Performance

ModelService Level Objectives

User

Behavior

CPU

ICOM 6025: High Performance Computing 3

Performance model Challenges

• Need realistic metrics to measure scalability• Impact on design of architectures and applications. • Multiple parameters involved

ParallelAlgorithm

ParallelMachine

PARALLEL SYSTEM


• High Performance Computing (HPC) units are:– Flop: floating point operation– Flops/s: floating point operations per second– Bytes: size of data (a double precision floating point number is 8)

• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 ~106 bytesGiga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytesTera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytesExa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytesZetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytesYotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes

Units of Measure


Overhead of Parallelism

• Parallelism overheads include:– Cost of starting a thread or process (latency)– Cost of communicating or sharing data (bandwidth)– Cost of synchronizing– Redundant computation

• Each of these can be in the range of milliseconds (=millions of flops) on some systems

• Tradeoff: Algorithm needs sufficiently large units of work (large granularity) to run fast in parallel, but not so large that there is not enough parallel work


Scalability

• Horizontal Scalability – Add nodes to the system

• Vertical Scalability– Add resources (CPU, memory) per node

7

Scalability Models

• Fixed-Problem Size Model– Speedup– Amdhal’s Law

• Memory-Constrained Model– Scaled speedup (Gustafson)– Scaled speedup is less than lineal (Flatt & Kennedy)– Isoefficiency (kumar & Gupta)

• Fixed-Time Scaling Model– Isospeed (Sun & Rover)


Speedup

),()(pW

W

TTS

p

sp =

WpT p ≅

WpWpWW

Wp

ETTT

Tp

s

),(1

1),( 00 +

=+

==

↑⎯→⎯↑

↓⎯→⎯↑

EWEp

*

*

Ideal

Actual

# of Processors


Amdahl’s Law

• Let f be the fraction of a program that is sequential– 1-f is the fraction that can be

parallelized• Let T1 be the execution time on 1

processor• Let Tp be the execution time on p

processors• Sp is the speedup

Sp = T1 / Tp

= T1 / (fT1 +(1-f)T1 /p))= 1 / (f +(1-f)/p))

• As p → ∞Sp = 1 / f


Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors.

Amdahl’s Law and Scalability

• Scalability– Ability of parallel algorithm to achieve performance

gains proportional to the number of processors and the size of the problem

• When does Amdahl’s Law apply?– When the problem size is fixed– Strong scaling (p→∞, Sp = S∞ → 1 / f )– Speedup bound is determined by the degree of

sequential execution time in the computation, not number of processors!!!

– Perfect efficiency is hard to achieve

11Introduction to Parallel Computing, University of Oregon, IPCC

Isoefficiency

• Relation of problem size and the maximum number of processors which can be used in a cost-optimal fashion

• A parallel system is cost optimal iff – pTp =O(W).

• A parallel system is scalable iff its isoefficiency function exists. – If W needs to grow exponentially with respect to p, the

parallel system is poorly scalable. – If W grows nearly linear with p, the parallel system is highly

scalable


Example: Maximum Element

In: a[]Out: maximum element in a

sequential_maximum(a) {n = a.lengthmax = a[0]for i = 1 to n – 1 {

if (a[i] > max)max = a[i]

}return max

}

21 11 23 17 48 33 22 41

21

23

23

48

48

48

48

O(n)

Example: Parallel Maximum

21 11 23 17 48 33 22 41

21 23 48 41

23 48

48

O(lgn)

Iso-efficiency Analysis

• Adding n numbers, using p=n processors– Ts=θ(n)– Tp= θ(log n)– E= θ(1/log n)– The larger the problem, the less efficiently we use the

processors • Adding n numbers, using p<n processors

– Ts=θ(n)– Tp= θ(n/p + log n)– E= θ(1/(1 + p(log p)/n))– The problem size must grow at least as fast as p log p to

balance the overhead of the parallel reduction

15

Effectiveness

}),,,(

)({

),,,(cos

),,,()(

)(

2max

min

ξα

ξ

ξα

ξαξ

αα

pHpw

pHpt

pHweperformanc

ww

T

TT

ppopt

p

p

A

=

=

=

=

Γ

∈


Application of Effectiveness

17Scalable Parallel Genetic Algorithms

Application of Effectiveness

18

Energy EfficiencyRACK

4 enclosures x 64 blades x 20 VMs ~ 5,000 VMs

10kwH ~ $20k/month

DATA CENTER

2,500 square foot =100 racks ~ 500,000 VMs

1000kwH ~ $2M/month

CO2 Emissions

1000kwH x 9000 H/year ~ 9M kwH/year

1.3 lb/kwH x 9M kwH/year ~ 11.7M lb

11.7M lb/2,200 ~ 5,300 metric tons

5,300 x $40/ton ~ $250k /year

http://www.environmentalleader.com/2008/07/27/data-centers-and-carbon-pricing/


Energy Efficient Data Centers

• Facebook Oregon data center• Microsoft GFS Datacenter Tour• Time Lapse of Data Center Construction

Energy Efficiency

The Green Grid, Using Virtualization to improve data centres Efficiency, 14 January 2009

• 75% of servers were running below 5% utilization

• An idle server consume more than 40% of the power that a fully utilized server does

• At 10% utilization, the server utilized 173 watts of power

• Energy reduction example– 10 servers @ 10% utilization (173 watts

each)=1730watts– 1 server @ 50% utilization = 230 watts

• http://www.spec.org/power_ssj2008/results/power_ssj2008.html

Barroso and Holzle (Google), The case for energy proportional computing,, IEEE Computer 2007


Data Center Metrics

• WUE = annual water use/ IT equipment energy

• PUE =total facility power /IT equipment power– Low is good, e.g. google =1.11

MicroBenchmarks

• Bonnie++ (Hard drive performance)• Stream (Memory performance)• Netperf (Network performance) • LMbench (Low level system)• Netpipe (Network performance)• Intel MPI Benchmarks (Low and high level)


MacroBenchmarks

• High Performance Linpack (HPL) – solves a dense linear system using LU factorization with partial

pivoting• Gromacs

– Molecular Dynamics, a good measure of floating point performance and maximum power

• NAS Parallel Suite – Computational Fluid Dynamics Kernels with self checking of

results• Intel’s MPI Benchmark• HPC Challenge Suite

– Includes several benchmark programs to test, computation, communication, and memory bandwidth


NAS Parallel Benchmarks (NPB)

• http://www.nas.nasa.gov/Software/NPB

• Numerical Aerodynamics Simulation Program at NASA Ames Research Center

• Benchmarks run with little or no tuning• SP and BT are simulated CFD applications that solve

systems of equations resulting from an approximately factored implicit finite discretization of the Navier-Stokes equations– BT code solves block-tridiagonal systems of 5x5 blocks– SP code solves scalar pentadiagonal systems

Performance Monitoring

• PAPI– Performance Application Programming Interface– Access to hardware performance counters

• ompP– Profiling openMP code

• IPM– Integrated Performance Monitoring Paradyn– Monitoring message passing application

• Other Performance monitoring tools– Vampir– KOJAK– TAU– Scalasca– Paradyn– Periscope– Perfsuite– HPCToolkit– CaryPat– Spin – Intel’s Parallel Studio

• Parallel Advisor• Inspector XE


Summary

• Performance challenges• Performance Metrics• Energy efficiency• Benchmarking• Monitoring Tools


Lecture 4 Parallel Computing Performanceece.uprm.edu/~wrivera/ICOM6025/Lecture4.pdf · Lecture 4...

Documents

Transcript of Lecture 4 Parallel Computing Performanceece.uprm.edu/~wrivera/ICOM6025/Lecture4.pdf · Lecture 4...