Download - IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept.

IBM Research

© 2009

Multicore ProgrammingChallenges

Michael PerroneIBM Master InventorMgr., Multicore Computing Dept.

IBM Research

© 20092 [email protected]

Multicore Performance Challenge

# of Cores

Per

form

ance

IBM Research


Take Home Messages

“Who needs 100 cores to run MS Word?”- Dave Patterson, Berkeley

• Performance is critical and it's not free!

• Data movement is critical to performance!

Which curve are you on?

Pe

rfo

rman

ce

# of Cores

IBM Research


Outline

• What’s happening?

• Why is it happening?

• What are the implications?

• What can we do about it?

IBM Research


What’s happening?

• Industry shift to multicore– Intel, IBM, AMD, Sun, nVidia, Cray, etc.

• Increasing– # Cores

– Heterogeneity (e.g., Cell processor, system level)

• Decreasing– Core complexity (e.g., Cell processor, GPUs)

– Decreasing since Pentium4 single core

– Bytes per FLOP

Single core Homogeneous Heterogeneous

Multicore Multicore

IBM Research

© [email protected]

Heterogeneity: Amdahl’s Law for Multicore

Unicore

Homogeneous

Heterogeneous

Even for square root performance growth (Hill & Marty, 2008)

Loophole: Have cores work in concert on serial code…

Serial Parallel

Cores

IBM Research


Good & Bad News

GOOD NEWS

Multicore programming is parallel programming

BAD NEWS

Multicore programming is parallel programming

IBM Research


Many Levels of Parallelism

• Node

• Socket

• Chip

• Core

• Thread

• Register/SIMD

• Multiple instruction pipelines

• Need to be aware of all of them!

IBM Research


Additional System Types

MulticoreCPU

System Bus

main memory

accelerator accelerator

accelerator

Power core

System Bus

main memory

bri

dg

e


accelerator

NIC

NIC

IB

System Bus

memory

bri

dg

e

MulticoreCPU

System Bus

main memory

bri

dg

e

PCIe


accelerator

memory

Heterogeneous bus attached

IO bus attached Network attached

MulticoreCPU

System Bus

main memory

bri

dg

e


accelerator

NIC

NICE’net System Bus

memory

bri

dg

e

On-chipI/O bus

Homogeneous bus attached

System Bus

main memory

MulticoreCPU

MulticoreCPU

IBM Research


Multicore Programming Challenge

Easier

Per

form

ance

ProgrammabilityHarder

Higher

Lower

Interestingresearch!

“Lazy” Programming

Nirvana

DangerZone!

Better toolsBetter programming

IBM Research


Outline



– HW Challenges

– BW Challenges



IBM Research


Power Density – The fundamental problem

1

10

100

1000

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

i386i486

Pentium®Pentium Pro®

Pentium II®Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

IBM Research


What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a fundamental limit (a few atomic layers)

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004P

ow

er D

ensi

ty (

W/c

m2 )

65 nM

Gate Length

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

IBM Research


1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clock Speed (MHz)

103

102

104

Managing power dissipation is limiting clock speed increases

Microprocessor Clock Speed TrendsC

lock

Fre

quen

cy (

MH

z)

IBM Research


Intuition: Power vs. Performance Trade Off

Relative Performance

RelativePower

1.8 1.3

1

.7

1.4

1.6

5

IBM Research


Outline



– HW Challenges

– BW Challenges



IBM Research


The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

IBM Research


The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

IBM Research


Move the food closer Cache

Processor

Data(“food”)

Load more food while the beast eats

IBM Research


What happens if the beast is still hungry? Cache

If the data set doesn’t fit in cache

Cache misses

Memory latency exposed

Performance degraded

Several important application classes don’t fitGraph searching algorithms

Network security

Natural language processing

Bioinformatics

Many HPC workloads

Processor

IBM Research


Make the food bowl larger Cache Cache size steadily increasing

Implications

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS

Processor

IBM Research


Make the food bowl larger

Processor

Cache Cache size steadily increasing

Implications

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS

But…

Important application working sets are growing faster

Multicore even more demanding on cache than unicore

IBM Research


The beast is hungry!

Data pipe not growing fast enough!

IBM Research


The beast had babies

• Multicore makes the data problem worse!

– Efficient data movement is critical

– Latency hiding is critical

IBM Research


GOAL: The proper care and feeding of hungry beasts

IBM Research


Outline





IBM Research


Example: The Cell/B.E. Processor

IBM Research


Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

IBM Research


Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow

Enables detailed programmer control of data flowGet/Put data when & where you want itHides latency: Simultaneous reads, writes & computes

Avoids restrictive HW cache managementUnlikely to determine optimal data flowPotentially very inefficient

Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

IBM Research


Lessons Learned Cell Processor

• Core simplicity impacted algorithmic design– Increased predictability– Avoid recursion & branches– Simpler code is better code

– e.g., bubble vs. comb sort

• Heterogeneity– Serial core must balance parallel cores well

• Programmability suffered– Forced to address data flow directly

– Led to better algorithms & performance portability

IBM Research


What are the implications?

• Computational Complexity

• Parallel programming

• Communication

• Synchronization

• Collecting metadata

• Merging Operations

• Grouping Operations

• Memory Layout

• Memory Conflicts

• Debugging

Some generalSome Cell specific

IBM Research


Computational complexity is inadequate

• Focus on computes: O(N), O(N2), O(lnN), etc.

• Ignores BW analysis– Memory flows are now the bottlenecks

– Memory hierarchies are critical to performance

– Need to incorporate memory into the picture

• Need “Data Complexity”– Necessarily HW dependent

– Calculate data movement (track where they come from) and divide by BW to get time for data

IBM Research


Don’t apply computational complexity blindly

O(N) isn’t always better than O(N2)

N

O(N)

O(N2)

You are here

Run

Tim

e

More cores can lead to smaller N per core…

IBM Research


Where is your data?

L3 cache

Run

Tim

e

Disk

L2 cache

L1 cache

Tape

Put your data where you want it when you want it!

N (“Locality”)

Localize your data!

IBM Research


Example: Compression

• Compress to reduce data flow

• Increases slope of O(N)

• But reduces run time

Compute

Read Write

Compute

Read

Compression

Write

Run Time

Compression

Compute

N

Computational Complexity

IBM Research


Implication: Communication Overhead

• BW can swamp compute

• Minimize communication

1 2

IBM Research


Implication: Communication Overhead

• Modify partitioning to reduce communications

• Trade off with synchronization

L

L

9L vs. 4L

IBM Research


Implications: Synchronization Overhead

Time

SynchronizationOverhead

IBM Research


Implications: Synchronization – Load Balancing

• Modify data partitioning to balance workloads

Uniform Adaptive

IBM Research


Implications: Synchronization – Nondeterminism

Suppose: =

IBM Research


Implications: Synchronization – Nondeterminism

Run Time

Pro

babi

lity

Average nondeterministic

Deterministic

Max of N Threads

IBM Research


Implications: Metadata - Parallel sort example

• Collect histogram in first pass

• Use histogram to parallelize second pass

Unsorteddata Metadata

Sorteddata

IBM Research


Buffer

Input Image

Transposed Image

Tile

Transposed Tile

Transposed Buffer

Implications: Merge Operations – FFT Example

• Naive

– 1D FFT (x axis)

– Transpose

– 1D FFT (y axis)

– Transpose

• Improved – Merge steps

– FFT/Transpose (x axis)

– FFT/Transpose (y axis)

• Avoid unnecessary data movement

IBM Research


Implications: Restructure to Avoid Data Movement

Compute A

Transform A to B

Compute B

Transform B to A

Compute A

Transform A to B

Compute B

Transform B to A

Compute A

Transform A to B

Compute B

Compute B

Compute B

Compute B

Compute A

Compute A

Compute A

IBM Research


Implications: Streaming Data & Finite Automata

DFA

DFA DFADFA

Data

Replicate &Overlap

Enables loop unrolling & software pipelining

IBM Research


Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Sample Word List:

“the”“that”

“math”

Implications: Streaming Data – NID Example

IBM Research


Random access to large state transition table (STT)

Implications: Streaming Data – NID Example

IBM Research


Implications: Streaming Data – Hiding Latency

IBM Research


Implications: Streaming Data – Hiding Latency

Enables loop unrolling & software pipelining

IBM Research


Roofline Model (S. Williams)

Compute bound

ProcessingRate

Software Pipelining

Data LocalityLow High

Latencybound

IBM Research


Implications: Group Like Operations – Tokenization Ex.

• Intuitive

– Get data Serial

– State Transition Serial

– Action Branchy & Nondeterministic

– Repeat

DFA

Data

Action

IBM Research


Implications: Group Like Operations – Tokenization Ex.

Better– Get data Serial

– State Transition Serial

– Add Action to List Serial

– Repeat

– Process Action Lists Serial

DFA

Data

Action Action List 1

Action List 3

Action List 2

• Loop unrolling• SIMD• Load balance

IBM Research


Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for cache, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

OutputF

X

Implications: Covert BW to Compute Bound – NN Ex.

IBM Research


Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

Implications: Covert BW to Compute Bound – NN Ex.

IBM Research


BW: High Low

Latency: Low High

Size: Small Larger

L1L2

Register File

Implications:Pay Attention to Memory Hierarchy

Main Memory

IBM Research


• Data eviction rate

• Optimal tiling

• Shared memory space can impact load balancing

Implications: Pay Attention to Memory Hierarchy

C L1

L2

L1C

C L1

L2

L1C

L3

C L1

L1C

C L1

L1C

L2

IBM Research


Implications: Memory Hierarchy & Tiling

= X

Optimal tiling depends on cache size

IBM Research


Single Element Data envelope

Stride 1

StrideN2

N

Implications: Data Re-Use – FFT Revisited

• Long stride trashes cache

• Use full cachelines where possible

IBM Research


Implications: Handle Race Conditions (Debugging)

• Heisenberg Uncertainty Principle

– Instrumenting the code changes behavior

– Problem with maintaining exact timing

Write data

Write data

Read data

Good

Bad

?

1

1

2

Thread

IBM Research


Implications: More Cores – More Memory Conflicts

• Avoid bank conflicts

– Plan data layout

– Avoid multiples of the number of banks

– Randomize start points

– Make critical data sizes and number of threads relatively prime

Bank 1 Bank 8

Thread

765432

1

8

Bank 1 Bank 8

765432

1

8

HOT SPOT

IBM Research


(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

∑ jiyxGjyixD ),,,(),(

Implications: Reduce Data Movement

ij

IBM Research


Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7


IBM Research


For each X

– Load next column of data

– Load next column of indices

– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at

(X,Y)– Cycle buffers

H

2R+1

1

Data bufferGreen’s Index buffer

(X,Y)

R

2


IBM Research


Outline





IBM Research


What can we do about it?

• We want

– High performance

– Low power

– Easy programmability

• We need

– “Magic” compiler

– Multicore enabled libraries

– Multicore enabled tools

– New algorithms

Chooseany two!

IBM Research



• Compiler “magic”

– OpenMP, autovectorizationBUT… Doesn’t encourage parallel thinking

• Programming models

– CUDA, OpenCL, Pthreads, UPC, PGAS, etc

• Tools

– Cell SDK, RapidMind (Intel), PeakStream (Google), Cilk (Intel), Gedae, VSIPL++, Charm++, Atlas, FFTW, PHiPAC

• If you want performance…

– No substitute for better algorithms & hand-tuning!

– Performance analyzers» HPCToolkit, FDPR-Pro, Code Analyzer, Diablo, TAU, Paraver, VTune,

SunStudio Performance Analyzer, Code Analyzer, PDT, Trace Analyzer, Thor, etc.

IBM Research


What can we do about it? Example: OpenCL

• Open “standard”

• Based on C - not difficult to learn

• Allows natural transition from (proprietary) CUDA programs

• Interoperates with MPI

• Provides application portability

– Hides specifics of underlying accelerator architecture

– Avoids HW lock-in: “future-proofs” applications

• Weaknesses

– No DP, no recursion & accelerator model only

Portability does not equal performance portability!

IBM Research



Hide Complexity in Libraries

• Manually

– Slow, expensive, new library for each architecture

• Autotuners

– Search program space for optimal performance

– Examples: Atlas (BLAS), FFTW (FFT), Spiral (DSP). OSKI (Sparse BLAS), PhiPAC (BLAS)

• Local Optimality Problem:

– F() & G() may be optimal, but will F(G()) be?

IBM Research


It’s all about the data! The data problem is growing

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!


IBM Research


Conclusions

• Programmability will continue to suffer– No pain - no gain

• Incorporate data flow into algorithmic development– Computational complexity vs. “data flow” complexity

• Restructure algorithms to minimize:– Synchronization, communication, non-determinism, load

imbalance, non-locality

• Data management is the key to better performance– Merge/Group data operations to minimize memory traffic

– Restructure data traffic: Tile, Align, SIMDize, Compress

– Minimize memory bottlenecks

IBM Research


Backup Slides

IBM Research


AbstractThe computer industry is facing fundamental challenges that are driving a major change in the design of computer

processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. This desire for newer, faster and larger is driving continued demand for higher computer performance.

The industry’s response to address these challenges has been to embrace “multicore” technology by designing processors that have multiple processing cores on each silicon chip. Increasing the number of cores per chip has enabled processor peak performance to double with each doubling of the number of cores. With performance doubling occurring at approximately constant clock frequency so that energy costs can be controlled, multicore technology is poised to deliver the performance users need for their next generation applications while at the same time reducing total cost of ownership per FLOP.

The multicore solution to the clock frequency problem comes at a cost: Performance scaling on multicore is generally sub-linear and frequently decreases beyond some number of cores. For a variety of technical reasons, off-chip bandwidth is not increasing as fast as the number of cores per chip which is making memory and communication bottlenecks the main barriers to improved performance. What these bottlenecks mean to multicore users is that precise and flexible control of data flows will be crucial to achieving high performance. Simple mappings of their existing algorithms to multicore will not result in the naïve performance scaling one might expect from increasing the number of cores per chip. Algorithmic changes, in many cases major, will have to be made to get value out of multicore. Multicore users will have to re-think and in many cases re-write their applications if they want to achieve high performance. Multicore forces each programmer to become a parallel programmer; to think of their chips as clusters; and to deal with the issues of communication, synchronization, data transfer and non-determinism as integral elements of their algorithms. And for those already familiar with parallel programming, multicore processors add a new level of parallelism and additional layers of complexity.

This talk will highlight some of the challenges that need to be overcome in order to get better performance scaling on multicore, and will suggest some solutions.

IBM Research


Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology

(to scale)

IBM Research


To Scale Comparison of L2

IBM

AMD

Intel

Cell

BE

IBM Research


Intel Multi-Core Forum (2006)

0

8750

17500

26250

35000

0 2 4 6 8 10 12 14 16 18 20 22 24

Linux

The Issue

9.8x

Processors

Throughput

SDET

IBM Research


The “Yale Patt Ladder”

Problem

Algorithm

Program

ISA (Instruction Set Architecture)

Microarchitecture

Circuits

Electrons

To improveperformanceneed people

who can crossbetween levels