IBM Research
© 2009
Multicore ProgrammingChallenges
Michael PerroneIBM Master InventorMgr., Multicore Computing Dept.
IBM Research
© 20093 [email protected]
Take Home Messages
“Who needs 100 cores to run MS Word?”- Dave Patterson, Berkeley
• Performance is critical and it's not free!
• Data movement is critical to performance!
Which curve are you on?
Pe
rfo
rman
ce
# of Cores
IBM Research
© 20094 [email protected]
Outline
• What’s happening?
• Why is it happening?
• What are the implications?
• What can we do about it?
IBM Research
© 20095 [email protected]
What’s happening?
• Industry shift to multicore– Intel, IBM, AMD, Sun, nVidia, Cray, etc.
• Increasing– # Cores
– Heterogeneity (e.g., Cell processor, system level)
• Decreasing– Core complexity (e.g., Cell processor, GPUs)
– Decreasing since Pentium4 single core
– Bytes per FLOP
Single core Homogeneous Heterogeneous
Multicore Multicore
IBM Research
Heterogeneity: Amdahl’s Law for Multicore
Unicore
Homogeneous
Heterogeneous
Even for square root performance growth (Hill & Marty, 2008)
Loophole: Have cores work in concert on serial code…
Serial Parallel
Cores
IBM Research
Good & Bad News
GOOD NEWS
Multicore programming is parallel programming
BAD NEWS
Multicore programming is parallel programming
IBM Research
© 20098 [email protected]
Many Levels of Parallelism
• Node
• Socket
• Chip
• Core
• Thread
• Register/SIMD
• Multiple instruction pipelines
• Need to be aware of all of them!
IBM Research
© 20099 [email protected]
Additional System Types
MulticoreCPU
System Bus
main memory
accelerator accelerator
accelerator
Power core
System Bus
main memory
bri
dg
e
accelerator accelerator
accelerator
NIC
NIC
IB
System Bus
memory
bri
dg
e
MulticoreCPU
System Bus
main memory
bri
dg
e
PCIe
accelerator accelerator
accelerator
memory
Heterogeneous bus attached
IO bus attached Network attached
MulticoreCPU
System Bus
main memory
bri
dg
e
accelerator accelerator
accelerator
NIC
NICE’net System Bus
memory
bri
dg
e
On-chipI/O bus
Homogeneous bus attached
System Bus
main memory
MulticoreCPU
MulticoreCPU
IBM Research
© 200910 [email protected]
Multicore Programming Challenge
Easier
Per
form
ance
ProgrammabilityHarder
Higher
Lower
Interestingresearch!
“Lazy” Programming
Nirvana
DangerZone!
Better toolsBetter programming
IBM Research
© 200911 [email protected]
Outline
• What’s happening?
• Why is it happening?
– HW Challenges
– BW Challenges
• What are the implications?
• What can we do about it?
IBM Research
Power Density – The fundamental problem
1
10
100
1000
1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ
i386i486
Pentium®Pentium Pro®
Pentium II®Pentium III®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
IBM Research
What’s causing the problem?
10S Tox=11AGate Stack
Gate dielectric approaching a fundamental limit (a few atomic layers)
0.010.110.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2004P
ow
er D
ensi
ty (
W/c
m2 )
65 nM
Gate Length
1 0.010.1
1000
100
10
1
0.1
0.01
0.001
IBM Research
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
Clock Speed (MHz)
103
102
104
Managing power dissipation is limiting clock speed increases
Microprocessor Clock Speed TrendsC
lock
Fre
quen
cy (
MH
z)
IBM Research
Intuition: Power vs. Performance Trade Off
Relative Performance
RelativePower
1.8 1.3
1
.7
1.4
1.6
5
IBM Research
© 200916 [email protected]
Outline
• What’s happening?
• Why is it happening?
– HW Challenges
– BW Challenges
• What are the implications?
• What can we do about it?
IBM Research
© 200917 [email protected]
The Hungry Beast
Processor(“beast”)
Data(“food”)
Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
IBM Research
© 200918 [email protected]
The Hungry Beast
Processor(“beast”)
Data(“food”)
Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
If flops grow faster than pipe capacity…
… the beast gets hungrier!
IBM Research
© 200919 [email protected]
Move the food closer Cache
Processor
Data(“food”)
Load more food while the beast eats
IBM Research
© 200920 [email protected]
What happens if the beast is still hungry? Cache
If the data set doesn’t fit in cache
Cache misses
Memory latency exposed
Performance degraded
Several important application classes don’t fitGraph searching algorithms
Network security
Natural language processing
Bioinformatics
Many HPC workloads
Processor
IBM Research
© 200921 [email protected]
Make the food bowl larger Cache Cache size steadily increasing
Implications
Chip real estate reserved for cache
Less space on chip for computes
More power required for fewer FLOPS
Processor
IBM Research
© 200922 [email protected]
Make the food bowl larger
Processor
Cache Cache size steadily increasing
Implications
Chip real estate reserved for cache
Less space on chip for computes
More power required for fewer FLOPS
But…
Important application working sets are growing faster
Multicore even more demanding on cache than unicore
IBM Research
© 200924 [email protected]
The beast had babies
• Multicore makes the data problem worse!
– Efficient data movement is critical
– Latency hiding is critical
IBM Research
© 200926 [email protected]
Outline
• What’s happening?
• Why is it happening?
• What are the implications?
• What can we do about it?
IBM Research
© 200928 [email protected]
Feeding the Cell Processor
8 SPEs each with
– LS
– MFC
– SXU
PPE
– OS functions
– Disk IO
– Network IO
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
MFC
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
IBM Research
© 200929 [email protected]
Cell Approach: Feed the beast more efficiently
Explicitly “orchestrate” the data flow
Enables detailed programmer control of data flowGet/Put data when & where you want itHides latency: Simultaneous reads, writes & computes
Avoids restrictive HW cache managementUnlikely to determine optimal data flowPotentially very inefficient
Allows more efficient use of the existing bandwidth
BOTTOM LINE:
It’s all about the data!
IBM Research
© 200930 [email protected]
Lessons Learned Cell Processor
• Core simplicity impacted algorithmic design– Increased predictability– Avoid recursion & branches– Simpler code is better code
– e.g., bubble vs. comb sort
• Heterogeneity– Serial core must balance parallel cores well
• Programmability suffered– Forced to address data flow directly
– Led to better algorithms & performance portability
IBM Research
© 200931 [email protected]
What are the implications?
• Computational Complexity
• Parallel programming
• Communication
• Synchronization
• Collecting metadata
• Merging Operations
• Grouping Operations
• Memory Layout
• Memory Conflicts
• Debugging
Some generalSome Cell specific
IBM Research
© 200932 [email protected]
Computational complexity is inadequate
• Focus on computes: O(N), O(N2), O(lnN), etc.
• Ignores BW analysis– Memory flows are now the bottlenecks
– Memory hierarchies are critical to performance
– Need to incorporate memory into the picture
• Need “Data Complexity”– Necessarily HW dependent
– Calculate data movement (track where they come from) and divide by BW to get time for data
IBM Research
Don’t apply computational complexity blindly
O(N) isn’t always better than O(N2)
N
O(N)
O(N2)
You are here
Run
Tim
e
More cores can lead to smaller N per core…
IBM Research
Where is your data?
L3 cache
Run
Tim
e
Disk
L2 cache
L1 cache
Tape
Put your data where you want it when you want it!
N (“Locality”)
Localize your data!
IBM Research
Example: Compression
• Compress to reduce data flow
• Increases slope of O(N)
• But reduces run time
Compute
Read Write
Compute
Read
Compression
Write
Run Time
Compression
Compute
N
Computational Complexity
IBM Research
© 200936 [email protected]
Implication: Communication Overhead
• BW can swamp compute
• Minimize communication
1 2
IBM Research
© 200937 [email protected]
Implication: Communication Overhead
• Modify partitioning to reduce communications
• Trade off with synchronization
L
L
9L vs. 4L
IBM Research
© 200938 [email protected]
Implications: Synchronization Overhead
Time
SynchronizationOverhead
IBM Research
© 200939 [email protected]
Implications: Synchronization – Load Balancing
• Modify data partitioning to balance workloads
Uniform Adaptive
IBM Research
© 200941 [email protected]
Implications: Synchronization – Nondeterminism
Run Time
Pro
babi
lity
Average nondeterministic
Deterministic
Max of N Threads
IBM Research
© 200942 [email protected]
Implications: Metadata - Parallel sort example
• Collect histogram in first pass
• Use histogram to parallelize second pass
Unsorteddata Metadata
Sorteddata
IBM Research
© 200943 [email protected]
Buffer
Input Image
Transposed Image
Tile
Transposed Tile
Transposed Buffer
Implications: Merge Operations – FFT Example
• Naive
– 1D FFT (x axis)
– Transpose
– 1D FFT (y axis)
– Transpose
• Improved – Merge steps
– FFT/Transpose (x axis)
– FFT/Transpose (y axis)
• Avoid unnecessary data movement
IBM Research
© 200944 [email protected]
Implications: Restructure to Avoid Data Movement
Compute A
Transform A to B
Compute B
Transform B to A
Compute A
Transform A to B
Compute B
Transform B to A
Compute A
Transform A to B
Compute B
Compute B
Compute B
Compute B
Compute A
Compute A
Compute A
IBM Research
© 200945 [email protected]
Implications: Streaming Data & Finite Automata
DFA
DFA DFADFA
Data
Replicate &Overlap
Enables loop unrolling & software pipelining
IBM Research
© 200946 [email protected]
Find (lots of) substrings in (long) string
Build graph of words & represent as DFA
Sample Word List:
“the”“that”
“math”
Implications: Streaming Data – NID Example
IBM Research
© 200947 [email protected]
Random access to large state transition table (STT)
Implications: Streaming Data – NID Example
IBM Research
© 200949 [email protected]
Implications: Streaming Data – Hiding Latency
Enables loop unrolling & software pipelining
IBM Research
© 200950 [email protected]
Roofline Model (S. Williams)
Compute bound
ProcessingRate
Software Pipelining
Data LocalityLow High
Latencybound
IBM Research
© 200951 [email protected]
Implications: Group Like Operations – Tokenization Ex.
• Intuitive
– Get data Serial
– State Transition Serial
– Action Branchy & Nondeterministic
– Repeat
DFA
Data
Action
IBM Research
© 200952 [email protected]
Implications: Group Like Operations – Tokenization Ex.
Better– Get data Serial
– State Transition Serial
– Add Action to List Serial
– Repeat
– Process Action Lists Serial
DFA
Data
Action Action List 1
Action List 3
Action List 2
• Loop unrolling• SIMD• Load balance
IBM Research
© 200953 [email protected]
Neural net function F(X)
– RBF, MLP, KNN, etc.
If too big for cache, BW Bound
N Basis functions: dot product + nonlinearity
D Input dimensions
DxN Matrix of parameters
OutputF
X
Implications: Covert BW to Compute Bound – NN Ex.
IBM Research
© 200954 [email protected]
Split function over multiple SPEs
Avoids unnecessary memory traffic
Reduce compute time per SPE
Minimal merge overhead
Merge
Implications: Covert BW to Compute Bound – NN Ex.
IBM Research
© 200955 [email protected]
BW: High Low
Latency: Low High
Size: Small Larger
L1L2
Register File
Implications:Pay Attention to Memory Hierarchy
Main Memory
IBM Research
© 200956 [email protected]
• Data eviction rate
• Optimal tiling
• Shared memory space can impact load balancing
Implications: Pay Attention to Memory Hierarchy
C L1
L2
L1C
C L1
L2
L1C
L3
C L1
L1C
C L1
L1C
L2
IBM Research
© 200957 [email protected]
Implications: Memory Hierarchy & Tiling
= X
Optimal tiling depends on cache size
IBM Research
© 200958 [email protected]
Single Element Data envelope
Stride 1
StrideN2
N
Implications: Data Re-Use – FFT Revisited
• Long stride trashes cache
• Use full cachelines where possible
IBM Research
© 200959 [email protected]
Implications: Handle Race Conditions (Debugging)
• Heisenberg Uncertainty Principle
– Instrumenting the code changes behavior
– Problem with maintaining exact timing
Write data
Write data
Read data
Good
Bad
?
1
1
2
Thread
IBM Research
© 200960 [email protected]
Implications: More Cores – More Memory Conflicts
• Avoid bank conflicts
– Plan data layout
– Avoid multiples of the number of banks
– Randomize start points
– Make critical data sizes and number of threads relatively prime
Bank 1 Bank 8
Thread
765432
1
8
Bank 1 Bank 8
765432
1
8
HOT SPOT
IBM Research
© 200961 [email protected]
(X,Y)
New G at each (x,y)
Radial symmetry of G reduces BW requirements
Data
Green’s Function
∑ jiyxGjyixD ),,,(),(
Implications: Reduce Data Movement
ij
IBM Research
© 200962 [email protected]
Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
Implications: Reduce Data Movement
IBM Research
© 200963 [email protected]
Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
Implications: Reduce Data Movement
IBM Research
© 200964 [email protected]
For each X
– Load next column of data
– Load next column of indices
– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at
(X,Y)– Cycle buffers
H
2R+1
1
Data bufferGreen’s Index buffer
(X,Y)
R
2
Implications: Reduce Data Movement
IBM Research
© 200965 [email protected]
Outline
• What’s happening?
• Why is it happening?
• What are the implications?
• What can we do about it?
IBM Research
© 200966 [email protected]
What can we do about it?
• We want
– High performance
– Low power
– Easy programmability
• We need
– “Magic” compiler
– Multicore enabled libraries
– Multicore enabled tools
– New algorithms
Chooseany two!
IBM Research
© 200967 [email protected]
What can we do about it?
• Compiler “magic”
– OpenMP, autovectorizationBUT… Doesn’t encourage parallel thinking
• Programming models
– CUDA, OpenCL, Pthreads, UPC, PGAS, etc
• Tools
– Cell SDK, RapidMind (Intel), PeakStream (Google), Cilk (Intel), Gedae, VSIPL++, Charm++, Atlas, FFTW, PHiPAC
• If you want performance…
– No substitute for better algorithms & hand-tuning!
– Performance analyzers» HPCToolkit, FDPR-Pro, Code Analyzer, Diablo, TAU, Paraver, VTune,
SunStudio Performance Analyzer, Code Analyzer, PDT, Trace Analyzer, Thor, etc.
IBM Research
© 200968 [email protected]
What can we do about it? Example: OpenCL
• Open “standard”
• Based on C - not difficult to learn
• Allows natural transition from (proprietary) CUDA programs
• Interoperates with MPI
• Provides application portability
– Hides specifics of underlying accelerator architecture
– Avoids HW lock-in: “future-proofs” applications
• Weaknesses
– No DP, no recursion & accelerator model only
Portability does not equal performance portability!
IBM Research
What can we do about it?
Hide Complexity in Libraries
• Manually
– Slow, expensive, new library for each architecture
• Autotuners
– Search program space for optimal performance
– Examples: Atlas (BLAS), FFTW (FFT), Spiral (DSP). OSKI (Sparse BLAS), PhiPAC (BLAS)
• Local Optimality Problem:
– F() & G() may be optimal, but will F(G()) be?
IBM Research
© 200970 [email protected]
It’s all about the data! The data problem is growing
Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
Efficient data management
– Multibuffering: Hide the latency!
– BW utilization: Make every byte count!
– SIMDization: Make every vector count!
– Problem/data partitioning: Make every core work!
– Software multithreading: Keep every core busy!
What can we do about it?
IBM Research
© 200971 [email protected]
Conclusions
• Programmability will continue to suffer– No pain - no gain
• Incorporate data flow into algorithmic development– Computational complexity vs. “data flow” complexity
• Restructure algorithms to minimize:– Synchronization, communication, non-determinism, load
imbalance, non-locality
• Data management is the key to better performance– Merge/Group data operations to minimize memory traffic
– Restructure data traffic: Tile, Align, SIMDize, Compress
– Minimize memory bottlenecks
IBM Research
© 200973 [email protected]
AbstractThe computer industry is facing fundamental challenges that are driving a major change in the design of computer
processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. This desire for newer, faster and larger is driving continued demand for higher computer performance.
The industry’s response to address these challenges has been to embrace “multicore” technology by designing processors that have multiple processing cores on each silicon chip. Increasing the number of cores per chip has enabled processor peak performance to double with each doubling of the number of cores. With performance doubling occurring at approximately constant clock frequency so that energy costs can be controlled, multicore technology is poised to deliver the performance users need for their next generation applications while at the same time reducing total cost of ownership per FLOP.
The multicore solution to the clock frequency problem comes at a cost: Performance scaling on multicore is generally sub-linear and frequently decreases beyond some number of cores. For a variety of technical reasons, off-chip bandwidth is not increasing as fast as the number of cores per chip which is making memory and communication bottlenecks the main barriers to improved performance. What these bottlenecks mean to multicore users is that precise and flexible control of data flows will be crucial to achieving high performance. Simple mappings of their existing algorithms to multicore will not result in the naïve performance scaling one might expect from increasing the number of cores per chip. Algorithmic changes, in many cases major, will have to be made to get value out of multicore. Multicore users will have to re-think and in many cases re-write their applications if they want to achieve high performance. Multicore forces each programmer to become a parallel programmer; to think of their chips as clusters; and to deal with the issues of communication, synchronization, data transfer and non-determinism as integral elements of their algorithms. And for those already familiar with parallel programming, multicore processors add a new level of parallelism and additional layers of complexity.
This talk will highlight some of the challenges that need to be overcome in order to get better performance scaling on multicore, and will suggest some solutions.
IBM Research
© 200974 [email protected]
Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology
(to scale)
IBM Research
Intel Multi-Core Forum (2006)
0
8750
17500
26250
35000
0 2 4 6 8 10 12 14 16 18 20 22 24
Linux
The Issue
9.8x
Processors
Throughput
SDET
IBM Research
The “Yale Patt Ladder”
Problem
Algorithm
Program
ISA (Instruction Set Architecture)
Microarchitecture
Circuits
Electrons
To improveperformanceneed people
who can crossbetween levels
Top Related