Costin Iancu Lawrence Berkeley National Laboratory · •Productivity = performance without pain +...
Transcript of Costin Iancu Lawrence Berkeley National Laboratory · •Productivity = performance without pain +...
Costin Iancu
Lawrence Berkeley National Laboratory
WPSE 2009
•� Unified Parallel C
–� SPMD programming model, shared memory space abstraction
–� Communication is either implicit or explicit – one-sided
–� Memory model: relaxed and strict
•� Ubiquitous UPC implementation –� Compiler based on the Open64 framework
–�Source to source translation
–� GASNet communication libraries
-� PUT/GET primitives
-� Vector/Index/Strided (VIS) primitives
-� Synchronization, collective operations
•� Provide integration across all levels of the software stack
•� Mechanisms for finer grained control over system resources
•� Application level resource usage policies
•� Language and compiler support
Compiler-generated C code
UPC Runtime system
GASNet Communication System
Network Hardware
UPC Code Compiler
UPC Runtime system
Emphasize production quality development tools
•� Productivity = performance without pain + portability
•� Provide support for application adaptation (load balance, comm/comp overlap, scheduling, synchronization)
•� Challenges: scale, heterogeneity, convergence of shared and distributed memory optimizations
•� Broad spectrum of approaches (distributed / shared memory) -� Fine grained communication optimizations (PACT’05)
-� Automatic non-blocking communication (PACT’05, ICS’07)
-� Performance models for loop nest optimizations (PPoPP’07, ICS’08, PACT’08)
-� Applications ( IPDPS’05, SC’07, PPoPP’08, IPDPS’09)
Adoption: >7 years concerted effort, DOE support and encouragement, one big government user
•� One of the highest scaling FFT
(NAS) results to date (~2 Tflops)
•� Communication is aggressively
overlapped with computation
•� UPC vs MPI – 10%-70% faster
one-sided is more effective
��������������� ��������������������������������������������������������������� ���!�������
"##��� �����������$���% �&����'��(������������� �������'�����#�)�! � ��*++,�-���##���.�
•� Best performance of “primitive” operations –� Select best implementation available for “primitive” operations (put/get, sync)
–� Provide efficient implementations for library “abstractions” (collectives)
•� Optimizations –� Single node performance
–� Mechanisms to efficiently map application to hardware/OS
–� Program transformations – minimize processor “idle” waiting Runtime Adaptation
•� Multi-level optimizations (distributed and shared memory)
•� Compile time, static optimizations are not sufficient
•� Adaptation = runtime
–� Program Description
–� Performance Models vs Autotuning
–� Parameter Estimation/Classification
Instantaneous vs Asymptotic
Guided vs Automatic
Offline vs Online
–� Feedback Loop
–� Static topology mapping vs dynamic
Compile Time Transformations Runtime Mechanisms
Communication Oblivious
Transformations
Communication Aware
Analysis
MessageVectorization
Message Strip-Mining
Data Redistribution
Estimation of Performance Parameters Description
+
Code Templates
Performance
Database
Performance
Models
Memory
Manager
(Cache)
Estimate Params
Analyze Comm
Requirements
Estimate Load
Instantiate Comm
Plan
Eliminate Redundant
Comm & Reshape
Code Generation
(categorical)
(numerical)
•� Describe program behavior, lightweight representation (Paek - LMAD perfect nests)
-� Easily extended for symbolic analysis
-� RT-LMAD similar to SSA- irregular loops
•� Decouple serial transformations from communication transformations
-� Serial transformations - cache parameters (static/conservative)
-� Communication transformations - network parameters (dynamic)
•� No performance loss when decoupling optimizations
-� Coarse grained characteristics
-� Blocking for cache and network at different scales
-� Compute and communication bound are categories
-� Multithreading
-� No global communication scheduling (intrinsic computation)
COMMUNICATION OPTIMIZATIONS
•� Domain Decomposition and Scheduling for Code Generation
•� Efficient High Level Communication Primitives (collectives,p2p)
•� Application level performance determining factors:
–� Computation
–� Spatial - topology (point-to-point, one-from-many, many-from-one, many-to-many)
–� Temporal - schedule (burst, peer order)
•� System level performance determining factors:
–� Multiple available implementations
–� Resource constraints (issue queue, TLB footprint)
–� Interaction with OS (mapping, scheduling)
Adaptation: offline search, easy to evaluate heuristics, lightweight analysis
Load
Ove
rhea
d O
R In
vers
e B
and
wid
th
Models,
Asymptotic
Optimizations,
Instantaneous
Flow Control,
Fairness
Throttling load is desirable for performance
> 2X
•� Deployed systems are under-provisioned, unfair, noisy
Two processors saturate the network, four processors overwhelm it (Underwood et al, SC’07)
•� Performance is unpredictable and unreproducible
•� Simple models can’t capture variation
100
100100
200100
300100
400100
500100
600100
700100
800100
900100
10 100 1000 10000 100000 1000000 10000000
Ban
dw
idth
(K
B/
s)
Size (bytes)
InfiniBand Bandwidth Repartition for 128 Procs Across Bisection
Quantitative or Qualitative?
•� Previous approaches measure asymptotic values, optimizations
need instantaneous values
•� Existing “time accurate” performance models do not account well
for system scale OR wide SMP nodes
•� Qualitative models: which is faster, not how fast! (PPoPP’07, ICS’08)
Not time accurate, understand errors and model robustness, allow for imprecision/noise
•� Spatiotemporal exploration of network performance:
-� Short and large time scales – account for variability and system noise
-� Small and large system scales – SMP node, full system
•� Preserve Ordering
–� Sample implementation space, transformation specific
–� Be pessimistic – determine the worst case
–� Track derivatives, not absolute values
•� Analytical performance models (strip-mining transformations, PPoPP’07) > 90% efficiency
•� Multiprotocol implementation of vector operations (ICS’08, PACT’08)
TUNING OF VECTOR OPERATIONS
•� Vector Operations – copy disjoint memory regions in one logical step (scatter/gather)
•� Often used in applications: boundary data in finite difference, particle-mesh, sparse
matrices, MPI Derived Data Types
•� Well supported:
•� Native : Elan, InfiniBand, IBM LAPI/DCMF
•� Third party comm libraries: GASNet, ARMCI, MPI •� “Frameworks”: UPC, Titanium, CAF, GA, LAPI
•� Interfaces: strided, indexed
•� Previous studies show the need for a multi-protocol approach
•� Implementations:
–� Blocking – no overlap (BLOCK)
–� Pipelining – flow control and fairness are problems (PIPE)
–� Packing – flow control and attentiveness are problems (VIS)
foreach(S)
start_time()
for (iters)
foreach(N)
get(S)
end_time()
foreach(S)
start_time()
for(iters)
foreach(N)
get_nb(S)
sync_all
end_time()
foreach(S)
start_time()
for(iters)
foreach(N)
vector_get(N,S)
end_time()
•� Protocols : Blocking, Non-Blocking, Packing (AM based)
•� Empirical approach based on optimization space exploration -�Transfer structure (N, S)
-� Application characteristics : active processors, communication topology, system size, instantaneous load
•� For each setting – Which implementation is faster?
•� Fast, lightweight decision mechanism – prune parameter space
•� Strategy: best OR worst case scenario?
��
���
����
�����
�� �� ���
���
����
�����
���
� ���������
�������������������� ����! �"#$��%� ���&�����
����
'���
��'�
����
����
Best algorithm determined by SMP arity and load
Resource constraints determine algorithm change
��
��
��
(��
���
���
(���
����
�(��
()���
(� �� �� �� (��
���
���
(���
����
�(��
()���
����
���� ����
*���������������+��� ��+�#�������������
VIS
BLOCK ��
��
��
���
���
���
����
����
����
�����
�� �� �� �� ���
���
���
����
����
����
�����
���
� ����������
��������������������������#���������� ���
VIS
PIPE
See PACT’08 paper for details
•� Changing system size or topology does not cause protocol changes
•� Magnitude of performance differences is lowered (40x – 20x)
•� Accuracy > 90%, less than 2x performance loss
!�
,�
"!�
��,�
����
��
��
��
-�
,�
���
��
��
��
,�
�-�
���
-��
��,�
��-�
����
�����
��
�
���
�������
�����
���
�����
������ ����
���������������� �����������������������!"��
�#�����$�����
��
,�
�����,�����
��
����
����
����
����
����
����
���
�� �� �� ,� ��� ��� ��� ��,� ���� ���������
�� �
�������
�����������.��
������������
���� ���!����"���#��$�%���$�"����������������
����������������
2
4
8
16
32
64
128
256
512
1024
1 4 16 64 256 1024
NM
SG
Size (Dbls)
BLOCK Inter/Poll
1-2
0-1
2
4
8
16
32
64
128
256
512
1024
1 2 4 8 16
32
64
128
256
512
1024
NM
SG
Size (Dbls)
VIS Inter/Poll
4-5
3-4
2-3
1-2
0-1
•� Polling vs Interrupts
•� Different event notification mechanisms required for different protocols (event inter-
arrival rate)
•� Categorical choice
> 5X performance difference
Bassi – Power5/Federation
��
'��
�����������������
����
������
���������
�����
�����&������&�� �����
������������
���������������������&�� ���������������
��������&� ��
Pessimistic (max) predictors obtained under high load work best.
Our micro-benchmarks and models are always concerned about worst case performance.
•� UPC compiler, GASNet communication layer -� 2 x 2068 x 2.6 Ghz Opteron, Cray (BigBen)
–� 2 x 320 x 2.2 GHz Opteron, InfiniBand 4x cluster (Jacquard)
–� 8 x 111 x 1.9 Ghz Power5, Federation (Bassi)
–� 16 x 3936 x 1.9 Ghz Barcelona, InfiniBand (Ranger)
•� NAS Parallel Benchmarks - manual optimizations vs compiler optimized
–� MG: point-to-point Put, dynamic granularity across one run
–� SP: point-to-point VIS Put, “static”
–� BT: point-to-point VIS Put/Get, “static”
•� Node load (category) is determining performance factor for wide SMPs
•� Categories can be further refined into numerical values, e.g. instantaneous load estimation
Workload: 22% improvement
Load estimation?
0
0.5
1
1.5
2
2.5
16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 128-C 256-C
BT SP MG
Per
form
ance
Co
mp
ared
to
VIS
Imp
lem
enta
tio
n
IBM p575
VIS
PIPE
BLOCK
ADAPTIVE
Hig
her
is b
ette
r
•� Communication optimizations: qualitative models, worst case
performance, offline/guided exploration
•� First order performance determining factors are system dependent, number of correlations tends to be constant, large ranges -� Strip-mining optimizations: Fat-tree and Torus
-� Vector optimizations: thin nodes and wide nodes
•� Instantaneous behavior important, can be coarsely categorized (#pragma)
•� Runtime Analysis feasible: algorithms O(n*log n) transfers, O(enest) faster than RTT
•� Decoupling transformations (comm/comp) works – no whole program
analysis
•� SPMD performance can be enhanced by RT/OS mechanisms
Thank You!
•� Large number of network performance models (LogGP variants) -
measurement methodology and validation on applications (asympotic values) –� Su et al (SC’05)
–� Cameron et al (IEEE ToC’07)
•� Implementations: –� Tipparaju et al (IPDPS’04) – InfiniBand
–� Nieplocha et al (HPCA’04) – Quadrics
–� Santhanaraman et al (PVM/MPI ’04) – InfiniBand
•� PGAS compilers
–� CAF: message vectorization
–� Titanium: array copy operations, inspector-executor
Micro-Benchmarks
Processing System
Characterization
Knowledge &
Experience
Base
OS &
Runtime
System
Configuration File & System Model
Bac
k-en
d P
roce
ssor
Spe
cific
Com
pile
r
Source-to-Source Code Transformations
Lang
uage
Ext
ensi
ons
& L
ibra
ries
C/C
++
F
ortr
an
UP
C/C
AF
/Cha
pel
Autotuning
Optimization
Learning &
Reasoning HP
C L
angu
ages
Opt
imiz
ed P
aral
lel E
xecu
tabl
e
Automated Task Recognition
DO
D A
pplic
atio
n C
ode
Program Analysis Source Code
Generation
Ope
nMP
Architecture Models
Network Models
App
licat
ion
Cod
es
Component Framework
Ideal Development Environment
J. Demmel, M. Hall, C. Iancu, D. Quinlan, K. Yelick…
•� All protocols chosen across the whole workload and systems
•� Two types of systems:
–� IBM – N-N estimators – static estimators are enough
–� Sun – P-N, P-HN, P-P – heuristics to change predictors with scale or use instantaneous load estimation
Overall improved scalability and performance
��
����
��
����
��
����
���!�
���!�
���%�
���%�
������
���!�
���!�
���%�
���%�
������
���!�
���!�
���%�
���%�
��,���
������
%�� ��� ��
�����
��
�����
� ��
������
��
���� �
����
���
�
�� ����
�$��%�
���
��
����
� ���
� ��
Improvement: 22% workload, 3x speedup max
Load estimation?
NAS Application Benchmarks
Infiniband Cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
A-4
A-8
B-1
6B
-32
C-6
4C
-12
8
A-4
B-1
6C
-64
A-4
B-1
6C
-64
A-4
A-8
B-1
6B
-32
C-6
4C
-12
8
A-4
A-8
B-1
6B
-32
C-6
4C
-12
8
A-4
A-8
B-1
6B
-32
C-6
4C
-12
8
A-4
A-8
B-1
6B
-32
C-6
4
MG SP BT CG IS FT FT-NLE
Perf
orm
an
ce R
ela
tive t
o U
NO
PTIM
IZED
Imp
lem
en
tati
on
HAND
OPT
2.962.15
Improvement: 22% workload, 3x speedup max (Sun: 2.5% workload, 15% speedup)
Iancu, Yelick
Instantaneous load estimation required for these results
(SMP load, comm topology, comm distance)