UPC at CRD/LBNLKathy Yelick
Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell
What is UPC?•UPC is an explicitly parallel language
– Global address space; can read/write remote memory
– Programmer control over layout and scheduling
– From Split-C, AC, PCP
•Why a new language?– Easier to use than MPI, especially for
program with complicated data structures– Possibly faster on some machines, but
current goal is comparable performance
p0 p1 p2
Background•UPC efforts elsewhere
– IDA: Bill Carlson, UPC promoter – GMU (documentation) and UMC (benchmarking) – HP (Alpha cluster and C+MPI compiler (with MTU))– Cray (implementations)– Intrepid (SGI and t3e compiler)
•UPC Book: – T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick– 3 chapters in draft form; goal is to have proofs by
SC03
•Three components of NERSC effort1)Compilers for DOE machines (SP and PC clusters)2)Runtime systems for ours and other compilers3)Applications and benchmarks
UPC Funding•Base program funding K52004
– Compiler/translator work– Applications– Runtime for DOE machines
•Part of Pmodels Center K52018– Runtime support common to Titanium (and
hopefully CoArray Fortran, at some point)– Collaboration with ARMCI group
•NSA funding– UPC for “clusters”
Compiler Status•NERSC compiler/translator
– Costin Iancu and Wei Chen– Translates UPC to C + “Berkeley UPC Runtime”– Based on Open64 compiler for C– Status
• Complete in prototype form• Debugging, tuning, extensions ongoing• Release planned for next month:
– Quadrics, Myrinet, IBM/SP, and MPI• Shared memory/process implementation is next
– Investigating optimization opportunities• Communication optimizations• UPC language optimizations
UPC CompilerUPC
Higher WHIRL
Lower WHIRL
• Compiler based on Open64• Multiple front-ends, including gcc• Intermediate form called WHIRL
• Leverage standard optimizations and analyses• Pointer analysis• Loop optimizations
• Current focus on C backend• IA64 possible in future
• UPC Runtime built on GASNet• Portable • Language-independent
Optimizingtransformations
C + Runtime
Assembly: IA64, MIPS,… + Runtime
Runtime: Global pointers (opaque type with rich set of pointer operations), memory management, job startup, etc.
GASNet Extended API: Supports put, get, locks, barrier, bulk, scatter/gather
Portable Runtime Support•Developing a runtime layer that can be easily
ported and tuned to multiple architectures.
GASNet Core API:Small interface based on
“Active Messages”
Generic support for UPC, CAF, Titanium
Core sufficient for functional implementation
Direct implementations of parts of full GASNet
GASNet released 1/03
Communication Optimizations• Characterizing performance of current machines
– Latency, overlap (communication & computation)
• Plan to automatically optimization using communication performance model
• Preliminary results: 10x improvement on Matmul
0
5
10
15
20
25
T3E/S
hm
T3E/E
-Reg
T3E/M
PI
IBM
/LAPI
IBM
/MPI
Quadr
ics/S
hm
Quadr
ics/M
PI
Myri
net/G
M
Myri
net/M
PI
GigE/V
IPL
GigE/M
PI
use
c
Added Latency
Send Overhead (Alone)
Send & Rec Overhead
Rec Overhead (Alone)
Performance without Communication
Vector Addition
0
5
10
15
20
25
30
1 2 4 6 8number of threads
tim
e (m
s)
Compaq Cyclic Compaq bsize = 10
Berkeley Cyclic Berkeley bsize = 10
Preliminary Parallel Performance
Gups
0
0.1
0.2
0.3
0.4
1 2 4 6 8 16
number of threads
time (
secon
ds)
IS -- Class B
0
10
20
30
40
1 2 4 8 16
number of threads
Mops
/seco
nds
compaq
Berkeley
Pointer-to-shared operations
0
0.01
0.02
0.03
0.040.05
0.06
0.07
0.08
0.09
generic cyclic indefinite
type of pointer
tim
e (m
icro
seco
nd
s) HP ptr + int
Berkeley ptr + int
HP ptr - ptr
Berkeley ptr -ptr
HP ptr == ptr
Berkeley ptr == ptr
Costs of Pointer-to-Shared Arithmetic – Berkeley vs. HP
• HP is faster for most operations, since HP generates assembly code• Both compilers optimize for “phaseless” pointers• For some operations, Berkeley can beat the HP (ptr comparison)• Expect gap to narrow once the proper optimizations are built-in for
Berkeley UPC
Applications•NAS Parallel Benchmark Sized Apps
–UPC MG complete–UPC CG complete–UPC GUPS–GWU has done IS, EP, and FT
•Planning on–Several Splash benchmarks–Sparse Cholesky–Possibly AMR
Mesh Generation
• Parallel Mesh Generation in UPC
• 2D Delaunay triangulation
• Based on Triangle software by Shewchuk (UCB)
• Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting
Summary•Lots of progress on
–Compiler–Runtime–Portable communication layer
(GASNet)–Applications
•Working on developing a large application that depends on UPC–Mesh generation–AMR (?), Sparse LU (?)
Top Related