Eldorado
description
Transcript of Eldorado
![Page 1: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/1.jpg)
Eldorado
John Feo
Cray Inc
![Page 2: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/2.jpg)
2
Outline
Why multithreaded architectures
The Cray Eldorado
Programming environment
Program examples
![Page 3: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/3.jpg)
3
Overview
Eldorado is a peak in the North Cascades.
Internal Cray project name for the next MTA system.
Make the MTA-2 cost-effective and supportable.
Retain proven performance and programmability advantages of the MTA-2.
Eldorado
Hardware based on Red Storm.
Programming model and software based on MTA-2.
![Page 4: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/4.jpg)
4
~1970’s Generic computer
CPU Memory
Memory keeps pace with CPUMemory keeps pace with CPU
EverythingEverything is in Balance…
Instructions
![Page 5: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/5.jpg)
5
Processors have gotten much faster (~ 1000x)
Memories have gotten a little faster (~ 10x), and much larger(~ 1,000,000x)
System diameters have gotten much larger (~ 1000x)
Flops are for free, but bandwidth is very expensive
Systems are no longer balanced, processors are starved for data
… but balance is short-lived
![Page 6: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/6.jpg)
6
20th Century parallel computer
Avoid the straws and there are no flaws;otherwise, good luck to you
CPU
Memory
Cache
CPU Cache
CPU Cache
![Page 7: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/7.jpg)
7
Constraints on parallel programs
Place data near computation
Avoid modifying shared data
Access data in order and reuse
Avoid indirection and linked data-structures
Partition program into independent, balanced computations
Avoid adaptive and dynamic computations
Avoid synchronization and minimize inter-process communications
![Page 8: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/8.jpg)
8
A tiny solution space
![Page 9: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/9.jpg)
9
Multithreading
Hide latencies via parallelism
Maintain multiple active threads per processor, so that gaps introduced by long latency operations in one thread are filled by instructions in other threads
Requires special hardware support
Multiple threads
Single-cycle context switch
Multiple outstanding memory requests per thread
Fine-grain synchronization
![Page 10: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/10.jpg)
10
The generic transaction
Allocate memory
Write/read memory
Deallocate memory
![Page 11: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/11.jpg)
11
Execution time line – 1 thread
![Page 12: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/12.jpg)
12
Execution time line – 2 threads
![Page 13: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/13.jpg)
13
Execution time line – 3 threads
![Page 14: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/14.jpg)
14
Transaction per second on MTA
Threads P1 P2 P4
1
10
20
30
40
50
1,490.55
14,438.27
26,209.85
31,300.51
32,629.65
32,896.10
65,890.24 123,261.82
![Page 15: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/15.jpg)
15
Eldorado Processor (logical view)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions
![Page 16: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/16.jpg)
16
Eldorado System (logical view)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Programs running in parallel
Concurrent threads of computation
Multithreaded across multiple processors
. . . . . . . . . . . .
![Page 17: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/17.jpg)
17
What is not important
Placing data near computation
Modifying shared data
Accessing data in order
Using indirection or linked data-structures
Partitioning program into independent, balanced computations
Using adaptive or dynamic computations
Minimizing synchronization operations
![Page 18: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/18.jpg)
18
What is important
![Page 19: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/19.jpg)
19
Eldorado system goals
Make the Cray MTA-2 cost-effective and supportable Mainstream Cray manufacturing Mainstream Cray service and support
Eldorado is a Red Storm with MTA-2 processors Eldorado leverages Red Storm technology Red Storm cabinet, cooling, power distribution Red Storm circuit boards and system interconnect Red Storm RAS system and I/O system (almost) Red Storm manufacturing, service, and support teams
Eldorado retains key MTA technology Instruction set Operating system Programming model and environment
![Page 20: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/20.jpg)
20
Red Storm
Red Storm consists of over 10,000 AMD Opteron™ processors connected by an innovative high speed, high bandwidth 3D mesh interconnect designed by Cray (Seastar)
Cray is responsible for the design, development, and delivery of the Red Storm system to support the Department of Energy's Nuclear stockpile stewardship program for advanced 3D modeling and simulation
Red Storm uses a distributed memory programming model (MPI)
![Page 21: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/21.jpg)
21
4 DIMM Slots4 DIMM Slots
CRAYSeastar™
CRAYSeastar™
CRAYSeastar™
CRAYSeastar™
L0 RAS ComputerL0 RAS Computer
Redundant VRMsRedundant VRMs
Red Storm Compute Board
![Page 22: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/22.jpg)
22
4 DIMM Slots4 DIMM Slots
CRAYSeastar2™
CRAYSeastar2™
CRAYSeastar2™
CRAYSeastar2™
L0 RAS ComputerL0 RAS Computer
Redundant VRMsRedundant VRMs
Eldorado Compute Board
![Page 23: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/23.jpg)
23
MTX Linux
Compute Service & IO
Service Partition• Linux OS• Specialized Linux nodes
Login PEs
IO Server PEs
Network Server PEs
FS Metadata Server PEs
Database Server PEs
Compute Partition
MTX (BSD)
RAID Controllers
Network
PCI-X
10 GigE
Fiber ChannelPCI-X
Eldorado system architecture
![Page 24: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/24.jpg)
24
Eldorado CPU
![Page 25: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/25.jpg)
25
Speeds and Feeds
CPU ASIC
140M memory ops
500M memory ops
1.5 GFlops
500M memory ops
100M memory ops
90M30M memory ops (1 4K processors)
16 GB DDR DRAM
Sustained memory rates are for random single word accesses over entire address space.
![Page 26: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/26.jpg)
26
Shared memory Some memory can be reserved as local memory at boot time Only compiler and runtime system have access to local memory
Memory module cache Decreases latency and increases bandwidth No coherency issues
8 word data segments randomly distributed across the memory system Eliminates stride sensitivity and hotspots Makes programming for data locality impossible Segment moves to cache, but only word moves to processor
Full/empty bits on all data words
Eldorado memory
tag bits data values
063
forwardtrap 1trap 2full-empty
![Page 27: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/27.jpg)
27
MTA-2 / Eldorado Comparisons
MTA-2 Eldorado
CPU clock speed 220 MHz 500 MHz
Max system size 256 P 8192 P
Max memory capacity 1 TB (4 GB/P) 128 TB (16 GB/P)
TLB reach 128 GB 128 TB
Network topology Modified Cayley graph 3D torus
Network bisection bandwidth 3.5 * P GB/s 15.36 * P2/3 GB/s
Network injection rate 220 MW/s per processor Variable (next slide)
![Page 28: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/28.jpg)
28
Eldorado Scaling
Example Topology 6x12x8 11x12x8 11x12x16 22x12x16 14x24x24
Processors 576 1056 2112 4224 8064
Memory capacity 9 TB 16.5 TB 33 TB 66 TB 126 TB
Sustainable remote memory reference rate (per processor)
60 MW/s 60 MW/s 45 MW/s 33 MW/s 30 MW/s
Sustainable remote memory reference rate (aggregate)
34.6 GW/s 63.4 GW/s 95.0 GW/s 139.4 GW/s 241.9 GW/s
Relative size 1.0 1.8 3.7 7.3 14.0
Relative performance 1.0 1.8 2.8 4.0 7.0
![Page 29: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/29.jpg)
29
Operating Systems LINUX on service and I/O nodes MTX on compute nodes Syscall offload for I/O
Run-Time System Job launch Node allocator Logarithmic loader Batch system – PBS Pro
Software
Programming Environment– Cray MTA compiler - Fortran, C, C++
– Debugger - mdb
– Performance tools: canal, traceview High Performance File Systems
Lustre
System Mgmt and Admin Accounting Red Storm Management System RSMS Graphical User Interface
The compute nodes run MTX, a multithreaded unix operating system
The service nodes run Linux
I/O service calls are offloaded to the service nodes
The programming environment runs on the service nodes
![Page 30: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/30.jpg)
30
CANAL
Compiler ANALysis
Static tool
Shows how the code is compiled and why
![Page 31: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/31.jpg)
31
Traceview
![Page 32: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/32.jpg)
32
Dashboard
![Page 33: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/33.jpg)
33
What is Eldorado’s sweet spot?
Any cache-unfriendly parallel application is likely to outperform on Eldorado
Any application whose performance depends upon ... Random access tables (GUPS, hash tables) Linked data structures (binary trees, relational graphs) Highly unstructured, sparse methods Sorting
Some candidate application areas: Adaptive meshes Graph problems (intelligence, protein folding, bioinformatics) Optimization problems (branch-and-bound, linear programming) Computational geometry (graphics, scene recognition and tracking)
![Page 34: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/34.jpg)
34
Sparse Matrix – Vector Multiply
C n x 1 = A n x m * B m x 1
Store A in packed row form A[nz], where nz is the number of non-zeros cols[nz] stores the column index of the non-zeros rows[n] stores the start index of each row in A
#pragma mta use 100 streams#pragma mta assert no dependencefor (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum;}
![Page 35: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/35.jpg)
35
Canal report
| #pragma mta use 100 streams | #pragma mta assert no dependence | for (i = 0; i < n; i++) { | int j; 3 P | double sum = 0.0; 4 P- | for (j = rows[i]; j < rows[i+1]; j++) | sum += A[j] * B[cols[j]]; 3 P | C[i] = sum; | }
Parallel region 2 in SpMVM Multiple processor implementation Requesting at least 100 streams
Loop 3 in SpMVM at line 33 in region 2 In parallel phase 1 Dynamically scheduled
Loop 4 in SpMVM at line 34 in loop 3 Loop summary: 3 memory operations, 2 floating point operations 3 instructions, needs 30 streams for full utilization, pipelined
![Page 36: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/36.jpg)
36
Performance
N = M = 1,000,000
Non-zeros 0 to 1000 per row, uniform distribution Nz = 499,902,410
T SpP
1
2
4
8
7.11
3.59
1.83
0.94
1.0
1.98
3.88
7.56
Time = (3 cycles * 499902410 iterations) / 220000000 cycles/sec = 6.82 sec
96% utilization
IBM Power4 1.7 GHz
26.1 s
![Page 37: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/37.jpg)
37
Integer sort
Any parallel sort works well on Eldorado
Bucket sort is best if universe is large and elements don’t cluster in just a few buckets
Bucket Sort
Count the number of elements in each bucket
Calculate the start of each bucket in dst
Copy elements from src to dst, placing each element in the correct segment
Sort the elements in each segment
![Page 38: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/38.jpg)
38
Step 1
for (i = 0; i < n; i++) count[src[i] >> shift] += 1;
Compiler automatically parallelizes the loop using an int_fetch_add instruction to increment count
src
Buckets
count
![Page 39: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/39.jpg)
39
Step 2
start(0) = 0;
for (i = 0; i < n_buckets; i++) start(i) = start(i-1) + count(i)
Compiler automatically parallelizes most first-order recurrences
src
dst
start 0 3 9 14 15 22 24 26 32
![Page 40: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/40.jpg)
40
Step 3
#pragma mta assert parallel#pragma mta assert noalias *src, *dstfor (i = 0; i < n; i++) { bucket = src(i) >> shift; index = start[bucket]++; dst(index) = src(I);}
Compiler can not automatically parallelize the loop because the order in which elements are copied to dst is not determinant
noalias pragma lets compiler generate optimal code (3 instructions)
The compiler uses an int_fetch_add instruction to increment start
![Page 41: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/41.jpg)
41
Step 4
Sort each segment
Since there are many more buckets than streams and we assume the number of elements per bucket is small, any sort algorithm will do we used a simple insertion sort
Most of the time is spent skipping through buckets of size 0 or 1
![Page 42: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/42.jpg)
42
Performance
N = 100,000,000
T SpP
1
2
4
8
9.14
4.56
2.41
1.26
1.0
2.0
3.79
7.25
9.14 sec * 220000000 cycles/sec / 100000000 elements
=20.1 cycles/elements
![Page 43: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/43.jpg)
43
Prefix operations on lists
P(i) = P(i - 1) + V(i)
List ranking - determine the rank of each node in the list A common procedure that occurs in many graph algorithms
1512
1
73
16
95
1413
6
10
11
4
28
N
R
1 2
2 13
3 4
4 14
5 6
7 10
7 8
3 15
9 10
6 11
11 1
12 1
12 13
1 9
14 15
8 16
16
5
![Page 44: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/44.jpg)
44
Hellman and JaJa algorithm
Mark nwalk nodes including the first node This step divides the list into nwalk sublists
Traverse the sublists computing each node’s rank in the sublist
From the lengths of the sublists, compute the start of each sublist
Re-traverse each sublist, incrementing each node’s local rank by the sublist’s start
![Page 45: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/45.jpg)
45
Steps 1 and 2
1512
1
73
16
95
1413
6
10
11
4
28
N 1 2 3 4 5 6 7 8 9 10 11 112 13 14 15 16
R 1 4 5 1 3 2
R 2 11
R 2 1 4 1 3
R 11
R 1 132 4
![Page 46: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/46.jpg)
46
Step 3
W
L
0 1
0 2
2 3
4 5
4 5
1 4
W
L
0 1
0 2
2 3
6 9
4 5
6 5
W
L
0 1
0 2
2 3
6 11
4 5
12 14
W
L
0 1
0 2
2 3
6 11
4 5
12 16
![Page 47: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/47.jpg)
47
Step 4
1512
1
73
16
95
1413
6
10
11
4
28
N 1 2 3 4 5 6 7 8 9 10 11 112 13 14 15 16
R + 6 7 10 11 1 9 8
R + 0 2 11
R + 2 4 3 6 1 5
R + 11 112
R + 12 13 11514 16
![Page 48: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/48.jpg)
48
Performance Comparison
Performance of List Ranking on Cray MTA (For Ordered List)
Problem Size (M)
0 20 40 60 80 100
Execution T
ime (S
econds)
0
1
2
3
4
1 Proc2 Proc4 Proc8 Proc
Performance of List Ranking on SUN E4500 (For Random List)
Problem Size (M)
0 20 40 60 80 100
Execution T
ime (S
econds)
0
20
40
60
80
100
120
140
1 Proc2 Proc4 Proc8 Proc
Performance of List Ranking on Cray MTA (For Random List)
Problem Size (M)
0 20 40 60 80 100
Execution T
ime (S
econds)
0
1
2
3
4
1 Proc2 Proc4 Proc8 Proc
Performance of List Ranking on Sun E4500 (For Ordered List)
Problem Size (M)
0 20 40 60 80 100
Execution T
ime (S
econds)
0
10
20
30
40
1 Proc2 Proc4 Proc8 Proc
![Page 49: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/49.jpg)
49
Two more kernels
Linked List Search
N = 1000 N = 10000
SunFire 880 MHz (1P)
Intel Xeon 2.8 GHz (32b) (1P)
Cray MTA-2 (1P)
Cray MTA-2 (10P)
9.30
7.15
0.49
0.05
107.0
40.0
1.98
0.20
Random Access (GUPs)
Giga updates per second
IBM Power4 1.7 GHz (256P)
Cray MTA-2 (1P)
Cray MTA-2 (5P)
Cray MTA-2 (10P)
0.0055
0.41
0.204
0.405
![Page 50: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/50.jpg)
50
Eldorado Timeline
ASIC Tape-out
Prototype Systems
Early Systems
(moderate-size of 500-1000P)
2004 2005 2006
![Page 51: Eldorado](https://reader036.fdocuments.us/reader036/viewer/2022062808/56815476550346895dc28d99/html5/thumbnails/51.jpg)
51
Eldorado Summary
Eldorado is a low-risk development project because it reuses much of the Red Storm infrastructure which is part of Cray’s standard product path Ranier module Cascade LWP
Eldorado retains the MTA-2’s easy to program model with parallelism managed by compiler and run-time Ideal for applications that do not run well on SMP clusters High productivity system for algorithms research and development
Applications scale to very large system sizes
15-18 months to system availability