Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...
Transcript of Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...
PARALLEL PROGRAMMING
MANY-CORE COMPUTING
FOR THE LOFAR TELESCOPE
ROB VAN NIEUWPOORT
Rob van Nieuwpoort
Rob van Nieuwpoort
Who am I
10 years of Grid / Cloud computing
6 years of many-core computing, radio astronomy
Netherlands eScience center
Software engineering, many-core solutions, astronomy
ASTRON: Netherlands institute for radio astronomy
LOFAR
SKA
Accelerators
2
Big Data
Novel instruments and applications produce lots of data
LHC, telescopes, climate simulations, genetics, medical
scanners, facebook, twitter, …
These instruments and applications cannot work without
computer science anymore
A lot of unexploited knowledge that we can only find if the
data across disciplines is accessible and usable…
Challenges: data handling, processing
Need large-scale parallelism
3
eScience
Enhanced science
Apply ICT in the broadest sense
Data-driven research across all scientific disciplines
Develop generic Big Data tools
Collaboration between cross-disciplinary
researchers
4
Schedule
LOFAR application
Introduction radio astronomy
LOFAR & Big Data
High-performance computing in LOFAR
Many-core computing for LOFAR
5
THE LOFAR SOFTWARE TELESCOPE
Why Radio?
Credit: NASA/IPAC
Centaurus A, visible light and radio
The Dwingeloo telescope
Dwingeloo telescope, 1954 – 1990's
25m dish, largest turnable telescope in the world
Hydrogen line (21cm), galaxies Dwingeloo I & II
Now a national monument
Westerbork synthesis radio telescope
14 25m dishes, 3 km
Combined in hardware
Built in 1970, upgraded in 1999
120 MHz - 8.3 GHz
Software radio telescopes (1) 11
Software radio
telescopes
We cannot keep on
building larger
dishes
Replace dishes with
thousands of small
antennas
Combine signals in
software
Software radio telescopes (2)
Software telescopes are being built now
LOFAR: LOw Frequency Array (Netherlands, Europe)
ASKAP: Australian Square Kilometre Array Pathfinder
MeerKAT: Karoo Array Telescope (South Africa)
2020: SKA, Square Kilometre Array
Exa-scale! (1018:giga, tera, peta, exa)
LOFAR overview
Hierarchical
Receiver
Tile
Station
Telescope
Central
processing
Groningen
IBM BG/P
Dedicated fibers
LOFAR 14
Largest telescope in the world
88.000 omni-directional antennas
Hundreds of gbit/s
14x LHC
Hundreds
of teraFLOPS
10–250 MHz
100x more sensitive
LOFAR low-band antennas
15
LOFAR high-band antennas
16
Station (150m)
2x3 km
Station cabinet
Station processing
Special-purpose hardware, FPGAs
200 MHz ADC, filtering
Send to BG/P
Dedicated fiber
UDP
LOFAR science 21
Imaging
Epoch of re-ionization
Cosmic rays
Extragalactic surveys
Transients
Pulsars
A LOFAR observation
Cas A
Supernova remnant
115–160 MHz
12 stations
Searching for pulsars
Rotating neutron starts
Discovered in 1967
About 2000 are known
Big mass, precise period
“Lab in the sky”
Probe space and gravitation
Investigate general relativity
Processing pipeline
astronomy pipelines real time offline
Data volume
10 terabit/s 265 DVDs /s
200 gigabit/s 5 DVDs /s
50 gigabit/s 1.3 DVD/s
Processing pipeline
astronomy pipelines real time offline
Data volume
10 terabit/s 265 DVDs /s
200 gigabit/s 5 DVDs /s
50 gigabit/s 1.3 DVD/s
Flexibility
Processing pipeline
astronomy pipelines real time offline
Data volume
10 terabit/s 265 DVDs /s
200 gigabit/s 5 DVDs /s
50 gigabit/s 1.3 DVD/s
Flexibility
Data intensiveness
Processing overview 28
29
Online pipelines
Stella, the IBM Blue Gene/P 30
Was #2, now #266
850 MHz PowerPC
Designed for energy
efficiency
Complex numbers
3-D torus, collective, barrier, 10 GbE, JTAG networks
2½ racks = 10,880 cores = 37 TFLOP/s + 160*10 Gb/s
Optimizations
We need high bandwidth, high performance, real-time behavior
Use assembly for performance-critical code [SPAA'06]
Avoid resource contention by smart scheduling [PPoPP'10]
Run part of application on I/O node [PPoPP'08]
Use optimized network protocol [PDPTA'09]
Modify OS to avoid software TLB miss handler [IJHPC'10]
Use real-time scheduler [PPoPP'10]
Drop data if running behind [PPoPP'10]
Use asynchronous I/O [PPoPP'10]
Correlator 32
Polyphase Filter (prisma)
FIR filter, FFT
Beam Form
Correlate, integrate
time
frequency
BG/P performance 33
Correlator is O(n2)
achieve 96% of the
theoretical peak
Problem: processing is challenging
Special-purpose hardware
Inflexible
Expensive to design
Long time from design to production
Supercomputer
Flexible
Expensive to purchase
Expensive maintenance
Expensive due to electrical power costs
For SKA, we need orders of magnitude more!
Many-core advantages 35
Fast and cheap
Latest ATI HD 6990 has 3072 cores, 5.1 tflops
Costs only 575 euro!
Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops
Potentially more power efficient
Example: In theory, ATI 4870 GPU is 15 times more power
efficient than BG/P
Many-cores are becoming more general
CPUs are incorporating many-core techniques
Research questions
Architectural problems:
Which part of the theoretical performance can be
achieved in practice?
Can we get the data into the accelerators fast enough?
Performance consistent enough for real-time use?
Which architectural properties are essential?
Many-cores
Intel core i7 quad core + hyperthreading + SSE
Sony/Toshiba/IBM Cell/B.E. QS21 blade
GPUs: NVIDIA Tesla C1060/GTX280, ATI 4870
Compare with production code on BG/P
Compare architectures: Implemented everything in
assembly
Reader: Rob V. van Nieuwpoort and John W. Romein,
‘‘Correlating Radio Astronomy Signals with Many-Core Hardware,’’
architecture Intel core i7 IBM BG/P ATI 4870 NVIDIA C1060 STI Cell
cores x FPUs per core = total FPUs
4 x 4 = 16
4 x 2 = 8 160 x 5 = 800 30 x 8 = 240 8 x 4 = 32
gflops 85 14 1200 936 205
registers/core x width (floats)
16 x 4 64
32 x 2 64
1024x4 4096
2048 x 1 2048
128 x 4 512
device RAM bandwidth (GB/s)
n.a n.a. 115.2 102 n.a.
host RAM bandwidth (GB/s)
25.6 13.6 8.0 4.6
8.0 5.6
25.8
per operation bandwidth slowdown compared to BG/P
3.3 1.0 10.4 (host: 150)
9.2 (host: 117)
7.9
Essential many-core properties
38
Correlator algorithm 39
For all channels (63488)
For all combinations of two stations (2080)
For the combinations of polarizations (4)
Complex float sum = 0;
For the time integration interval (768 samples)
Sum += sample1 * sample2 (complex multiplication)
Store sum in memory
Correlator optimization 40
Overlap data transfers and computations
Exploit caches / shared memory / local store
Loop unrolling
Tiling
Scheduling
SIMD operations
Assembly
...
Correlator: Arithmetic Intensity 41
complex multiply-add: 8 flops
sample: real + complex float (2 * 4 bytes)
for (time = 0; time < integrationTime; time++) {
sum += samples[ch][station1][time][pol1] *
samples[ch][station2][time][pol2];
}
Correlator inner loop:
Correlator: Arithmetic Intensity 42
complex multiply-add: 8 flops
sample: real + complex float: 8 bytes
AI: 8 FLOPS, 2 samples: 8 / 16 = 0.5
for (time = 0; time < integrationTime; time++) {
sum += samples[ch][station1][time][pol1] *
samples[ch][station2][time][pol2];
}
Correlator inner loop:
Correlator AI optimization 43
Combine polarizations
complex multiply-add: 8 flops
2 polarizations: X, Y
calculate XX, XY, YX, YY
32 flops per square
Complex XY-sample: 16 bytes (x2)
1 flop/byte
Tiling
1 flop/byte 2.4 flops/byte
but, we need registers
1x1 already needs 16!
Tuning the tile size
tile size
floating point operations
memory loads (bytes)
arithmetic intensity
Minimum # Registers (floats)
1 x 1 32 32 1.00 16
2 x 1 64 48 1.33 24
2 x 2 128 64 2.00 44
3 x 2 192 80 2.40 60
3 x 3 288 96 3.00 88
4 x 3 384 112 3.43 112
4 x 4 512 128 4.00 148
Intel Core i7 CPU
The Cell Broadband Engine
ATI & NVIDIA GPUs
Correlator implementation
Implementation strategy on CPU 46
Partition frequencies over the cores
Independent
Multithreading
Each core computes its own correlation triangle
Use tiling: 2x2
Vectorize with SSE
Unroll time loop; compute 4 time steps in parallel
Implementation strategy on the Cell/BE 47
Partition frequencies over the SPEs
Independent
Each SPE computes its own correlation triangle
Use tiling: 4x4 (128 registers!)
Keep a strip of tiles in the local store: more reuse
Use double buffering from memory to local store
Overlap communication and computation
Vectorize
Different vector elements compute different
polarizations
Implementation strategy on GPUs 48
Partition frequencies over the streaming multiprocessors
Independent
Double buffering between GPU and host
Exploit data reuse as much as possible
Each streaming multiprocessor computes a correlation
triangle
Threads/cores within a SM cooperate on a single triangle
Load samples into shared memory
Use tiling (4x3 on ATI, 3x2 on NVIDIA)
Evaluation 49
How to cheat with speedups 50
How can this be? Core I7 CPU has 154 GFLOPs NVIDIA GTX 580 GPU has 1581 GFLOPs (10.3 X more)
How to cheat with speedups 51
Heavily Optimize GPU version Coalescing, Shared memory
Tiling, Loop unrolling
…
Do not optimize CPU version 1 core only
No SSE
Cache unaware
No loop unrolling and tiling, …
Result: very high speedups!
Exception: kernels that do interpolations (texturing hardware)
Solution Optimize CPU version
Use efficiencies: % of peak performance, Roofline
Theoretical performance bounds 52
Distinguish between global and local (host vs device)
Local AI = 1 .. 4
Depends on tile size, and # registers
Max performance = AI * memory bandwidth
ATI (4x3): 3.43 * 115.2 = 395 gflops
Peak of 1200 needs AI of 10.4 or 350 GB/s bandwidth
NVIDIA (3x2): 2.40 * 102.0 = 245 gflops
Peak of 996 needs AI of 9.8 or 415 GB/s bandwidth
Can we achieve more than this?
Theoretical performance bounds 53
Global AI = #stations + 1 (LOFAR: 65)
Max performance = AI * memory bandwidth
Use bandwidth of PCI-e 2.0
Max performance GPUs, with AI global:
ATI: 65 * 4.6 = 300 gflops
need 19 GB/s for peak
NVIDIA: 65 * 5.6 = 363 gflops
need 15 GB/s for peak
Correlator performance 54
Measured power efficiency 55
Current CPUs (even at 45 nm) still are less power efficient than BG/P (90 nm)
GPUs are not 15, but only 2-3x more power efficient than BG/P
65 nm Cell is 4x more power efficient than the BG/P
Scalability on NVIDIA GTX480 56
16 32 64 128 256 5120
200
400
600
800
1000
number of stations
gflo
ps
Weak and strong points 57
Intel Core i7 IBM BG/P ATI 4870 NVIDIA Tesla C1060
STI Cell
+ well-known toolchain
+ L2 prefetch unit + high memory bandwidth
+ largest # cores + shuffling support
+ Cuda is high-level + explicit cache (LS) + shuffle capabilities + power efficiency
- few registers - limited shuffling
- double precision only - expensive
- low PCI-e bandwidth (4.6 GB/s) - transfer slows down kernel - CAL is low-level - bad Brook+ performance - not well documented
- low PCI-e bandwidth (5.6 GB/s)
- multiple parallelism Levels (6!) - no increment in odd pipeline
Conclusions
Software telescopes are the future, extremely challenging
Software provides the required flexibility
Many-core architectures show great potential (28x)
PCI-e is a bottleneck
Compared to the BG/P or CPUs, the many-cores have low memory bandwidth per operation
This is OK if the architecture allows efficient data reuse
Optimal use of registers (tile size + SIMD strategy)
Exploit caches / local memories / shared memories
The Cell has 8 times lower memory bandwidth per operation, but still works thanks to explicit cache control and large number of registers
Backup slides
Vectorizing the correlator 60
How do we efficiently use the vectors ?
for (pol1 = 0; pol1 < nrPolarizations; pol1++) {
for (pol2 = 0; pol2 < nrPolarizations; pol2++) {
float sum = 0.0;
for (time = 0; time < integrationTime; time++) {
sum += samples[ch][station1][time][pol1]
* samples[ch][station2][time][pol2];
}
}
}
Vectorizing the correlator 61
Option 1: vectorize over time
Unroll time loop 4 times
for (pol1 = 0; pol1 < nrPolarizations; pol1++) {
for (pol2 = 0; pol2 < nrPolarizations; pol2++) {
float sum = 0.0;
for (time = 0; time < integrationTime; time += 4) {
sum += samples[ch][station1][time+0][pol1]
* samples[ch][station2][time+0][pol2];
sum += samples[ch][station1][time+1][pol1]
* samples[ch][station2][time+1][pol2];
sum += samples[ch][station1][time+2][pol1]
* samples[ch][station2][time+2][pol2];
sum += samples[ch][station1][time+3][pol1]
* samples[ch][station2][time+3][pol2];
}
}
}
Vectorizing the correlator 62
for (pol1 = 0; pol1 < nrPolarizations; pol1++) {
for (pol2 = 0; pol2 < nrPolarizations; pol2++) {
vector float sum = {0.0, 0.0, 0.0, 0.0};
for (time = 0; time < integrationTime; time += 4) {
vector float s1 = {
samples[ch][station1][time+0][pol1],
samples[ch][station1][time+1][pol1],
samples[ch][station1][time+2][pol1],
samples[ch][station1][time+3][pol1],
};
vector float s2 = {
samples[ch][station2][time+0][pol2],
samples[ch][station2][time+1][pol2],
samples[ch][station2][time+2][pol2],
samples[ch][station2][time+3][pol2],
};
sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2
}
result = sum.x + sum.y + sum.z + sum.w;
}
}
Vectorizing the correlator 63
Option 2: vectorize over polarization
for (pol1 = 0; pol1 < nrPolarizations; pol1++) {
for (pol2 = 0; pol2 < nrPolarizations; pol2++) {
float sum = 0.0;
for (time = 0; time < integrationTime; time++) {
sum += samples[ch][station1][time][pol1]
* samples[ch][station2][time][pol2];
}
}
}
Vectorizing the correlator 64
Option 2: vectorize over polarization
Remove polarization loops (4 combinations)
float sum = 0.0;
for (time = 0; time < integrationTime; time++) {
sum += samples[ch][station1][time][0]
* samples[ch][station2][time][0]; // XX
sum += samples[ch][station1][time][0]
* samples[ch][station2][time][1]; // XY
sum += samples[ch][station1][time][1]
* samples[ch][station2][time][0]; // YX
sum += samples[ch][station1][time][1]
* samples[ch][station2][time][1]; // YY
}
Vectorizing the correlator 65
vector float sum = {0.0, 0.0, 0.0, 0.0};
for (time = 0; time < integrationTime; time++) {
vector float s1 = {
samples[ch][station1][time][0],
samples[ch][station1][time][0],
samples[ch][station1][time][1],
samples[ch][station1][time][1],
};
vector float s2 = {
samples[ch][station2][time][0],
samples[ch][station2][time][1],
samples[ch][station2][time][0],
samples[ch][station2][time][1],
};
sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2
// sum now contains {XX, XY, YX, YY}
}
Delay Compensation
feature Cell/B.E. GPUs
access times uniform non-uniform
cache sharing level single thread (SPE) all threads in a multiprocessor
access to off-chip memory not possible, only through DMA
supported
memory access overlapping
asynchronous DMA Hardware-managed thread preemption (tens of thousands of threads)
communication communication between SPEs through EIB
independent thread blocks + shared memory within a block
It's all about the memory
67