Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...

Post on 26-Sep-2020

1 views 0 download

Transcript of Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...

PARALLEL PROGRAMMING

MANY-CORE COMPUTING

FOR THE LOFAR TELESCOPE

ROB VAN NIEUWPOORT

Rob van Nieuwpoort

rob@cs.vu.nl

Rob van Nieuwpoort

rob@cs.vu.nl

Who am I

10 years of Grid / Cloud computing

6 years of many-core computing, radio astronomy

Netherlands eScience center

Software engineering, many-core solutions, astronomy

ASTRON: Netherlands institute for radio astronomy

LOFAR

SKA

Accelerators

2

Big Data

Novel instruments and applications produce lots of data

LHC, telescopes, climate simulations, genetics, medical

scanners, facebook, twitter, …

These instruments and applications cannot work without

computer science anymore

A lot of unexploited knowledge that we can only find if the

data across disciplines is accessible and usable…

Challenges: data handling, processing

Need large-scale parallelism

3

eScience

Enhanced science

Apply ICT in the broadest sense

Data-driven research across all scientific disciplines

Develop generic Big Data tools

Collaboration between cross-disciplinary

researchers

4

Schedule

LOFAR application

Introduction radio astronomy

LOFAR & Big Data

High-performance computing in LOFAR

Many-core computing for LOFAR

5

THE LOFAR SOFTWARE TELESCOPE

Why Radio?

Credit: NASA/IPAC

Centaurus A, visible light and radio

The Dwingeloo telescope

Dwingeloo telescope, 1954 – 1990's

25m dish, largest turnable telescope in the world

Hydrogen line (21cm), galaxies Dwingeloo I & II

Now a national monument

Westerbork synthesis radio telescope

14 25m dishes, 3 km

Combined in hardware

Built in 1970, upgraded in 1999

120 MHz - 8.3 GHz

Software radio telescopes (1) 11

Software radio

telescopes

We cannot keep on

building larger

dishes

Replace dishes with

thousands of small

antennas

Combine signals in

software

Software radio telescopes (2)

Software telescopes are being built now

LOFAR: LOw Frequency Array (Netherlands, Europe)

ASKAP: Australian Square Kilometre Array Pathfinder

MeerKAT: Karoo Array Telescope (South Africa)

2020: SKA, Square Kilometre Array

Exa-scale! (1018:giga, tera, peta, exa)

LOFAR overview

Hierarchical

Receiver

Tile

Station

Telescope

Central

processing

Groningen

IBM BG/P

Dedicated fibers

LOFAR 14

Largest telescope in the world

88.000 omni-directional antennas

Hundreds of gbit/s

14x LHC

Hundreds

of teraFLOPS

10–250 MHz

100x more sensitive

LOFAR low-band antennas

15

LOFAR high-band antennas

16

Station (150m)

2x3 km

Station cabinet

Station processing

Special-purpose hardware, FPGAs

200 MHz ADC, filtering

Send to BG/P

Dedicated fiber

UDP

LOFAR science 21

Imaging

Epoch of re-ionization

Cosmic rays

Extragalactic surveys

Transients

Pulsars

A LOFAR observation

Cas A

Supernova remnant

115–160 MHz

12 stations

Searching for pulsars

Rotating neutron starts

Discovered in 1967

About 2000 are known

Big mass, precise period

“Lab in the sky”

Probe space and gravitation

Investigate general relativity

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Flexibility

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Flexibility

Data intensiveness

Processing overview 28

29

Online pipelines

Stella, the IBM Blue Gene/P 30

Was #2, now #266

850 MHz PowerPC

Designed for energy

efficiency

Complex numbers

3-D torus, collective, barrier, 10 GbE, JTAG networks

2½ racks = 10,880 cores = 37 TFLOP/s + 160*10 Gb/s

Optimizations

We need high bandwidth, high performance, real-time behavior

Use assembly for performance-critical code [SPAA'06]

Avoid resource contention by smart scheduling [PPoPP'10]

Run part of application on I/O node [PPoPP'08]

Use optimized network protocol [PDPTA'09]

Modify OS to avoid software TLB miss handler [IJHPC'10]

Use real-time scheduler [PPoPP'10]

Drop data if running behind [PPoPP'10]

Use asynchronous I/O [PPoPP'10]

Correlator 32

Polyphase Filter (prisma)

FIR filter, FFT

Beam Form

Correlate, integrate

time

frequency

BG/P performance 33

Correlator is O(n2)

achieve 96% of the

theoretical peak

Problem: processing is challenging

Special-purpose hardware

Inflexible

Expensive to design

Long time from design to production

Supercomputer

Flexible

Expensive to purchase

Expensive maintenance

Expensive due to electrical power costs

For SKA, we need orders of magnitude more!

Many-core advantages 35

Fast and cheap

Latest ATI HD 6990 has 3072 cores, 5.1 tflops

Costs only 575 euro!

Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops

Potentially more power efficient

Example: In theory, ATI 4870 GPU is 15 times more power

efficient than BG/P

Many-cores are becoming more general

CPUs are incorporating many-core techniques

Research questions

Architectural problems:

Which part of the theoretical performance can be

achieved in practice?

Can we get the data into the accelerators fast enough?

Performance consistent enough for real-time use?

Which architectural properties are essential?

Many-cores

Intel core i7 quad core + hyperthreading + SSE

Sony/Toshiba/IBM Cell/B.E. QS21 blade

GPUs: NVIDIA Tesla C1060/GTX280, ATI 4870

Compare with production code on BG/P

Compare architectures: Implemented everything in

assembly

Reader: Rob V. van Nieuwpoort and John W. Romein,

‘‘Correlating Radio Astronomy Signals with Many-Core Hardware,’’

architecture Intel core i7 IBM BG/P ATI 4870 NVIDIA C1060 STI Cell

cores x FPUs per core = total FPUs

4 x 4 = 16

4 x 2 = 8 160 x 5 = 800 30 x 8 = 240 8 x 4 = 32

gflops 85 14 1200 936 205

registers/core x width (floats)

16 x 4 64

32 x 2 64

1024x4 4096

2048 x 1 2048

128 x 4 512

device RAM bandwidth (GB/s)

n.a n.a. 115.2 102 n.a.

host RAM bandwidth (GB/s)

25.6 13.6 8.0 4.6

8.0 5.6

25.8

per operation bandwidth slowdown compared to BG/P

3.3 1.0 10.4 (host: 150)

9.2 (host: 117)

7.9

Essential many-core properties

38

Correlator algorithm 39

For all channels (63488)

For all combinations of two stations (2080)

For the combinations of polarizations (4)

Complex float sum = 0;

For the time integration interval (768 samples)

Sum += sample1 * sample2 (complex multiplication)

Store sum in memory

Correlator optimization 40

Overlap data transfers and computations

Exploit caches / shared memory / local store

Loop unrolling

Tiling

Scheduling

SIMD operations

Assembly

...

Correlator: Arithmetic Intensity 41

complex multiply-add: 8 flops

sample: real + complex float (2 * 4 bytes)

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1] *

samples[ch][station2][time][pol2];

}

Correlator inner loop:

Correlator: Arithmetic Intensity 42

complex multiply-add: 8 flops

sample: real + complex float: 8 bytes

AI: 8 FLOPS, 2 samples: 8 / 16 = 0.5

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1] *

samples[ch][station2][time][pol2];

}

Correlator inner loop:

Correlator AI optimization 43

Combine polarizations

complex multiply-add: 8 flops

2 polarizations: X, Y

calculate XX, XY, YX, YY

32 flops per square

Complex XY-sample: 16 bytes (x2)

1 flop/byte

Tiling

1 flop/byte 2.4 flops/byte

but, we need registers

1x1 already needs 16!

Tuning the tile size

tile size

floating point operations

memory loads (bytes)

arithmetic intensity

Minimum # Registers (floats)

1 x 1 32 32 1.00 16

2 x 1 64 48 1.33 24

2 x 2 128 64 2.00 44

3 x 2 192 80 2.40 60

3 x 3 288 96 3.00 88

4 x 3 384 112 3.43 112

4 x 4 512 128 4.00 148

Intel Core i7 CPU

The Cell Broadband Engine

ATI & NVIDIA GPUs

Correlator implementation

Implementation strategy on CPU 46

Partition frequencies over the cores

Independent

Multithreading

Each core computes its own correlation triangle

Use tiling: 2x2

Vectorize with SSE

Unroll time loop; compute 4 time steps in parallel

Implementation strategy on the Cell/BE 47

Partition frequencies over the SPEs

Independent

Each SPE computes its own correlation triangle

Use tiling: 4x4 (128 registers!)

Keep a strip of tiles in the local store: more reuse

Use double buffering from memory to local store

Overlap communication and computation

Vectorize

Different vector elements compute different

polarizations

Implementation strategy on GPUs 48

Partition frequencies over the streaming multiprocessors

Independent

Double buffering between GPU and host

Exploit data reuse as much as possible

Each streaming multiprocessor computes a correlation

triangle

Threads/cores within a SM cooperate on a single triangle

Load samples into shared memory

Use tiling (4x3 on ATI, 3x2 on NVIDIA)

Evaluation 49

How to cheat with speedups 50

How can this be? Core I7 CPU has 154 GFLOPs NVIDIA GTX 580 GPU has 1581 GFLOPs (10.3 X more)

How to cheat with speedups 51

Heavily Optimize GPU version Coalescing, Shared memory

Tiling, Loop unrolling

Do not optimize CPU version 1 core only

No SSE

Cache unaware

No loop unrolling and tiling, …

Result: very high speedups!

Exception: kernels that do interpolations (texturing hardware)

Solution Optimize CPU version

Use efficiencies: % of peak performance, Roofline

Theoretical performance bounds 52

Distinguish between global and local (host vs device)

Local AI = 1 .. 4

Depends on tile size, and # registers

Max performance = AI * memory bandwidth

ATI (4x3): 3.43 * 115.2 = 395 gflops

Peak of 1200 needs AI of 10.4 or 350 GB/s bandwidth

NVIDIA (3x2): 2.40 * 102.0 = 245 gflops

Peak of 996 needs AI of 9.8 or 415 GB/s bandwidth

Can we achieve more than this?

Theoretical performance bounds 53

Global AI = #stations + 1 (LOFAR: 65)

Max performance = AI * memory bandwidth

Use bandwidth of PCI-e 2.0

Max performance GPUs, with AI global:

ATI: 65 * 4.6 = 300 gflops

need 19 GB/s for peak

NVIDIA: 65 * 5.6 = 363 gflops

need 15 GB/s for peak

Correlator performance 54

Measured power efficiency 55

Current CPUs (even at 45 nm) still are less power efficient than BG/P (90 nm)

GPUs are not 15, but only 2-3x more power efficient than BG/P

65 nm Cell is 4x more power efficient than the BG/P

Scalability on NVIDIA GTX480 56

16 32 64 128 256 5120

200

400

600

800

1000

number of stations

gflo

ps

Weak and strong points 57

Intel Core i7 IBM BG/P ATI 4870 NVIDIA Tesla C1060

STI Cell

+ well-known toolchain

+ L2 prefetch unit + high memory bandwidth

+ largest # cores + shuffling support

+ Cuda is high-level + explicit cache (LS) + shuffle capabilities + power efficiency

- few registers - limited shuffling

- double precision only - expensive

- low PCI-e bandwidth (4.6 GB/s) - transfer slows down kernel - CAL is low-level - bad Brook+ performance - not well documented

- low PCI-e bandwidth (5.6 GB/s)

- multiple parallelism Levels (6!) - no increment in odd pipeline

Conclusions

Software telescopes are the future, extremely challenging

Software provides the required flexibility

Many-core architectures show great potential (28x)

PCI-e is a bottleneck

Compared to the BG/P or CPUs, the many-cores have low memory bandwidth per operation

This is OK if the architecture allows efficient data reuse

Optimal use of registers (tile size + SIMD strategy)

Exploit caches / local memories / shared memories

The Cell has 8 times lower memory bandwidth per operation, but still works thanks to explicit cache control and large number of registers

Backup slides

Vectorizing the correlator 60

How do we efficiently use the vectors ?

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1]

* samples[ch][station2][time][pol2];

}

}

}

Vectorizing the correlator 61

Option 1: vectorize over time

Unroll time loop 4 times

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time += 4) {

sum += samples[ch][station1][time+0][pol1]

* samples[ch][station2][time+0][pol2];

sum += samples[ch][station1][time+1][pol1]

* samples[ch][station2][time+1][pol2];

sum += samples[ch][station1][time+2][pol1]

* samples[ch][station2][time+2][pol2];

sum += samples[ch][station1][time+3][pol1]

* samples[ch][station2][time+3][pol2];

}

}

}

Vectorizing the correlator 62

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

vector float sum = {0.0, 0.0, 0.0, 0.0};

for (time = 0; time < integrationTime; time += 4) {

vector float s1 = {

samples[ch][station1][time+0][pol1],

samples[ch][station1][time+1][pol1],

samples[ch][station1][time+2][pol1],

samples[ch][station1][time+3][pol1],

};

vector float s2 = {

samples[ch][station2][time+0][pol2],

samples[ch][station2][time+1][pol2],

samples[ch][station2][time+2][pol2],

samples[ch][station2][time+3][pol2],

};

sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2

}

result = sum.x + sum.y + sum.z + sum.w;

}

}

Vectorizing the correlator 63

Option 2: vectorize over polarization

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1]

* samples[ch][station2][time][pol2];

}

}

}

Vectorizing the correlator 64

Option 2: vectorize over polarization

Remove polarization loops (4 combinations)

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][0]

* samples[ch][station2][time][0]; // XX

sum += samples[ch][station1][time][0]

* samples[ch][station2][time][1]; // XY

sum += samples[ch][station1][time][1]

* samples[ch][station2][time][0]; // YX

sum += samples[ch][station1][time][1]

* samples[ch][station2][time][1]; // YY

}

Vectorizing the correlator 65

vector float sum = {0.0, 0.0, 0.0, 0.0};

for (time = 0; time < integrationTime; time++) {

vector float s1 = {

samples[ch][station1][time][0],

samples[ch][station1][time][0],

samples[ch][station1][time][1],

samples[ch][station1][time][1],

};

vector float s2 = {

samples[ch][station2][time][0],

samples[ch][station2][time][1],

samples[ch][station2][time][0],

samples[ch][station2][time][1],

};

sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2

// sum now contains {XX, XY, YX, YY}

}

Delay Compensation

feature Cell/B.E. GPUs

access times uniform non-uniform

cache sharing level single thread (SPE) all threads in a multiprocessor

access to off-chip memory not possible, only through DMA

supported

memory access overlapping

asynchronous DMA Hardware-managed thread preemption (tens of thousands of threads)

communication communication between SPEs through EIB

independent thread blocks + shared memory within a block

It's all about the memory

67