Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...

67
PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT Rob van Nieuwpoort [email protected] Rob van Nieuwpoort [email protected]

Transcript of Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork...

Page 1: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

PARALLEL PROGRAMMING

MANY-CORE COMPUTING

FOR THE LOFAR TELESCOPE

ROB VAN NIEUWPOORT

Rob van Nieuwpoort

[email protected]

Rob van Nieuwpoort

[email protected]

Page 2: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Who am I

10 years of Grid / Cloud computing

6 years of many-core computing, radio astronomy

Netherlands eScience center

Software engineering, many-core solutions, astronomy

ASTRON: Netherlands institute for radio astronomy

LOFAR

SKA

Accelerators

2

Page 3: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Big Data

Novel instruments and applications produce lots of data

LHC, telescopes, climate simulations, genetics, medical

scanners, facebook, twitter, …

These instruments and applications cannot work without

computer science anymore

A lot of unexploited knowledge that we can only find if the

data across disciplines is accessible and usable…

Challenges: data handling, processing

Need large-scale parallelism

3

Page 4: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

eScience

Enhanced science

Apply ICT in the broadest sense

Data-driven research across all scientific disciplines

Develop generic Big Data tools

Collaboration between cross-disciplinary

researchers

4

Page 5: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Schedule

LOFAR application

Introduction radio astronomy

LOFAR & Big Data

High-performance computing in LOFAR

Many-core computing for LOFAR

5

Page 6: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

THE LOFAR SOFTWARE TELESCOPE

Page 7: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Why Radio?

Credit: NASA/IPAC

Page 8: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Centaurus A, visible light and radio

Page 9: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

The Dwingeloo telescope

Dwingeloo telescope, 1954 – 1990's

25m dish, largest turnable telescope in the world

Hydrogen line (21cm), galaxies Dwingeloo I & II

Now a national monument

Page 10: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Westerbork synthesis radio telescope

14 25m dishes, 3 km

Combined in hardware

Built in 1970, upgraded in 1999

120 MHz - 8.3 GHz

Page 11: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Software radio telescopes (1) 11

Software radio

telescopes

We cannot keep on

building larger

dishes

Replace dishes with

thousands of small

antennas

Combine signals in

software

Page 12: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Software radio telescopes (2)

Software telescopes are being built now

LOFAR: LOw Frequency Array (Netherlands, Europe)

ASKAP: Australian Square Kilometre Array Pathfinder

MeerKAT: Karoo Array Telescope (South Africa)

2020: SKA, Square Kilometre Array

Exa-scale! (1018:giga, tera, peta, exa)

Page 13: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

LOFAR overview

Hierarchical

Receiver

Tile

Station

Telescope

Central

processing

Groningen

IBM BG/P

Dedicated fibers

Page 14: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

LOFAR 14

Largest telescope in the world

88.000 omni-directional antennas

Hundreds of gbit/s

14x LHC

Hundreds

of teraFLOPS

10–250 MHz

100x more sensitive

Page 15: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

LOFAR low-band antennas

15

Page 16: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

LOFAR high-band antennas

16

Page 17: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Station (150m)

Page 18: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

2x3 km

Page 19: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Station cabinet

Page 20: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Station processing

Special-purpose hardware, FPGAs

200 MHz ADC, filtering

Send to BG/P

Dedicated fiber

UDP

Page 21: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

LOFAR science 21

Imaging

Epoch of re-ionization

Cosmic rays

Extragalactic surveys

Transients

Pulsars

Page 22: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

A LOFAR observation

Cas A

Supernova remnant

115–160 MHz

12 stations

Page 23: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Searching for pulsars

Rotating neutron starts

Discovered in 1967

About 2000 are known

Big mass, precise period

“Lab in the sky”

Probe space and gravitation

Investigate general relativity

Page 24: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,
Page 25: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Page 26: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Flexibility

Page 27: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Processing pipeline

astronomy pipelines real time offline

Data volume

10 terabit/s 265 DVDs /s

200 gigabit/s 5 DVDs /s

50 gigabit/s 1.3 DVD/s

Flexibility

Data intensiveness

Page 28: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Processing overview 28

Page 29: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

29

Online pipelines

Page 30: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Stella, the IBM Blue Gene/P 30

Was #2, now #266

850 MHz PowerPC

Designed for energy

efficiency

Complex numbers

3-D torus, collective, barrier, 10 GbE, JTAG networks

2½ racks = 10,880 cores = 37 TFLOP/s + 160*10 Gb/s

Page 31: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Optimizations

We need high bandwidth, high performance, real-time behavior

Use assembly for performance-critical code [SPAA'06]

Avoid resource contention by smart scheduling [PPoPP'10]

Run part of application on I/O node [PPoPP'08]

Use optimized network protocol [PDPTA'09]

Modify OS to avoid software TLB miss handler [IJHPC'10]

Use real-time scheduler [PPoPP'10]

Drop data if running behind [PPoPP'10]

Use asynchronous I/O [PPoPP'10]

Page 32: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator 32

Polyphase Filter (prisma)

FIR filter, FFT

Beam Form

Correlate, integrate

time

frequency

Page 33: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

BG/P performance 33

Correlator is O(n2)

achieve 96% of the

theoretical peak

Page 34: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Problem: processing is challenging

Special-purpose hardware

Inflexible

Expensive to design

Long time from design to production

Supercomputer

Flexible

Expensive to purchase

Expensive maintenance

Expensive due to electrical power costs

For SKA, we need orders of magnitude more!

Page 35: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Many-core advantages 35

Fast and cheap

Latest ATI HD 6990 has 3072 cores, 5.1 tflops

Costs only 575 euro!

Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops

Potentially more power efficient

Example: In theory, ATI 4870 GPU is 15 times more power

efficient than BG/P

Many-cores are becoming more general

CPUs are incorporating many-core techniques

Page 36: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Research questions

Architectural problems:

Which part of the theoretical performance can be

achieved in practice?

Can we get the data into the accelerators fast enough?

Performance consistent enough for real-time use?

Which architectural properties are essential?

Page 37: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Many-cores

Intel core i7 quad core + hyperthreading + SSE

Sony/Toshiba/IBM Cell/B.E. QS21 blade

GPUs: NVIDIA Tesla C1060/GTX280, ATI 4870

Compare with production code on BG/P

Compare architectures: Implemented everything in

assembly

Reader: Rob V. van Nieuwpoort and John W. Romein,

‘‘Correlating Radio Astronomy Signals with Many-Core Hardware,’’

Page 38: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

architecture Intel core i7 IBM BG/P ATI 4870 NVIDIA C1060 STI Cell

cores x FPUs per core = total FPUs

4 x 4 = 16

4 x 2 = 8 160 x 5 = 800 30 x 8 = 240 8 x 4 = 32

gflops 85 14 1200 936 205

registers/core x width (floats)

16 x 4 64

32 x 2 64

1024x4 4096

2048 x 1 2048

128 x 4 512

device RAM bandwidth (GB/s)

n.a n.a. 115.2 102 n.a.

host RAM bandwidth (GB/s)

25.6 13.6 8.0 4.6

8.0 5.6

25.8

per operation bandwidth slowdown compared to BG/P

3.3 1.0 10.4 (host: 150)

9.2 (host: 117)

7.9

Essential many-core properties

38

Page 39: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator algorithm 39

For all channels (63488)

For all combinations of two stations (2080)

For the combinations of polarizations (4)

Complex float sum = 0;

For the time integration interval (768 samples)

Sum += sample1 * sample2 (complex multiplication)

Store sum in memory

Page 40: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator optimization 40

Overlap data transfers and computations

Exploit caches / shared memory / local store

Loop unrolling

Tiling

Scheduling

SIMD operations

Assembly

...

Page 41: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator: Arithmetic Intensity 41

complex multiply-add: 8 flops

sample: real + complex float (2 * 4 bytes)

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1] *

samples[ch][station2][time][pol2];

}

Correlator inner loop:

Page 42: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator: Arithmetic Intensity 42

complex multiply-add: 8 flops

sample: real + complex float: 8 bytes

AI: 8 FLOPS, 2 samples: 8 / 16 = 0.5

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1] *

samples[ch][station2][time][pol2];

}

Correlator inner loop:

Page 43: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator AI optimization 43

Combine polarizations

complex multiply-add: 8 flops

2 polarizations: X, Y

calculate XX, XY, YX, YY

32 flops per square

Complex XY-sample: 16 bytes (x2)

1 flop/byte

Tiling

1 flop/byte 2.4 flops/byte

but, we need registers

1x1 already needs 16!

Page 44: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Tuning the tile size

tile size

floating point operations

memory loads (bytes)

arithmetic intensity

Minimum # Registers (floats)

1 x 1 32 32 1.00 16

2 x 1 64 48 1.33 24

2 x 2 128 64 2.00 44

3 x 2 192 80 2.40 60

3 x 3 288 96 3.00 88

4 x 3 384 112 3.43 112

4 x 4 512 128 4.00 148

Page 45: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Intel Core i7 CPU

The Cell Broadband Engine

ATI & NVIDIA GPUs

Correlator implementation

Page 46: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Implementation strategy on CPU 46

Partition frequencies over the cores

Independent

Multithreading

Each core computes its own correlation triangle

Use tiling: 2x2

Vectorize with SSE

Unroll time loop; compute 4 time steps in parallel

Page 47: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Implementation strategy on the Cell/BE 47

Partition frequencies over the SPEs

Independent

Each SPE computes its own correlation triangle

Use tiling: 4x4 (128 registers!)

Keep a strip of tiles in the local store: more reuse

Use double buffering from memory to local store

Overlap communication and computation

Vectorize

Different vector elements compute different

polarizations

Page 48: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Implementation strategy on GPUs 48

Partition frequencies over the streaming multiprocessors

Independent

Double buffering between GPU and host

Exploit data reuse as much as possible

Each streaming multiprocessor computes a correlation

triangle

Threads/cores within a SM cooperate on a single triangle

Load samples into shared memory

Use tiling (4x3 on ATI, 3x2 on NVIDIA)

Page 49: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Evaluation 49

Page 50: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

How to cheat with speedups 50

How can this be? Core I7 CPU has 154 GFLOPs NVIDIA GTX 580 GPU has 1581 GFLOPs (10.3 X more)

Page 51: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

How to cheat with speedups 51

Heavily Optimize GPU version Coalescing, Shared memory

Tiling, Loop unrolling

Do not optimize CPU version 1 core only

No SSE

Cache unaware

No loop unrolling and tiling, …

Result: very high speedups!

Exception: kernels that do interpolations (texturing hardware)

Solution Optimize CPU version

Use efficiencies: % of peak performance, Roofline

Page 52: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Theoretical performance bounds 52

Distinguish between global and local (host vs device)

Local AI = 1 .. 4

Depends on tile size, and # registers

Max performance = AI * memory bandwidth

ATI (4x3): 3.43 * 115.2 = 395 gflops

Peak of 1200 needs AI of 10.4 or 350 GB/s bandwidth

NVIDIA (3x2): 2.40 * 102.0 = 245 gflops

Peak of 996 needs AI of 9.8 or 415 GB/s bandwidth

Can we achieve more than this?

Page 53: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Theoretical performance bounds 53

Global AI = #stations + 1 (LOFAR: 65)

Max performance = AI * memory bandwidth

Use bandwidth of PCI-e 2.0

Max performance GPUs, with AI global:

ATI: 65 * 4.6 = 300 gflops

need 19 GB/s for peak

NVIDIA: 65 * 5.6 = 363 gflops

need 15 GB/s for peak

Page 54: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Correlator performance 54

Page 55: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Measured power efficiency 55

Current CPUs (even at 45 nm) still are less power efficient than BG/P (90 nm)

GPUs are not 15, but only 2-3x more power efficient than BG/P

65 nm Cell is 4x more power efficient than the BG/P

Page 56: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Scalability on NVIDIA GTX480 56

16 32 64 128 256 5120

200

400

600

800

1000

number of stations

gflo

ps

Page 57: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Weak and strong points 57

Intel Core i7 IBM BG/P ATI 4870 NVIDIA Tesla C1060

STI Cell

+ well-known toolchain

+ L2 prefetch unit + high memory bandwidth

+ largest # cores + shuffling support

+ Cuda is high-level + explicit cache (LS) + shuffle capabilities + power efficiency

- few registers - limited shuffling

- double precision only - expensive

- low PCI-e bandwidth (4.6 GB/s) - transfer slows down kernel - CAL is low-level - bad Brook+ performance - not well documented

- low PCI-e bandwidth (5.6 GB/s)

- multiple parallelism Levels (6!) - no increment in odd pipeline

Page 58: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Conclusions

Software telescopes are the future, extremely challenging

Software provides the required flexibility

Many-core architectures show great potential (28x)

PCI-e is a bottleneck

Compared to the BG/P or CPUs, the many-cores have low memory bandwidth per operation

This is OK if the architecture allows efficient data reuse

Optimal use of registers (tile size + SIMD strategy)

Exploit caches / local memories / shared memories

The Cell has 8 times lower memory bandwidth per operation, but still works thanks to explicit cache control and large number of registers

Page 59: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Backup slides

Page 60: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 60

How do we efficiently use the vectors ?

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1]

* samples[ch][station2][time][pol2];

}

}

}

Page 61: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 61

Option 1: vectorize over time

Unroll time loop 4 times

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time += 4) {

sum += samples[ch][station1][time+0][pol1]

* samples[ch][station2][time+0][pol2];

sum += samples[ch][station1][time+1][pol1]

* samples[ch][station2][time+1][pol2];

sum += samples[ch][station1][time+2][pol1]

* samples[ch][station2][time+2][pol2];

sum += samples[ch][station1][time+3][pol1]

* samples[ch][station2][time+3][pol2];

}

}

}

Page 62: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 62

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

vector float sum = {0.0, 0.0, 0.0, 0.0};

for (time = 0; time < integrationTime; time += 4) {

vector float s1 = {

samples[ch][station1][time+0][pol1],

samples[ch][station1][time+1][pol1],

samples[ch][station1][time+2][pol1],

samples[ch][station1][time+3][pol1],

};

vector float s2 = {

samples[ch][station2][time+0][pol2],

samples[ch][station2][time+1][pol2],

samples[ch][station2][time+2][pol2],

samples[ch][station2][time+3][pol2],

};

sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2

}

result = sum.x + sum.y + sum.z + sum.w;

}

}

Page 63: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 63

Option 2: vectorize over polarization

for (pol1 = 0; pol1 < nrPolarizations; pol1++) {

for (pol2 = 0; pol2 < nrPolarizations; pol2++) {

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][pol1]

* samples[ch][station2][time][pol2];

}

}

}

Page 64: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 64

Option 2: vectorize over polarization

Remove polarization loops (4 combinations)

float sum = 0.0;

for (time = 0; time < integrationTime; time++) {

sum += samples[ch][station1][time][0]

* samples[ch][station2][time][0]; // XX

sum += samples[ch][station1][time][0]

* samples[ch][station2][time][1]; // XY

sum += samples[ch][station1][time][1]

* samples[ch][station2][time][0]; // YX

sum += samples[ch][station1][time][1]

* samples[ch][station2][time][1]; // YY

}

Page 65: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Vectorizing the correlator 65

vector float sum = {0.0, 0.0, 0.0, 0.0};

for (time = 0; time < integrationTime; time++) {

vector float s1 = {

samples[ch][station1][time][0],

samples[ch][station1][time][0],

samples[ch][station1][time][1],

samples[ch][station1][time][1],

};

vector float s2 = {

samples[ch][station2][time][0],

samples[ch][station2][time][1],

samples[ch][station2][time][0],

samples[ch][station2][time][1],

};

sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2

// sum now contains {XX, XY, YX, YY}

}

Page 66: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

Delay Compensation

Page 67: Many-core computing class 2 - Vrije Universiteit Amsterdambal/college13/pp-class-lofar.pdfWesterbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970,

feature Cell/B.E. GPUs

access times uniform non-uniform

cache sharing level single thread (SPE) all threads in a multiprocessor

access to off-chip memory not possible, only through DMA

supported

memory access overlapping

asynchronous DMA Hardware-managed thread preemption (tens of thousands of threads)

communication communication between SPEs through EIB

independent thread blocks + shared memory within a block

It's all about the memory

67