V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer...

59
V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto 1/60

Transcript of V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer...

Page 1: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran

University of Belgrade

Oskar MencerImperial College, London

Oliver Pell Maxeler Technologies, London and Palo Alto

Michael FlynnStanford University, Palo Alto

1/60

Page 2: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

For Big Data algorithms and for the same hardware price as before, achieving:

a) speed-up, 20-200 b) monthly electricity bills, reduced

20 timesc) size, 20 times smaller

2/60

Page 3: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

1. BigData2. WORM3. Tolerance to latency4. Over 95% of run time in loops5. Reusability of the data6. Skills

3/60

Page 4: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Absolutely all results achieved with:

a) all hardware produced in Europe, specifically UK

b) all software generated by programmers

of EU and WB

4/60

Page 5: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

ControlFlow (MultiFlow and ManyFlow): Top500 ranks using Linpack (Japanese

K,…)

DataFlow: Coarse Grain (HEP) vs. Fine Grain

(Maxeler)

5/60

Page 6: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Compiling below the machine code level brings speedups;also a smaller power, size, and cost.

The price to pay:The machine is more difficult to program.

Consequently:Ideal for WORM applications :)

Examples using Maxeler:GeoPhysics (20-40), Banking (200-1000, with JP Morgan

20%), M&C (New York City), Datamining (Google), …

6/60

Page 7: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

7

Page 8: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

8/60

Page 9: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

9/60

Page 10: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

10

Page 11: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

11

Page 12: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores.

tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU

tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU

tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF

12/60

Page 13: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

DualCore?

Which way are the horses going?

13/60

Page 14: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Is it possibleto use 2000 chicken instead of two horses?

?==

14/60

What is better, real and anecdotic?

Page 15: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

2 x 1000 chickens (CUDA and rCUDA) 15/60

Page 16: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

How about 2 000 000 ants?

16/60

Dat

a

Page 17: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Marmalade

Big Data Input Results

17/60

Page 18: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Factor: 20 to 200

MultiCore/ManyCore

Dataflow

Machine Level Code

Gate Transfer Level

18/60

Page 19: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Factor: 20

MultiCore/ManyCore

Dataflow

19/60

Page 20: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Factor: 20

Data Processing

Process ControlData Processing

Process Control

MultiCore/ManyCore

DataFlow

20/60

Page 21: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

MultiCore:Explain what to do, to the driverCaches, instruction buffers, and predictors needed

ManyCore:Explain what to do, to many sub-driversReduced caches and instruction buffers needed

DataFlow:Make a field of processing gates: 1C+2nJava+3JavaNo caches, etc. (300 students/year: BGD, BCN, LjU,

ICL,…)

21/60

Page 22: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

MultiCore:Business as usual

ManyCore:More difficult

DataFlow:Much more difficultDebugging both, application and configuration

code

22/60

Page 23: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

MultiCore/ManyCore:Several minutes

DataFlow:Several hours for the real hardwareFortunately, only several minutes for the simulatorThe simulator supports

both the large JPMorgan machineas well as the smallest “University Support” machine

Good news:Tabula@2GHz

23/60

Page 24: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

24/60

Page 25: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

MultiCore:Horse stable

ManyCore:Chicken house

DataFlow:Ant hole

25/60

Page 26: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

MultiCore:Haystack

ManyCore:Cornbits

DataFlow:Crumbs

26/60

Page 27: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

27/60

Small Data: Toy Benchmarks (e.g., Linpack)

Page 28: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

28/60

Medium Data (benchmarks favorising NVidia,compared to Intel,…)

Page 29: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

29/60

Big Data

Page 30: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Revisiting the Top 500 SuperComputer BenchmarksOur paper in Communications of the ACM

Revisiting all major Big Data DM algorithmsMassive static parallelism at low clock frequencies

Concurrency and communicationConcurrency between millions of tiny cores difficult,

“jitter” between cores will harm performance at synchronization points

Reliability and fault tolerance10-100x fewer nodes, failures much less often

Memory bandwidth and FLOP/byte ratioOptimize data choreography, data movement,

and the algorithmic computation

30/60

Page 31: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Maxeler Hardware

CPUs plus DFEsIntel Xeon CPU cores and up to

4 DFEs with 192GB of RAM

DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation

of DFEs to CPU servers

Low latency connectivityIntel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet

connections

MaxWorkstationDesktop development system

MaxCloudOn-demand scalable accelerated compute resource, hosted in London

3131/60/60

Page 32: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

1. Coarse grained, stateful: Business– CPU requires DFE for minutes or hours

2. Fine grained, transactional with shared database: DM– CPU utilizes DFE for ms to s– Many short computations, accessing common database data

3. Fine grained, stateless transactional: Science (FF)– CPU requires DFE for ms to s– Many short computations

3232/60/60

Major Classes of Algorithms, from the Computational Perspective

Page 33: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

• Long runtime, but:• Memory requirements

change dramatically based on modelled frequency

• Number of DFEs allocated to a CPU process can be easily varied to increase available memory

• Streaming compression• Boundary data exchanged

over chassis MaxRing

3333/60/60

Coarse Grained: Modeling

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

1 4 8

Equi

vale

nt C

PU c

ores

Number of MAX2 cards

15Hz peak frequency

30Hz peak frequency

45Hz peak frequency

70Hz peak frequency

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80Peak Frequency (Hz)

Timesteps (thousand)

Domain points (billion)

Total computed points (trillion)

Page 34: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

• DFE DRAM contains the database to be searched• CPUs issue transactions find(x, db)• Complex search function

– Text search against documents– Shortest distance to coordinate (multi-dimensional)– Smith Waterman sequence alignment for genomes

• Any CPU runs on any DFE that has been loaded with the database– MaxelerOS may add or remove DFEs

from the processing group to balance system demands– New DFEs must be loaded with the search DB before use

3434/60/60

Fine Grained, Shared Data: Monitoring

Page 35: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

• Analyse > 1,000,000 scenarios• Many CPU processes run on many DFEs

– Each transaction executes on any DFE in the assigned group atomically

• ~50x MPC-X vs. multi-core x86 node

3535/60/60

Fine Grained, Stateless: The BSOP Control

CPU DFE Loop over instrumentsLoop over instruments

Random number generator and

sampling of underliers

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Price instruments using Black

Scholes

Tail analysis on CPU

Tail analysis on CPU

CPU DFE Loop over instrumentsLoop over instruments

Random number generator and

sampling of underliers

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Price instruments using Black

Scholes

Tail analysis on CPU

Tail analysis on CPU

CPU DFE Loop over instrumentsLoop over instruments

Random number generator and

sampling of underliers

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Price instruments using Black

Scholes

Tail analysis on CPU

Tail analysis on CPU

CPU DFE Loop over instrumentsLoop over instruments

Random number generator and

sampling of underliers

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Price instruments using Black

Scholes

Tail analysis on CPU

Tail analysis on CPU

DFE Loop over instrumentsLoop over instrumentsCPUMarket and instruments data

Random number generator and

sampling of underliers

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Price instruments using Black

ScholesInstrument values

Tail analysis on CPU

Tail analysis on CPU

Page 36: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

3636/60/60

Selected Examples

Page 37: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

37373737/60/60

Page 38: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Performance of one MAX2 card vs. 1 CPU core

Land case (8 params), speedup of 230x

Marine case (6 params), speedup of 190x

The CRS Results

CPU Coherency MAX2 Coherency

3838/60/60

Page 39: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Seismic Imaging

• Running on MaxNode servers- 8 parallel compute pipelines per chip- 150MHz => low power consumption!- 30x faster than microprocessors

An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§

†Chevron, ‡Maxeler, §Formerly Chevron, SEG 20083939/60/60

Page 40: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

4040

Page 41: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,
Page 42: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

• DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method

to obtain a stack of traces, based on 8 parameters– Search for every sample of each output trace

Trace Stacking: Speed-up 217P. Marchetti et al, 2010

parameters( emergence angle & azimuth

Normal Wave front parametersKN,11; KN,12 ; KN22

NIP Wave front parameters( KNip,11; KNip,12 ; KNip22 )

hHKHhmHKHmmw TzyNIPzy

TTzyNzy

TT

0

0

2

00

2 22

v

t

vtthyp

4242/60/60

Page 43: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

4343

Page 44: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

4444

Page 45: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

4545

Page 46: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

This is about algorithmic changes, to maximize

the algorithm to architecture match:Data choreography, process modifications,

anddecision precision.

The winning paradigm of Big Data ExaScale?

4646/60/60

Conclusion: Nota Bene

Page 47: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

47/8

The TriPeak

Siena+ BSC+ Imperial College + Maxeler+ Belgrade

46/60

Page 48: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

48/8

The TriPeakMontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)Maxeler = A FineGrain DataFlow (FPGA)

How about a happy marriage?MontBlanc (ompSS) and Maxeler (an accelerator)

In each happy marriage,it is known who does what :)

The Big Data DM algorithms:What part goes to MontBlanc and what to Maxeler?

48/60

Page 49: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

49/8

Core of the Symbiotic SuccessAn intelligent DM algorithmic scheduler,partially implemented for compile time,and partially for run time.

At compile time:Checking what part of code fits where(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K

At run time:Rechecking the compile time decision,based on the current data values.

49/60

Page 50: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

50/850/850

Page 51: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

51/851/8

Maxeler: Teaching (Google: prof vm)TEACHING, VLSI, PowerPoints, Maxeler:

Maxeler Veljko Explanations, August 2012Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012Maxeler Forbes ArticleFlyer by JP MorganFlyer by Maxeler HPCTutorial Slides by Sasha and Veljko: Practice (Current Update)Paper, unconditionally accepted for Advances in Computers by ElsevierPaper, unconditionally accepted for Communications of the ACMTutorial Slides by Oskar: Theory (7 parts)Slides by Jacob, New YorkSlides by Jacob, AlabamaSlides by Sasha: Practice (Current Update)Maxeler in MeteorologyMaxeler in MathematicsExamples generated in Belgrade and Worldwide

THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example

51/60

Page 52: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

52/852/8© H. Maurer52

Page 53: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

53/853/8

Maxeler: Research (Google: good method)

Structure of a Typical Research Paper: Scenario #1[Comparison of Platforms for One Algorithm]Curve A: MultiCore of approximately the same PurchasePriceCurve B: ManyCore of approximately the same PurchasePriceCurve C: Maxeler after a direct algorithm migrationCurve D: Maxeler after algorithmic improvementsCurve E: Maxeler after data choreographyCurve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2[Ranking of Algorithms for One Application]CurveSet A: Comparison of Algorithms on a MultiCoreCurveSet B: Comparison of Algorithms on a ManyCoreCurveSet C: Comparison on Maxeler, after a direct algorithm migrationCurveSet D: Comparison on Maxeler, after algorithmic improvementsCurveSet E: Comparison on Maxeler, after data choreographyCurveSet F: Comparison on Maxeler, after precision modifications

53/60

Page 54: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

54/854/8© H. Maurer54

Page 55: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

55/855/8

Maxeler Tutorials

Google:

IEEE HiPeak, Berlin, Germany, January 2013

ACM iSAC, Coimbra, Portugal, March 2013

IEEE MECO, Budva, Montenegro, June 2013

ACM ISCA, Tel Aviv, Israel, June 2013

55/60

Page 56: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

56/856/8

Maxeler Research in Serbia

KG: Blood Flow, Prof. Filipovic and Tijana Djukic

NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic

MISANU: The SAT Math, Ognjanovic

ETF: Meteorology, Radomir Radojicic and Marko Stankovic

ETF: Physics (Gross Pitaevskii 3D real), Lena Parezanovic

ETF: Physics (Gross Pitaevskii 3D imaginary), Sasa Stojanovic

56/60

Page 57: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

57/857/8

Maxeler: DAFNE

MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma,IJS, FRI,IRB, QPLAN, Bogazici, U of Istanbul,U of Bucharest, U of Arad,U of Tuzla, Technion, Maxeler Israel, IPSI

57/60

Page 58: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

58/858/8

Maxeler: Advances in Computers

MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma,IJS, FRI,IRB, QPLAN, Bogazici, U of Istanbul,U of Bucharest, U of Arad,U of Tuzla, Technion, Maxeler Israel, IPSI

58/60

Page 59: V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

59/859/8© H. Maurer59 [email protected]

Q&A