Data flow super computing valentina balas
-
Upload
valentina-emilia-balas -
Category
Education
-
view
245 -
download
0
description
Transcript of Data flow super computing valentina balas
![Page 1: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/1.jpg)
V. Milutinović, G. Rakocevic, S. Stojanović, and Z. SustranUniversity of Belgrade
Oskar MencerImperial College, London
Oliver Pell Maxeler Technologies, London and Palo Alto
Michael FlynnStanford University, Palo Alto
Valentina E. BalasAurel Vlaicu University of Arad
1/52
![Page 2: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/2.jpg)
For Big Data algorithms and for the same hardware price as before, achieving:
a) speed-up, 20-200 b) monthly electricity bills, reduced
20 timesc) size, 20 times smaller
2/52
![Page 3: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/3.jpg)
Absolutely all results achieved with:
a) all hardware produced in Europe, specifically UK
b) all software generated by programmers
of EU and WB
3/52
![Page 4: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/4.jpg)
ControlFlow (MultiFlow and ManyFlow): Top500 ranks using Linpack (Japanese
K,…)
DataFlow: Coarse Grain (HEP) vs. Fine Grain
(Maxeler)
4/52
![Page 5: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/5.jpg)
Compiling below the machine code level brings speedups;also a smaller power, size, and cost.
The price to pay:The machine is more difficult to program.
Consequently:Ideal for WORM applications :)
Examples using Maxeler:GeoPhysics (20-40), Banking (200-1000, with JP Morgan
20%), M&C (New York City), Datamining (Google), …
5/52
![Page 6: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/6.jpg)
6
![Page 7: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/7.jpg)
7/52
![Page 8: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/8.jpg)
8/52
![Page 9: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/9.jpg)
9
![Page 10: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/10.jpg)
10
![Page 11: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/11.jpg)
Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores.
tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU
tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU
tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF
11/52
![Page 12: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/12.jpg)
DualCore?
Which way are the horses going?
12/52
![Page 13: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/13.jpg)
Is it possibleto use 2000 chicken instead of two horses?
?==
13/52
What is better, real and anecdotic?
![Page 14: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/14.jpg)
2 x 1000 chickens (CUDA and rCUDA) 14/52
![Page 15: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/15.jpg)
How about 2 000 000 ants?
15/52
Dat
a
![Page 16: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/16.jpg)
Marmalade
Big Data Input Results
16/52
![Page 17: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/17.jpg)
Factor: 20 to 200
MultiCore/ManyCore
Dataflow
Machine Level Code
Gate Transfer Level
17/52
![Page 18: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/18.jpg)
Factor: 20
MultiCore/ManyCore
Dataflow
18/52
![Page 19: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/19.jpg)
Factor: 20
Data Processing
Process ControlData Processing
Process Control
MultiCore/ManyCore
DataFlow
19/52
![Page 20: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/20.jpg)
MultiCore:Explain what to do, to the driverCaches, instruction buffers, and predictors needed
ManyCore:Explain what to do, to many sub-driversReduced caches and instruction buffers needed
DataFlow:Make a field of processing gates: 1C+2nJava+3JavaNo caches, etc. (300 students/year: BGD, BCN, LjU,
ICL,…)
20/52
![Page 21: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/21.jpg)
MultiCore:Business as usual
ManyCore:More difficult
DataFlow:Much more difficultDebugging both, application and configuration
code
21/52
![Page 22: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/22.jpg)
MultiCore/ManyCore:Several minutes
DataFlow:Several hours for the real hardwareFortunately, only several minutes for the simulatorThe simulator supports
both the large JPMorgan machineas well as the smallest “University Support” machine
Good news:Tabula@2GHz
22/52
![Page 23: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/23.jpg)
23/52
![Page 24: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/24.jpg)
MultiCore:Horse stable
ManyCore:Chicken house
DataFlow:Ant hole
24/52
![Page 25: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/25.jpg)
MultiCore:Haystack
ManyCore:Cornbits
DataFlow:Crumbs
25/52
![Page 26: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/26.jpg)
26/52
Small Data: Toy Benchmarks (e.g., Linpack)
![Page 27: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/27.jpg)
27/52
Medium Data (benchmarks favorising NVidia,compared to Intel,…)
![Page 28: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/28.jpg)
28/52
Big Data
![Page 29: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/29.jpg)
Revisiting the Top 500 SuperComputer BenchmarksOur paper in Communications of the ACM
Revisiting all major Big Data DM algorithmsMassive static parallelism at low clock frequencies
Concurrency and communicationConcurrency between millions of tiny cores difficult,
“jitter” between cores will harm performance at synchronization points
Reliability and fault tolerance10-100x fewer nodes, failures much less often
Memory bandwidth and FLOP/byte ratioOptimize data choreography, data movement,
and the algorithmic computation
29/52
![Page 30: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/30.jpg)
Maxeler Hardware
CPUs plus DFEsIntel Xeon CPU cores and up to
4 DFEs with 192GB of RAM
DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation
of DFEs to CPU servers
Low latency connectivityIntel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet
connections
MaxWorkstationDesktop development system
MaxCloudOn-demand scalable accelerated compute resource, hosted in London
3030/52/52
![Page 31: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/31.jpg)
1. Coarse grained, stateful: Business– CPU requires DFE for minutes or hours
2. Fine grained, transactional with shared database: DM– CPU utilizes DFE for ms to s– Many short computations, accessing common database data
3. Fine grained, stateless transactional: Science (FF)– CPU requires DFE for ms to s– Many short computations
3131/52/52
Major Classes of Algorithms, from the Computational Perspective
![Page 32: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/32.jpg)
• Long runtime, but:• Memory requirements
change dramatically based on modelled frequency
• Number of DFEs allocated to a CPU process can be easily varied to increase available memory
• Streaming compression• Boundary data exchanged
over chassis MaxRing
3232/52/52
Coarse Grained: Modeling
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 4 8
Equi
vale
nt C
PU c
ores
Number of MAX2 cards
15Hz peak frequency
30Hz peak frequency
45Hz peak frequency
70Hz peak frequency
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80Peak Frequency (Hz)
Timesteps (thousand)
Domain points (billion)
Total computed points (trillion)
![Page 33: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/33.jpg)
• DFE DRAM contains the database to be searched• CPUs issue transactions find(x, db)• Complex search function
– Text search against documents– Shortest distance to coordinate (multi-dimensional)– Smith Waterman sequence alignment for genomes
• Any CPU runs on any DFE that has been loaded with the database– MaxelerOS may add or remove DFEs
from the processing group to balance system demands– New DFEs must be loaded with the search DB before use
3333/52/52
Fine Grained, Shared Data: Monitoring
![Page 34: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/34.jpg)
• Analyse > 1,000,000 scenarios• Many CPU processes run on many DFEs
– Each transaction executes on any DFE in the assigned group atomically
• ~50x MPC-X vs. multi-core x86 node
3434/52/52
Fine Grained, Stateless: The BSOP Control
CPU DFE Loop over instrumentsLoop over instruments
Random number generator and
sampling of underliers
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Price instruments using Black
Scholes
Tail analysis on CPU
Tail analysis on CPU
CPU DFE Loop over instrumentsLoop over instruments
Random number generator and
sampling of underliers
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Price instruments using Black
Scholes
Tail analysis on CPU
Tail analysis on CPU
CPU DFE Loop over instrumentsLoop over instruments
Random number generator and
sampling of underliers
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Price instruments using Black
Scholes
Tail analysis on CPU
Tail analysis on CPU
CPU DFE Loop over instrumentsLoop over instruments
Random number generator and
sampling of underliers
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Price instruments using Black
Scholes
Tail analysis on CPU
Tail analysis on CPU
DFE Loop over instrumentsLoop over instrumentsCPUMarket and instruments data
Random number generator and
sampling of underliers
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Price instruments using Black
ScholesInstrument values
Tail analysis on CPU
Tail analysis on CPU
![Page 35: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/35.jpg)
3535/52/52
Selected Examples
![Page 36: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/36.jpg)
36363636/52/52
![Page 37: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/37.jpg)
Performance of one MAX2 card vs. 1 CPU core
Land case (8 params), speedup of 230x
Marine case (6 params), speedup of 190x
The CRS Results
CPU Coherency MAX2 Coherency
3737/52/52
![Page 38: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/38.jpg)
Seismic Imaging
• Running on MaxNode servers- 8 parallel compute pipelines per chip- 150MHz => low power consumption!- 30x faster than microprocessors
An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
†Chevron, ‡Maxeler, §Formerly Chevron, SEG 20083838/52/52
![Page 39: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/39.jpg)
3939
![Page 40: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/40.jpg)
![Page 41: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/41.jpg)
• DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters– Search for every sample of each output trace
Trace Stacking: Speed-up 217P. Marchetti et al, 2010
parameters( emergence angle & azimuth
Normal Wave front parametersKN,11; KN,12 ; KN22
NIP Wave front parameters( KNip,11; KNip,12 ; KNip22 )
hHKHhmHKHmmw TzyNIPzy
TTzyNzy
TT
0
0
2
00
2 22
v
t
vtthyp
4141/52/52
![Page 42: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/42.jpg)
4242
![Page 43: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/43.jpg)
4343
![Page 44: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/44.jpg)
4444
![Page 45: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/45.jpg)
This is about algorithmic changes, to maximize
the algorithm to architecture match:Data choreography, process modifications,
anddecision precision.
The winning paradigm of Big Data ExaScale?
4545/52/52
Conclusion: Nota Bene
![Page 46: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/46.jpg)
46/8
The TriPeak
Siena+ BSC+ Imperial College + Maxeler+ Belgrade
46/52
![Page 47: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/47.jpg)
47/8
The TriPeakMontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage?MontBlanc (ompSS) and Maxeler (an accelerator)
In each happy marriage,it is known who does what :)
The Big Data DM algorithms:What part goes to MontBlanc and what to Maxeler?
47/52
![Page 48: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/48.jpg)
48/8
Core of the Symbiotic SuccessAn intelligent DM algorithmic scheduler,partially implemented for compile time,and partially for run time.
At compile time:Checking what part of code fits where(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K
At run time:Rechecking the compile time decision,based on the current data values.
48/52
![Page 49: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/49.jpg)
49/849/8
Maxeler: Teaching (Google: prof vm)TEACHING, VLSI, PowerPoints, Maxeler:
Maxeler Veljko Explanations, August 2012Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012Maxeler Forbes ArticleFlyer by JP MorganFlyer by Maxeler HPCTutorial Slides by Sasha and Veljko: Practice (Current Update)Paper, unconditionally accepted for Advances in Computers by ElsevierPaper, unconditionally accepted for Communications of the ACMTutorial Slides by Oskar: Theory (7 parts)Slides by Jacob, New YorkSlides by Jacob, AlabamaSlides by Sasha: Practice (Current Update)Maxeler in MeteorologyMaxeler in MathematicsExamples generated in Belgrade and Worldwide
THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example
49/52
![Page 50: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/50.jpg)
50/850/8
Maxeler: Research (Google: good method)
Structure of a Typical Research Paper: Scenario #1[Comparison of Platforms for One Algorithm]Curve A: MultiCore of approximately the same PurchasePriceCurve B: ManyCore of approximately the same PurchasePriceCurve C: Maxeler after a direct algorithm migrationCurve D: Maxeler after algorithmic improvementsCurve E: Maxeler after data choreographyCurve F: Maxeler after precision modifications
Structure of a Typical Research Paper: Scenario #2[Ranking of Algorithms for One Application]CurveSet A: Comparison of Algorithms on a MultiCoreCurveSet B: Comparison of Algorithms on a ManyCoreCurveSet C: Comparison on Maxeler, after a direct algorithm migrationCurveSet D: Comparison on Maxeler, after algorithmic improvementsCurveSet E: Comparison on Maxeler, after data choreographyCurveSet F: Comparison on Maxeler, after precision modifications
50/52
![Page 51: Data flow super computing valentina balas](https://reader030.fdocuments.us/reader030/viewer/2022013003/54bde22d4a7959bb608b46c6/html5/thumbnails/51.jpg)
51/851/8
Maxeler: Topics (Google: HiPeac Berlin)
SRB (TR):KG: Blood FlowNS: Combinatorial MathBG1: MiSANU MathBG2: Meteos MeteorologyBG3: Physics (Gross Pitaevskii 3D real)BG4: Physics (Gross Pitaevskii 3D imaginary) (reusability with MPI/OpenMP vs effort to accelerate)
FP7 (Call 11):University of Siena, Italy,ICL, UK,BSC, Spain,QPLAN, Greece,ETF, Serbia,IJS, Slovenia, …
51/52