From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the...
Transcript of From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the...
![Page 1: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/1.jpg)
From the latency to the throughput age
Prof. Jesús LabartaDirector Computer Science Dept (BSC)UPC
ETP4HPC Post-H2020 HPC Vision
Frankfurt, June 24th 2018
![Page 2: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/2.jpg)
2
To exascale ... and beyond
![Page 3: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/3.jpg)
3
VisionThe multicore and memory revolution– ISA leak … – Plethora of architectures
• Heterogeneity• Memory hierarchies
Complexity + variability = Divergence– Between our mental models and actual
system behavior
ApplicationsApplications
ISA / API
The power wall made us go multicore and the ISA interface to leak our world is shaking
What programmers need ? HOPE !!!
![Page 4: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/4.jpg)
4
Vision• … similar effect at system level/coarse grain
• Plethora of architectures• Heterogeneity• Memory hierarchies
• New usage practices• Online simulation, analytics and visualization• Interactive supercomputing, response time• Value based computing• Urgent computing
• Important• Integration of concurrency and data• Dynamic resource sharing
data1
Simulation1
Simul2
dat a2
dat a2
BSC vision. BDEC. Fukuoka. Feb 2014
![Page 5: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/5.jpg)
Evolution vs. revolution
• Revolutions• Change of mindset before after
• Do we think outside the box ?
![Page 6: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/6.jpg)
6
Do we think outside the box ?• Very strong walls in the HPC box !!!
![Page 7: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/7.jpg)
7
Do we think outside the box ?• Very strong walls in the HPC box !!!• Sometimes we try to blow them up
![Page 8: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/8.jpg)
8
Do we think outside the box ?• Very strong walls in the HPC box !!!• Sometimes we try to blow them up• But the walls are in our mind !!!
![Page 9: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/9.jpg)
9
Do we think outside the box ?
• We do (I may be exaggerating … or may be not that much)
• Proudly show the performances we achieve and not the code we write• Use variables about resources (cores, GPUs)
• omp_get-num_threads(), …• Run sequences of jobs with 5K core because each of them takes 20% less time
than with 2K cores• Believe that overlap == changing sends isends or using one sided calls• Burn million hours to estimate good configuration• Integrate simulation, analytics, visualization in a single MPI binary
![Page 10: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/10.jpg)
10
Do we think outside the box ?
• Do we ?• Interleave processes ?• Think of using MPI + OpenMP with just 1 OpenMP thread ?• Share nodes among jobs ?• Serialize (and overlap) reductions?• Taskify MPI calls to allow their out of order execution?• Spawn packing and unpacking tasks to allow for fast draining of incoming
messages by main process?• Parallelize packing and unpacking of messages? Depending on message size ?
![Page 11: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/11.jpg)
11
Do we think outside the box ?
• Why?• Follow “recommended best practices”• Never thought of ?• Some bad experience never again• I can do it better !!!!!• Dazzled by performance !!
![Page 12: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/12.jpg)
12
All about the mindset• The real parallel programming revolution
• … is in the mindset of programmers• From the latency to the throughput age !!!
• … and can/should be achieved productively• Incrementally• On a standard programming model/language (MPI+OpenMP, Python, …)
• Real revolution, real effort• Issue everywhere. At home first.• Shape minds vs. reshape minds
![Page 13: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/13.jpg)
13
Key aspects
• Actual behavior/Performance analysis• Avoid flying blind !!• Towards insight and understanding of fundamental issues• For application & system developers
• Programming practices and models• Decouple programmer from machine
• Programs to convey ideas to humans … that happen to be executable by machines
• Enable productive/evolutionary/composable approaches• Can we avoid/contain the complexity explosion ?• Dynamic resource sharing
![Page 14: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/14.jpg)
14
Behavior awareness
• A common language about fundamental issues
• Evolution of bottlenecks
• Methodology • 195 studies:
• ~25% industry• Awareness• Opportunity to improve
• And examples how• Co-design input
![Page 15: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/15.jpg)
1515
Behavior awareness15
Tracking scaling behavior of computation regions(Strong scaling MPI+OpenMP example)
![Page 16: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/16.jpg)
16
• Coupled codes• Multiple physics, domains• Compute & I/O
16
Behavior awareness
26.7MB traceEff: 0.43; LB: 0.52; Comm:0.81
1600
cor
es
2.5 sEC-EARTH
Atmosphere
Ocean
![Page 17: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/17.jpg)
1717
Vision in the programming revolution
ISA / API
Applications
Power to the runtime
PM: High-level, clean, abstract interface
General purpose
Decouple
Forget about resources
Minimal & sufficient permeability?
Intelligence&
Resource management
“Reuse & expand” old architectural ideas under
new constraints
![Page 18: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/18.jpg)
1818
Vision in the programming revolution
ISA / API
Special purpose
Must be easy to develop/maintain
Fast prototyping
Applications
Power to the runtime
PM: High-level, clean, abstract interface
DSL1 DSL2 DSL3
![Page 19: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/19.jpg)
19
Integrate concurrency and dataSingle mechanism
Concurrency:Dependences built from data accessesLookahead: About instantiating work
Locality & data managementFrom data accesses
Task based parallel programming
![Page 20: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/20.jpg)
20
Task based parallel programming• Some important features
• Dependences, Lookahead• Taskloops• Nesting• Array sections / Regions• Exploiting malleability:
• Dynamic Load Balance (DLB)• Within App, across apps
• MPI+OpenMP interoperability
• Think global, specify local
![Page 21: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/21.jpg)
21
Towards the throughput age• By
• Express potential concurrency
• Malleability• Dynamic resource
sharing/management
• Configuration independence
• Amount of resources is what really matters
• Side effects• Nx1 can be better than
pure MPI !!!• hope for lazy programmers
![Page 22: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/22.jpg)
Infrastructures for new usage modes• Persistent KVS
• Alternative for parallel programs I/O?• Flexible querying: 3D indexing, Data-thinning
• Need/opportunity of clean integration of concurrency and data
• Within one app• Shared communication space between multiple apps.
• Malleable/Elastic/opportunistic resource management/sharing
![Page 23: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/23.jpg)
23
Impact on architecture ?• High throughput devices
• Long Vectors• Decouple Front end - Back end engines, reduce front end pressure, optimize memory
throughput, explicit locality management• Specialized compute and data motion engines • Tuned numerical precision
• ISA is important• Decouple/hide again hardware details, reuse SW technologies (compilers, OS,…), • Specific instructions?• “limited” number of control flows
• Hierarchical Acceleration• Nesting• Homogenize heterogeneity
• Runtime aware architectures (RAA)
![Page 24: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/24.jpg)
24
Age before beauty• Behavior (insight/models) before syntax• Detail performance analytics before aggregated profiles• Work instantiation and order before overhead• Malleability before fitted rigid structure• Possibilities before how tos• Elegance before one day shine
![Page 25: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •](https://reader035.fdocuments.us/reader035/viewer/2022070111/604df1220c0b666b8d381358/html5/thumbnails/25.jpg)
25
The challenge• Think of fundamentals, think out of the box
• Revolution: change everything so that nothing changes
• Should we: change as little as possible so that everything is different ?
• Programmers !!!!
• Develop a culture of• Efficiency awareness• Latency throughput mindset• Dynamic sharing of resources
• To exascale … and before