3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in...
Transcript of 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in...
3rd SRA -4 work-session-summary charts
May 20th, Frankfurt
28/03/2019Photonics PPP Annual Meeting 20192
Research clusters for SRA-4 – the result:
System Architecture
For each research cluster define:
• relevance & impact (why chosen?)
• maturity (time to market)
• Hurdles to overcome
• Driving competence in Europe
• Cost of research to gain sign. uptake
System Hardware Components
System Software & Management
Programming Environment
IO & Storage
Math & Algorithms
Application co-design
Centre to edge framework
Research - domains
Data
everywhere
AI
everywhere
Energy
efficiency ResilienceDevelopm. methods
and standards
HPC and the
digital continuum
3
How do we get to the final set of research clusters for SRA –4 ?
cluster #12nd level cluster x
2nd level cluster y
2nd level cluster z
..
STEP1: 44 suggested cluster elements
cluster #X2nd level cluster a
2nd level cluster
2nd level cluster
..
.
.
.
STEP2: bundle & select
STEP3: review / correct 1st to 2nd level grouping
X< 10
STEP5: name “champions” for selected 1st lvl clusters (2..3)
1-3 champions per cluster, resp. for “5 parameters and intro”
STEP4: review / define “anchoring“ of 1st level clusters
STEP6: review “5 cluster parameters” : keep/change?
• Relevance & impact (why chosen?)
• Maturity (time to market)
• Hurdles to overcome
• Driving competence in Europe
• Cost of research to gain sign. uptake
“size”, naming?
total number?
any last minute additions?
4
#1, Champions: Adrian Tate/CRAY, Francois Bodin (University of Rennes)
Research clusters Cluster anchoring Cluster elements why chosenCo-design of application/runtime/ architecture, including proof-of-
concept and best practice demonstrations improving scalability and
performance of applications
Modelling of system components, data and whole system
Development methods &
standardsFull system architectures (including technology integration)
Standards' convergence in order to achieve portablity and 'technology
islands'
Performance Analysis and Programming best practices
Performance portability and future-proofing
Enabling applications to perform across varying and heterogeneous architectures is
essential for ensuring sustainable performance on emerging Exascale computing
systems, and to prevent development investments from binding applications into a
specific solution path. When applied correctly the approach should enable optimal
design choices between multiple hardware and software options. Frameworks that
consist of domain-specific languages, libraries, programming and abstraction
frameworks, models and toolchains have proven to provide a good practical approach
to sustain highest possible performance with complex applications on multiple
computing architectures.
independent of specific
scietific/industrial/societal problems-to-
solve, holitic approaches based on co-
design, an "end-to-end" design incl.
modelling and integration aspects are
widely accepted as a necessary
prerequisite for success.
5
#2, Champions: Laurent Cargamel (Atos), Paul Carpenter (BSC) and BSC volunteer
Research clusters Cluster anchoring Cluster elements why chosen
Heterogeneous Acceleration
reaching exascale-class performance on real world applications requires applications
to be able to exploit different acceleration techniques (GPUs, FPGAs, etc). Proving
that this assumption is true and that there applications exploiting programming
techniques enabling the use of different acceleration components seta a proof point
in the future work programs
Hetergenous computing Exascale HW will be heterogeneous - Need to address this for appl. dev. anmd
runtime
Scalable energy efficient solvers
Algorithmic changes
Auto-tuning systems
The idea behind this title is to investigate what could be done to have the HPC system
(with the meaning of the continuum) that improves automatically its behavior (more
generic than performance). This includes the use of AI for the HPC system but also of
older technics.
Innovation in Energy efficiency is
essential across all scientific and
industrial use scenrious as a
prerequiste for gaining the desired
scalablity, reducing TCO, etc.
Energy efficiency
6
#3, Champions: Maria Perez (UPM), Benny Koren (Mellanox)
Application of AI AI is one of the main technologies driving industry today. Its impact will only increase
in the near future, making this a priority target for HPC tools and infrastructure.
AI everywhere
Learning across the continuumthere is "AI for HPC" and "HPC for AI" - both ways will be needed (this will be the
focus of one of the two BDEC demonstrators)
AI & Data Analytics
Distributed AI Network Accelaration
There will be no future for HPC
infrastructure w/o a strong support
for AI, and a limited future for AI
without HPCAI everywhere
Research clusters Cluster anchoring Cluster elements why chosen
7
#4, Champions: Sai Narasimhamurthy (Seagate), Hans-Christin Hoppe (Intel), Gabriel Antoniu (Inria)
Research clusters Cluster anchoring Cluster elements why chosen
Data centric computing
NVM and its use as persistent memory and for persistent I/O
Data life cycle Management in distributed scientific environments
Data sharing (data flexibility)
Data sharing or flexibility is the ability for different users to have access to the same
data and for the data to be useable from heterogeneous framework. It is fundamental
in order to develop a federation, as well as to optimise the usage and efficiency of
such a federation
new addressing schemes for persistent memory
Byte adressable versus block model
Understanding data centric
requirements (and data logistics) is
essential for "HPC in a digital
continuum"
Data everywhere
8
#5, Champions: Marc Duranton (CEA), Gabriel Antoniu (Inria), Francois Bodin (University of Rennes)
Unified data storage and processing across the digital continuum: edge-
cloud-HPC system
HPC on the edge
HPC in the loopThis makes HPC more relevant as it integrates deeply into larger society-wide
workflows / industry wide workflows. It will increase HPC simulations value by
making them more available to stakeholders.
Support workflows on heterogenous systems
Seamless heterogeneous architectures (and software for them)
Privace and Security in the edge-HPC Centre continuum
Orchestration and mediation on ressource and workflow
Coupling of HPC & HTC; Ensembles
Full system design of converged HPC/Cloud architectures
HPC (or HTC) in the cloud/edge
Containers
simulation and data experimentation assimilation model
HPC and the digital
continuum
link and motivation is obvious ( see
blueprint)
industrial and scientific use cases
(CoEs)
Research clusters Cluster anchoring Cluster elements why chosen
9
#6, Champions: Manolis Marazakis, FORTH-ICS, Petar Radjkovic (BSC)
HPC resilience
System res i l ience is one of the most important Exascale requirements . The European HPC
community, however, lacks the s trong research effort in res i l ience, which makes res i l ience one of
the greatest chal lenges of the EU HPC ini tiative (ETP4HPC Strategic Research Agenda: Achieving
HPC Leadership in Europe, 2013. page 42). Assuring the res i l iency of large-sca le HPC systems is
complex and requires research and engineering effort for the analys is , development and
evaluation of rel iabi l i ty features . The additional problem is that res i l ience is a vertica l problem
that needs hol is tic solutions . For a l l these reasons , we have to make sure that HPC system
res i l ience is properly represented in future work programmes.
Resilience in exascale HPC
Res i l ience is widely recognized as a cri tica l chal lenge for high performance computing (HPC)
systems, as a result of the increas ing complexi ty, both at the level of individual hardware and
software components and at the level of subsystems and complete heterogeneous system
configurations . At sca le, we can no longer assume faults , errors , and fa i lures to be uncommon
events . Moreover, even more chal lenging fa i lure modes have emerged, beyond the assumptions
of the commonly assumed fa i l -s top model , ra is ing concern about the integri ty of computations
and data at-rest and in-trans i t. Appl ication correctness and execution efficiency, in spi te of
frequent faults , errors , and fa i lures , i s therefore essentia l to ensure the success of the extreme-
sca le HPC systems, and more broadly for data center-sca le systems such as cloud infrastructure.
Further chal lenges arise from the interplay between res i l iency and energy consumption:
Improving res i l ience often rel ies on redundancy (repl ication and/or checkpointing, rol lback and
recovery), which consumes extra energy. Based on these observations , I am propos ing a cross -
cutting activi ty focused on res i l ience concerns in future exascale HPC systems.
Adaptivity, Uncertainty Quantification
Resilience
improved system level resiliency was
always on the top requirements list of
users we interviewed and will stay that
wayin the persuit of extreme scale
computing, Included as well is
algoritmic resilience.
Research clusters Cluster anchoring Cluster elements why chosen
10
Proposed topics to be moved to working groups:
Math and Algo-WG Scalable Applications Guy Londdale - scapos AG
Infrastructure Computing
The emergence of smart nics in the HPC and cloud calls for different partition of the
functionalities that where all running on the hosts. Some of the infrastructure sevices
( e.g. firewalls, storage ) will migrate from the main CPU into the smart nic processing
engines. This is a major change in the compute Node architecture that opens the door
to many innovations and new capabilities.
Benny Korben - Mellanox
Adoption of upcoming hardware architectures Dirk Pleiter - Forschungszentrum Juelich
Architecture-WGConfigurable architectures Laurent Cargamel
Reconfigurable computing Xavier Martorell - Barcelona Supercomputing Center
Urgent Computing CoE- Cheese
Active NetworkingIn future networking devices (such as Smart NICs) allow us to prepare data (i.e.
filtering, compression, aggregation) before sending it. Such approaches can reduce
network congestion, applications (such as AI) can immediately reap benefits.
Valeria Bartsch - Fraunhofer ITWM
GPU disintegration and scaling across the network Benny Korben - Mellanox
Applications that provide decision makers with information during critical
emergencies cannot waste time waiting in job queues and need access to
computational resources as soon as possible
Structuring the SRA text contributions:Working group overviews (NOW to Septemper 10th )
▪ From working groups: “State of the art” & “challenges” – overview …. 2-3 pages max
➢ Current state of the art in ….WGx-domain
➢ Challenges for 2021 – 2024 in the area of….. WGx-domain
Research clusters (agreed upon set of 1st level clusters)
▪ From Cluster Champions: (NOW to August 1st)
❖ Intro which topics are covered by this cluster? ( refer to set of 2nd level clusters)……1 page max
❖ “5 parameters” …..1 page max (not all parameters might be relevant for every cluster)
▪ From working groups: Describe the WGx-domain – facets of the cluster: ……(3 pages max) (August 2nd –September 10th)
➢ What is the …WGx –domain – overlap with the cluster ?
➢ What are the specific technical challenges ?
➢ Approaches and options for solutions
➢ What should be further researched ?
11
• Relevance & impact (why chosen?)
• Maturity (time to market)
• Hurdles to overcome
• Driving competence in Europe
• Cost of research to gain sign. uptake
…
e.g. “Programming Environment”
SRA-4 timeline end-to-end:● March 19th : SRA4 process communicated at General Assembly
● March 21st : Invitation to apply for SRA-4 working groups to be sent out to ETP4HPC members
● March 31st : FP9 vision document electronic version available, registration for working groups incl. suggestions for research clusters
● April 12th : Deadline for working group registration and collection of proposals for research clusters
● April 15th - April 19th : we analyse your input and set up working groups
● May 17th : SRA-4 working group leaders internal workshop during European HPC Summit week (May 13th to 17th)
● May 20th – June 14th: Kick-off calls with working groups (8 calls, set up by office)
● June 19th : SRA-4 working group leaders internal workshop during ISC 19, start 18:30, Citadines Hotel, Frankfurt
● July/August/early September: writing complete text, individual working group calls (organized by working group leaders)
● Sept. 19th : first integration of SRA-4 document, technical part
● Oct. 4th : all other doc-parts integrated, document complete (first rendition)
● Oct. 17th : work session during European Big Data Value Forum
● October/November: text refinements, reviews, corrections
● Dec. 9th week: closing SRA-4 workshop, IBM ZRL Rueschlikon
● December: language checks, document design, release
Between now and September 10th
● “Keep things flowing”: a sync call in July, August and 1st week September between all sra-wgls
● Any question in between? : contact Michael, Maike or Marcin
● Any interlock between Working groups and Cluster –Champions: use email connections. (next page)
13
How to connect with work group leaders….
28/03/2019Photonics PPP Annual Meeting 201914
1 - System Architecture led by Laurent
Cargemel (Atos) and Estela Suarez (Juelich
SC) [email protected] [email protected] [email protected]
2 - System Hardware Components led by
Marc Duranton (CEA) and Benny Koren
(Mellanox) [email protected] [email protected] [email protected]
3 - System Software & Management led by
Pascale Rosse-Laurent (Atos) and María S.
Pérez-Hernández (Universidad Politécnica de
Madrid) and Manolis Marazakis (FORTH) [email protected]
pascale.rosse-
[email protected] [email protected]
4 - Programming Environment led by Guy
Londsdale (Scapos), Paul Carpenter (BSC)
and Gabriel Antoniu (Inria) [email protected] [email protected] [email protected]
gabriel.antoniu@inria
.fr
5 - I/O & Storage led by Sai
Narasimhamurthy (Seagate) and André
Brinkman (Universität Mainz – JGU) [email protected]
sai.narasimhamurthy@seagat
e.com [email protected]
6 - Mathematics & Algorithms led by Dirk
Pleiter (Juelich SC) and Adrian Tate (Cray) [email protected] [email protected] [email protected]
7 - Centre-to-Edge Framework led by Jens
Krueger (Fraunhofer) and Hans-Christian
Hoppe (Intel) [email protected]
.de
hans-
m
8 - Application co-design led by Erwin Laure
(KTH) and Andreas Wierse (SICOS) [email protected] [email protected] [email protected]
15
SRA-4 working group mailing lists
• System Architecture
• System Hardware Components
• System Software and Management
• Programming Environment
• I/O & Storage
• Mathematics & Algorithms
• Application co-design
• Centre-to-edge-framework
all workroup leaders only: [email protected],
SRA-4: the increasing interplay of Simulation, AI, IoT and Analytics
SRA-4 content and structure / size
18
• Intro based on updated “Blueprint” document (20 out of 36 pages)
• Strategic directions (with input from RIAG – Axel Auweter, Maria Perez) (3 pages)
• Technical Research priorities 2021 – 2024 (see also next page)
• “State of the art” & “challenges” per each working group – 2-3 pages max per working group
(appr. 16-20 pages)
• Examples of relevant use cases (4 pages)
• Research clusters (agreed upon set)
Intro & “5 parameters” (1-2 pages by ‘sponsors’(tbd)). Detailed descriptions by working group (2-3 pages)
(max 50 pages)
• Upstream Technologies – focus for 2021 – 2024
• Gen. recommendations for workprogramme 2021&2022 (focus calls, large scale pilots, collaborative aspects..)
• Non-technical topics:
• HPC and HPDA in Europe/China/US/Japan – (BDEC-2 , M. Asch)
• Gap Analysis proposed (WPs in H2020) vs actual research (J.F. Lavignon)
• Contributing European organizational eco-system :
• CoE, BDEC-2, PRACE, HiPEAC, BDVA, AIOTI, ECSO Eurolab4HPC
19
SRA-4 working groups and leads
• System Architecture
• Laurent Cargemel, ATOS
• Estela Suarez, JSC
• System Hardware Components
• Marc Duranton, CEA (HiPEAC)
• Benny Koren, MELLANOX
• System Software and Management
• Pascale Rosse-Laurent, ATOS
• Maria Perez, UPM (BDVA)
• Manolis Marazakis (FORTH)
• Programming Environment
• Guy Lonsdale, SCAPOS
• Paul Carpenter, BSC
• Gabriel Antoniu, INRIA (BDVA)
• I/O & Storage
• Sai Narasimhamurthy, SEAGATE
• Andre Brinkmann, JGU
• Mathematics & Algorithms
• Dirk Pleiter, JSC
• Adrian Tate, CRAY
• Application co-design
• Erwin Laure, KTH
• Andreas Wierse, SICOS
• Centre-to-edge-framework
• Jens Krueger, FRAUNHOFER
• Hans-Christian Hoppe, INTEL