Post on 15-Oct-2021
HPC’s Pivot to Data Suzanne Parete-Koon (OLCF)
2
What is HPC’s Pivot to Data?
• The boundaries of the scientific computing ecosystem are being pushed beyond single centers providing HPC resources or any single experimental facility.
• Increasing data volume, variety, and velocity require additional resources to foster a productive environment for scientific discovery. This focus on data does not imply a reduction in importance of large-scale simulation capabilities
• Predicting how and why scientific workflows achieve their observed end-to-end performance is a growing challenge for scientists.
• This talk will focus on data movement- because it is an excellent example of a task that requires multi-facility collaboration to enable multi-facility scientific workflows
3
Data Growth
NERSC Storage Usage Projection (Log Scale)
Peta
byte
s
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
2022
0.1
110
100
1000
HPSS Total PBNGF Total PB
NCCS HPSS Usage since 2010
Peta
byte
s
Jan 2
010
Apr 20
10
Jul 2
010
Oct 20
10
Jan 2
011
May 20
11
Aug 20
11
Nov 20
11
Feb 2
012
May 20
12
Aug 20
12
Nov 20
12
Mar 20
13
Jun 2
013
Sep 20
13
Dec 20
13
Mar 20
14
1015
2025
3035
10
20
25
30
Pet
abyt
es
35
15
10
10
Pet
abyt
es 10
0 10
00
1 0.
1
NCCS/OLCF HPSS Usage NERSC HPSS Usage
4
Jan 1990 Jul 1990 Jan 1991 Jul 1991 Jan 1992 Jul 1992 Jan 1993 Jul 1993 Jan 1994 Jul 1994 Jan 1995 Jul 1995 Jan 1996 Jul 1996 Jan 1997 Jul 1997 Jan 1998 Jul 1998 Jan 1999 Jul 1999 Jan 2000 Jul 2000 Jan 2001 Jul 2001 Jan 2002 Jul 2002 Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008 Jul 2008 Jan 2009 Jul 2009 Jan 2010 Jul 2010 Jan 2011 Jul 2011 Jan 2012 Jul 2012 Jan 2013 Jul 2013 Jan 2014
Month
0.0000
0.0001
0.0010
0.0100
0.1000
1.0000
10.0000
100.0000
Peta
byte
sESnet Accepted Traffic: Jan 1990 - Mar 2014 (Log Scale)
Data Movement
100000
10000
1000
0.100
0.010
0.001
0.0001
0.00
Pet
abyt
es
1990
1995
2000
2005
2010
2014
ESnet Accepted Traffic: Jan 1990-Mar. 2014
5
Inter-Workflow: One HPC Center
Even projects with workflows designed to utilize only one HPC center need efficient data transfer, because data is typically moved to more permanent storage at the end of a project.
6
Intra-workflows: Simulation
Lingerfelt et al 2014, Near real-time data analysis of core-collapse supernova simulations with Bellerophon,” in The International Conference on Computational Science 2015 Proceedings, in press, 2014.
7
Intra-workflows : Experiment
Single par+cle X-‐ray Crystallography 10E6 diffrac+ve images reconstructed to reproduce angstrom scale 3-‐D structure.
8
Enabling Intra-workflow: Remote Transfers Centers • How can we enable secure data movement? • What transfer tools are best for the system? • How can we efficiently monitor our network?
Users • Can I scrip the transfer? Click and forget ? • How long will the transfer take? • What tools do you recommend?
9
The Science DMZ
• A network architecture explicitly designed for high-performance applications, where the science network is distinct from the general-purpose network
• The use of dedicated systems for data transfer • Performance measurement and network testing systems
that are regularly used to characterize the network and are available for troubleshooting
• Security policies and enforcement mechanisms that are tailored for high performance science environments
10
OLCF Infrastructure
Titan, 27PF, 299,008 Cores
18,866 Cray XK7 Nodes
Eos, 200 TF, 11,904 Cores
744 Cray XC30 Nodes
Rhea Focus
EVEREST
Infiniband Core
(SION)
Data Transfer Nodes
Ethernet Core
18 SFA12K500GB/s
Atlas1
18 SFA12K500GB/s
Atlas2
432 FDR IB Ports
2 x 10GbE
1 x 100GbE
144 FDR IB Ports
144 FDR IB Ports
20 FDR IB Ports
HPSS35PB, 5GB/s
ingest
38 FDR IB Ports
11
NERSC Infrastructure
12
http://fasterdata.es.net/performance- testing/perfsonar/ perfsonar- dashboard/
Perfsonar: How well is the transfer performing?
13
• Open to the public.
Perfsonar: How is my transfer performing?
14
The Pros and Cons of SCP and Rsync
Single stream: therefore slow and take poor advantage of WAN.
The common version do not allow much control over buffer size and fault
checking.
SCP Common on all Unix-‐like systems
Can be scripted into a workflow when the
transfer des+na+on allows for passwordless SSH
logins.
15
The Pros and Cons Parallel Streams: GridFTP, globusonline, bbcp
The soPware must be available at both end of the transfer and the availability varies. Setup required.
Security policies impact ease of use.
Mul+-‐streams allow fast transfers on WAN. Many Op+ons for customiza+on.
Globus is extremely user-‐friendly once it is set up.
Activation Energy
16
What rates can I expect?
0"
1000"
2000"
3000"
4000"
GUC" GUC"(stripe)" Globus" BBCP" RSYNC"
!Rate!Mbits!
1"TB" 10"100"GB"files"
0"20"40"60"80"100"120"
GUC" GUC"(stripe)" Globus" BBCP"
Average'Time'
1"TB" 10"100"GB"files"
17
Best tool features for the system?
This filesystem benchmarking with fio shows that the performance improvement from GUC to GUC with striping comes from parallelism at the Lustre client.
18
Case Studies
Year Methods Data Transfer
Combus+on research challenge for mul+-‐lab work-‐ flow
2007-‐2009 dtns/SCP/Kepler workflow
10TB OLCFèNERSC
20th Century Climate Reanalysis
2011 HPSS Direct 40 TB OLCFèNERSC
DNS Combus+on
2013 dtns/ Globus 80TB OLCFèALCF
LSCS 2013-‐14 dtns / bbcp Globus
130 TB SLACèNERSC
19
Road Maps: Data-Centric Services • ORNL is establishing a Compute & Data Environment for
Science (CADES). OLCF is a key partner. • CADES will be a HUB to share data infrastructure and
compute & data science capabilities with and among many projects.
20
Conclusion
• Success for workflows with large data movement depends critically on a ”Science DMZ” like approach to connectivity
• HPC centers and their users benefit from collaboration between different centers and facilities and collaboration between centers and projects.
• Computing centers traditionally focused on large-scale simulation are expanding their repertoire to include user-facing data services.
• Data services are built around a long-lived resources, the data itself. Both HPC architecture and allocation of will need to adapt to this longevity.
21
Data Transfer Working Group
This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. This work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.
OLCF B.Caldwell J. Hill C. Layton S. Parete-Koon D. Pelfrey H. Nam J. Wells
CADES G. Shipman
NERSC S. Cannon D. Hazen J. Hick D. Skinner
ESnet E. Dart J. Zurawski