High Performance GridFTP Transport of Earth System Grid (ESG) Data 1 Center for Enabling Distributed...
-
Upload
ada-wilkinson -
Category
Documents
-
view
218 -
download
0
Transcript of High Performance GridFTP Transport of Earth System Grid (ESG) Data 1 Center for Enabling Distributed...
High Performance GridFTP Transport of Earth System Grid
(ESG) Data
1
Center for Enabling Distributed Petascale Science
Description Transfer 10TBs of climate data into the SC09 show floor from
three sites – the Argonne Leadership Computing Facility
(ACLF), the National Energy Research Scientific Computing
center (NERSC) and LLNL.
As the data arrives at its destination in the University of Utah’s
SC09 booth, it will be stored on disks provided by the Data
Direct Networks.
Data will be processed using climate data analysis and
visualization tool and then publicly displayed along with
graphs depicting the characteristics of the transfer.
End-to-End Flow
Scientific Purpose Climate data is moved in this challenge
Climate is a discipline that is highly collaborative, and its datasets are
distributed across the globe.
An interesting feature of climate data is that the actual file size is not very
large compared to that of other sciences.
Climate researchers, however, need to move hundreds or thousands of files
in a single transfer.
Volume of data to be moved across the network is massive.
Multiple TB of data from Climate Research Program Coupled Model
Intercomparison Project, Phase 3 (CMIP3) is moved This data was used in the Intergovernmental Panel on Climate Change
(IPCC) Fourth Assessment Report (AR4)
This data is used in anticipation of the approaching IPCC Fifth Assessment
Report (AR5)
How Computing and Network map into Climate Modeling Efforts
Each Climate
Modeling task maps
onto these strategic
objectives
from:
Network Challenges in ESG
Independent gateways federating metadata and users Individual data nodes responsible for publishing services Designed for model output data sets
Technical Approach and Methods Transfers initiated by the climate community can be between
a client and a server or between two remote servers initiated
by the user from a third machine. GridFTP and other data movement tools developed by Center
for Enabling Distributed Petascale Science (CEDPS) are ideal
for these types of transfers GridFTP is optimized for high-bandwidth, wide area networks. Globus implementation of GridFTP provides a software suite
optimized for a broad range of data access applications Including bulk file transfer and data extraction from complex
storage systems.
GridFTP Advantages Performance - Orders of magnitude performance improvements
over standard FTP Uses parallel TCP streams and non-TCP protocols such as UDT coordinated transfer using multiple computers at source and destination.
Secure - GridFTP supports the PKI/X.509 based Grid Security Infrastructure (GSI) – simple options to encrypt/integrity check data
GridFTP also supports SSH security
Robust - Restart markers allow interrupted transfers to restart with
minimal delay overhead.
Extensible – Clear abstractions to interface with various transport
protocols and with different storage systems Completely shields user from the complexities of underlying storage systems
including tape archves such as HPSS
Key GridFTP Features used in the Challenge
Concurrency and Pipelining Allows the client to simultaneously maintain multiple
outstanding, unacknowledged transfer commands Greatly improves performance lots of small files transfers
File Request 1
File Request 2
File Request 3
DATA 1
DATA 2
DATA 3
ACK 1
ACK 2
ACK 3
File Request 1
File Request 2File Request 3DATA 1
DATA 2
DATA 3
ACK 1
ACK 2
ACK 3
Traditional Pipelining
GridFTP Clients and Netlogger
Three different GridFTP clients are used to move the 10 TB
data set for the challenge Globus.org – hosted data movement service
BDM – Bulk Data Mover
Globus-url-copy
Netlogger – used to monitor transfers and troubleshoot
problems Distributed performance analysis and troubleshooting
Standard log format and best practices
Log collection tools
Log parser
Data analysis tools
What is the Globus.org Data Movement Service (a.k.a. DataKoa)?
A new Globus data movement service The same vision, but an updated implementation Hosted Domain-independent, multi-use
Enables scientists to focus on domain-specific work Manages technology failures Sends notifications of interesting events
Enables non-experts to easily and efficiently move data No operations overhead Minimal user-side software installation User interfaces require no special expertise Built-in data transport configuration expertise
GridFTP
Server A
GridFTP
Server B
Globus.orgGlobus.org
LaptopLaptop
Globus.org Data Movement Service
The client connects to Globus.org and submits requests. It can then
disappear from the network
Globus.org orchestrates the transfer between
GridFTP servers.
What is BDM? BDM: Bulk Data Mover
Scalable data movement management tool Calls GridFTP file transfers
Designed for climate community (Earth System Grid) needs Efficient and reliable transfer management from user’s point of
view Simple to install and maintain as a novice user Scalable to large in volume Scalable to large in number of files Efficient handling on extreme variance in file sizes Scalable to future performance expectations
Network performance improvements – 100Gbps and beyond Storage performance improvements – distributed, parallel, SSD,
etc. Multiple transfer protocol support
Able to work with other applications with similar needs Information
http://sdm.lbl.gov/bdm Contact: Dean Williams [email protected]
Globus-url-copy Commonly used command line scriptable GridFTP
client
Supports various transfer optimizations including
parallel TCP streams, concurrent file transfers
New features Fault tolerant
Store state in a file
Restarting globus-url-copy transfers only the remaining data
Associate multiple physical endpoints with single logical
endpoint
Load balance across all the physical endpoints
9/15/09 Argonne National Laboratory
NetLogger BWC Deployment
ALCFALCF LLNLLLNL NERSCNERSC
LBNLLBNL
GridFTP
servers
GridFTP
servers
GridFTP
servers
GridFTP
servers
GridFTP
servers
GridFTP
servers
SC09 Show FloorSC09 Show Floor
DataData
Logs NetLogger
DB
NetLogger
DB
Plots on the
web
Data Direct Networks Silicon Storage Architecture (S2A)
ESnet Science Data Network Good network is as important having the right tools
and applications. needed a good network that would move these datasets at
high speeds to the convention center
ESnet was the perfect fit to pull data from national labs
Science Data Networks (SDN) and On-Demand
Secure Circuit and Advance Reservation System
(OSCARS) guarantees that we will have a dedicated circuit on the
network for the duration of the challenge
don’t have to compete with anyone else for bandwidth
9/15/09 Argonne National Laboratory
Data Analysis and Visualization The data were analyzed using the Climate Data Analysis
Tools (CDAT) developed by Program for Climate Model
Diagnosis and Intercomparison (PCMDI)
CDAT is a suite of interrelated diagnostic software tools Flexible, portable, adaptable, efficient, easy-to-use,
shareable and free
Capable of operating in a distributed environment
3D Interface provided by the ViSUS plugin developed at
the SCI Institute at University of Utah and LLNL Streaming and progressive data flow
Integrated analysis and illustration tools
9/15/09 Argonne National Laboratory
Data Analysis and Visualization
Full Video is available at http://www.sci.utah.edu/~pascucci/tmp/climate_video/
Overarching Research Agenda Climate community is expecting to generate petabytes of
simulated data for analysis and future climate predictions.
In the next few years, climate researchers will be moving
terabytes of data to collaborators across the globe for IPCC
Fifth Assessment Report (AR5), which will be published in
2013. Moving large amounts of data seamlessly, reliably and
quickly is required to make sense of the enormous AR5
climate data set Help scientists understand climatic imbalances and the
potential impacts of future climate change scenarios.
9/15/09 Argonne National Laboratory
Overarching Research Agenda This demonstration highlights the tools and services
that will help them transport their data quickly and
reliably
Hope that the lessons learned in this experiment will
help us to do this better
Improve the transport and monitoring tools further
and help not only the climate researchers but also
other researchers in getting their science done
faster than before
9/15/09 Argonne National Laboratory