Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid...

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL Alex Sim Arie Shoshani Super Computing 2000 What is Earth System Grid? Climate modeling: a mission critical application area High resolution, long-duration of climate modeling simulations produces tens of petabytes of data Earth System Grid (ESG): a virtual collaborative environment connecting distributed centers, users, models and data Super Computing 2000 Earth System Grid ESG provides scientists with Virtual proximity to the distributed data Resources comprising collaborative environment ESG supports Rapid transport of climate data between storage centers and users upon users request Integrated middleware and network mechanisms that broker and manage high speed, secure, and reliable access to data and other resource in a wide area system Persistent testbed that provides virtual proximity and shows reliable high performance data transport across a heterogeneous environment Data volume and Transmittal problems High speed data transfer in heterogeneous Grid environments Super Computing 2000 In this Demo Will show managing user requests accessing files from multiple sites in a secure manner selecting the best replica participating institutions for file replicas SDSC: all the files for the demo on HPSS about 15 disjoint files on disk in each of 5 locations: ISI, ANL, NCAR, LBNL, LLNL some files are only on tape size of files MBs the entire dataset stored on HPSS at NERSC (LBNL) use HRM (via CORBA) to request staging of files to HRMs disk use GSI-ftp (security enhanced FTP) to transfer the file after it is staged Super Computing 2000 Request Manager Coordination Super Computing 2000 Request Manager Request Manager: developed at LBNL accepts a request to cache a set of logical file names checks replica locations for each file gets NWS bandwidth/latency for each replica location selects lowest cost location initiates transfer using GSI-FTP monitors file transfer progress, responds to status command Client: PCMDI software (LLNL) it has its own metadata catalog lookup in the catalog generates a set of files that are needed to satisfy a users request Super Computing 2000 FTP Services for GRID Secured FTPs used for GRID: GridFTP (developed at ANL) Support for both client and server Secured with grid security infrastructure (GSI) Parallel streaming capability gsi-wuftpd server (developed at WU) Wuftp server with grid security infrastructure gsi-ncftp client (ncftp.com) Ncftp client with grid security infrastructure gsi-pftpd (developed at SDSC) For access to HPSS Parallel ftp server with grid security infrastructure Super Computing 2000 Replica Catalog Service Globus Replica Catalog developed using LDAP has concept of a logical file collection registers logical file name by collection uses URL format for location of replica this includes host machine, (port), path, file_name may contain other parameters, e.g. file size provides hierarchical partitioning of a collection in the catalog (does not have to reflect physical organization at any site) provides C-API Super Computing 2000 Network Weather Service Network weather service (NWS) developed by U of Tennessee require installation at each participating host provides pair-wise bandwidth/latency estimates accessible through LDAP query Super Computing 2000 Hierarchical Resource Manager Hierarchical Resource Manager (HRM) HRM: for managing the access to tape resources (and staging to local disk) A HRM uses a disk cache for staging functionality generic but needs to be specialize for specific mass storage systems e.g. HRM-HPSS, HRM-Enstore,... DRM: for managing disk resources Under development Super Computing 2000 HRM Functionality HRM functionality includes : queuing of file transfer requests reordering of request to optimize Parallel FTP (ordered by files on the same tape) monitoring progress and error messages re-schedules failed transfers enforces local resource policy number of simultaneous file transfer requests number of total file transfer requests per user priority of users fair treatment of users Super Computing 2000 Current implementation of an HRM system Currently implemented for HPSS system All transfers go through HRM disk reasons: flexibility of pre-staging disk is sufficiently cheap for a large cache opportunity to optimize for same file requests Functionality Queuing file transfers File queue management File clustering parameter Transfer rate estimation Query estimation - total time Error handling Super Computing 2000 Queuing File Transfers Number of Parallel FTPs to HPSS are limited limit set by a parameter parameter can be changed dynamically HRM is multi-threaded issues and monitors multiple Parallel FTPs All requests beyond PFTP limit are queued File Catalog used to provide for each file HPSS path/file_name Disk cache path/file_name File size tape ID Super Computing 2000 File Queue Management Goal minimize tape mounts still respect the order of requests do not postpone unpopular tapes forever File clustering parameter - FCP If the file at top of queue is in Tape i and FCP > 1 (e.g. 4) then up to 4 files from Tape i will be selected to be transferred next then, go back to file at top of queue Parameter can be set dynamically F 1 (T i ) F 3 (T i ) F 2 (T i ) F 4 (T i ) Order of file service Super Computing 2000 Reading Order from Tape for different File Clustering Parameters File Clustering Parameter = 1 File Clustering Parameter = 10 Super Computing Typical Processing Flow Super Computing 2000 Typical Processing Flow with HRM Super Computing 2000 Conclusion Demo ran successfully at SC2000 Received hottest infrastructure award Proved the ability to put together multiple middleware components using common standards, interfaces, and protocols Proved usefulness of Storage Resource Management (SRM) concept for grid applications Most difficult problem for future: robustness in the face of hardware failures network failures system failures client failures

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid...

Documents

Transcript of Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid...