San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure...
-
Upload
danielle-manning -
Category
Documents
-
view
214 -
download
0
Transcript of San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure...
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Storage Resource Broker
Reagan W. MooreSan Diego Supercomputer Center
[email protected]://www.npaci.edu/DICE/
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Data Management Objectives
• Automate all aspects of data management– Discovery (without knowing the file name)– Access (without knowing its location)– Retrieval (using your preferred API)– Control (without having a personal account at the
remote storage system)– Performance (use latency management mechanisms to
minimize impact of wide-area-networks)
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Collections Replicated via SRBonto TeraGrid
• 2MASS – 10 TBs, 5 million images
• DPOSS– 3 TBs, 6000 images
• USNO-B– In progress
• SDSS– In progress
• MACHO– In negotiation
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
SRB Implementations• Data collecting
– Sensor systems, object ring buffers and portals
• Data organization– Collections, manage data context
• Data sharing– Data grids, manage heterogeneity
• Data publication– Digital libraries, support discovery
• Data preservation– Persistent archives, manage technology evolution
• Data analysis– Processing pipelines, manage knowledge extraction
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
NSF Infrastructure Projects Using SRB
• Partnership for Advanced Computational Infrastructure - PACI– Data grid - Storage Resource Broker
• Distributed Terascale Facility - DTF/ETF– Compute, storage, network resources
• Digital Library Initiative, Phase II - DLI2– Publication, discovery, access
• Information Technology Research projects - ITR– SCEC Southern California Earthquake Center– GEON GeoSciences Network– SEEK Science Environment for Ecological Knowledge– GriPhyN Grid Physics Network– NVO National Virtual Observatory
• National Science Digital Library - NSDL– Support for education curricula modules
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Federal Infrastructure Projects Using SRB
• NASA– Information Power Grid - IPG– Advanced Data Grid - ADG– Data Management System - Data Assimilation Office
• Integration of DODS with Storage Resource Broker data grid
– Earth Observing Satellite EOS data pools – Consortium of Earth Observing Satellites CEOS data grid
• Library of Congress– National Digital Information Infrastructure and Preservation Program - NDIIPP
• National Archives and Records Administration and National Historical Public Records Commission– Prototype persistent archives
• NIH– Biomedical Informatics Research Network data grid
• DOE– Particle Physics Data Grid - Babar, CMS
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
SDSC Collaborations
• Hayden Planetarium Simulation & Visualization
• Knowledge Network for BioComplexity (NSF)
• Mol Science – JCSG, AfCS• Visual Embryo Project (NLM)• RoadNet (NSF)
• Earth System Sciences – CEED, Bionome, SIO Explorer
• Hyper LTER • Grid Portal (NPACI)• Tera Scale Computing (NSF)• Long Term Archiving Project (NARA)• Education – Transana (NPACI)• NSDL – National Science Digital Library
(NSF)• Digital Libraries – ADL, Stanford,
UMichigan, UBerkeley, CDL• … 31 additional collaborations
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Approach
• Use collections to organize digital entities– Digital entity - file, URL, SQL, directory, table, …
• Create logical name space– Location independent naming convention– Map state information created by data access services to the
logical name space– Manage consistency constraints on the metadata update
• Build an interoperability mechanism– Map from storage repository protocols to preferred APIs
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Basic Concepts• Logical name space
– Map administrative, descriptive, authenticity, consistency metadata onto the logical name
• Storage repository abstraction– Standard operations performed at remote storage
• Information repository abstraction– Standard operations to manage collection in a database
• Access abstraction– Standard operations supported for metadata and data access
• Authentication abstraction– Collection-owned data, ACLs for data and metadata
• Latency management mechanisms
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Unix Shell
Java, NTBrowsers
OAIWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data Catalog
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRMORB
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Postgres, SQLServer, Informix
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Production Data Grid• SDSC Storage Resource Broker
– Federated client-server system, managing• Over 70 TBs of data at SDSC
• Over 10 million files
– Manages data collections stored in• Archives (HPSS, UniTree, ADSM, DMF)
• Hierarchical Resource Managers
• Tapes, tape robots
• File systems (Unix, Linux, Mac OS X, Windows)
• FTP sites
• Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix)
• Virtual Object Ring Buffers
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
SRBserver
SRB agent
SRBserver
Federated SRB server model
MCAT
Read Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
Peer-to-peer
Brokering
Server(s) SpawningData
Access
Parallel Data Access
R1R2
5/6
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Logical Name Space Example - Hayden Planetarium
• Generate “fly-through” of the evolution of the solar system
• Access data distributed across multiple administration domains
• Gigabyte files, total data size was 7 TBytes
• Very tight production schedule - 3 months
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Hayden Data Flow
NCSA
SDSC
AMNHNYC
GPFS7.5 TB
IBM SP2
SGI
Production parameters, movies, images
data simulation
visualization
HPSS 7.5 TB
2.5 TB UniTree
UVa
NY
CalTech
BIRN
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Logical Name Space
• Global, location-independent identifiers for digital entities– Organized as collection hierarchy– Attributes mapped to logical name space
• Attributed managed in a database
• Types of system metadata– Physical location of file– Owner, size, creation time, update time– Access controls
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Mappings on Name Space
• Define logical resource name– List of physical resources
• Replication– Write to logical resource completes when all physical resources
have a copy
• Load balancing– Write to a logical resource completes when copy exist on next
physical resource in the list
• Fault tolerance– Write to a logical resource completes when copies exist on “k” of
“n” physical resources
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Latency ManagementExample - Digital Sky Project
• 2MASS (2 Micron All Sky Survey): – Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen-
Piao Lee, IPAC, Caltech
• NVO (National Virtual Observatory):– Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good,
IPAC, Caltech
• SDSC – SRB :– Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Digital Sky - 2MASS
• http://www.ipac.caltech.edu/2mass
• The input data was originally written to DLT tapes in the order seen by the telescope – 10 TBytes of data, 5 million files
• Ingestion took nearly 1.5 years - almost daily reading of tapes, one at a time
• Images aggregated into 147,000 containers by SRB
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Digital Sky Data Ingestion
Informix
SUN
SRBSUN E10K
HPSS
….
800 GB
10 TB
SDSCIPAC CALTECH
input tapes from telescopes
star catalogData
Cache
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
SRB Latency Management
ReplicationServer-initiated I/O
StreamingParallel I/O
CachingClient-initiated I/O
Remote Proxies,Staging
Data AggregationContainers
SourceDestination
Prefetch
NetworkDestinationNetwork
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Containers
• Images sorted by spatial location– Retrieving one container accesses related images
• Minimizes impact on archive name space– HPSS stores 680 Tbytes in 17 million files
• Minimizes distribution of images across tapes• Bulk unload by transport of containers
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
SRB Development
• Peer-to-peer federation– Support multiple independent MCAT catalogs– Replicate metadata
• mySQL/BerkeleyDB port
• OGSA/OGSI compliant interface
• GridFTP interfaces– Waiting for next release of the software (4thQ)
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
MySRB Features
• Data & File Management
• Collection Creation and Management
• Collection of Varied Objects– Files, SQL Objects, Databases, URLs, directories, archives, …
• Metadata Handling
• Browsing & Querying Interface
• Access Control
• Version Control (soon)
• Support proxy (remote) operations
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
MySRB
• Web-based Access to the SRB• Secure HTTP• Uses Cookies for Session Control• Self Registration of Users Supported
– Currently limited to SDSC users• Self Registration of Resources (soon)• Access to Both Data and Metadata
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Data Management
• Browse in Hierarchical Collections• Registration of (remote) Legacy Files & Directories• Registration of SQL Objects• Registration of URLs• Data Movement Operations
– Ingest & Re-Ingest, Delete, Unlink– Replicate, Copy, Move, S-Link
• Access Control Operations– Read, Write, Own, Curate, Annotate, …– Ticket-based Access
• Version Control Operations (soon)– Read Lock, Write Lock, Unlock– Check In Check Out
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Types of Meta data• System-level Metadata
– Size, resource, owner, date, access control, …
• User-defined Meta data– for data & collections– <name,value,unit> triples– No limits in number of metadata– Support for Collection-level schemas
• Comments, default values, drop-down lists
– Support for Standardized Schemas • (eg. Dublin Core)
• Annotations– Supports textual annotations– Annotator, date, context also registered
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
Meta Data Management
• Insert, Update and Delete of Metadata
• Access Control for Metadata (soon in mySRB)
• Querying across system-level, user-defined metadata and annotations– Query under collections & across collections
• Browsing on user-defined metadata
• Metadata supported for legacy files & directories
• Extract Metadata (using proxy operations)