National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information...

download National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

of 55

  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    5

Embed Size (px)

Transcript of National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information...

  • Slide 1
  • National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE
  • Slide 2
  • National Partnership for Advanced Computational Infrastructure Distributed Archives Application Digital Library Data Mining Information Based Computing Information Discovery Collection Building
  • Slide 3
  • National Partnership for Advanced Computational Infrastructure Co-evolution of Technology Supercomputer Centers and Digital Libraries Both support large scale processing & storage of data Will the supercomputer centers of the future be digital libraries?
  • Slide 4
  • National Partnership for Advanced Computational Infrastructure Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan
  • Slide 5
  • National Partnership for Advanced Computational Infrastructure Outline Two views of computing Executionenvironment - metacomputing systems Data Management environment - digital library Analysis for moving data to the process or the process to the data Data Management Environment Information Based Computing
  • Slide 6
  • National Partnership for Advanced Computational Infrastructure Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Publication / Services Environment Presentation Interface Object Based Information Model Data Management for publication Data Resources Parallel I/O - MPI Constructors: turning data sets into objects Data Resources Data Management for execution Metacomputing Environment Execution Environment
  • Slide 7
  • National Partnership for Advanced Computational Infrastructure Choice between Environments Should we provide services for manipulating information Move the process to the data Should we provide execution environments Move data to the process
  • Slide 8
  • National Partnership for Advanced Computational Infrastructure Data Distribution Comparison Data Handling Platform Supercomputer Execution rate r
  • National Partnership for Advanced Computational Infrastructure Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero.
  • Slide 14
  • National Partnership for Advanced Computational Infrastructure Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.
  • Slide 15
  • National Partnership for Advanced Computational Infrastructure Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.
  • Slide 16
  • National Partnership for Advanced Computational Infrastructure Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.
  • Slide 17
  • National Partnership for Advanced Computational Infrastructure Is the Future Environment a Metacomputer or a Digital Library? Sufficiently high complexity Move data to processing engine Digital Library execution of remote services Traditional supercomputer processing of applications Sufficiently low complexity Move process to the data source Metacomputing execution of remote applications Traditional digital library service
  • Slide 18
  • National Partnership for Advanced Computational Infrastructure The IBM Digital Library Architecture Application (DL client) Metadata in DB2 or Oracle Videocharger DB2 ADSM Oracle Library Server Text and Image indices Federated search Object Server Distributed storage resources (SRB) (MCAT)
  • Slide 19
  • National Partnership for Advanced Computational Infrastructure Generalization of Digital Library Scaling transparency Support for arbitrary size data sets Support for arbitrary data type Location transparency Access to remote data Access to heterogeneous (non-uniform) storage systems Remove restriction of local disk space size Name service transparency Support for multiple views (naming conventions) for data Presentation transparency Support for alternate representations of data
  • Slide 20
  • National Partnership for Advanced Computational Infrastructure Describing Information Content
  • Slide 21
  • National Partnership for Advanced Computational Infrastructure State-of-the-art Information Management: Digital Library
  • Slide 22
  • National Partnership for Advanced Computational Infrastructure High Performance Storage Provide access to tertiary storage - scale size of repository Disk caches Tape robots Manage migration of data between disk and tape High Performance Storage System - IBM Provides service classes Support for parallel I/O Support for terabyte sized data sets Provide recoverable name space
  • Slide 23
  • National Partnership for Advanced Computational Infrastructure State-of-the-art Storage: HPSS Store Teraflops computer output Growth - 200 TB data per year Data access rate - 7 TB/day = 80 MB/sec 2-week data cache - 10 TB Scalable control platform 8-node SP (32 processors) Support digital libraries Support for millions of data sets Integration with database meta-data catalogs
  • Slide 24
  • National Partnership for Advanced Computational Infrastructure HPSS Archival Storage System 108 GB SSA RAID High Performance Gateway Node High Node Disk Mover HiPPI driver Wide Node Disk Mover HiPPI driver 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon Silver Node Tape / disk mover DCE / FTP /HIS Log Client 160 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 830 GB MaxStrat RAID 9490 Robot Four Drives 3490 Tape RS6000 Tape Mover PVR (9490) HiPPI Switch Trail- Blazer3 Switch Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 9490 Robot Eight Tape Drives Magstar 3590 Tape 9490 Robot Seven Tape Drives
  • Slide 25
  • National Partnership for Advanced Computational Infrastructure SDSC has achieved: Striping required to achieve desired I/O rates HPSS Bandwidths
  • Slide 26
  • National Partnership for Advanced Computational Infrastructure Turning Archives into Digital Libraries Meta-data based access to data sets Support for application of methods (procedures) to data sets Support for information discovery Support for publication of data sets Research issue - optimization of data distribution between database and archive
  • Slide 27
  • National Partnership for Advanced Computational Infrastructure Database Table C4C5C1C2C3 DB2/HPSS Integration DB2 HPSS DB2 Disk buffer HPSS Disk cache Collaboration with IBM TJ Watson Research Center Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala Features: Prototype, works with DB2 UDB (Version 5) DB2 is able to use a HPSS file as a tablespace container DB2 handles DCE authentication to HPSS Regular as well as long (LOB) data can be stored in HPSS Optional disk buffer between DB2 and HPSS
  • Slide 28
  • National Partnership for Advanced Computational Infrastructure Generalizing Digital Libraries SRB - Location transparency Access to heterogeneous systems Access to remote systems MCAT - Name service transparency Extensible Schema support MIX - Presentation transparency Mediation of information with XML Support for semi-structured data Access scaling MPI-I/O access to data sets using parallel I/O
  • Slide 29
  • National Partnership for Advanced Computational Infrastructure SRB UniTreeHPSSDB2IllustraUnix SRB Software Architecture SRB APIs User Authentication Dataset Location Access Control Type Replication Logging Metadata Catalog MCAT Application (SRB client)
  • Slide 30
  • National Partnership for Advanced Computational Infrastructure 14 Installed SRB Sites Rutgers NCSA Montana State University Large Archives
  • Slide 31
  • National Partnership for Advanced Computational Infrastructure SRB / MCAT Features Support for Collection hierarchy allows grouping of hetero- geneous data sets into a s