Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

download Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

of 55

  • date post

  • Category


  • view

  • download


Embed Size (px)


Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services. Reagan W. Moore San Diego Supercomputer Center Information Based Computing. Data Mining. Distributed Archives. Application. Collection Building. - PowerPoint PPT Presentation

Transcript of Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

  • Data Intensive Computing

    Information Based Computing

    Digital Libraries / Metacomputing ServicesReagan W. MooreSan Diego Supercomputer Centermoore@sdsc.edu

  • Information Based ComputingDistributed Archives ApplicationDigital Library Data Mining Information DiscoveryCollectionBuilding

  • Co-evolution of TechnologySupercomputer Centers and Digital LibrariesBoth support large scale processing & storage of data

    Will the supercomputer centers of the future be digital libraries?

  • ResearchersChaitanya BaruAmarnath GuptaBertram LudaescherRichard Marciano Yannis PapakonstantinouArcot RajasekarWayne SchroederMichael Wan

  • OutlineTwo views of computingExecutionenvironment - metacomputing systemsData Management environment - digital library Analysis for moving data to the process or the process to the dataData Management EnvironmentInformation Based Computing

  • Digital LibrariesMultimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50Publication / Services EnvironmentPresentation InterfaceExecution Environment

  • Choice between EnvironmentsShould we provide services for manipulating informationMove the process to the data

    Should we provide execution environments Move data to the process

  • Data Distribution ComparisonData Handling PlatformSupercomputerExecution rate r r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)]

    Note the denominator changes sign whenO < o (1 - s/S) [1 + r/(ob)]

    Even with an infinitely fast supercomputer, it is better toprocess at the archive if the complexity is too small.

  • Data Reduction OptimizationMoving all of the data is faster, T(Super) < T(Archive)Data reduction is small enough

    s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]}

    Note criteria changes sign whenO > o [1 + r/R + r/(ob)] / (1 - r/R)

    When the complexity is sufficiently large, it is faster toprocess on the supercomputer even when data can be reduced to one bit.

  • Is the Future Environment a Metacomputer or a Digital Library?Sufficiently high complexityMove data to processing engineDigital Library execution of remote servicesTraditional supercomputer processing of applicationsSufficiently low complexityMove process to the data sourceMetacomputing execution of remote applicationsTraditional digital library service

  • The IBM Digital Library Architecture Application(DL client)Metadata inDB2 or Oracle Videocharger DB2 ADSM OracleLibrary ServerText and Image indicesFederated searchObject ServerDistributed storage resources(SRB)(MCAT)

  • Generalization of Digital LibraryScaling transparencySupport for arbitrary size data setsSupport for arbitrary data typeLocation transparencyAccess to remote dataAccess to heterogeneous (non-uniform) storage systemsRemove restriction of local disk space sizeName service transparencySupport for multiple views (naming conventions) for dataPresentation transparencySupport for alternate representations of data

  • Describing Information Content

  • State-of-the-art Information Management: Digital Library

  • High Performance StorageProvide access to tertiary storage - scale size of repositoryDisk cachesTape robotsManage migration of data between disk and tapeHigh Performance Storage System - IBMProvides service classes Support for parallel I/OSupport for terabyte sized data setsProvide recoverable name space

  • State-of-the-art Storage: HPSSStore Teraflops computer outputGrowth - 200 TB data per year Data access rate - 7 TB/day = 80 MB/sec2-week data cache - 10 TBScalable control platform8-node SP (32 processors)Support digital librariesSupport for millions of data sets Integration with database meta-data catalogs

  • HPSS Archival Storage System

  • HPSS BandwidthsSDSC has achieved:

    Striping required to achieve desired I/O rates


    Node-HPGN90 MB/s

    Texas Memory Box80 MB/s

    Max Strat disk60 MB/s

    SSA Raid20-30 MB/s

  • Turning Archives into Digital LibrariesMeta-data based access to data setsSupport for application of methods (procedures) to data setsSupport for information discoverySupport for publication of data sets

    Research issue - optimization of data distribution between database and archive

  • DB2/HPSS IntegrationCollaboration with IBM TJ Watson Research CenterMing-Ling Lo, Sriram Padmanabhan, Vibby GottemukkalaFeatures:Prototype, works with DB2 UDB (Version 5) DB2 is able to use a HPSS file as a tablespace containerDB2 handles DCE authentication to HPSSRegular as well as long (LOB) data can be stored in HPSSOptional disk buffer between DB2 and HPSSDatabase TableC4C5C1C2C3DB2HPSSDB2DiskbufferHPSSDiskcache

  • Generalizing Digital LibrariesSRB - Location transparencyAccess to heterogeneous systemsAccess to remote systemsMCAT - Name service transparencyExtensible Schema supportMIX - Presentation transparencyMediation of information with XMLSupport for semi-structured dataAccess scalingMPI-I/O access to data sets using parallel I/O

  • SRB Software ArchitectureSRBUniTreeHPSSDB2Illustra

    UnixSRB APIsUser AuthenticationDataset LocationAccess ControlTypeReplicationLoggingMetadataCatalogMCATApplication(SRB client)

  • 14 Installed SRB SitesRutgersNCSAMontana State UniversityLarge Archives

  • SRB / MCAT FeaturesSupport for Collection hierarchyallows grouping of hetero-geneous data sets into a single logical collectionhierarchical access control, with ticket mechanismReplicationoptional replication at the time of creationcan choose replica on readProxy operationssupports proxy (remote) move and copy operationsMonitoring capabilitySupports storing/querying of system- and user-defined metadata for data sets and resourcesAPI for ad hoc querying of metadataAbility to extend schemas and define new schemasAbility to associate data sets with multiple metadata schemasAbility to relate attributes across schemasImplemented in Oracle and DB2

  • MCAT Schema IntegrationPublish schema for each collectionClusters of attributes form a tableTables implement the schemaUse Tokens to define semantic meaningAssociate Token with each attributeUse DAG to automate queriesSpecify directed linkage between clusters of attributesTokens - Clusters - Attributes

  • PublishingA NewSchema

  • AddingAttributesto theNewSchema

  • Displaying AttributesFrom SelectedSchemas

  • SecurityIntegration of SDSC Encryption Authentication system (SEA) with Globus GSIKerberos within security domainGlobus for inter-realm authenticationAccess control lists per data setAudit trails of usageNeed support for third-party authenticationUser A accesses data under the control of digital library B when the data is stored at site C

  • MIX: Mediation of Information using XMLXMAS queryXMAS query fragmentMediatorWrapperActiveView 1Convert XMAS query to local query language,and data in native format to XMLSQL DatabaseWrapperWrapperSpreadsheetHTML filesXML dataXML dataSupport for active viewsActiveView 2BBQ InterfaceBBQ InterfaceLocal Data Repository

  • Integration of Digital Librarywith Metacomputing SystemsNTON OC-192 network (LLNL - Caltech - SDSC)HPSS archiveGlobus metacomputing systemSRB data handling systemMCAT extensible metadataMIX semi-structured data mediation using XMLICE collaboration environmentFeature extraction

  • INFORMATION SERVICESData Intensive and High-Performance Distributed ComputingLocal Resource ManagementData RepositoriesResources LayerFault DetectionResource ManagementGeneric Services LayerDomain Specific Services LayerApplication ToolkitsNetwork CachingMetadataCommunication Libs.Grid-enabled LibsVisualizationResource DiscoveryResource BrokeringEnd-to-End QoSRemote Data AccessInterdomain SecurityScheduling

  • Research ActivitiesSupport for remote execution of data manipulation proceduresGlobus - SRB integrationAutomated feature extractionXML based tagging of featuresXML query language for storing attributes into the Intelligent ArchiveIntegration with RIO - parallel I/O transport

  • Views of Software InfrastructureSoftware infrastructure supports user applicationsReason for existence of software is to provide explicit capabilities required by applications

    What is the user perspective for building new software systems?Is the integration of digital library and metacomputing systems the final version?

  • Software Integration ProjectsNSFComputational Grid - Middleware using distributed state information to support metacomputing servicesDOEData Visualization Corridor - collaboratively visualize multi-terabyte sized data setsNASAInformation Power Grid - integrate data repositories with applications and visualization systemsDARPAQuorum - provide quality of service guarantees

  • User Requirements - Five Software EnvironmentsCode DevelopmentResources supportRun-timeParallel Tools and LibrariesDistributed Run-Time Metacomputing environmentInteraction Environments Collaboration, presentationPublication / Discovery / RetrievalData intensive computing environment

  • Metacomputing Environment Data Flow PerspectiveArchival Storage SystemRemote Data ManipulationData Handling SystemData Staging SystemData Caching SystemDistributed Execution EnvironmentObject Oriented InterfaceApplication

  • Publication Environment Data Flow PerspectiveArchival Storage SystemRemote Data ManipulationData Handling SystemCollection Management SoftwareDigital Library ServicesData Set ConstructorRun-time AccessApplication

  • Run-time Environment Data Flow Perspective

  • Interaction Environment Data Flow PerspectiveArchiv