NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

40
NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager

Transcript of NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Page 1: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

NeSC Data Projects and Initiatives

Dr. Dave BerryResearch Manager

Page 2: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Contents

The Data DelugeWeb ServicesThe DAI visionThe OGSA-DAI Project and GGFThe OGSA-DAI SoftwareEdiktOther relevant projects in the UK

Page 3: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Acknowledgements

This talk includes material prepared by:The OGSA-DAI projectThe e-Diamond projectThe BRIDGES projectThe GGF OGSA Working Groupand others…

Page 4: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

The Data Deluge

Mont Blanc(4810 m)

Entering an age of dataCERN: LHC will generate 1GB/s = 10PB/yVLBA (NRAO) generates 1GB/s todayPixar generate 100 TB/Movie

Data stored in many different waysRelational databasesXML databasesFlat files

Need ways to facilitate Data discoveryData accessData integration

Downtown Geneva

Page 5: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Astronomical Databases

No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues nr. 1B objects

Data and images courtesy Alex Szalay, John Hopkins

Page 6: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Bioinformatics DatabasesPDB Content Growth

•Biobliographic (MedLine, …)

•Amino Acid Seq (SWISS-PROT, …)

•3D Molecular Structure (PDB, …)

•Nucleotide Seq (GenBank, EMBL, …)

•Biochemical Pathways (KEGG, WIT…)

•Molecular Classifications (SCOP, CATH,…)

•Motif Libraries (PROSITE, Blocks, …)

Page 7: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Web Services

Using the protocols and ideas that have made the web a success for humans…And applying them to distributed programming

HTTP Single networking port Autonomy & Failure handlingOpen standards

Tools & PlatformsApache axisWebsphere, .NET, Oracle Application Server, Sun ONE, …

Page 8: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

From Browsing to Programming

  Browsing the web Programming the web

Readers People Software

Discovery Google, Altavista, … UDDI, …

Description N/A WSDL

Operations Get, post, … Service-specific

Protocol HTTP SOAP over HTTP

Format HTML, XHTML XML + Schema

Page 9: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

A Perspective on WS Specifications

Page 10: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Open Grid Services Architecture

Web Services

Business integration

Secure and universal access

Applications on demand

Grid Protocols

Vast resourcescalability

Global Accessibility

Resourceson demand

ContinuousAvailability

Accessresource

Manageresource

Shareresource

The architecture of the Global Grid Forum

Page 11: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

ContextServices

InformationServices

InfrastructureServices

SecurityServices

ResourceMgmt

Services

ExecutionMgmt

Services

DataServices

PolicyMgmt

VOMgmt

Access

Integration

Provisioning

Cataloging

BoundaryTraversal

Integrity

Authorization

Authentication

WSRF WSN WSDM

EventMgmt

Trouble-shooting

Discovery

JobMgmt

Logging

ExecutionPlanning

WorkflowMgmt

WorkloadMgmt

Provisioning

ApplicationMgmt

DeploymentConfigurationReservation

Naming

SelfMgmt

Services

HeterogeneityMgmt

Service LevelAttainment

QoSMgmt

Optimization

GGF11:OGSA specification

informationaldocument

Page 12: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Data Access and Integration

Web Services for querying and integrating structured data resourcesThe foundation framework for:

Building tailored DAI applicationsHigher-level services:

Replication: Data located in multiple locations Federation: Composition of multiple sources Provenance: How was data generated?

Page 13: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

The OGSA-DAI Project

Powered by ….

Funded by the Grid Core ProgrammeOGSA-DAI£3 million, 18 months, from Feb 2002

Three major releases, three interim releases

DAIT (DAI-Two)Keep the OGSA-DAI brand name£1.5 million, 24 months, from Oct 2003Four major releases

Page 14: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

DAI in GGF and OGSA

Data Access and Integration Services WGStrong involvement from OGSA-DAI membersStandardise the interfaces – WS-DAIOGSA-DAI a reference implementationExperience informing specification work

OGSA WG Data Design TeamDesigning the data-oriented aspects of OGSACreated after GGF10 (March 2004)Led by NeSC

Page 15: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Context Services Info

Services

InfraServices

SecurityServices

Rsrc Mgmt Services

Execution Mgmt

Services

DataServices

PolicyMgmt

VOMgmt

Access

Integration

Provisioning

Cataloging

BoundaryTraversal

Integrity

Authorization

Authentication

WSRF WSN WSDM

EventMgmt

Trouble-shooting

Discovery

JobMgmt

Logging

ExecutionPlanning

WorkflowMgmt

WorkloadMgmt

Provisioning

ApplicationMgmt

DeploymentConfigurationReservation

Naming

Self MgmtServices

HeterogeneityMgmt

Service LevelAttainment

QoSMgmt

Optimization

OGSA Design Teams

OGSA-WG

Information Service design teamData Service design team

EMS design team

Resource Mgmt design team

Security Service design team

Self Mgmt design team

Core (roadmap) design team

Naming design team

Page 16: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Data Services design team

Informal domain expert groups within OGSAMay include co-chairs of other WG/RGsOutput is included in OGSA specification

OGSA-WG

OGSA Data ServiceDesign team

DAIS-WG

GSM-WG

GFS-WG

Info-D WG

ADF, OREP, …

Tele cons, F2F meetings

Page 17: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

OGSA v2 Document Deliverables

RootDocuments

Usecase doc Architecture v2 Glossary

Design team

DocumentsService descriptions Scenarios

Working Group

Specifications GGF Recommendation documents

Page 18: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

1a. Request to Registry for sources of data about “x”

1b. Registry responds with

Factory handle2a. Request to Factory for access to database

2c. Factory returns handle of GDS to client

3a. Client queries GDS with XPath, SQL, etc

3b. GDS interacts with database

3c. Results of query returned to client as XML

SOAP/HTTP

service creation

API interactions

Registry

Factory

2b. Factory creates GridDataService to manage access

Grid Data Service

Client

XML / Relational database

How OGSA-DAI works

Page 19: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

OGSA-DAI compared to JDBC

Language independence at the client endPlatform independence

Do not have to worry about connection technology, drivers, etc

Can handle XML resourcesCan embed additional functionality at the service end

TransformationsThird party deliveryAvoiding unnecessary data movement

Provision of Metadata is powerfulUsefulness of the Registry for service discovery

Dynamic service binding process

Page 20: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

GDTS2 GDS3

GDS2

GDTS1

Sx

Sy

1a. Request to Registry for sources of data about “x” & “y”

1b. Registry responds with

Factory handle

2a. Request to Factory for access and integration from resources Sx and Sy

2b. Factory creates GridDataServices network

2c. Factory returns handle of GDS to client

3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc

3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation

SOAP/HTTP

service creation

API interactions

Data Registry

Data Access& Integrationmaster

Client

Analyst XML database

Relational database

GDS

GDS

GDS

GDTS

GDTS

3b. Client tells analyst

GDS1

Future DAI Services

Application Code

Page 21: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Activities are the drivers

Express a task to be performed by a GDSThree broad classes of activities:

StatementTransformationsDelivery

Extensible:Easy to add new functionalityDoes not require modification to the service interfaceExtension operate within the OGSA-DAI framework

Functionality:Implemented at the serviceWork where the data is (do not require to move data back)

Page 22: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

OGSA-DAI Deck

Page 23: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Building Applications

Activities are grouped togetherPerform documentData can flow between activities

OptimisationAvoids multiple message exchanges

Can deliver to other GDSsPrerequisite for data integration

Base middleware for projects requiring data access

Some capability for data integration

Page 24: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Release 4, April 2004

Provides Data Access components, an extensible framework for building applications and some integration componentsBuilt on top of Globus Toolkit 3.2Supports relational, xml and some files

MySQL, Oracle, DB2, SQL Server, Postgres, XIndice, CSVSupports various delivery options

SOAP, FTP, GridFTP, HTTP, files, email, inter-serviceSupports various transforms

XSLT, ZIP, GZipSupports message level security using X509 certificatesClient Toolkit library for application developersGUI data browser (contributed by FirstDIG project)Separate Distributed Query Processing componentsComprehensive documentation and tutorials in XHTML format

Page 25: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Downloads by Release

0

500

1000

1500

2000

2500

3000

15/0

1/20

03

15/0

3/20

03

15/0

5/20

03

15/0

7/20

03

15/0

9/20

03

15/1

1/20

03

15/0

1/20

04

15/0

3/20

04

15/0

5/20

04

15/0

7/20

04

R1 R2

R3

R4

2746 downloads (~4.7 downloads a day)

Page 26: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Downloads by country

792 registered users @ 23/8/04

Page 27: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Release 5, October 2004

Re-engineered interface-independent core OGSA-DAI functionality.Improved dependability and security integration.New file data resources representing flat files queried using full text searches (e.g. EMBL format).Installation and Configuration Wizard, including “all-in-one installer”Improved Data Browser which allows XPath querying.Set of standard benchmarks.JSP Quick View interface.Support for other databases (e.g. Access, Exist, HSQL).

Page 28: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Release 6, April 2006

Data Integration applications supporting identified scenariosOGSA-DQP as an integrated part of releaseFully compliant JDBC Driver for OGSA-DAISupport for WS-Security implementationsSupport for stored procedures on all supported databasesImproved support for different database specific SQL typesSQL translation between vendor dialects for subset of queries Support for XQuery data resourcesWe expect to comply with a version of the emerging DAIS specification at this release.

Page 29: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Who is Using OGSA-DAI?

OGSA-DAI(http://www.ogsadai.org.uk)

AstroGrid(http://www.astrogrid.org/)

BioSimGrid(http://www.biosimgrid.org/)

BioGrid(http://www.biogrid.jp/)

Bridges(http://www.brc.dcs.gla.ac.uk/projects/bridges/)

eDiaMoND (http://www.ediamond.ox.ac.uk/)

FirstDig(http://www.epcc.ed.ac.uk/~firstdig/)

GeneGrid(http://www.qub.ac.uk/escience/projects.php#genegrid)

GEON(http://www.geongrid.org/)

IU RGRBench(http://www.cs.indiana.edu/~plale/projects/RGR/OGSA-DAI.html)

myGrid(http://www.mygrid.org.uk/)

N2Grid(http://www.cs.univie.ac.at/institute/index.html?project-80=80)

ODD-Genes(http://www.epcc.ed.ac.uk/oddgenes/) OGSA-WebDB

(http://www.gtrc.aist.go.jp/dbgrid/)

MCS(http://www.isi.edu/~deelman/MCS/)

INWA(http://www.epcc.ed.ac.uk/projects/inwa/)

GridMiner(http://www.gridminer.org/)

Page 30: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

OGSA-DAIBiologicalSciences

PhysicalSciences

Commercial Applications

ComputerSciences

• FirstDig

• I NWA

• Bridges • AstroGrid

• BioSimGrid• BioGrid

• eDiamond• myGrid

• ODD- Genes

• N2Grid

• GEON

• MCS

• I U RGBench

• OGSA Web- DB

• GeneGrid

• GridMiner

OGSA-DAIBiologicalSciences

PhysicalSciences

Commercial Applications

ComputerSciences

• FirstDig

• I NWA

• Bridges • AstroGrid

• BioSimGrid• BioGrid

• eDiamond• myGrid

• ODD- Genes

• N2Grid

• GEON

• MCS

• I U RGBench

• OGSA Web- DB

• GeneGrid

• GridMiner

Project classification

Page 31: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Edikt

The team: 8 professional software engineers, support staff, project manager, commercialisation manager, architect, and SABSHEFC funded research and development grant

3 years funding: May 2002 – 2005+3 years funding upon successful project and review

Standards

Edikt project

Requirementsanalysis

Technologymatchmaking

Gap filling Rigorousengineering

CS Research

Grid Services fore-Science Data Management

Commercial SW components

and skills

E-Science Apps

Page 32: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

JavaFramework

ELDAS – Data Access Service

Implemented using Enterprise Java BeansData Access Components interface to distinct DBMSsAccessible as a grid data service or a web data service

ELDAS

DB2 DBMySQL DBXindice DB

Web User1

Oracle 9i DB

EJB - DAS

DACDACDACDAC

Another (partial) implementation of the GGF WS-DAI specifications

Web ServletGrid Proxy

Grid User1 Grid User2

Page 33: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

e-ScienceApplication

BinaryData File

BinaryData FileBinary

Data File

BinaryData FileBinary

Data File

BinaryData File

BinX – accessing legacy binary data

The Problem:Many binary data filesApplications must “know”the data formatBinary data formats are machine-specific

BinX Library

The Solution:Write a “stand-aside” format description in XMLProvide a library to

Interpret the description Provide file access across different

machines

Build higher-level services

BinX file describes binary file structure

BinX file describes binary file structure

simulations

Page 34: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Mammography

Mammograms have different appearances, depending on image settings and acquisition systems

StandardMammoFormat

StandardMammoFormat

Temporal mammography

ComputerAidedDetection

3D View

A prototype of a national database of mammographic images in support of the UK breast screening programme

Page 35: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

DB2 ContentManager

DB2 ContentManager

DB2 ContentManager

DB2 ContentManager

DB2 Federation

OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI

Database Files

OGSA-DAI

Core Services

Core Services

Core Services

Core Services

DataLoad

TrainingApp

TrainingServices

UCLKCL UEDCHU

CoreAPI

TrainingAPI

TrainingApplication

Core & Training API

OGSA-DAI

DataLoad

TrainingApp

Core & Training API

DataLoad

TrainingApp

Core & Training API

DataLoad

TrainingApp

Core & Training API

Page 36: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

The BRIDGES Project

Biomedical Research Informatics Delivered by Grid Enabled Services

NeSC (Edinburgh and Glasgow) and IBM www.brc.dcs.gla.ac.uk/projects/bridges

Supporting project for CFG project Generating data on hypertensionRat, Mouse, Human genome databases

Variety of tools usedBLAST, BLAT, Gene Prediction, visualisation, …

Variety of data sources and formatsMicroarray data, genome DBs, project partner research data, medical records, …

Aim is integrated infrastructure supportingData federationSecurity

Page 37: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

BRIDGES

Glasgow Edinburgh

Leicester Oxford

London

Netherlands

Publically Curated Data

Private data

Private data

Private data

Private data

Private data

Private data

CFG Virtual Organisation Ensembl

MGI

HUGO

OMIM

SWISS-PROT

… DATA HUB

RGD

SyntenyGrid

Service

blast

+

VO Authorisation

Information Integrator

OGSA-DAI

Page 38: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

INWA Project

Innovation Node Western AustraliaInforming Business & Regional Policy: Grid-enabled fusion of global data and local knowledge

Involved 10 partners (6 UK + 4 Australia)Aim

Data mine commercially sensitive dataSecurity an absolute MUSTEmploy Grid technologiesNeed access to data and computational resources

OGSA-DAIAccess data resources

SunDCG's TOG (Transfer-queue Over Globus)Handle job submission to analyse micro array data

Page 39: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

user@australia

Curtin,Australia

EPCC,UK

INWA

Grid Engine

Bank Telco

Grid Engine

Bank Telco

OGSA-DAI OGSA-DAI

OGSA-DAI OGSA-DAI

TOG

TOG

Data Browser

Data Browser

user@edinburgh

Telco data

Bank data

Australian property

UK Property

Page 40: NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager.

Further Information on OGSA-DAI

The OGSA-DAI Project Site:http://www.ogsadai.org.uk

The DAIS-WG site:http://cs.man.ac.uk/grid-db

OGSA-DAI Users Mailing [email protected] discussion on grid DAI matters

Formal support for OGSA-DAI releaseshttp://www.ogsadai.org.uk/[email protected]

OGSA-DAI training courses