my Grid: Upper level Grid Services for the Bioinformatican

45
my Grid: Upper level Grid Services for the Bioinformatican Prof. Carole Goble http:// www.mygrid.org.uk Sun Microsystems BioGrid Symposium, Baltimore, USA 4 th -5 th December 2002

description

my Grid: Upper level Grid Services for the Bioinformatican. Prof. Carole Goble http://www.mygrid.org.uk Sun Microsystems BioGrid Symposium, Baltimore, USA 4 th -5 th December 2002. UK eScience Programme. Grid-enabled eScience Emphasis on information integration and knowledge management - PowerPoint PPT Presentation

Transcript of my Grid: Upper level Grid Services for the Bioinformatican

myGrid: Upper level Grid Services

for the Bioinformatican

Prof. Carole Goblehttp://www.mygrid.org.uk

Sun Microsystems BioGrid Symposium, Baltimore, USA 4th-5th December 2002

UK eScience Programme

Grid-enabled eScienceEmphasis on information integration

and knowledge managementThe Virtual Organisation view$180 million + industrial contributionsComplete infrastructure of regional eScience centres, support and a UK computational GridStarted on Globus though Unicore

used in EuroGrid with great successCentres donated equipment – highly

heterogeneousCore component of the EU Grid FP6

programme

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

CardiffSouthampton

London

BelfastDL

RALHinxton

myGrid

IBM

• EPSRC UK eScience pilot project• 01/01/02 - end 30/03/05• Uses the UK Grid infrastructure

Lion BioSciences, Millennium Pharmaceuticals & Oracle

• Not a computational grid project• Building Grid middleware• Higher level services: workflow, databases,

knowledge management, provenance…• Service-based : Open Grid Service

Architecture early adopter• Bioinformatics services are published as Web

services and Grid Services• Working with publicly available biological

resources: e.g. EMBL-EBI

myGrid

What is the Grid?• Resource sharing & coordinated problem

solving in dynamic, multi-institutional virtual organizations

• On-demand, ubiquitous access to computing, data, and all kinds of services

• New capabilities constructed dynamically and transparently from distributed services

• No central location, No central control, No existing trust relationships, Little predetermination

• Uniformity, Pooling & Virtualisation

What is the Grid?• In silico experiments

– Information harvesting & PSE– Dynamically forming virtual

organisations to solve problems.

– Describing, searching for and weaving resources: people. applications, db, content, instruments

– Orchestrating resources – Support for scientific method:

provenance, argumentation, opinion contextualisation etc

• BioUtility & communities of practice

Knowledge Grid

Information Grid

Data/Computation Grid

“E-Scientists” Environment

Information Weaving

• Large amounts of different kinds of data & many applications.

• Highly heterogeneous.– Different types, algorithms,

forms, implementations, communities, service providers

• High autonomy.• Highly complex and inter-

related, & volatile.• Much of it textual narrative

Circadian Rhythms

1. Has anyone else studied the effect of neurotransmitters on the circadian rhythms in Drosophila?

2. I’ve got a cluster of proteins from my experiment. How do their functions interrelate? And what are the proteins with a particular function?

3. Is a structure known for my protein? What other proteins have a similar structure?

4. Can I build a homology 3D model?5. What is known about a homologous

protein?

1

2

54

3

e-Science Q & A

Who else has asked this question & can I use/adapt their approach?– Workflow.

What were the results at each stage?– Dynamic Data Repositories.

When was P12345 last updated?Which BLAST did I use?

– Provenance.Has PDB changed since I last ran this?

– Notification.

1

2

54

3

Personalisation.

3

54

Courtesy of Mark Wilkinson (BioMOBY)

myGrid • Service based architecture

– Publication, discovery, interoperation, composition, decommissioning of myGrid services

• Resource Interoperation– Workflow coordination & Database

integration.– Experimental workflows rather than

production workflows.• Experimentation

– Provenance & Change Propagation– Personalisation & Collaborative working.

• Security & ownership• Knowledge based using metadata and

ontologies

RASMOL

Metadata

Knowledge(ontologie

s)

Low level Grid Common Services (OGSI)Co-scheduling, data shipping, authentication, job execution, resource monitoring, database access

Middle level Grid Common Services:Database access, distributed query processing, service discovery, workflow enactment, event

notification

Upper level knowledge-based Grid Common Services:

Semantic integration, knowledge based querying, workflow composition, visualisation, provenance

mgt, semantic service discovery

Pro

ven

an

ce

Pers

on

alis

aio

nSecu

rity

BioMedical Services Library:DAS, workflow sets, integrated databases

Web Portal

Carp Gene expression

analysis

TALISMANannotation workbench

Workbench

Who is myGrid for?myGrid users

biologists IS specialists

infrequentproblem specificbioinformaticians

tool builders

serviceprovider

systemsadministrators

bioinformaticstool builders

myGrid Outcomes

• e-Scientists– Environment built on toolkits for service access,

personalisation & community.– Talisman – Interpro family of pattern databases

annotation– UTOPIA – visual multiple sequence alignment– Workbench for gene expression in Carp & Graves

disease• Developers

– Protocols and service descriptions.– myGrid-in-a-Box developers kit of core services.– Reference implementation services & applications.– Bio services.

Service based architecture

• Each bio resource is a service– Database, archive, analysis,

tool, person, instrument, a workflow …

• Each myGrid architectural component is a service– Workflow enactment

engine, event notification, registry, scheduler…

• OGSA early adopter.

Web services

Grid protocols

Open Grid Service

Architecture

Metadata+ontology• Service registration, discovery,

publication, composition, management.

• Data types & ontologies• Service matchmaking• Ontology editor, deployment

server & reasoner• Typing inputs and outputs of

workflows• Semantic Database integration• Portal driving ….

Web services

Grid protocols

OGSA

Semantic Web

W3C: RDF,DAML+OIL, OWL

1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives.

2. Once the user has entered a partial description they submit it for matching. The results are displayed below.

3. The user adds the operation to the growing workflow.

4. The workflow specification is complete and ready to match against those in the workflow repository.

Integration & Coordination

• View-based Information Repository for XML data

• Database integration– Access XML and RDBMS with OGSA-DAI– Semantic database integration.– Distributed query processing.

• Workflow – Dynamic workflow enactment engine.– Workflow repository– User interactivity.– Workflows linked with results

E-Science Support

• Data provenance and resource change management– Workflow logs.– Event notification service.– Incremental view management.– Workflow and query evolution.

• Personalisation– Management of views over repositories.– Personalisation of process flows. – Annotation of data sets and workflows– Dynamic creation of personal data sets.

Bio-Science services

• Grid-enabled BioServices by the EMBL-European Bioinformatics Institute– EMBOSS, SRS, Open BQS, BLAST, XEmbl and

EmblFetch, Flybase, Gadfly …

• Applications using Gateway API– TALISMAN (annotation tool used by Interpro)– UTOPIA (sequence fingerprint analysis)

• Portal• Workbench application

How do the functions of a

cluster of proteins

interrelate?

Some proteins in my personal repository

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

Find services that takes a protein and gives their functions and pick the best match.

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

Find another that displays the proteins base on their function. Ontology restricts inputs & outputs

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

Build a workflow of composed services linked together

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

See if a workflow that is appropriate already exists. It could have been made anyone who will share with you.

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

Pick one and enact it.

Portal

PersonalRepository

Meta Data:Ontology

WorkflowRepository

Meta Data:Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

While its running it picks the best service instance that can run the service at that time.

Repos.Client

Bioinformatic Services

PersonalRepository

WorkflowEnactment

ServiceDirectory

4

2

2?

2?Provenance

Data

3

WorkflowClient

Service SelectionClient

1

Repos.Client

Bioinformatic Services

PersonalRepository

WorkflowEnactment

ServiceDirectory

4

2

2?

2?Provenance

Data

3

WorkflowClient

Service SelectionClient

1

While its running it picks the best service instance that can run the service at that time.

Or you choose.

The workflow finishes with the final display service

Repos. Client

Bioinformatic Services

PersonalRepository

WorkflowEnactment

ServiceDirectory

4

2

2?

2?Provenance

Data

3

WorkflowClient

Service SelectionClient

1

Results are put into your personal repository, with a concept from the ontology to tell you and myGrid what they mean.

Repos. Client

Bioinformatic Services

PersonalRepository

WorkflowEnactment

ServiceDirectory

4

2

2?

2?Provenance

Data

3

WorkflowClient

Service SelectionClient

1

And full provenance record kept, and linked with the results. We could redo or reuse the workflow.

Repos. Client

Bioinformatic Services

PersonalRepository

WorkflowEnactment

ServiceDirectory

4

2

2?

2?Provenance

Data

3

WorkflowClient

Service SelectionClient

1

HPC vs Bioinformatics

• Computational Biology vs Bioinformatics => HPC vs Info Grid– Relationship between them? Shared

components? Architectures? – Information management matters!

Accelerating scientific process is not just accelerating compute intensive processes.

• HPC style BioGrid– Provenance? Personalisation? Metadata?

Interactivity? Knowledge? Intermediate results to db; annotated logs…

We are not alone

• Other Efforts – we are not alone– W3C semantic web, BioMOBY, I3C, OMG

LSR, active ontology development in the community, DARPA,

• Open Grid Service Architecture– We believe!! Links with Web Services give

many benefits.– But it’s a moving target … – GGF is a zoo … over 40 RG and WG, often

overlapping.

Service Providers • Its hard to get Service Providers buy-in

– lower the barriers of entry– make it reliable.– security & intellectual property management– programmatic interfaces

• How do we migrate legacy applications?– Whole bunch of apps and databases on the web

• Accounting matters– Who is going to pay for all this?

Hotch potch

• Heterogeneity sucks– Multi-policy of everything – security,

access, accounting really matters in EU– Getting a UK Grid to work is non-trivial– Huge investment in system admin.

• Doing more than you could do before.– Not just another predictable BLAST

service over a bunch of machines– Non-predictable analysis.

Not a silver bullet! Its just middleware not magic• Data quality• Content management of databases (controlled

vocabularies)• Provenance and versioning policies• Appropriate use of tools• Computational inaccessibility of free text

annotation• Database accessibility through means other than

point and click web interfaces.Independent of the Grid!

Life Sciences Grid (LSG)

http://people.cs.uchicago.edu/~dangulo/LSG/

The sum up

• If you ignore the multi-organisational aspect of Grid

• If you ignore the heterogeneous aspect of Grid

• If you assume its safe and free and fair

• Then its not so hard.

The myGrid Team• Carole Goble• Norman Paton• Alvaro Fernandes• Stephen Pettifer• Luc Moreau• Dave De Roure• Chris Greenhalgh• Tom Rodden• John Brooke• Paul Watson• Alan Robinson• Rob Gaizauskas• Robert Stevens• Neil Wipat

• Matthew Addis• Nick Sharman• Rich Cawley• Simon Harper• Karon Mee• Simon Miles• Vijay Dailani• Xiaojian Liu• Tom Oinn• Martin Senger• Milena Radenkovic• Kevin Glover• Angus Roberts• Chris Wroe

• Mark Greenwood • Phil Lord• Neil Davis• Darren Marvin• Justin Ferris• Peter Li• Nedim Alpdemir• Luca Toldo• Robin McEntire• Anne Westcott• Tony Storey• Bernard Horan• Paul Smart• Robert Haynes

Spares

Knowledge Services

Knowledge-based data/computation

services

Knowledge-based information

services

Data/computation services

Information services

e-Scientist environment

Text miningAnnotation

Base services

Semanticservices

Knowledgeservices

Knowledgeapplications & networks

Collaboratory Prediction

Applications

Resources

Web Portal

Gateway API

Workbench Apps Builder (Talisman)

Custom Application DemonstratorApplication

UTOPIA

WorkbenchDemonstrator

Cold Carp Gene Expression

MSD Sequence annotation

Pro

ven

an

ce

Pers

on

alis

aio

n

Secu

rityBioMedical Services Librarye.g. Distributed Annotation Service

User Agent

Presentation Services

Collaboration Support

Management Tools

Base

Serv

ices

Sem

an

tic

aw

are

serv

ices

Fab

ric

Semantic Data Integration

Provenance metadata

Versioning

QoSDistributed

Query

Database

Provenance Validation & Assessment

MIR Database Access

Workflow Enactment

JobExecution

Semantic Workflow Design

Third Party

Ontology Service

Event Notification

Semantic Discovery

Syntactic Discovery

‘White Pages’ & ‘Yellow Pages’

Discovery

Device Access

Information Extraction

Knowledge

Metadata

Annotation

Preferences

Reasoner

Availability

Service matcher

myGrid Stack

Web Portal

Gateway API

Workbench Apps Builder (Talisman)

Custom Application DemonstratorApplication

UTOPIA

WorkbenchDemonstrator

Cold Carp Gene Expression

MSD Sequence annotation

Pro

ven

an

ce

Pers

on

alis

aio

n

Secu

rityBioMedical Services Librarye.g. Distributed Annotation Service

User Agent

Presentation Services

Collaboration Support

Management Tools

Base

Serv

ices

Sem

an

tic

aw

are

serv

ices

Fab

ric

Semantic Data Integration

Provenance metadata

Versioning

QoSDistributed

Query

Database

Provenance Validation & Assessment

MIR Database Access

Workflow Enactment

JobExecution

Semantic Workflow Design

Third Party

Ontology Service

Event Notification

Semantic Discovery

Syntactic Discovery

‘White Pages’ & ‘Yellow Pages’

Discovery

Device Access

Information Extraction

Knowledge

Metadata

Annotation

Preferences

Reasoner

Availability

Service matcher

myGrid Stack 0.1

Cold Carp Gene Expression

Web Portal

Gateway API

Workbench Apps Builder (Talisman)

Custom Application DemonstratorApplication

UTOPIA

WorkbenchDemonstrator

MSD Sequence annotation

Pro

ven

an

ce

Pers

on

alis

aio

n

Secu

rityBioMedical Services Librarye.g. Distributed Annotation Service

User Agent

Presentation Services

Collaboration Support

Management Tools

Base

Serv

ices

Sem

an

tic

aw

are

serv

ices

Fab

ric

Semantic Data Integration

Provenance metadata

Versioning

QoSDistributed

Query

Database

Provenance Validation & Assessment

MIR Database Access

Workflow Enactment

JobExecution

Semantic Workflow Design

Third Party

Ontology Service

Event Notification

Semantic Discovery

Syntactic Discovery

‘White Pages’ & ‘Yellow Pages’

Discovery

Device Access

Information Extraction

Knowledge

Metadata

Annotation

Preferences

Reasoner

Availability

Service matcher

myGrid Stack 0.2

Service based architecture

Find them

Publication, registration, discovery, matchmaking,

deregistration.

Organise them.

Interoperation, composition, substitution.

Run them.

Execution, monitoring, exception handling.