Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk...

Post on 19-Dec-2015

214 views 0 download

Tags:

Transcript of Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk...

Software for the Data-Driven Researcher of the Future

Dr. Paul Fisher

Paul.Fisher@manchester.ac.ukhttp://www.cs.man.ac.uk/~fisherp

What is myGrid?

• An e-Science Collaboration Since 2001

• Numerous partners involved:– Manchester – Southampton– Oxford– EMBL-EBI

• It provides sustainable and production quality software– Supported by OMII-UK, EPSRC and BBSRC

• Mixture of developers, bioinformaticians and researchers

Software | Services | Content | Skills | Community

myGrid Open Suite of Tools

Client User InterfacesWorkflow GUI Workbench

and 3rd party plug-ins

Workflow Repository

Service Catalogue

Programming and APIs

Web Portals

Activity and Service Plug-in Manager

Provenance Store

Workflow Server

Open Provenance

Model

Secure Service Access, and Programming APIs

Huge amounts of data

100+ GenesQTL regions

Microarray

1000+ Genes

How do I look at ALL the genes systematically?

Next Gen Sequencing

10,000+ Genes

Issues with current approaches

• Scale of analysis task overwhelms researchers – lots of data

• User bias and premature filtering of datasets – cherry picking

• Hypothesis-Driven approach to data analysis

• Constant changes in data - problems with re-analysis of data

• Implicit methodologies (hyper-linking through web pages)

• Error proliferation from any of the listed issues – notably human error

Solution Automate

• Web Services– Technology and standard for exposing code and

data resources by an means that can be consumed by a third party remotely

– Describes how to interact with it, e.g. service parameters

• Workflows– General technique for describing and executing a

process– Describes what you want to do, including the

services to use

What kind of Services?

• WSDL Web Services• REST• BioMart • R-processor• BioMoby• SoapLab• Grid Services• Local Java services• Beanshell• Workflows

Who Provides the Services?

• Open domain services and resources• Taverna accesses 3500+ services (11,874 operations)• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

Can include your own services and resources too !!!

Where can I find these services?

• A public centralised and curated registry of Life Science Web Services

• ‘Web 2.0’-style website and API

• Allow anyone to register, discover and curate Web Services

• Community oriented with expert guidance

• Open content, open source, open platform

www.BioCatalogue.org

Available services

http://www.taverna.org.uk

Workflow diagram

Workflow Explorer

What are Workflows used for?

Taverna

• Taverna first released 2004 • Current version Taverna 2.2• Currently 1500+ users per month, 350+ organizations,

~40 countries, 80,000+ downloads across versions

• Freely available, open source LGPL• Windows, Mac OS, and Linux

• http://www.taverna.org.uk• User and developer workshops • Documentation• Public Mailing list and direct email support

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

Trypanosomiasis in Africa

An

dy Brass

Steve

Ke

mp

+ many Others

Reuse, Recycle, Repurpose Workflows

Dr Paul Fisher

Dr Jo Pennock

Identify biological pathways implicated in resistance to Trypanosomiasis in cattle using mouse as a model organism.

Identify the biological pathways colitis and helminth infections in the mouse model

DOI: 10.1002/ibd.21326 | PMID: 20687192

Where can I find workflows?

Recycling, Reuse, Repurposing

http://www.myexperiment.org/

• Share

• Search

• Re-use

• Re-purpose

• Execute

• Communicate

• Record

Bringing myExperiment to the Taverna userBringing myExperiment to the Taverna user

Taverna Plug-in

Take a breath…..

• myGrid

• Taverna– Workflows good for automation– Reduce errors

• BioCatalogue– Publicly curated repository of Web Services

• myExperiment– Web 2.0 repository supporting Workflow discovery and re-use

Taverna and the ‘Cloud’

Analysing Next Generation Sequencing Data

+

Analysing African Cattle with Taverna 2.2

Different breeds of African Cattle• 10,000 years separation

African Livestock adaptations:• More productive• Increases disease resistance

Potential outcomes: • Food security• Understanding resistance• Understanding environmental• Understanding diversity

http://www.bbc.co.uk/news/10403254

The study

• Lots of sites involved in Study:– Univeristy of Liverpool– University of Manchester – ILRI (Nairobi)……

• Genetic variation in cattle species– African breeds: N’dama, Boran and Sahiwal

• Resistance to African trypanosomiasis infection (sleeping sickness)– Genetic differences to make one species more resistant?– Potential consequences of those genetic differences?– Pathways are affected by those changes?

The Analysis Problem• Sequenced DNA from 3 cattle breeds using SOLiD / Illumina

• 22 million SNPs for Sahiwal alone – N’Dama, Boran ~ 11 millions SNPs each– Large data

• Comparing new data with reference genomes

• Identifying interesting differences– e.g. non-synonymous SNPs, stop lost, stop gained,

splicing regions etc

The Analysis Pipeline (in Perl)

MAP

FILTER

ANALYSIS

Input SNP data from sequencer

Map betweenGenome Builds (Liftover)

Filter for SNPs in Exons

SNP consequences

Identifying damaging SNPs (Polyphen)Harry Noyes –

University of Liverpool

Workflow and phases

Input SNP file

Populate DB with start SNP’s and resource version numbers

Lift-over: maps between UMD3 and BTA4 cow assemblies

Exon positions from ENSMBL

Find SNPs in Exon regions

PolyPhen to mark “dangerous” SNP’s

The result can be either a MySQL database or TSV / CSV download

MSc Student - Mohammad Khodadadi

Taverna and the ‘Cloud’

+

What we will demonstrate

1. Uploading Next Generation Sequencing SNP data to the cloud

2. Creating a new experiment

3. Running a workflow on multiple cloud instances

4. Showing result output, including links to annotated SNPs

Demo

Managing and Processing Data

Accessing Taverna on the Cloud

Jobs Status

Input Provenance

Experiment Metadata

Input data summary

Loading inputs

Summary of Workflow Output

Non-synonymous coding SNPs

Polyphen predictions: probably damaging

11 Million SNP for N’ Dama

N.B. Number variances due to workflow and polyphen filtering process

New Developments in myGrid

Essential for cloud

Taverna• Taverna 2.2 execution engine

– Large data processing– Pause, resume and cancelling workflows– Retry and parallelisation layer

• Taverna 2.2 server– Remote workflow execution– Workflows launched from web pages– Workflows executed on the cloud

Other New featuresValidation reporting

• Loading and sharing service sets• Support for offline editing• New provenance features

ISMB 10

BioCatalogue Plug-in

Training

• Tutorials and Training– 58+ tutorials to >900

people.– >20 universities, Life

Science Institutes, and networks.

– Major Bio conferences– Summer schools in Biology

and Middleware

• Developer and User Days– Annotation Jamborees

• Undergraduate and Postgraduate Bioinformatics in > 30 universities.

More Information

myGrid– http://www.mygrid.org.uk

• Taverna– http://www.taverna.org.uk

• myExperiment– http://www.myexperiment.org

• BioCatalogue– http://www.biocatalogue.org

Visit us at the myGrid Silver Sponsor Stand

FIN