Software for the Data-Driven Researcher of the Future Dr. Paul Fisher [email protected]...

44
Software for the Data- Driven Researcher of the Future Dr. Paul Fisher [email protected] http://www.cs.man.ac.uk/~fisherp
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Software for the Data-Driven Researcher of the Future Dr. Paul Fisher [email protected]...

Page 1: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Software for the Data-Driven Researcher of the Future

Dr. Paul Fisher

[email protected]://www.cs.man.ac.uk/~fisherp

Page 2: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

What is myGrid?

• An e-Science Collaboration Since 2001

• Numerous partners involved:– Manchester – Southampton– Oxford– EMBL-EBI

• It provides sustainable and production quality software– Supported by OMII-UK, EPSRC and BBSRC

• Mixture of developers, bioinformaticians and researchers

Software | Services | Content | Skills | Community

Page 3: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

myGrid Open Suite of Tools

Client User InterfacesWorkflow GUI Workbench

and 3rd party plug-ins

Workflow Repository

Service Catalogue

Programming and APIs

Web Portals

Activity and Service Plug-in Manager

Provenance Store

Workflow Server

Open Provenance

Model

Secure Service Access, and Programming APIs

Page 4: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Huge amounts of data

100+ GenesQTL regions

Microarray

1000+ Genes

How do I look at ALL the genes systematically?

Next Gen Sequencing

10,000+ Genes

Page 5: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Issues with current approaches

• Scale of analysis task overwhelms researchers – lots of data

• User bias and premature filtering of datasets – cherry picking

• Hypothesis-Driven approach to data analysis

• Constant changes in data - problems with re-analysis of data

• Implicit methodologies (hyper-linking through web pages)

• Error proliferation from any of the listed issues – notably human error

Solution Automate

Page 6: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

• Web Services– Technology and standard for exposing code and

data resources by an means that can be consumed by a third party remotely

– Describes how to interact with it, e.g. service parameters

• Workflows– General technique for describing and executing a

process– Describes what you want to do, including the

services to use

Page 7: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

What kind of Services?

• WSDL Web Services• REST• BioMart • R-processor• BioMoby• SoapLab• Grid Services• Local Java services• Beanshell• Workflows

Page 8: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Who Provides the Services?

• Open domain services and resources• Taverna accesses 3500+ services (11,874 operations)• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

Can include your own services and resources too !!!

Page 9: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Where can I find these services?

Page 10: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

• A public centralised and curated registry of Life Science Web Services

• ‘Web 2.0’-style website and API

• Allow anyone to register, discover and curate Web Services

• Community oriented with expert guidance

• Open content, open source, open platform

www.BioCatalogue.org

Page 11: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.
Page 12: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Available services

http://www.taverna.org.uk

Workflow diagram

Workflow Explorer

Page 13: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

What are Workflows used for?

Page 14: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Taverna

• Taverna first released 2004 • Current version Taverna 2.2• Currently 1500+ users per month, 350+ organizations,

~40 countries, 80,000+ downloads across versions

• Freely available, open source LGPL• Windows, Mac OS, and Linux

• http://www.taverna.org.uk• User and developer workshops • Documentation• Public Mailing list and direct email support

Page 15: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

Trypanosomiasis in Africa

An

dy Brass

Steve

Ke

mp

+ many Others

Page 16: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Reuse, Recycle, Repurpose Workflows

Dr Paul Fisher

Dr Jo Pennock

Identify biological pathways implicated in resistance to Trypanosomiasis in cattle using mouse as a model organism.

Identify the biological pathways colitis and helminth infections in the mouse model

DOI: 10.1002/ibd.21326 | PMID: 20687192

Page 17: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Where can I find workflows?

Page 18: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Recycling, Reuse, Repurposing

http://www.myexperiment.org/

• Share

• Search

• Re-use

• Re-purpose

• Execute

• Communicate

• Record

Page 19: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.
Page 20: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Bringing myExperiment to the Taverna userBringing myExperiment to the Taverna user

Taverna Plug-in

Page 21: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Take a breath…..

• myGrid

• Taverna– Workflows good for automation– Reduce errors

• BioCatalogue– Publicly curated repository of Web Services

• myExperiment– Web 2.0 repository supporting Workflow discovery and re-use

Page 22: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Taverna and the ‘Cloud’

Analysing Next Generation Sequencing Data

+

Page 23: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Analysing African Cattle with Taverna 2.2

Different breeds of African Cattle• 10,000 years separation

African Livestock adaptations:• More productive• Increases disease resistance

Potential outcomes: • Food security• Understanding resistance• Understanding environmental• Understanding diversity

http://www.bbc.co.uk/news/10403254

Page 24: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

The study

• Lots of sites involved in Study:– Univeristy of Liverpool– University of Manchester – ILRI (Nairobi)……

• Genetic variation in cattle species– African breeds: N’dama, Boran and Sahiwal

• Resistance to African trypanosomiasis infection (sleeping sickness)– Genetic differences to make one species more resistant?– Potential consequences of those genetic differences?– Pathways are affected by those changes?

Page 25: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

The Analysis Problem• Sequenced DNA from 3 cattle breeds using SOLiD / Illumina

• 22 million SNPs for Sahiwal alone – N’Dama, Boran ~ 11 millions SNPs each– Large data

• Comparing new data with reference genomes

• Identifying interesting differences– e.g. non-synonymous SNPs, stop lost, stop gained,

splicing regions etc

Page 26: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

The Analysis Pipeline (in Perl)

MAP

FILTER

ANALYSIS

Input SNP data from sequencer

Map betweenGenome Builds (Liftover)

Filter for SNPs in Exons

SNP consequences

Identifying damaging SNPs (Polyphen)Harry Noyes –

University of Liverpool

Page 27: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Workflow and phases

Input SNP file

Populate DB with start SNP’s and resource version numbers

Lift-over: maps between UMD3 and BTA4 cow assemblies

Exon positions from ENSMBL

Find SNPs in Exon regions

PolyPhen to mark “dangerous” SNP’s

The result can be either a MySQL database or TSV / CSV download

MSc Student - Mohammad Khodadadi

Page 28: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Taverna and the ‘Cloud’

+

Page 29: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

What we will demonstrate

1. Uploading Next Generation Sequencing SNP data to the cloud

2. Creating a new experiment

3. Running a workflow on multiple cloud instances

4. Showing result output, including links to annotated SNPs

Page 30: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Demo

Page 31: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Managing and Processing Data

Page 32: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Accessing Taverna on the Cloud

Page 33: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Jobs Status

Input Provenance

Experiment Metadata

Input data summary

Loading inputs

Page 34: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Summary of Workflow Output

Non-synonymous coding SNPs

Polyphen predictions: probably damaging

11 Million SNP for N’ Dama

N.B. Number variances due to workflow and polyphen filtering process

Page 35: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.
Page 36: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

New Developments in myGrid

Page 37: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Essential for cloud

Taverna• Taverna 2.2 execution engine

– Large data processing– Pause, resume and cancelling workflows– Retry and parallelisation layer

• Taverna 2.2 server– Remote workflow execution– Workflows launched from web pages– Workflows executed on the cloud

Page 38: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Other New featuresValidation reporting

• Loading and sharing service sets• Support for offline editing• New provenance features

Page 39: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

ISMB 10

BioCatalogue Plug-in

Page 40: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Training

• Tutorials and Training– 58+ tutorials to >900

people.– >20 universities, Life

Science Institutes, and networks.

– Major Bio conferences– Summer schools in Biology

and Middleware

• Developer and User Days– Annotation Jamborees

• Undergraduate and Postgraduate Bioinformatics in > 30 universities.

Page 41: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.
Page 42: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

More Information

myGrid– http://www.mygrid.org.uk

• Taverna– http://www.taverna.org.uk

• myExperiment– http://www.myexperiment.org

• BioCatalogue– http://www.biocatalogue.org

Page 43: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

Visit us at the myGrid Silver Sponsor Stand

Page 44: Software for the Data-Driven Researcher of the Future Dr. Paul Fisher Paul.Fisher@manchester.ac.uk fisherp.

FIN