a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the...
Transcript of a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the...
![Page 1: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/1.jpg)
Handling next generation sequence data
a pilot to run data analysis on the Dutch Life Sciences Grid
Barbera van Schaik
Bioinformatics Laboratory - KEBB
Academic Medical Center
Amsterdam
![Page 2: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/2.jpg)
Very short intro on high throughput sequencing
• Sanger sequencing• High throughput sequencing
23-01-2009 2
![Page 3: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/3.jpg)
23-01-2009 3
DNA building blocks
http://en.wikipedia.org/wiki/DNA
>chr1taaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaaccctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaccctaaccctcgcggtaccctcagccggcccgcccgcccgggtctgacctgaggagaactgtgctccgccttcagagtaccaccgaaatctgtgcagaggacaacgcagctccgccctcgcggtgctctccgggtctgtgctgaggagaacgcaactccgccggcgcaggcgcagagaggcgcgccgcgccggcgcaggcgcagacacatgctagcgcgtcggggtggaggcgtggcgcaggcgcagagaggcgcgccgcgccggcgcaggcgcagagacacatgctaccgcgtccaggggtggaggcgtggcgcaggcgcagagaggcgcaccgcgccggcgcaggcgcagagacacatgctagcgcgtccaggggtggaggcgtggcgcaggcgcagagacgcaagcctacgggcgggggttgggggggcgtgtgttgcaggagcaaagtcgcacggcgccgggctggggcggggggagggtggcgccgtgcacgcgcagaaactcacgtcacggtggcgcggcgcagagacgggtagaacctcagtaatccgaaaagccgggatcgaccgccccttgcttgcagccgggcactacaggacccgcttgctcacggtgctgtgccagggcgccccctgctggcgactagggcaactgcagggctctcttgcttagagtggtggccagcgccccctgctggcgccggggcactgcagggccctcttgcttactgtatagtggtggcacgccgcctgctggcagctagggacattgcagggtcctcttgctcaaggtgtagtggcagcacgcccacctgctggcagctggggacactgccgggccctcttgctCCAACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTCTCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaatatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatcccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttacttttggatttttgccagtctaacaggtgaAGccctggagattcttattagtgatttgggctggggcctggccatgtgtatttttttaaatttccactgatgattttgctgcatggccggtgttgagaatgactgCGCAAAT
![Page 4: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/4.jpg)
23-01-2009 4
Sanger sequencing
Bentley 2006, Curr opinion in genetics
http://en.wikipedia.org/wiki/DNA_sequencing
Mix of "standard" nucleotidesand labelled dideoxynucleotides(chain-terminating nucleotides)
Gel or capillary sequencing
One sequence at the timeRobots: up to 384 samples in one run
![Page 5: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/5.jpg)
23-02-2009 5
Overview sequencing methods
Synthetic chain-terminator chemistry (Sanger sequencing)Sequencing by hybridisationPyrosequencingBase-by-base sequencing by synthesisSequencing by ligationNanopore technologySingle-molecule sequencing by synthesis in real time
![Page 6: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/6.jpg)
People and labs
Bioinformatics laboratory - KEBBAngela Luyf, Silvia D OlabarriagaTristan GlatardBarbera van SchaikAntoine van Kampen
Sequence facilityMarja JakobsTed BradleyFrank Baas
Laboratory Division
Roche (454) sequencer23-01-2009 6
![Page 7: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/7.jpg)
23-01-2009 7
Pyrosequencing - Roche FLX (454) Sample preparation
Nature Methods 2007 advertisement 454.com
![Page 8: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/8.jpg)
23-01-2009 8
Pyrosequencing - Roche FLX (454) Sequencing process
Nature Methods 2008
![Page 9: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/9.jpg)
23-01-2009 9
Pyrosequencing - Roche FLX (454) Summary
One sequence per bead
Amplification in oil/water emulsion
Fix bead in container (picotiterplate)
Put plate with containers in machine
Wash one nucleotide at the timeover plate -> light emission
Take picture
Wash next nucleotide over plate
Nature Methods 2008 and 454.com
![Page 10: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/10.jpg)
23-01-2009 10
Data pre-processing
CACTC
CGACA
TGCGT
TGCGT
>E9108QN01BVB2T length=238 xy=0649_3411 region=1 run=R_2008_05_06_17_51_52_CACTCCAGGAAACAGCTATGACCTCTGCCTGGAAAGCCAGGTGCCTGTGGGCAGAGCCCAGGACCACAGGGCCAGGGGTATCTCGTGTTCCTGTCCTGGCCGCGGATCTTCTTCTCCATCTCAGCGTCTGTCAGAGTCTCCAGCAGTGGGCACCACTGGTCCGCATCGCCCGTGTTCCGGATGGCAATCTCCACTGTGGGCAGAGGGTTCTCGCTACGAGGAGGGAGGCAGTGAGAGG
10011 00101 01010
>seqACACTC CAGGA AACAG
Binaryto
Fasta
![Page 11: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/11.jpg)
23-01-2009 11
Further analysis
Mutation detection: BLAST against referenceVirus discovery: BLAST against virus databaseGene expression: BLA(S)T against gene reference setChip-on-sequencing: BLA(S)T against genome sequence
Preproces-
sing
BLAT
BLAST
Featurecount
Qualitycheck
![Page 12: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/12.jpg)
High throughput sequence data explosion
One sequence run: 2 GB (>400,000 reads)
Per day: 6 GB (1,200,000 reads)
Per week (5d): 30 GB (6,000,000 reads)
Per year: 1500 GB (312,000,000 reads)
This becomes worse when the Roche system is upgraded!
23-01-2009 12
![Page 13: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/13.jpg)
Pilot: run bioinformatics tools on the GridExperience with earlier projects
Many computation intensive tasks
This pilot: BLAST as (small) test case
Advantages of Grid
Sharing of data storage and computing power
Parallel computing (multiple jobs at same time)
Disadvantage of Grid
Complex system to work with
Currently bioinformatician friendly systems are available
End-user interface for Grid usage
Workbench for building workflows
System to run workflows on the Grid23-01-2009 13
![Page 14: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/14.jpg)
OutlineComponents
Dutch Life Sciences Grid
VBrowser
Workflows
Taverna
Moteur
GASW webservices
Interaction between the components
DiscussionExperiences so far and considerations
Wish list for Life Sciences Grid
Current status and future work
23-01-2009 14
![Page 15: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/15.jpg)
Interaction between the components
Taverna
VBrowser
lsgrid
Scuflfile(XML)
expo
rt
import
workflow management system
23-01-2009 15
![Page 16: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/16.jpg)
Virtual Laboratory for eScience
http://www.vl-e.nl/
Bioinformatics
23-01-2009 16
![Page 17: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/17.jpg)
Dutch Life Sciences Grid
Roll out GRID infrastructure in the Netherlands
Sharing of data storage and computer power
http://www.biggrid.nl/
23-01-2009 17
![Page 18: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/18.jpg)
VBrowser
http://www.vl-e.nl/vbrowser/
SARAAMC
23-01-2009 18
![Page 19: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/19.jpg)
WorkflowsWe want to create a bioinformatics pipeline for sequence analysis
Modular building blocks that perform a single task
Connect blocks to create a program
Sequence files
Pre-processing
BLAST
BLAST files23-01-2009 19
![Page 20: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/20.jpg)
Taverna
http://taverna.sourceforge.net/
http://www.ebi.ac.uk/Tools/webservices/tutorials/workflow/taverna23-01-2009 20
![Page 21: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/21.jpg)
Workflow management systems Difference between Taverna and Moteur
Sequence files
Pre-processing
BLAST
BLAST files
Sequence files
Pre-processing
BLAST
BLAST filesNCBI
AMC
Job onGrid node
Job onGrid node
23-01-2009 21
![Page 22: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/22.jpg)
Generic Application Service Wrapper (GASW)
GASW services
Configuration files (XML)
http://rainbow.i3s.unice.fr/wiki/dokuwiki/doku.php?id=public_namespace:moteur
23-01-2009 22
![Page 23: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/23.jpg)
Example config file for wrapping a perl script with GASW<description>
<executable name="sff2fasta.pl">
<access type="LFN"/>
<value value="/grid/lsgrid/angela/Sequence_WF/Michel_28_10_2008/perlScripts/sff2 fasta.pl"/>
<input name="tarFile" option="no0">
<access type="LFN"/>
</input>
<input name="sffFile" option="no1">
<access type="LFN"/>
</input>
<output name="out_sff2fasta.txt" option = "no2">
<template value="/grid/lsgrid/angela/Sequence_WF/Michel_28_10_2008/sff2fasta_out/%s _fasta.fna"/>
<access type="LFN"/>
</output>
<output name="out_sff2fasta.txt" option = "no3">
<template value="/grid/lsgrid/angela/Sequence_WF/Michel_28_10_2008/sff2fasta_out/%s
l l"/>
![Page 24: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/24.jpg)
Interaction between the components
Taverna
VBrowser
lsgrid
Scuflfile(XML)
expo
rt
import
Gridcertificate:I am me
workflow management system
23-01-2009 24
![Page 25: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/25.jpg)
Screenshot VBrowser/Moteur (1)
http://rainbow.i3s.unice.fr/wiki/dokuwiki/doku.php?id=public_namespace:moteur
23-01-2009 25
![Page 26: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/26.jpg)
Screenshot VBrowser/Moteur (2)
26
![Page 27: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/27.jpg)
OutlineComponents
Dutch Life Sciences Grid
VBrowser
Workflows
Taverna
Moteur
GASW webservices
Interaction between the components
DiscussionExperiences so far and considerationsWish list for Life Sciences Grid
Current status and future work
23-01-2009 27
![Page 28: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/28.jpg)
Experiences so far and considerations
Request certificate for the Life Sciences Grid
Learn how all components work
Wrap our applications for use in Grid workflows
Ship databases and blast executables to the Grid
23-01-2009 28
![Page 29: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/29.jpg)
Wish list for Life Sciences Grid related to sequence analysis
Public databases
GenBank
Bioinformatics tools
BLAST
EMBOSS
BioPerl
23-01-2009 29
![Page 30: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/30.jpg)
OutlineComponents
Dutch Life Sciences Grid
VBrowser
Workflows
Taverna
Moteur
GASW webservices
Interaction between the components
DiscussionExperiences so far and considerations
Wish list for Life Sciences Grid
Current status and future work
23-01-2009 30
![Page 31: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/31.jpg)
StatusCurrent status
Wrapped with GASW:Perl scripts for pre-processing of Roche sequence dataBLAST and BLAT
Build a workflow of these components in Taverna
Ran workflow successfully on the Life Sciences Grid with Moteur
Future workSubmit workflows for multiple sequence runs
Real (computation intensive) application
Examine how we can build a system for end-users
NBIC Bioassist - sequencing platform23-01-2009 31
![Page 32: a pilot to run data analysis on the Dutch Life Sciences Grid · a pilot to run data analysis on the Dutch Life Sciences Grid Barbera van Schaik Bioinformatics Laboratory - KEBB. Academic](https://reader030.fdocuments.us/reader030/viewer/2022040823/5e6d82f1e528687ac9389e31/html5/thumbnails/32.jpg)
Angela LuijfBioinformatics Laboratory
Silvia D OlabarriagaBioinformatics [email protected]
Tristan GlatardCreatis-LRMN Lyon [email protected]
Barbera van SchaikBioinformatics Laboratory
Frank BaasNeurogenetics
Sequencing [email protected]
Antoine van KampenBioinformatics [email protected]