IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Post on 27-Aug-2014

340 views 0 download

Tags:

description

Computer Engineer Degree Final Project. Universidad de La Laguna, Spain, July 2014. Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads. IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.

Transcript of IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Author

Adrián Báez Ortega

Supervisors

Marcos Colebrook SantamaríaJosé Luis Roda García

Date

17/07/2014

IonGAP

Contents1. Introduction

2. Objective of the project

3. State of the art

4. The genome assembler

5. A genome assembly and analysis pipeline

6. IonGAP Web service

7. Parallel assembly of large genomes

8. Conclusions

IonGAP 1

DNA

Genomics

Genome Proteins

GenesDouble helix

Biomedicine Life

Introduction

IonGAP 2

Genomesequencing

Genomede novo assembly

Adapted from:http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png

Introduction

IonGAP 3

Introduction

Genomics

Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias

ComputerScience

Escuela Técnica Superior de

Ingeniería InformáticaB

ioin

form

ati

cs

IonGAP 4

Objective of the project

The development of an easy-to-use integrated software

platform that offers an optimally configured processing and

de novo assembly of genomic data obtained by Ion Torrent

sequencing, also complemented with several result analysis

stages.

IonGAP 5

Most sequencingtechnologies:

Paired-end short reads

IUETSPC’s sequencingtechnology:

Single-end long reads

DNA DNA

5’ 3’ 5’ 3’

Gap25-250 bp 25-250 bp 200-400 bp

Genome sequencing

Genome fragments FASTQ file

State of the art

IonGAP 6

Source:http://gcat.davidson.edu/phast/img/contig.png

Genome assembly

State of the art

IonGAP 7

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

State of the art

IonGAP 8

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

Adapted from:http://gcat.davidson.edu/phast

State of the art

IonGAP 9

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

State of the art

IonGAP 10

Source:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646

State of the art

IonGAP 11

Data preprocessing

• Removing adapters

• Quality control

State of the art

IonGAP 12

Data preprocessing

• Quality control

State of the art

IonGAP 13

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 14

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 15

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 16

The genome assembler

IonGAP 17

Data preprocessing

Genomeassembly

Genome finishing

Genome analysis

The genome assembler

Data set

Streptococcus

agalactiae

(686,800 reads)

IonGAP 18

Source:http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg

The genome assembler

Comparative study of assemblers

• OLC assemblers

– MIRA

– Celera Assembler

– SGA

IonGAP 19

• DBG assemblers

– ABySS

– Ray

– Velvet

– SparseAssembler

– Minia

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 20

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

50% of the genome is in contigs larger than N50

Source:http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf

The genome assembler

IonGAP 21

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 22

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly1

The genome assembler

IonGAP 23

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 24

MIRA assembler

The genome assembler

IonGAP 25

1Automatic

editing

Data preprocessing

Fast readcomparison

Smith-Watermanalignment

Contig assembly

Finishedproject

Assembly parameter optimization

• Number of assembly iterations

• Uniform read distribution

• Separation of long repeats in different contigs

• Maximum number of times a contig can be rebuilt during an iteration

• Minimum number of reads per contig

Conclusion

The assembler is set by default in its optimal configuration

• Minimum size of a contig for being considered as "large"

• Minimum read length

• Minimum repeat length

• Minimum overlap length

• Minimum overlap score

The genome assembler

IonGAP 26

Minimum size of a contig for being considered as "large"

A genome assembly and analysis pipeline

IonGAP 27

Data preprocessing

Genomeassembly

Genome finishing

Genome analysis

aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta

aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc

gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac

tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag

taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa

acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc

tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag

cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg

aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac

cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt

tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat

gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat

taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa

gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa

cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc

cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa

acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc

tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag

cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg

aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac

cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt

tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt

ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt

attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc

ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt

gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt

atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt

ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc

gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc

aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat

gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa

tgcgataagttctgtaatagtagctgctaatttttctaccattgatttcacttctggctt

gene cas2

inference ab initio prediction:Prodigal:2.60

inference similar to AA sequence:UniProtKB:G3ECR3

locus_tag Sagalactiae_00003

product CRISPR-associated endoribonuclease Cas2

protein_id gnl|Prokka|Sagalactiae_00003

Contig name Subject name Score % Identity

Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00

Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00

Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00

Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00

A genome assembly and analysis pipeline

IonGAP 28

A genome assembly and analysis pipeline

IonGAP 29

Genome assembly

Data preprocessing

Genome finishing

Genome analysis

Data preprocessing

• Comparative study of trimmers

(PRINSEQ, ERNE-filter, Trimmomatic)

– Removing adapters → 5’ trimming

– Discarding useless reads → Minimum length

– Removing low-quality regions

• Internal quality control of MIRA

– Sliding window trimming

Maximum length

Sliding window trimming

Window length

Quality threshold

A genome assembly and analysis pipeline

IonGAP 30

A genome assembly and analysis pipeline

Data preprocessing

Mauve Assembly Metrics

IonGAP 31

Data preprocessing

Conclusion

Read preprocessing has negative effects on the assembly

• An extensive evaluation of read trimming effects on Illumina NGS data analysis

(Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013):

"For high quality values, trimmed datasets produce slightly more fragmented assemblies, probably due to a more stringent trimming that reflects also on lower computational needs."

• MIRA user manual (Chevreux B):

"For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave the data alone!"

A genome assembly and analysis pipeline

IonGAP 32

A genome assembly and analysis pipeline

IonGAP 33

Data preprocessing

Genomefinishing

Genome assembly

Genome analysis

Genome finishing

• Scaffolding

– Impossible: no mate-pair reads

• Correction of assembly errors

– Simplifier: selective elimination of redundant sequences

A genome assembly and analysis pipeline

IonGAP 34

Genome finishing

Simplifier

• Only eliminates complete redundant contigs

• Time expensive

• Natural repeats in genome → Risky

Conclusion

It is better to leave postprocessing in the user's hands

A genome assembly and analysis pipeline

IonGAP 35

A genome assembly and analysis pipeline

IonGAP 36

Data preprocessing

Genomeanalysis

Genome assembly

Genome finishing

Genome analysis

• Quality analysis of reads and contigs (FastQC)

• Taxonomic classification (BLAST)

• Genome annotation (Prokka)

If reference sequence provided:

• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 37

Genome analysis

• Taxonomic classification (BLAST)

• Genome annotation (Prokka)

A genome assembly and analysis pipeline

IonGAP 38

Genome analysis

• Genome annotation (Prokka)

UGENE genome viewer

A genome assembly and analysis pipeline

IonGAP 39

Genome analysis

If reference sequence provided:

• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)

A genome assembly and analysis pipeline

IonGAP 40

Generated byCircos, BLAST and Circoletto

A genome assembly and analysis pipeline

IonGAP 41

Genome analysis

If reference sequence provided:

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 42

Mauve genome viewer

Genome analysis

If reference sequence provided:

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 43

Mauve genome viewer

Functioning and implementation

• Web user interface

• Input Web form

• Two independent modules (daemons)

– Assembly module

– Analysis module

• User notification via email

IonGAP Web service

IonGAP 44

Functioning and implementation

• Hosting: ETSII’s Computing Center

– Virtual machine (Ubuntu 12.04)

– Dual core 64 bits processor

– 17 GB RAM

IonGAP Web service

IonGAP 45

IonGAP Web service

IonGAP 46

IonGAP Web service

IonGAP 47

Web service demo

IonGAP | an integrated Genome Assembly Platform

for Ion Torrent data

IonGAP Web service

IonGAP 48

(http://193.145.101.223/)

Genome assembly with IonGAP

Trypanosoma cruzi

• Extremely repetitive genome

• Data explosion

• Data filtering: 900 MB = 1,500,000 reads

IonGAP Web service

IonGAP 49

Parallel assembly of large genomes

Parallel genome assembly

• Parallel computing: Computer cluster

• Contrail

– Parallel assembly on Hadoop

• ETSII’s Computing Center

– Cluster of 108 computers

– Hadoop installation

IonGAP 50

Parallel assembly of large genomes

Parallel assembly with Contrail

IonGAP 51

Parallel assembly with Contrail

Conclusions

• Good performance

– Parallel computing is the future of assembly

• Bad results

– Contrail uses DBG → Not suitable for long reads

Parallel assembly of large genomes

IonGAP 52

• IonGAP solves the need for an automated tool for the assembly and preliminary analysis of Ion Torrent data suffered by IUETSPC

• Availability to the scientific community is directed to stimulate low-cost genome research and development of other customized solutions

• The S. agalactiae genome has been successfully

assembled, and a manuscript is been prepared for publication in a scientific journal

Conclusions

IonGAP 53

Future work

• New options and features

• Cloud assembly with Amazon Web Services

• Parallel OLC assembly on Hadoop

• High performance computing

– ITER’s Teide HPC – September 2014

Conclusions

IonGAP 54

Conclusions

Multidisciplinary work is the way to tackle the new science of the 21st century

IonGAP 55

Genomics

Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias

ComputerScience

Escuela Técnica Superior de

Ingeniería Informática

Bio

info

rma

tics

Many thanksfor yourattention

IonGAP 56