IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

57
Author Adrián Báez Ortega Supervisors Marcos Colebrook Santamaría José Luis Roda García Date 17/07/2014 IonGAP

description

Computer Engineer Degree Final Project. Universidad de La Laguna, Spain, July 2014. Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads. IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.

Transcript of IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Page 1: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Author

Adrián Báez Ortega

Supervisors

Marcos Colebrook SantamaríaJosé Luis Roda García

Date

17/07/2014

IonGAP

Page 2: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Contents1. Introduction

2. Objective of the project

3. State of the art

4. The genome assembler

5. A genome assembly and analysis pipeline

6. IonGAP Web service

7. Parallel assembly of large genomes

8. Conclusions

IonGAP 1

Page 3: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

DNA

Genomics

Genome Proteins

GenesDouble helix

Biomedicine Life

Introduction

IonGAP 2

Page 4: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genomesequencing

Genomede novo assembly

Adapted from:http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png

Introduction

IonGAP 3

Page 5: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Introduction

Genomics

Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias

ComputerScience

Escuela Técnica Superior de

Ingeniería InformáticaB

ioin

form

ati

cs

IonGAP 4

Page 6: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Objective of the project

The development of an easy-to-use integrated software

platform that offers an optimally configured processing and

de novo assembly of genomic data obtained by Ion Torrent

sequencing, also complemented with several result analysis

stages.

IonGAP 5

Page 7: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Most sequencingtechnologies:

Paired-end short reads

IUETSPC’s sequencingtechnology:

Single-end long reads

DNA DNA

5’ 3’ 5’ 3’

Gap25-250 bp 25-250 bp 200-400 bp

Genome sequencing

Genome fragments FASTQ file

State of the art

IonGAP 6

Page 8: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Source:http://gcat.davidson.edu/phast/img/contig.png

Genome assembly

State of the art

IonGAP 7

Page 9: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

State of the art

IonGAP 8

Page 10: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

Adapted from:http://gcat.davidson.edu/phast

State of the art

IonGAP 9

Page 11: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome assembly

• Genome assembler

– Overlap-layout-consensus (OLC) assemblers

– De Bruijn graph (DBG) assemblers

State of the art

IonGAP 10

Page 12: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Source:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646

State of the art

IonGAP 11

Page 13: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Data preprocessing

• Removing adapters

• Quality control

State of the art

IonGAP 12

Page 14: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Data preprocessing

• Quality control

State of the art

IonGAP 13

Page 15: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 14

Page 16: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 15

Page 17: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome finishing

• Scaffolding

• Correction of assembly errors

– Discrepancies with reads or reference genome

– Repeat correction

State of the art

IonGAP 16

Page 18: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

The genome assembler

IonGAP 17

Data preprocessing

Genomeassembly

Genome finishing

Genome analysis

Page 19: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

The genome assembler

Data set

Streptococcus

agalactiae

(686,800 reads)

IonGAP 18

Source:http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg

Page 20: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

The genome assembler

Comparative study of assemblers

• OLC assemblers

– MIRA

– Celera Assembler

– SGA

IonGAP 19

• DBG assemblers

– ABySS

– Ray

– Velvet

– SparseAssembler

– Minia

Page 21: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 20

Page 22: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

50% of the genome is in contigs larger than N50

Source:http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf

The genome assembler

IonGAP 21

Page 23: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 22

Page 24: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly1

The genome assembler

IonGAP 23

Page 25: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Results

• Number of contigs ≥ 500 bp

• N50 length

Conclusions

• MIRA is the most suitable assembler

• DBG is not indicated for long-read assembly

The genome assembler

IonGAP 24

Page 26: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

MIRA assembler

The genome assembler

IonGAP 25

1Automatic

editing

Data preprocessing

Fast readcomparison

Smith-Watermanalignment

Contig assembly

Finishedproject

Page 27: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Assembly parameter optimization

• Number of assembly iterations

• Uniform read distribution

• Separation of long repeats in different contigs

• Maximum number of times a contig can be rebuilt during an iteration

• Minimum number of reads per contig

Conclusion

The assembler is set by default in its optimal configuration

• Minimum size of a contig for being considered as "large"

• Minimum read length

• Minimum repeat length

• Minimum overlap length

• Minimum overlap score

The genome assembler

IonGAP 26

Minimum size of a contig for being considered as "large"

Page 28: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

A genome assembly and analysis pipeline

IonGAP 27

Data preprocessing

Genomeassembly

Genome finishing

Genome analysis

Page 29: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta

aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc

gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac

tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag

taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa

acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc

tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag

cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg

aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac

cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt

tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat

gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat

taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa

gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa

cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc

cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa

acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc

tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag

cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg

aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac

cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt

tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt

ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt

attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc

ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt

gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt

atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt

ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc

gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc

aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat

gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa

tgcgataagttctgtaatagtagctgctaatttttctaccattgatttcacttctggctt

gene cas2

inference ab initio prediction:Prodigal:2.60

inference similar to AA sequence:UniProtKB:G3ECR3

locus_tag Sagalactiae_00003

product CRISPR-associated endoribonuclease Cas2

protein_id gnl|Prokka|Sagalactiae_00003

Contig name Subject name Score % Identity

Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00

Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00

Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00

Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00

A genome assembly and analysis pipeline

IonGAP 28

Page 30: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

A genome assembly and analysis pipeline

IonGAP 29

Genome assembly

Data preprocessing

Genome finishing

Genome analysis

Page 31: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Data preprocessing

• Comparative study of trimmers

(PRINSEQ, ERNE-filter, Trimmomatic)

– Removing adapters → 5’ trimming

– Discarding useless reads → Minimum length

– Removing low-quality regions

• Internal quality control of MIRA

– Sliding window trimming

Maximum length

Sliding window trimming

Window length

Quality threshold

A genome assembly and analysis pipeline

IonGAP 30

Page 32: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

A genome assembly and analysis pipeline

Data preprocessing

Mauve Assembly Metrics

IonGAP 31

Page 33: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Data preprocessing

Conclusion

Read preprocessing has negative effects on the assembly

• An extensive evaluation of read trimming effects on Illumina NGS data analysis

(Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013):

"For high quality values, trimmed datasets produce slightly more fragmented assemblies, probably due to a more stringent trimming that reflects also on lower computational needs."

• MIRA user manual (Chevreux B):

"For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave the data alone!"

A genome assembly and analysis pipeline

IonGAP 32

Page 34: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

A genome assembly and analysis pipeline

IonGAP 33

Data preprocessing

Genomefinishing

Genome assembly

Genome analysis

Page 35: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome finishing

• Scaffolding

– Impossible: no mate-pair reads

• Correction of assembly errors

– Simplifier: selective elimination of redundant sequences

A genome assembly and analysis pipeline

IonGAP 34

Page 36: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome finishing

Simplifier

• Only eliminates complete redundant contigs

• Time expensive

• Natural repeats in genome → Risky

Conclusion

It is better to leave postprocessing in the user's hands

A genome assembly and analysis pipeline

IonGAP 35

Page 37: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

A genome assembly and analysis pipeline

IonGAP 36

Data preprocessing

Genomeanalysis

Genome assembly

Genome finishing

Page 38: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

• Quality analysis of reads and contigs (FastQC)

• Taxonomic classification (BLAST)

• Genome annotation (Prokka)

If reference sequence provided:

• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 37

Page 39: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

• Taxonomic classification (BLAST)

• Genome annotation (Prokka)

A genome assembly and analysis pipeline

IonGAP 38

Page 40: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

• Genome annotation (Prokka)

UGENE genome viewer

A genome assembly and analysis pipeline

IonGAP 39

Page 41: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

If reference sequence provided:

• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)

A genome assembly and analysis pipeline

IonGAP 40

Page 42: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Generated byCircos, BLAST and Circoletto

A genome assembly and analysis pipeline

IonGAP 41

Page 43: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

If reference sequence provided:

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 42

Mauve genome viewer

Page 44: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome analysis

If reference sequence provided:

• Contig reordering (Mauve)

A genome assembly and analysis pipeline

IonGAP 43

Mauve genome viewer

Page 45: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Functioning and implementation

• Web user interface

• Input Web form

• Two independent modules (daemons)

– Assembly module

– Analysis module

• User notification via email

IonGAP Web service

IonGAP 44

Page 46: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Functioning and implementation

• Hosting: ETSII’s Computing Center

– Virtual machine (Ubuntu 12.04)

– Dual core 64 bits processor

– 17 GB RAM

IonGAP Web service

IonGAP 45

Page 47: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

IonGAP Web service

IonGAP 46

Page 48: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

IonGAP Web service

IonGAP 47

Page 49: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Web service demo

IonGAP | an integrated Genome Assembly Platform

for Ion Torrent data

IonGAP Web service

IonGAP 48

(http://193.145.101.223/)

Page 50: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Genome assembly with IonGAP

Trypanosoma cruzi

• Extremely repetitive genome

• Data explosion

• Data filtering: 900 MB = 1,500,000 reads

IonGAP Web service

IonGAP 49

Page 51: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Parallel assembly of large genomes

Parallel genome assembly

• Parallel computing: Computer cluster

• Contrail

– Parallel assembly on Hadoop

• ETSII’s Computing Center

– Cluster of 108 computers

– Hadoop installation

IonGAP 50

Page 52: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Parallel assembly of large genomes

Parallel assembly with Contrail

IonGAP 51

Page 53: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Parallel assembly with Contrail

Conclusions

• Good performance

– Parallel computing is the future of assembly

• Bad results

– Contrail uses DBG → Not suitable for long reads

Parallel assembly of large genomes

IonGAP 52

Page 54: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

• IonGAP solves the need for an automated tool for the assembly and preliminary analysis of Ion Torrent data suffered by IUETSPC

• Availability to the scientific community is directed to stimulate low-cost genome research and development of other customized solutions

• The S. agalactiae genome has been successfully

assembled, and a manuscript is been prepared for publication in a scientific journal

Conclusions

IonGAP 53

Page 55: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Future work

• New options and features

• Cloud assembly with Amazon Web Services

• Parallel OLC assembly on Hadoop

• High performance computing

– ITER’s Teide HPC – September 2014

Conclusions

IonGAP 54

Page 56: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Conclusions

Multidisciplinary work is the way to tackle the new science of the 21st century

IonGAP 55

Genomics

Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias

ComputerScience

Escuela Técnica Superior de

Ingeniería Informática

Bio

info

rma

tics

Page 57: IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

Many thanksfor yourattention

IonGAP 56