IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

download IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

of 57

  • date post

    27-Aug-2014
  • Category

    Software

  • view

    320
  • download

    0

Embed Size (px)

description

Computer Engineer Degree Final Project. Universidad de La Laguna, Spain, July 2014. Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads. IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.

Transcript of IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

  • Author Adrin Bez Ortega Supervisors Marcos Colebrook Santamara Jos Luis Roda Garca Date 17/07/2014 IonGAP
  • Contents 1. Introduction 2. Objective of the project 3. State of the art 4. The genome assembler 5. A genome assembly and analysis pipeline 6. IonGAP Web service 7. Parallel assembly of large genomes 8. Conclusions IonGAP 1
  • DNA Genomics Genome Proteins GenesDouble helix Biomedicine Life Introduction IonGAP 2
  • Genome sequencing Genome de novo assembly Adapted from: http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png Introduction IonGAP 3
  • Introduction Genomics Instituto Universitario de Enfermedades Tropicales y Salud Pblica de Canarias Computer Science Escuela Tcnica Superior de Ingeniera InformticaBioinformatics IonGAP 4
  • Objective of the project The development of an easy-to-use integrated software platform that offers an optimally configured processing and de novo assembly of genomic data obtained by Ion Torrent sequencing, also complemented with several result analysis stages. IonGAP 5
  • Most sequencing technologies: Paired-end short reads IUETSPCs sequencing technology: Single-end long reads DNA DNA 5 3 5 3 Gap25-250 bp 25-250 bp 200-400 bp Genome sequencing Genome fragments FASTQ file State of the art IonGAP 6
  • Source: http://gcat.davidson.edu/phast/img/contig.png Genome assembly State of the art IonGAP 7
  • Genome assembly Genome assembler Overlap-layout-consensus (OLC) assemblers De Bruijn graph (DBG) assemblers State of the art IonGAP 8
  • Genome assembly Genome assembler Overlap-layout-consensus (OLC) assemblers De Bruijn graph (DBG) assemblers Adapted from: http://gcat.davidson.edu/phast State of the art IonGAP 9
  • Genome assembly Genome assembler Overlap-layout-consensus (OLC) assemblers De Bruijn graph (DBG) assemblers State of the art IonGAP 1 0
  • Source: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646 State of the art IonGAP 1 1
  • Data preprocessing Removing adapters Quality control State of the art IonGAP 12
  • Data preprocessing Quality control State of the art IonGAP 13
  • Genome finishing Scaffolding Correction of assembly errors Discrepancies with reads or reference genome Repeat correction State of the art IonGAP 14
  • Genome finishing Scaffolding Correction of assembly errors Discrepancies with reads or reference genome Repeat correction State of the art IonGAP 15
  • Genome finishing Scaffolding Correction of assembly errors Discrepancies with reads or reference genome Repeat correction State of the art IonGAP 16
  • The genome assembler IonGAP 17 Data preprocessing Genome assembly Genome finishing Genome analysis
  • The genome assembler Data set Streptococcus agalactiae (686,800 reads) IonGAP 18 Source: http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg
  • The genome assembler Comparative study of assemblers OLC assemblers MIRA Celera Assembler SGA IonGAP 19 DBG assemblers ABySS Ray Velvet SparseAssembler Minia
  • Results Number of contigs 500 bp N50 length Conclusions MIRA is the most suitable assembler DBG is not indicated for long-read assembly The genome assembler IonGAP 20
  • Results Number of contigs 500 bp N50 length Conclusions MIRA is the most suitable assembler DBG is not indicated for long-read assembly 50% of the genome is in contigs larger than N50 Source: http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf The genome assembler IonGAP 21
  • Results Number of contigs 500 bp N50 length Conclusions MIRA is the most suitable assembler DBG is not indicated for long-read assembly The genome assembler IonGAP 22
  • Results Number of contigs 500 bp N50 length Conclusions MIRA is the most suitable assembler DBG is not indicated for long-read assembly 1 The genome assembler IonGAP 23
  • Results Number of contigs 500 bp N50 length Conclusions MIRA is the most suitable assembler DBG is not indicated for long-read assembly The genome assembler IonGAP 24
  • MIRA assembler The genome assembler IonGAP 25 1 Automatic editing Data preprocessing Fast read comparison Smith-Waterman alignment Contig assembly Finished project
  • Assembly parameter optimization Number of assembly iterations Uniform read distribution Separation of long repeats in different contigs Maximum number of times a contig can be rebuilt during an iteration Minimum number of reads per contig Conclusion The assembler is set by default in its optimal configuration Minimum size of a contig for being considered as "large" Minimum read length Minimum repeat length Minimum overlap length Minimum overlap score The genome assembler IonGAP 26 Minimum size of a contig for being considered as "large"
  • A genome assembly and analysis pipeline IonGAP 27 Data preprocessing Genome assembly Genome finishing Genome analysis
  • aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa gene cas2 inference ab initio prediction:Prodigal:2.60 inference similar to AA sequence:UniProtKB:G3ECR3 locus_tag Sagalactiae_00003 product CRISPR-associated endoribonuclease Cas2 protein_id gnl|Prokka|Sagalactiae_00003 Contig name Subject name Score % Identity Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00 Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00 Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00 Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00 A genome assembly and analysis pipeline IonGAP 28
  • A genome assembly and analysis pipeline IonGAP 29 Genome assembly Data preprocessing Genome finishing Genome analysis
  • Data preprocessing Comparative study of trimmers (PRINSEQ, ERNE-filter, Trimmomatic) Removing adapters 5 trimming Discarding useless reads Minimum length Removing low-quality regions Internal quality control of MIRA Sliding window trimming Maximum leng