download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...

30

Transcript of download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...

  • � �

    COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS

  • � �

    Wiley Series on

    Bioinformatics: Computational Techniques and Engineering

    A complete list of the titles in this series appears at the end of this volume.

  • � �

    COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS

    Edited by

    ION I. MĂNDOIUALEXANDER ZELIKOVSKY

  • � �

    Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

    Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form orby any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission shouldbe addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

    Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.

    For general information on our other products and services or for technical support, please contact ourCustomer Care Department within the United States at (800) 762-2974, outside the United States at(317) 572-3993 or fax (317) 572-4002.

    Wiley also publishes its books in a variety of electronic formats. Some content that appears in print maynot be available in electronic formats. For more information about Wiley products, visit our web site atwww.wiley.com.

    Library of Congress Cataloging-in-Publication Data:

    Names: Măndoiu, I. Ion, editor of compilation. | Zelikovsky, Alexander, editorof compilation.

    Title: Computational methods for next generation sequencing data analysis /edited by Ion I. Măndoiu, Alexander Zelikovsky.

    Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includesbibliographical references and index.

    Identifiers: LCCN 2016010861 (print) | LCCN 2016014704 (ebook) | ISBN9781118169483 (cloth) | ISBN 9781119272168 (pdf) | ISBN 9781119272175(epub)

    Subjects: LCSH: Nucleotide sequence–Methodology. | Nucleotide sequence–Dataprocessing.

    Classification: LCC QP620 .C648 2016 (print) | LCC QP620 (ebook) | DDC611/.0181663–dc23

    LC record available at http://lccn.loc.gov/2016010861

    Cover image courtesy of Gettyimages/Andrew Brookes

    Printed in the United States of America

    10 9 8 7 6 5 4 3 2 1

    www.copyright.comhttp://www.wiley.com/go/permissionswww.wiley.com

  • � �

    CONTENTS IN BRIEF

    CONTRIBUTORS xix

    PREFACE xxiii

    ABOUT THE COMPANION WEBSITE xxv

    PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1

    1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan

    2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam

    3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov

    4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin

  • � �

    vi CONTENTS IN BRIEF

    PART II GENOMICS AND EPIGENOMICS 105

    5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky

    6 Genomic Variants Detection and Genotyping 133Jorge Duitama

    7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth

    8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu

    9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi

    10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi

    PART III TRANSCRIPTOMICS 245

    11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea

    12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall

    13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze

    14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky

  • � �

    CONTENTS IN BRIEF vii

    PART IV MICROBIOMICS 329

    15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov

    16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel

    17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky

    18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan

    INDEX 425

  • � �

  • � �

    CONTENTS

    CONTRIBUTORS xix

    PREFACE xxiii

    ABOUT THE COMPANION WEBSITE xxv

    PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1

    1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan

    1.1 Introduction, 31.2 Challenges for NGS Data Analysis, 41.3 Background For Cloud Computing and its Programming Models, 6

    1.3.1 Overview of Cloud Computing, 71.3.2 Cloud Service Providers, 71.3.3 Programming Models, 8

    1.4 Cloud Computing Services for NGS Data Analysis, 131.4.1 Hardware as a Service (HaaS), 131.4.2 Platform as a Service (PaaS), 131.4.3 Software as a Service (SaaS), 151.4.4 Data as a Service (DaaS), 20

    1.5 Conclusions and Future Directions, 20References, 21

  • � �

    x CONTENTS

    2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam

    2.1 Introduction & Overview, 252.2 Background, 262.3 Metapathways Processes, 27

    2.3.1 Open Reading Frame (ORF) Prediction, 292.3.2 Functional Annotation, 322.3.3 Analysis Modules, 332.3.4 ePGDB Construction, 38

    2.4 Big Data Processing, 392.4.1 A Master–Worker Model for Grid Distribution, 392.4.2 GUI and Data Integration, 41

    2.5 Downstream Analyses, 412.5.1 Large Table Comparisons, 432.5.2 Pathway Tools Cellular Overview, 432.5.3 Statistical Analysis with R, 452.5.4 Venn Diagram, 472.5.5 Clustering and Relating Samples by Pathways, 492.5.6 Faceting Variables with ggplot2, 49

    2.6 Conclusions, 50References, 50

    3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov

    3.1 Introduction, 573.2 Design of Pools for Big Viral Data, 60

    3.2.1 Pool Design Optimization Formulation, 623.2.2 Greedy Heuristic for VSPD Problem, 633.2.3 The Tabu Search Heuristic for the OCBG Problem, 65

    3.3 Deconvolution of Viral Samples from Pools, 683.3.1 Deconvolution Using Generalized Intersections and

    Differences of Pools, 683.3.2 Maximum Likelihood k-Clustering, 70

    3.4 Performance of Pooling Methods on Simulated Data, 713.4.1 Performance of the Viral Sample Pool Design Algorithm, 713.4.2 Performance of the Pool Deconvolution Algorithm, 74

    3.5 Experimental Validation of Pooling Strategy, 753.5.1 Experimental Pools and Sequencing, 753.5.2 Results, 77

  • � �

    CONTENTS xi

    3.6 Conclusion, 79References, 81

    4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin

    4.1 Introduction, 854.2 High-Fidelity Sequencing Protocol, 864.3 Assembly of High-Fidelity Sequencing Data, 88

    4.3.1 Consensus Construction, 884.3.2 Reads Mapping, 894.3.3 Viral Genome Assembler (VGA), 894.3.4 Viral Population Quantification, 91

    4.4 Performance of VGA on Simulated Data, 934.5 Performance of Existing Viral Assemblers on Simulated

    Consensus Error-Corrected Reads, 984.6 Performance of VGA on Real Hiv Data, 99

    4.6.1 Validation of de novo Consensus, 994.7 Comparison of Alignment on Error-Corrected Reads, 1004.8 Evaluating of Error Correction Tools Based on High-Fidelity

    Sequencing Reads, 101Acknowledgment, 101References, 102

    PART II GENOMICS AND EPIGENOMICS 105

    5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky

    5.1 Scaffolding, 1075.2 State-of-The-Art Scaffolding Tools, 108

    5.2.1 Sspace, 1085.2.2 OPERA, 1095.2.3 SOPRA, 1105.2.4 MIP, 1105.2.5 SCARPA, 111

    5.3 Recent Scaffolding Tools, 1115.3.1 SILP2, 1115.3.2 ScaffMatch, 119

    5.4 Scaffolding Software Evaluation, 1245.4.1 Data Sets, 1245.4.2 Quality Metrics, 1245.4.3 Evaluation and Comparison, 126References, 129

  • � �

    xii CONTENTS

    6 Genomic Variants Detection and Genotyping 133Jorge Duitama

    6.1 Introduction, 1336.2 Methods for Detection and Genotyping of SNPs and Small

    Indels, 1356.2.1 Description of the Problem, 1356.2.2 Bayesian Model, 1366.2.3 Common Issues Affecting Genotype Quality, 1376.2.4 Population Variability, 139

    6.3 Methods for Detection and Genotyping of CNVs, 1416.3.1 Mean-Shift Approach for CNVs within a Sample, 1416.3.2 Identifying CNVs between Samples, 143

    6.4 Putting Everything Together, 144References, 145

    7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth

    7.1 Introduction, 1497.1.1 Twilight Zone Deletions, 151

    7.2 Notation, 1517.2.1 Alignments, 1527.2.2 Gaps/Splits, 1527.2.3 Deletions, 152

    7.3 Non-Twilight-Zone Deletion Discovery, 1527.3.1 Internal Segment Size-Based Approaches, 1537.3.2 Split-Read Mapping Approaches, 1547.3.3 Hybrid Approaches, 1557.3.4 The “Twilight Zone”: Definition, 156

    7.4 Discovering “Twilight Zone” Deletions: New Solutions, 1567.4.1 CLEVER, 1567.4.2 Mate-Clever, 1577.4.3 Pindel, 157

    7.5 Genotyping “Twilight Zone” Deletions, 1587.5.1 A Maximum Likelihood Approach under Read Alignment

    Uncertainty, 1587.6 Results, 162

    7.6.1 Data Set, 1627.6.2 Tools, 1637.6.3 Discovery, 1637.6.4 Genotyping, 166

    7.7 Discussion, 1677.7.1 HiSeq, 1697.7.2 MiSeq, 1707.7.3 Conclusion, 170

  • � �

    CONTENTS xiii

    7.8 Availability, 171Acknowledgments, 171References, 171

    8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu

    8.1 Background, 1758.2 Methods, 177

    8.2.1 Signatures of Long Indels in Sequence Reads, 1778.2.2 Methods for Discovering Long Indels without Exact

    Breakpoints, 1838.2.3 Methods for Discovering Long Indels with Exact

    Breakpoints, 1858.2.4 Combined Approaches, 186

    8.3 Applications, 1918.3.1 Population SV Calling, 1918.3.2 Cancer Genomics, 192

    8.4 Conclusions and Future Directions, 193Acknowledgment, 193References, 193

    9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi

    9.1 Introduction, 1979.2 Enrichment-Based Approaches, 201

    9.2.1 Data Analysis Procedure, 2019.2.2 Available Approaches, 205

    9.3 Bisulfite Treatment-Based Approaches, 2119.3.1 Data Analysis Procedure, 2119.3.2 Available Approaches, 214

    9.4 Conclusion, 221References, 222

    10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi

    10.1 Introduction, 22710.2 The Problem of Mapping BS-Treated Reads, 22910.3 Algorithmic Approaches to the Problem of Mapping BS-Treated

    Reads, 23110.4 Methylation Estimation, 234

  • � �

    xiv CONTENTS

    10.5 Possible Biases in Estimation of Methylation Level, 23410.6 Bisulfite Conversion Rate, 23510.7 Reduced Representation Bisulfite Sequencing, 23510.8 Accuracy as a Performance Measurement, 235

    References, 241

    PART III TRANSCRIPTOMICS 245

    11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea

    11.1 Introduction, 24711.2 De Novo Assembly, 248

    11.2.1 Preprocessing of Reads, 24911.2.2 The De Bruijn Graph for RNA-seq Read Assembly, 25011.2.3 Contig Assembly, 25211.2.4 Filtering and Error Correction, 25211.2.5 Variations, 253

    11.3 Genome-Based Assembly, 25411.3.1 Candidate Isoforms, 25411.3.2 Minimality, 25811.3.3 Accuracy, 26111.3.4 Completeness, 26311.3.5 Extensions, 264

    11.4 Conclusions, 264Acknowledgment, 265References, 265

    12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall

    12.1 Quality Assessment, 27112.2 Experimental Considerations, 27212.3 Assembly, 27512.4 Experiment, 27612.5 Comparison, 27812.6 Results, 27912.7 Summary and Conclusion, 280

    Acknowledgments, 284References, 284

  • � �

    CONTENTS xv

    13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze

    13.1 Introduction, 28713.1.1 Alternative Splicing, 28713.1.2 Nonmodel Organisms, 28813.1.3 RNA-Seq Data, 288

    13.2 Representation of Alternative Splicing, 28913.2.1 de Bruijn Graph, 28913.2.2 A Set of Transcripts, 29113.2.3 Splicing Graph, 292

    13.3 Comparison to Model Organisms, 29313.3.1 A Set of Transcripts, 29313.3.2 Splicing Graph, 293

    13.4 Accuracy of Algorithms, 29313.4.1 Assembly Results, 29313.4.2 mRNA BLAST Results, 29513.4.3 Alternative Splicing Junctions, 295

    13.5 Discussion, 296References, 297

    14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky

    14.1 Introduction, 30114.1.1 Motivation and Problems Description, 30214.1.2 RNA-Seq Protocol, 303

    14.2 Overview of the State-of-the-Art Methods, 30414.2.1 Quantification Methods, 30414.2.2 Differential Expression Methods, 305

    14.3 Recent Algorithms, 30714.3.1 SimReg: Simulated Regression Method for Transcriptome

    Quantification, 30714.3.2 Differential Gene Expression Analysis: IsoDE, 311

    14.4 Experimental Setup, 31314.4.1 Quantification Methods, 31314.4.2 Differential Expression Methods, 313

    14.5 Evaluation, 31614.5.1 Transcriptome Quantification Methods Evaluation, 316

  • � �

    xvi CONTENTS

    14.5.2 Differential Expression Methods Evaluation, 319Acknowledgments, 326References, 326

    PART IV MICROBIOMICS 329

    15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov

    15.1 Next-Generation Sequencing of Heterogeneous Viral Populationsand Sequencing Errors, 331

    15.2 Methods and Algorithms for The Ngs Error Correction in ViralData, 33415.2.1 Clustering-Based Algorithms, 33415.2.2 k-Mer-Based Algorithms, 33915.2.3 Alignment-Based Algorithms, 345

    15.3 Algorithm Comparison, 34715.3.1 Benchmark Data, 34715.3.2 Results and Discussion, 348References, 350

    16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel

    16.1 Intra-Host Virus Populations, 35516.1.1 Viral Quasispecies, 35616.1.2 Fitness, 35716.1.3 HIV-1 as a Model System, 35716.1.4 Recombination, 35816.1.5 Clinical Implications, 35916.1.6 Genotyping, 360

    16.2 Next-Generation Sequencing for Viral Genomics, 36016.2.1 Library Preparation, 36016.2.2 Sequencing Approaches, 36116.2.3 Specialized Viral Sequencing Methods, 36316.2.4 Data Preprocessing and Read Alignment, 36416.2.5 Spatial Scales of Viral Haplotype Reconstruction, 36416.2.6 Quasispecies Assembly Performance, 365

    16.3 Probabilistic Reconstruction Methods, 36616.3.1 From Human to Viral Haplotype Reconstruction, 36616.3.2 Viral Haplotype Inference Methods Overview, 36916.3.3 Local Viral Haplotype Inference Approaches, 36916.3.4 Quasispecies Assembly, 37016.3.5 Recombinant Quasispecies Assembly, 370

    16.4 Conclusion, 375References, 376

  • � �

    CONTENTS xvii

    17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky

    17.1 Introduction, 38317.2 Background, 384

    17.2.1 Infectious Bronchitis Virus, 38417.2.2 High-Throughput Sequencing, 384

    17.3 Methods, 38517.3.1 Compared Methods, 388

    17.4 Results and Discussion, 38817.4.1 Data Sets, 38817.4.2 Validation of Error Correction Methods, 38917.4.3 Tuning, Comparison, and Validation of Methods for

    Quasispecies Reconstruction, 390Acknowledgments, 397References, 397

    18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan

    18.1 Introduction, 40118.2 The Metagenomics Analysis Pipeline, 40318.3 Data Limitations and Sources of Errors, 405

    18.3.1 Designing Degenerate Primers for Microbiome Work, 40718.4 Diversity and Richness Measures, 40718.5 Correlations and Association Rules, 40918.6 Microbial Functional Profiles, 41018.7 Microbial Social Interactions and Visualizations, 41318.8 Bayesian Inferences, 41818.9 Conclusion, 419

    References, 420

    INDEX 425

  • � �

  • � �

    CONTRIBUTORS

    Vanessa Aguiar-Pulido, Bioinformatics Research Group (BioRG), School ofComputing and Information Sciences, Florida International University, Miami,FL, USA

    Sahar Al Seesi, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

    Alexander Artyomenko, Department of Computer Science, Georgia State Univer-sity, Atlanta, GA, USA

    Niko Beerenwinkel, Department of Biosystems Science and Engineering, ETHZurich, Basel, Switzerland

    Adrian Caciula, Department of Computer Science, Georgia State University,Atlanta, GA, USA

    David S. Campo, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

    Michael Campos, Miller School of Medicine, University of Miami, Miami, FL, USA

    Stefan Canzar, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD,and Toyota Technological Institute at Chicago, Chicago, IL, USA

    Jeong-Hyeon Choi, Cancer Center, Medical College of Georgia, Georgia RegentsUniversity, Augusta, GA, USA; Department of Biostatistics and Epidemiology,Medical College of Georgia, Georgia Regents University, Augusta, GA, USA

    Chong Chu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

  • � �

    xx CONTRIBUTORS

    Zoya Dimitrova, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

    Jorge Duitama, Agrobiodiversity Research Area, International Center for TropicalAgriculture (CIAT), Cali, Colombia

    Eleazar Eskin, Department of Computer Science, University of California,Los Angeles, CA, USA

    Mitch Fernandez, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA

    Liliana Florea, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore,MD, USA

    Olga Glebova, Department of Computer Science, Georgia State University, Atlanta,GA, USA

    Xuan Guo, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

    Steven J. Hallam, Graduate Program in Bioinformatics and Department of Microbi-ology and Immunology, University of British Columbia, Vancouver, BC, Canada

    Niels W. Hanson, Graduate Program in Bioinformatics, University of BritishColumbia, Vancouver, BC, Canada

    Elena Harris, Department of Computer Science, California State University,Chico, CA

    Wenrui Huang, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA

    Mazhar I. Khan, Department of Pathobiology and Veterinary Science, University ofConnecticut, Storrs, CT, USA

    Yury Khudyakov, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

    Kishori M. Konwar, Department of Microbiology and Immunology, University ofBritish Columbia, Vancouver, BC, Canada

    Bing Li, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

    James Lindsay, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

    Rasiah Loganantharaj, Bioinformatics Research Lab, The Center for AdvancedComputer Studies, University of Louisiana, Lafayette, LA, USA

    Stefano Lonardi, Department of Computer Science and Engineering, University ofCalifornia, Riverside, CA, USA

  • � �

    CONTRIBUTORS xxi

    Nicholas Mancuso, Department of Computer Science, Georgia State University,Atlanta, GA, USA

    Ion I. Măndoiu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

    Igor Mandric, Department of Computer Science, Georgia State University, Atlanta,GA, USA

    Serghei Mangul, Department of Computer Science, University of California, LosAngeles, CA, USA

    Tobias Marschall, Centrum Wiskunde & Informatica, Amsterdam, Netherlands

    Kalai Mathee, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA

    Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA

    Ekaterina Nenastyeva, Department of Computer Science, Georgia State University,Atlanta, GA, USA

    Rachel O’neill, Department of Molecular and Cell Biology, University of Connecti-cut, Storrs, CT, USA

    Yi Pan, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

    Sumathi Ramachandran, Division of Viral Hepatitis, Centers of Disease Controland Prevention, Atlanta, GA, USA

    Thomas A. Randall, Integrative Bioinformatics, National Institute of EnvironmentalHealth Sciences, Research Triangle Park, NC, USA

    Juan Riveros, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA

    Alexander Schönhuth, Centrum Wiskunde & Informatica, Amsterdam, Netherlands

    Jonathan Segal, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA

    Huidong Shi, Cancer Center, Medical College of Georgia, Georgia Regents Univer-sity, Augusta, GA, USA Department of Biochemistry, Medical College of Georgia,Georgia Regents University, Augusta, GA, USA

    Pavel Skums, Division of Viral Hepatitis, Centers of Disease Control and Prevention,Atlanta, GA, USA

    Ren Sun, Department of Molecular and Medical Pharmacology, University ofCalifornia, Los Angeles, CA, USA

    Sing-hoi Sze, Department of Computer Science and Engineering and Department ofBiochemistry and Biophysics, Texas A&M University, College Station, TX, USA

  • � �

    xxii CONTRIBUTORS

    Yvette Temate-tiagueu, Department of Computer Science, Georgia State University,Atlanta, GA, USA

    Armin Töpfer, Department of Biosystems Science and Engineering, ETH Zurich,Basel, Switzerland

    Bassam Tork, Department of Computer Science, Georgia State University, Atlanta,GA, USA

    Nicholas C. Wu, Department of Integrative Structural and Computational Biology,The Scripps Research Institute, La Jolla, CA, USA

    Shang-ju Wu, Department of Computer Science, University of British Columbia,Vancouver, BC, Canada

    Yufeng Wu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

    Ning Yu, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

    Alexander Zelikovsky, Department of Computer Science, Georgia State University,Atlanta, GA, USA

    Erliang Zeng, Department of Computer Science and Engineering, University ofNotre Dame, Notre Dame, IN, USA

    Jin Zhang, McDonnell Genome Institute, Washington University in St. Luis,MO, USA

  • � �

    PREFACE

    Massively parallel DNA sequencing and RNA sequencing have become widelyavailable, reducing the cost by several orders of magnitude and placing the capacityto generate gigabases to terabases of sequence data into the hands of individualinvestigators. These so-called next-generation sequencing (NGS) technologieshave dramatically accelerated biological and biomedical research by enabling thecomprehensive analysis of genomes and transcriptomes to become inexpensive,routine, and widespread. The ensuing explosion in the volume of data has spurrednumerous advances in computational methods for NGS data analysis.

    This book aims to provide an in-depth survey of some of the most important recentdevelopments in this area. It is neither intended as an introductory text nor as a com-prehensive review of existing bioinformatics tools and active research areas in NGSdata analysis. Rather, our intention is to make a carefully selected set of advancedcomputational techniques accessible to a broad readership, including graduatestudents in bioinformatics and related areas and biomedical professionals who wantto expand their repertoire of computational techniques for NGS data analysis. Wehope that our emphasis on in-depth presentation of both algorithms and softwarefor computational data analysis of current high-throughput sequencing technologieswill best prepare the readers for developing their own algorithmic techniques and forsuccessfully implementing them in existing and novel NGS applications.

    The book features 18 chapters authored by bioinformatics experts who are activecontributors to the respective subjects. The chapters are intended to be largely inde-pendent, so that readers do not have to read every chapter nor have to read them in aparticular order. The chapters are grouped into the following four parts:

    • Part I focuses on computing and experimental infrastructure for NGS data anal-ysis, including chapters on cloud computing, a modular pipeline for metabolicpathway reconstruction, pooling strategies for massive viral sequencing, andhigh-fidelity sequencing protocols.

  • � �

    xxiv PREFACE

    • Part II concentrates on analyses of DNA sequencing data and includes chapterson the classic scaffolding problem, detection of genomic variants, two chapterson finding insertions and deletions, and two chapters on the analysis of DNAmethylation sequencing data.

    • Part III is devoted to analyses of RNA-seq data. Two chapters describe algo-rithms and compare software tools for transcriptome assembly: one chapterfocuses on methods for alternative splicing analysis and the other chapterfocuses on tools for transcriptome quantification and differential expressionanalysis.

    • Part IV explores computational tools for NGS applications in microbiomics.The first chapter concentrates on error correction of NGS reads from viralpopulations, then two chapters describe methods for viral quasispecies recon-struction, and the last chapter surveys the state of the art and future trends inmicrobiome analysis.

    We are grateful to all the authors for their excellent contributions, without whichthis book would not have been possible. We hope that their deep insights and freshenthusiasm will help in attracting new generations of researchers to this dynamicfield. We would also like to thank Yi Pan and Albert Y. Zomaya for nurturing thisproject since its inception, and the editorial staff at Wiley Interscience for theirpatience and assistance throughout the project. Finally, we wish to thank our friendsand families for their continuous support.

    Ion I. Măndoiu

    Storrs, Connecticut

    Alexander Zelikovsky

    Atlanta, Georgia

  • � �

    ABOUT THE COMPANION WEBSITE

    This book is accompanied by a companion website:

    www.wiley.com/go/Mandoiu/NextGenerationSequencing

    The book companion website contains the color version of a few selected figures

    Figure 2.3, Figure 2.5, Figure 2.6, Figure 2.13, Figure 3.1, Figure 3.9,

    Figure 7.5, Figure 8.3, Figure 8.4, Figure 9.4, Figure 9.8, Figure 9.9,

    Figure 9.12, Figure 9.14, Figure 12.3, Figure 12.4, Figure 12.5, Figure 15.3,

    Figure 16.1, Figure 16.6, Figure 16.7, Figure 16.11, Figure 16.12, Figure 16.13,

    Figure 18.1, Figure 18.2, Figure 18.3, Figure 18.4, Figure 18.5, Figure 18.7.

    www.wiley.com/go/Mandoiu/NextGenerationSequencing

  • � �

  • � �

    PART I

    COMPUTING AND EXPERIMENTALINFRASTRUCTURE FOR NGS

  • � �