download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...

�

� �

�

COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS

�

� �

�

Wiley Series on

Bioinformatics: Computational Techniques and Engineering

A complete list of the titles in this series appears at the end of this volume.

�

� �

�

COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS

Edited by

ION I. MĂNDOIUALEXANDER ZELIKOVSKY

�

� �

�

Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form orby any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission shouldbe addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact ourCustomer Care Department within the United States at (800) 762-2974, outside the United States at(317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print maynot be available in electronic formats. For more information about Wiley products, visit our web site atwww.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Măndoiu, I. Ion, editor of compilation. | Zelikovsky, Alexander, editorof compilation.

Title: Computational methods for next generation sequencing data analysis /edited by Ion I. Măndoiu, Alexander Zelikovsky.

Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includesbibliographical references and index.

Identifiers: LCCN 2016010861 (print) | LCCN 2016014704 (ebook) | ISBN9781118169483 (cloth) | ISBN 9781119272168 (pdf) | ISBN 9781119272175(epub)

Subjects: LCSH: Nucleotide sequence–Methodology. | Nucleotide sequence–Dataprocessing.

Classification: LCC QP620 .C648 2016 (print) | LCC QP620 (ebook) | DDC611/.0181663–dc23

LC record available at http://lccn.loc.gov/2016010861

Cover image courtesy of Gettyimages/Andrew Brookes

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

www.copyright.comhttp://www.wiley.com/go/permissionswww.wiley.com

�

� �

�

CONTENTS IN BRIEF

CONTRIBUTORS xix

PREFACE xxiii

ABOUT THE COMPANION WEBSITE xxv

PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1

1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan

2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam

3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov

4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin

�

� �

�

vi CONTENTS IN BRIEF

PART II GENOMICS AND EPIGENOMICS 105

5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky

6 Genomic Variants Detection and Genotyping 133Jorge Duitama

7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth

8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu

9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi

10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi

PART III TRANSCRIPTOMICS 245

11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea

12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall

13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze

14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky

�

� �

�

CONTENTS IN BRIEF vii

PART IV MICROBIOMICS 329

15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov

16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel

17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky

18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan

INDEX 425

�

� �

�

�

� �

�

CONTENTS

CONTRIBUTORS xix

PREFACE xxiii

ABOUT THE COMPANION WEBSITE xxv

PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1

1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan

1.1 Introduction, 31.2 Challenges for NGS Data Analysis, 41.3 Background For Cloud Computing and its Programming Models, 6

1.3.1 Overview of Cloud Computing, 71.3.2 Cloud Service Providers, 71.3.3 Programming Models, 8

1.4 Cloud Computing Services for NGS Data Analysis, 131.4.1 Hardware as a Service (HaaS), 131.4.2 Platform as a Service (PaaS), 131.4.3 Software as a Service (SaaS), 151.4.4 Data as a Service (DaaS), 20

1.5 Conclusions and Future Directions, 20References, 21

�

� �

�

x CONTENTS

2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam

2.1 Introduction & Overview, 252.2 Background, 262.3 Metapathways Processes, 27

2.3.1 Open Reading Frame (ORF) Prediction, 292.3.2 Functional Annotation, 322.3.3 Analysis Modules, 332.3.4 ePGDB Construction, 38

2.4 Big Data Processing, 392.4.1 A Master–Worker Model for Grid Distribution, 392.4.2 GUI and Data Integration, 41

2.5 Downstream Analyses, 412.5.1 Large Table Comparisons, 432.5.2 Pathway Tools Cellular Overview, 432.5.3 Statistical Analysis with R, 452.5.4 Venn Diagram, 472.5.5 Clustering and Relating Samples by Pathways, 492.5.6 Faceting Variables with ggplot2, 49

2.6 Conclusions, 50References, 50

3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov

3.1 Introduction, 573.2 Design of Pools for Big Viral Data, 60

3.2.1 Pool Design Optimization Formulation, 623.2.2 Greedy Heuristic for VSPD Problem, 633.2.3 The Tabu Search Heuristic for the OCBG Problem, 65

3.3 Deconvolution of Viral Samples from Pools, 683.3.1 Deconvolution Using Generalized Intersections and

Differences of Pools, 683.3.2 Maximum Likelihood k-Clustering, 70

3.4 Performance of Pooling Methods on Simulated Data, 713.4.1 Performance of the Viral Sample Pool Design Algorithm, 713.4.2 Performance of the Pool Deconvolution Algorithm, 74

3.5 Experimental Validation of Pooling Strategy, 753.5.1 Experimental Pools and Sequencing, 753.5.2 Results, 77

�

� �

�

CONTENTS xi

3.6 Conclusion, 79References, 81

4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin

4.1 Introduction, 854.2 High-Fidelity Sequencing Protocol, 864.3 Assembly of High-Fidelity Sequencing Data, 88

4.3.1 Consensus Construction, 884.3.2 Reads Mapping, 894.3.3 Viral Genome Assembler (VGA), 894.3.4 Viral Population Quantification, 91

4.4 Performance of VGA on Simulated Data, 934.5 Performance of Existing Viral Assemblers on Simulated

Consensus Error-Corrected Reads, 984.6 Performance of VGA on Real Hiv Data, 99

4.6.1 Validation of de novo Consensus, 994.7 Comparison of Alignment on Error-Corrected Reads, 1004.8 Evaluating of Error Correction Tools Based on High-Fidelity

Sequencing Reads, 101Acknowledgment, 101References, 102

PART II GENOMICS AND EPIGENOMICS 105

5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky

5.1 Scaffolding, 1075.2 State-of-The-Art Scaffolding Tools, 108

5.2.1 Sspace, 1085.2.2 OPERA, 1095.2.3 SOPRA, 1105.2.4 MIP, 1105.2.5 SCARPA, 111

5.3 Recent Scaffolding Tools, 1115.3.1 SILP2, 1115.3.2 ScaffMatch, 119

5.4 Scaffolding Software Evaluation, 1245.4.1 Data Sets, 1245.4.2 Quality Metrics, 1245.4.3 Evaluation and Comparison, 126References, 129

�

� �

�

xii CONTENTS

6 Genomic Variants Detection and Genotyping 133Jorge Duitama

6.1 Introduction, 1336.2 Methods for Detection and Genotyping of SNPs and Small

Indels, 1356.2.1 Description of the Problem, 1356.2.2 Bayesian Model, 1366.2.3 Common Issues Affecting Genotype Quality, 1376.2.4 Population Variability, 139

6.3 Methods for Detection and Genotyping of CNVs, 1416.3.1 Mean-Shift Approach for CNVs within a Sample, 1416.3.2 Identifying CNVs between Samples, 143

6.4 Putting Everything Together, 144References, 145

7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth

7.1 Introduction, 1497.1.1 Twilight Zone Deletions, 151

7.2 Notation, 1517.2.1 Alignments, 1527.2.2 Gaps/Splits, 1527.2.3 Deletions, 152

7.3 Non-Twilight-Zone Deletion Discovery, 1527.3.1 Internal Segment Size-Based Approaches, 1537.3.2 Split-Read Mapping Approaches, 1547.3.3 Hybrid Approaches, 1557.3.4 The “Twilight Zone”: Definition, 156

7.4 Discovering “Twilight Zone” Deletions: New Solutions, 1567.4.1 CLEVER, 1567.4.2 Mate-Clever, 1577.4.3 Pindel, 157

7.5 Genotyping “Twilight Zone” Deletions, 1587.5.1 A Maximum Likelihood Approach under Read Alignment

Uncertainty, 1587.6 Results, 162

7.6.1 Data Set, 1627.6.2 Tools, 1637.6.3 Discovery, 1637.6.4 Genotyping, 166

7.7 Discussion, 1677.7.1 HiSeq, 1697.7.2 MiSeq, 1707.7.3 Conclusion, 170

�

� �

�

CONTENTS xiii

7.8 Availability, 171Acknowledgments, 171References, 171

8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu

8.1 Background, 1758.2 Methods, 177

8.2.1 Signatures of Long Indels in Sequence Reads, 1778.2.2 Methods for Discovering Long Indels without Exact

Breakpoints, 1838.2.3 Methods for Discovering Long Indels with Exact

Breakpoints, 1858.2.4 Combined Approaches, 186

8.3 Applications, 1918.3.1 Population SV Calling, 1918.3.2 Cancer Genomics, 192

8.4 Conclusions and Future Directions, 193Acknowledgment, 193References, 193

9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi

9.1 Introduction, 1979.2 Enrichment-Based Approaches, 201

9.2.1 Data Analysis Procedure, 2019.2.2 Available Approaches, 205

9.3 Bisulfite Treatment-Based Approaches, 2119.3.1 Data Analysis Procedure, 2119.3.2 Available Approaches, 214


10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi

10.1 Introduction, 22710.2 The Problem of Mapping BS-Treated Reads, 22910.3 Algorithmic Approaches to the Problem of Mapping BS-Treated

Reads, 23110.4 Methylation Estimation, 234

�

� �

�

xiv CONTENTS

10.5 Possible Biases in Estimation of Methylation Level, 23410.6 Bisulfite Conversion Rate, 23510.7 Reduced Representation Bisulfite Sequencing, 23510.8 Accuracy as a Performance Measurement, 235

References, 241

PART III TRANSCRIPTOMICS 245

11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea

11.1 Introduction, 24711.2 De Novo Assembly, 248

11.2.1 Preprocessing of Reads, 24911.2.2 The De Bruijn Graph for RNA-seq Read Assembly, 25011.2.3 Contig Assembly, 25211.2.4 Filtering and Error Correction, 25211.2.5 Variations, 253

11.3 Genome-Based Assembly, 25411.3.1 Candidate Isoforms, 25411.3.2 Minimality, 25811.3.3 Accuracy, 26111.3.4 Completeness, 26311.3.5 Extensions, 264

11.4 Conclusions, 264Acknowledgment, 265References, 265

12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall

12.1 Quality Assessment, 27112.2 Experimental Considerations, 27212.3 Assembly, 27512.4 Experiment, 27612.5 Comparison, 27812.6 Results, 27912.7 Summary and Conclusion, 280

Acknowledgments, 284References, 284

�

� �

�

CONTENTS xv

13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze

13.1 Introduction, 28713.1.1 Alternative Splicing, 28713.1.2 Nonmodel Organisms, 28813.1.3 RNA-Seq Data, 288

13.2 Representation of Alternative Splicing, 28913.2.1 de Bruijn Graph, 28913.2.2 A Set of Transcripts, 29113.2.3 Splicing Graph, 292

13.3 Comparison to Model Organisms, 29313.3.1 A Set of Transcripts, 29313.3.2 Splicing Graph, 293

13.4 Accuracy of Algorithms, 29313.4.1 Assembly Results, 29313.4.2 mRNA BLAST Results, 29513.4.3 Alternative Splicing Junctions, 295

13.5 Discussion, 296References, 297

14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky

14.1 Introduction, 30114.1.1 Motivation and Problems Description, 30214.1.2 RNA-Seq Protocol, 303

14.2 Overview of the State-of-the-Art Methods, 30414.2.1 Quantification Methods, 30414.2.2 Differential Expression Methods, 305

14.3 Recent Algorithms, 30714.3.1 SimReg: Simulated Regression Method for Transcriptome

Quantification, 30714.3.2 Differential Gene Expression Analysis: IsoDE, 311

14.4 Experimental Setup, 31314.4.1 Quantification Methods, 31314.4.2 Differential Expression Methods, 313

14.5 Evaluation, 31614.5.1 Transcriptome Quantification Methods Evaluation, 316

�

� �

�

xvi CONTENTS

14.5.2 Differential Expression Methods Evaluation, 319Acknowledgments, 326References, 326

PART IV MICROBIOMICS 329

15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov

15.1 Next-Generation Sequencing of Heterogeneous Viral Populationsand Sequencing Errors, 331

15.2 Methods and Algorithms for The Ngs Error Correction in ViralData, 33415.2.1 Clustering-Based Algorithms, 33415.2.2 k-Mer-Based Algorithms, 33915.2.3 Alignment-Based Algorithms, 345

15.3 Algorithm Comparison, 34715.3.1 Benchmark Data, 34715.3.2 Results and Discussion, 348References, 350

16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel

16.1 Intra-Host Virus Populations, 35516.1.1 Viral Quasispecies, 35616.1.2 Fitness, 35716.1.3 HIV-1 as a Model System, 35716.1.4 Recombination, 35816.1.5 Clinical Implications, 35916.1.6 Genotyping, 360

16.2 Next-Generation Sequencing for Viral Genomics, 36016.2.1 Library Preparation, 36016.2.2 Sequencing Approaches, 36116.2.3 Specialized Viral Sequencing Methods, 36316.2.4 Data Preprocessing and Read Alignment, 36416.2.5 Spatial Scales of Viral Haplotype Reconstruction, 36416.2.6 Quasispecies Assembly Performance, 365

16.3 Probabilistic Reconstruction Methods, 36616.3.1 From Human to Viral Haplotype Reconstruction, 36616.3.2 Viral Haplotype Inference Methods Overview, 36916.3.3 Local Viral Haplotype Inference Approaches, 36916.3.4 Quasispecies Assembly, 37016.3.5 Recombinant Quasispecies Assembly, 370


�

� �

�

CONTENTS xvii

17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky

17.1 Introduction, 38317.2 Background, 384

17.2.1 Infectious Bronchitis Virus, 38417.2.2 High-Throughput Sequencing, 384

17.3 Methods, 38517.3.1 Compared Methods, 388

17.4 Results and Discussion, 38817.4.1 Data Sets, 38817.4.2 Validation of Error Correction Methods, 38917.4.3 Tuning, Comparison, and Validation of Methods for

Quasispecies Reconstruction, 390Acknowledgments, 397References, 397

18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan

18.1 Introduction, 40118.2 The Metagenomics Analysis Pipeline, 40318.3 Data Limitations and Sources of Errors, 405

18.3.1 Designing Degenerate Primers for Microbiome Work, 40718.4 Diversity and Richness Measures, 40718.5 Correlations and Association Rules, 40918.6 Microbial Functional Profiles, 41018.7 Microbial Social Interactions and Visualizations, 41318.8 Bayesian Inferences, 41818.9 Conclusion, 419

References, 420

INDEX 425

�

� �

�

�

� �

�

CONTRIBUTORS

Vanessa Aguiar-Pulido, Bioinformatics Research Group (BioRG), School ofComputing and Information Sciences, Florida International University, Miami,FL, USA

Sahar Al Seesi, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

Alexander Artyomenko, Department of Computer Science, Georgia State Univer-sity, Atlanta, GA, USA

Niko Beerenwinkel, Department of Biosystems Science and Engineering, ETHZurich, Basel, Switzerland

Adrian Caciula, Department of Computer Science, Georgia State University,Atlanta, GA, USA

David S. Campo, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

Michael Campos, Miller School of Medicine, University of Miami, Miami, FL, USA

Stefan Canzar, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD,and Toyota Technological Institute at Chicago, Chicago, IL, USA

Jeong-Hyeon Choi, Cancer Center, Medical College of Georgia, Georgia RegentsUniversity, Augusta, GA, USA; Department of Biostatistics and Epidemiology,Medical College of Georgia, Georgia Regents University, Augusta, GA, USA

Chong Chu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

�

� �

�

xx CONTRIBUTORS

Zoya Dimitrova, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

Jorge Duitama, Agrobiodiversity Research Area, International Center for TropicalAgriculture (CIAT), Cali, Colombia

Eleazar Eskin, Department of Computer Science, University of California,Los Angeles, CA, USA

Mitch Fernandez, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA

Liliana Florea, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore,MD, USA

Olga Glebova, Department of Computer Science, Georgia State University, Atlanta,GA, USA

Xuan Guo, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

Steven J. Hallam, Graduate Program in Bioinformatics and Department of Microbi-ology and Immunology, University of British Columbia, Vancouver, BC, Canada

Niels W. Hanson, Graduate Program in Bioinformatics, University of BritishColumbia, Vancouver, BC, Canada

Elena Harris, Department of Computer Science, California State University,Chico, CA

Wenrui Huang, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA

Mazhar I. Khan, Department of Pathobiology and Veterinary Science, University ofConnecticut, Storrs, CT, USA

Yury Khudyakov, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA

Kishori M. Konwar, Department of Microbiology and Immunology, University ofBritish Columbia, Vancouver, BC, Canada

Bing Li, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

James Lindsay, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

Rasiah Loganantharaj, Bioinformatics Research Lab, The Center for AdvancedComputer Studies, University of Louisiana, Lafayette, LA, USA

Stefano Lonardi, Department of Computer Science and Engineering, University ofCalifornia, Riverside, CA, USA

�

� �

�

CONTRIBUTORS xxi

Nicholas Mancuso, Department of Computer Science, Georgia State University,Atlanta, GA, USA

Ion I. Măndoiu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

Igor Mandric, Department of Computer Science, Georgia State University, Atlanta,GA, USA

Serghei Mangul, Department of Computer Science, University of California, LosAngeles, CA, USA

Tobias Marschall, Centrum Wiskunde & Informatica, Amsterdam, Netherlands

Kalai Mathee, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA

Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA

Ekaterina Nenastyeva, Department of Computer Science, Georgia State University,Atlanta, GA, USA

Rachel O’neill, Department of Molecular and Cell Biology, University of Connecti-cut, Storrs, CT, USA

Yi Pan, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

Sumathi Ramachandran, Division of Viral Hepatitis, Centers of Disease Controland Prevention, Atlanta, GA, USA

Thomas A. Randall, Integrative Bioinformatics, National Institute of EnvironmentalHealth Sciences, Research Triangle Park, NC, USA

Juan Riveros, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA

Alexander Schönhuth, Centrum Wiskunde & Informatica, Amsterdam, Netherlands

Jonathan Segal, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA

Huidong Shi, Cancer Center, Medical College of Georgia, Georgia Regents Univer-sity, Augusta, GA, USA Department of Biochemistry, Medical College of Georgia,Georgia Regents University, Augusta, GA, USA

Pavel Skums, Division of Viral Hepatitis, Centers of Disease Control and Prevention,Atlanta, GA, USA

Ren Sun, Department of Molecular and Medical Pharmacology, University ofCalifornia, Los Angeles, CA, USA

Sing-hoi Sze, Department of Computer Science and Engineering and Department ofBiochemistry and Biophysics, Texas A&M University, College Station, TX, USA

�

� �

�

xxii CONTRIBUTORS

Yvette Temate-tiagueu, Department of Computer Science, Georgia State University,Atlanta, GA, USA

Armin Töpfer, Department of Biosystems Science and Engineering, ETH Zurich,Basel, Switzerland

Bassam Tork, Department of Computer Science, Georgia State University, Atlanta,GA, USA

Nicholas C. Wu, Department of Integrative Structural and Computational Biology,The Scripps Research Institute, La Jolla, CA, USA

Shang-ju Wu, Department of Computer Science, University of British Columbia,Vancouver, BC, Canada

Yufeng Wu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA

Ning Yu, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA

Alexander Zelikovsky, Department of Computer Science, Georgia State University,Atlanta, GA, USA

Erliang Zeng, Department of Computer Science and Engineering, University ofNotre Dame, Notre Dame, IN, USA

Jin Zhang, McDonnell Genome Institute, Washington University in St. Luis,MO, USA

�

� �

�

PREFACE

Massively parallel DNA sequencing and RNA sequencing have become widelyavailable, reducing the cost by several orders of magnitude and placing the capacityto generate gigabases to terabases of sequence data into the hands of individualinvestigators. These so-called next-generation sequencing (NGS) technologieshave dramatically accelerated biological and biomedical research by enabling thecomprehensive analysis of genomes and transcriptomes to become inexpensive,routine, and widespread. The ensuing explosion in the volume of data has spurrednumerous advances in computational methods for NGS data analysis.

This book aims to provide an in-depth survey of some of the most important recentdevelopments in this area. It is neither intended as an introductory text nor as a com-prehensive review of existing bioinformatics tools and active research areas in NGSdata analysis. Rather, our intention is to make a carefully selected set of advancedcomputational techniques accessible to a broad readership, including graduatestudents in bioinformatics and related areas and biomedical professionals who wantto expand their repertoire of computational techniques for NGS data analysis. Wehope that our emphasis on in-depth presentation of both algorithms and softwarefor computational data analysis of current high-throughput sequencing technologieswill best prepare the readers for developing their own algorithmic techniques and forsuccessfully implementing them in existing and novel NGS applications.

The book features 18 chapters authored by bioinformatics experts who are activecontributors to the respective subjects. The chapters are intended to be largely inde-pendent, so that readers do not have to read every chapter nor have to read them in aparticular order. The chapters are grouped into the following four parts:

• Part I focuses on computing and experimental infrastructure for NGS data anal-ysis, including chapters on cloud computing, a modular pipeline for metabolicpathway reconstruction, pooling strategies for massive viral sequencing, andhigh-fidelity sequencing protocols.

�

� �

�

xxiv PREFACE

• Part II concentrates on analyses of DNA sequencing data and includes chapterson the classic scaffolding problem, detection of genomic variants, two chapterson finding insertions and deletions, and two chapters on the analysis of DNAmethylation sequencing data.

• Part III is devoted to analyses of RNA-seq data. Two chapters describe algo-rithms and compare software tools for transcriptome assembly: one chapterfocuses on methods for alternative splicing analysis and the other chapterfocuses on tools for transcriptome quantification and differential expressionanalysis.

• Part IV explores computational tools for NGS applications in microbiomics.The first chapter concentrates on error correction of NGS reads from viralpopulations, then two chapters describe methods for viral quasispecies recon-struction, and the last chapter surveys the state of the art and future trends inmicrobiome analysis.

We are grateful to all the authors for their excellent contributions, without whichthis book would not have been possible. We hope that their deep insights and freshenthusiasm will help in attracting new generations of researchers to this dynamicfield. We would also like to thank Yi Pan and Albert Y. Zomaya for nurturing thisproject since its inception, and the editorial staff at Wiley Interscience for theirpatience and assistance throughout the project. Finally, we wish to thank our friendsand families for their continuous support.

Ion I. Măndoiu

Storrs, Connecticut

Alexander Zelikovsky

Atlanta, Georgia

�

� �

�

ABOUT THE COMPANION WEBSITE

This book is accompanied by a companion website:

www.wiley.com/go/Mandoiu/NextGenerationSequencing

The book companion website contains the color version of a few selected figures

Figure 2.3, Figure 2.5, Figure 2.6, Figure 2.13, Figure 3.1, Figure 3.9,




Figure 18.1, Figure 18.2, Figure 18.3, Figure 18.4, Figure 18.5, Figure 18.7.

www.wiley.com/go/Mandoiu/NextGenerationSequencing

�

� �

�

�

� �

�

PART I

COMPUTING AND EXPERIMENTALINFRASTRUCTURE FOR NGS

�

� �

�

download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...

Documents

Transcript of download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...