download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...
Transcript of download.e-bookshelf.de · 2016. 9. 15. · Title: Computational methods for next generation...
-
�
� �
�
COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS
-
�
� �
�
Wiley Series on
Bioinformatics: Computational Techniques and Engineering
A complete list of the titles in this series appears at the end of this volume.
-
�
� �
�
COMPUTATIONALMETHODS FOR NEXTGENERATIONSEQUENCING DATAANALYSIS
Edited by
ION I. MĂNDOIUALEXANDER ZELIKOVSKY
-
�
� �
�
Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form orby any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission shouldbe addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact ourCustomer Care Department within the United States at (800) 762-2974, outside the United States at(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print maynot be available in electronic formats. For more information about Wiley products, visit our web site atwww.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Names: Măndoiu, I. Ion, editor of compilation. | Zelikovsky, Alexander, editorof compilation.
Title: Computational methods for next generation sequencing data analysis /edited by Ion I. Măndoiu, Alexander Zelikovsky.
Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includesbibliographical references and index.
Identifiers: LCCN 2016010861 (print) | LCCN 2016014704 (ebook) | ISBN9781118169483 (cloth) | ISBN 9781119272168 (pdf) | ISBN 9781119272175(epub)
Subjects: LCSH: Nucleotide sequence–Methodology. | Nucleotide sequence–Dataprocessing.
Classification: LCC QP620 .C648 2016 (print) | LCC QP620 (ebook) | DDC611/.0181663–dc23
LC record available at http://lccn.loc.gov/2016010861
Cover image courtesy of Gettyimages/Andrew Brookes
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
www.copyright.comhttp://www.wiley.com/go/permissionswww.wiley.com
-
�
� �
�
CONTENTS IN BRIEF
CONTRIBUTORS xix
PREFACE xxiii
ABOUT THE COMPANION WEBSITE xxv
PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1
1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan
2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam
3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov
4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin
-
�
� �
�
vi CONTENTS IN BRIEF
PART II GENOMICS AND EPIGENOMICS 105
5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky
6 Genomic Variants Detection and Genotyping 133Jorge Duitama
7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth
8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu
9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi
10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi
PART III TRANSCRIPTOMICS 245
11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea
12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall
13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze
14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky
-
�
� �
�
CONTENTS IN BRIEF vii
PART IV MICROBIOMICS 329
15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov
16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel
17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky
18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan
INDEX 425
-
�
� �
�
-
�
� �
�
CONTENTS
CONTRIBUTORS xix
PREFACE xxiii
ABOUT THE COMPANION WEBSITE xxv
PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTUREFOR NGS 1
1 Cloud Computing for Next-Generation Sequencing Data Analysis 3Xuan Guo, Ning Yu, Bing Li, and Yi Pan
1.1 Introduction, 31.2 Challenges for NGS Data Analysis, 41.3 Background For Cloud Computing and its Programming Models, 6
1.3.1 Overview of Cloud Computing, 71.3.2 Cloud Service Providers, 71.3.3 Programming Models, 8
1.4 Cloud Computing Services for NGS Data Analysis, 131.4.1 Hardware as a Service (HaaS), 131.4.2 Platform as a Service (PaaS), 131.4.3 Software as a Service (SaaS), 151.4.4 Data as a Service (DaaS), 20
1.5 Conclusions and Future Directions, 20References, 21
-
�
� �
�
x CONTENTS
2 Introduction to the Analysis of Environmental SequenceInformation Using Metapathways 25Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and StevenJ. Hallam
2.1 Introduction & Overview, 252.2 Background, 262.3 Metapathways Processes, 27
2.3.1 Open Reading Frame (ORF) Prediction, 292.3.2 Functional Annotation, 322.3.3 Analysis Modules, 332.3.4 ePGDB Construction, 38
2.4 Big Data Processing, 392.4.1 A Master–Worker Model for Grid Distribution, 392.4.2 GUI and Data Integration, 41
2.5 Downstream Analyses, 412.5.1 Large Table Comparisons, 432.5.2 Pathway Tools Cellular Overview, 432.5.3 Statistical Analysis with R, 452.5.4 Venn Diagram, 472.5.5 Clustering and Relating Samples by Pathways, 492.5.6 Faceting Variables with ggplot2, 49
2.6 Conclusions, 50References, 50
3 Pooling Strategy for Massive Viral Sequencing 57Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran,David S. Campo, Zoya Dimitrova, Ion I. Măndoiu, Alexander Zelikovsky,and Yury Khudyakov
3.1 Introduction, 573.2 Design of Pools for Big Viral Data, 60
3.2.1 Pool Design Optimization Formulation, 623.2.2 Greedy Heuristic for VSPD Problem, 633.2.3 The Tabu Search Heuristic for the OCBG Problem, 65
3.3 Deconvolution of Viral Samples from Pools, 683.3.1 Deconvolution Using Generalized Intersections and
Differences of Pools, 683.3.2 Maximum Likelihood k-Clustering, 70
3.4 Performance of Pooling Methods on Simulated Data, 713.4.1 Performance of the Viral Sample Pool Design Algorithm, 713.4.2 Performance of the Pool Deconvolution Algorithm, 74
3.5 Experimental Validation of Pooling Strategy, 753.5.1 Experimental Pools and Sequencing, 753.5.2 Results, 77
-
�
� �
�
CONTENTS xi
3.6 Conclusion, 79References, 81
4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso,Alexander Zelikovsky, Ren Sun, and Eleazar Eskin
4.1 Introduction, 854.2 High-Fidelity Sequencing Protocol, 864.3 Assembly of High-Fidelity Sequencing Data, 88
4.3.1 Consensus Construction, 884.3.2 Reads Mapping, 894.3.3 Viral Genome Assembler (VGA), 894.3.4 Viral Population Quantification, 91
4.4 Performance of VGA on Simulated Data, 934.5 Performance of Existing Viral Assemblers on Simulated
Consensus Error-Corrected Reads, 984.6 Performance of VGA on Real Hiv Data, 99
4.6.1 Validation of de novo Consensus, 994.7 Comparison of Alignment on Error-Corrected Reads, 1004.8 Evaluating of Error Correction Tools Based on High-Fidelity
Sequencing Reads, 101Acknowledgment, 101References, 102
PART II GENOMICS AND EPIGENOMICS 105
5 Scaffolding Algorithms 107Igor Mandric, James Lindsay, Ion I. Măndoiu, and Alexander Zelikovsky
5.1 Scaffolding, 1075.2 State-of-The-Art Scaffolding Tools, 108
5.2.1 Sspace, 1085.2.2 OPERA, 1095.2.3 SOPRA, 1105.2.4 MIP, 1105.2.5 SCARPA, 111
5.3 Recent Scaffolding Tools, 1115.3.1 SILP2, 1115.3.2 ScaffMatch, 119
5.4 Scaffolding Software Evaluation, 1245.4.1 Data Sets, 1245.4.2 Quality Metrics, 1245.4.3 Evaluation and Comparison, 126References, 129
-
�
� �
�
xii CONTENTS
6 Genomic Variants Detection and Genotyping 133Jorge Duitama
6.1 Introduction, 1336.2 Methods for Detection and Genotyping of SNPs and Small
Indels, 1356.2.1 Description of the Problem, 1356.2.2 Bayesian Model, 1366.2.3 Common Issues Affecting Genotype Quality, 1376.2.4 Population Variability, 139
6.3 Methods for Detection and Genotyping of CNVs, 1416.3.1 Mean-Shift Approach for CNVs within a Sample, 1416.3.2 Identifying CNVs between Samples, 143
6.4 Putting Everything Together, 144References, 145
7 Discovering and Genotyping Twilight Zone Deletions 149Tobias Marschall and Alexander Schönhuth
7.1 Introduction, 1497.1.1 Twilight Zone Deletions, 151
7.2 Notation, 1517.2.1 Alignments, 1527.2.2 Gaps/Splits, 1527.2.3 Deletions, 152
7.3 Non-Twilight-Zone Deletion Discovery, 1527.3.1 Internal Segment Size-Based Approaches, 1537.3.2 Split-Read Mapping Approaches, 1547.3.3 Hybrid Approaches, 1557.3.4 The “Twilight Zone”: Definition, 156
7.4 Discovering “Twilight Zone” Deletions: New Solutions, 1567.4.1 CLEVER, 1567.4.2 Mate-Clever, 1577.4.3 Pindel, 157
7.5 Genotyping “Twilight Zone” Deletions, 1587.5.1 A Maximum Likelihood Approach under Read Alignment
Uncertainty, 1587.6 Results, 162
7.6.1 Data Set, 1627.6.2 Tools, 1637.6.3 Discovery, 1637.6.4 Genotyping, 166
7.7 Discussion, 1677.7.1 HiSeq, 1697.7.2 MiSeq, 1707.7.3 Conclusion, 170
-
�
� �
�
CONTENTS xiii
7.8 Availability, 171Acknowledgments, 171References, 171
8 Computational Approaches for Finding Long Insertionsand Deletions with NGS Data 175Jin Zhang, Chong Chu, and Yufeng Wu
8.1 Background, 1758.2 Methods, 177
8.2.1 Signatures of Long Indels in Sequence Reads, 1778.2.2 Methods for Discovering Long Indels without Exact
Breakpoints, 1838.2.3 Methods for Discovering Long Indels with Exact
Breakpoints, 1858.2.4 Combined Approaches, 186
8.3 Applications, 1918.3.1 Population SV Calling, 1918.3.2 Cancer Genomics, 192
8.4 Conclusions and Future Directions, 193Acknowledgment, 193References, 193
9 Computational Approaches in Next-Generation Sequencing DataAnalysis for Genome-Wide DNA Methylation Studies 197Jeong-Hyeon Choi and Huidong Shi
9.1 Introduction, 1979.2 Enrichment-Based Approaches, 201
9.2.1 Data Analysis Procedure, 2019.2.2 Available Approaches, 205
9.3 Bisulfite Treatment-Based Approaches, 2119.3.1 Data Analysis Procedure, 2119.3.2 Available Approaches, 214
9.4 Conclusion, 221References, 222
10 Bisulfite-Co version-Based Methods for DNA MethylationSequencing Data Analysis 227Elena Harris and Stefano Lonardi
10.1 Introduction, 22710.2 The Problem of Mapping BS-Treated Reads, 22910.3 Algorithmic Approaches to the Problem of Mapping BS-Treated
Reads, 23110.4 Methylation Estimation, 234
-
�
� �
�
xiv CONTENTS
10.5 Possible Biases in Estimation of Methylation Level, 23410.6 Bisulfite Conversion Rate, 23510.7 Reduced Representation Bisulfite Sequencing, 23510.8 Accuracy as a Performance Measurement, 235
References, 241
PART III TRANSCRIPTOMICS 245
11 Computational Methods for Transcript Assembly from RNA-SEQReads 247Stefan Canzar and Liliana Florea
11.1 Introduction, 24711.2 De Novo Assembly, 248
11.2.1 Preprocessing of Reads, 24911.2.2 The De Bruijn Graph for RNA-seq Read Assembly, 25011.2.3 Contig Assembly, 25211.2.4 Filtering and Error Correction, 25211.2.5 Variations, 253
11.3 Genome-Based Assembly, 25411.3.1 Candidate Isoforms, 25411.3.2 Minimality, 25811.3.3 Accuracy, 26111.3.4 Completeness, 26311.3.5 Extensions, 264
11.4 Conclusions, 264Acknowledgment, 265References, 265
12 An Overview And Comparison of Tools for RNA-Seq Assembly 269Rasiah Loganantharaj and Thomas A. Randall
12.1 Quality Assessment, 27112.2 Experimental Considerations, 27212.3 Assembly, 27512.4 Experiment, 27612.5 Comparison, 27812.6 Results, 27912.7 Summary and Conclusion, 280
Acknowledgments, 284References, 284
-
�
� �
�
CONTENTS xv
13 Computational Approaches for Studying Alternative Splicing inNonmodel Organisms From RNA-SEQ Data 287Sing-Hoi Sze
13.1 Introduction, 28713.1.1 Alternative Splicing, 28713.1.2 Nonmodel Organisms, 28813.1.3 RNA-Seq Data, 288
13.2 Representation of Alternative Splicing, 28913.2.1 de Bruijn Graph, 28913.2.2 A Set of Transcripts, 29113.2.3 Splicing Graph, 292
13.3 Comparison to Model Organisms, 29313.3.1 A Set of Transcripts, 29313.3.2 Splicing Graph, 293
13.4 Accuracy of Algorithms, 29313.4.1 Assembly Results, 29313.4.2 mRNA BLAST Results, 29513.4.3 Alternative Splicing Junctions, 295
13.5 Discussion, 296References, 297
14 Transcriptome Quantificatio and Differential ExpressionFrom NGS Data 301Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi,Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. Măndoiu,and Alexander Zelikovsky
14.1 Introduction, 30114.1.1 Motivation and Problems Description, 30214.1.2 RNA-Seq Protocol, 303
14.2 Overview of the State-of-the-Art Methods, 30414.2.1 Quantification Methods, 30414.2.2 Differential Expression Methods, 305
14.3 Recent Algorithms, 30714.3.1 SimReg: Simulated Regression Method for Transcriptome
Quantification, 30714.3.2 Differential Gene Expression Analysis: IsoDE, 311
14.4 Experimental Setup, 31314.4.1 Quantification Methods, 31314.4.2 Differential Expression Methods, 313
14.5 Evaluation, 31614.5.1 Transcriptome Quantification Methods Evaluation, 316
-
�
� �
�
xvi CONTENTS
14.5.2 Differential Expression Methods Evaluation, 319Acknowledgments, 326References, 326
PART IV MICROBIOMICS 329
15 Error Correction of NGS Reads from Viral Populations 331Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo,Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov
15.1 Next-Generation Sequencing of Heterogeneous Viral Populationsand Sequencing Errors, 331
15.2 Methods and Algorithms for The Ngs Error Correction in ViralData, 33415.2.1 Clustering-Based Algorithms, 33415.2.2 k-Mer-Based Algorithms, 33915.2.3 Alignment-Based Algorithms, 345
15.3 Algorithm Comparison, 34715.3.1 Benchmark Data, 34715.3.2 Results and Discussion, 348References, 350
16 Probabilistic Viral Quasispecies Assembly 355Armin Töpfer and Niko Beerenwinkel
16.1 Intra-Host Virus Populations, 35516.1.1 Viral Quasispecies, 35616.1.2 Fitness, 35716.1.3 HIV-1 as a Model System, 35716.1.4 Recombination, 35816.1.5 Clinical Implications, 35916.1.6 Genotyping, 360
16.2 Next-Generation Sequencing for Viral Genomics, 36016.2.1 Library Preparation, 36016.2.2 Sequencing Approaches, 36116.2.3 Specialized Viral Sequencing Methods, 36316.2.4 Data Preprocessing and Read Alignment, 36416.2.5 Spatial Scales of Viral Haplotype Reconstruction, 36416.2.6 Quasispecies Assembly Performance, 365
16.3 Probabilistic Reconstruction Methods, 36616.3.1 From Human to Viral Haplotype Reconstruction, 36616.3.2 Viral Haplotype Inference Methods Overview, 36916.3.3 Local Viral Haplotype Inference Approaches, 36916.3.4 Quasispecies Assembly, 37016.3.5 Recombinant Quasispecies Assembly, 370
16.4 Conclusion, 375References, 376
-
�
� �
�
CONTENTS xvii
17 Reconstruction of Infectious Bronchitis Virus Quasispeciesfrom NGS Data 383Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko,Nicholas Mancuso, Mazhar I. Khan, Rachel O’Neill, Ion I. Măndoiu,and Alexander Zelikovsky
17.1 Introduction, 38317.2 Background, 384
17.2.1 Infectious Bronchitis Virus, 38417.2.2 High-Throughput Sequencing, 384
17.3 Methods, 38517.3.1 Compared Methods, 388
17.4 Results and Discussion, 38817.4.1 Data Sets, 38817.4.2 Validation of Error Correction Methods, 38917.4.3 Tuning, Comparison, and Validation of Methods for
Quasispecies Reconstruction, 390Acknowledgments, 397References, 397
18 Microbiome Analysis: State of the Art and Future Trends 401Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang,Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee,and Giri Narasimhan
18.1 Introduction, 40118.2 The Metagenomics Analysis Pipeline, 40318.3 Data Limitations and Sources of Errors, 405
18.3.1 Designing Degenerate Primers for Microbiome Work, 40718.4 Diversity and Richness Measures, 40718.5 Correlations and Association Rules, 40918.6 Microbial Functional Profiles, 41018.7 Microbial Social Interactions and Visualizations, 41318.8 Bayesian Inferences, 41818.9 Conclusion, 419
References, 420
INDEX 425
-
�
� �
�
-
�
� �
�
CONTRIBUTORS
Vanessa Aguiar-Pulido, Bioinformatics Research Group (BioRG), School ofComputing and Information Sciences, Florida International University, Miami,FL, USA
Sahar Al Seesi, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA
Alexander Artyomenko, Department of Computer Science, Georgia State Univer-sity, Atlanta, GA, USA
Niko Beerenwinkel, Department of Biosystems Science and Engineering, ETHZurich, Basel, Switzerland
Adrian Caciula, Department of Computer Science, Georgia State University,Atlanta, GA, USA
David S. Campo, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA
Michael Campos, Miller School of Medicine, University of Miami, Miami, FL, USA
Stefan Canzar, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD,and Toyota Technological Institute at Chicago, Chicago, IL, USA
Jeong-Hyeon Choi, Cancer Center, Medical College of Georgia, Georgia RegentsUniversity, Augusta, GA, USA; Department of Biostatistics and Epidemiology,Medical College of Georgia, Georgia Regents University, Augusta, GA, USA
Chong Chu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA
-
�
� �
�
xx CONTRIBUTORS
Zoya Dimitrova, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA
Jorge Duitama, Agrobiodiversity Research Area, International Center for TropicalAgriculture (CIAT), Cali, Colombia
Eleazar Eskin, Department of Computer Science, University of California,Los Angeles, CA, USA
Mitch Fernandez, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA
Liliana Florea, Center for Computational Biology, McKusick-Nathans Institute ofGenetic Medicine, Johns Hopkins University School of Medicine, Baltimore,MD, USA
Olga Glebova, Department of Computer Science, Georgia State University, Atlanta,GA, USA
Xuan Guo, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA
Steven J. Hallam, Graduate Program in Bioinformatics and Department of Microbi-ology and Immunology, University of British Columbia, Vancouver, BC, Canada
Niels W. Hanson, Graduate Program in Bioinformatics, University of BritishColumbia, Vancouver, BC, Canada
Elena Harris, Department of Computer Science, California State University,Chico, CA
Wenrui Huang, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA
Mazhar I. Khan, Department of Pathobiology and Veterinary Science, University ofConnecticut, Storrs, CT, USA
Yury Khudyakov, Division of Viral Hepatitis, Centers of Disease Control andPrevention, Atlanta, GA, USA
Kishori M. Konwar, Department of Microbiology and Immunology, University ofBritish Columbia, Vancouver, BC, Canada
Bing Li, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA
James Lindsay, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA
Rasiah Loganantharaj, Bioinformatics Research Lab, The Center for AdvancedComputer Studies, University of Louisiana, Lafayette, LA, USA
Stefano Lonardi, Department of Computer Science and Engineering, University ofCalifornia, Riverside, CA, USA
-
�
� �
�
CONTRIBUTORS xxi
Nicholas Mancuso, Department of Computer Science, Georgia State University,Atlanta, GA, USA
Ion I. Măndoiu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA
Igor Mandric, Department of Computer Science, Georgia State University, Atlanta,GA, USA
Serghei Mangul, Department of Computer Science, University of California, LosAngeles, CA, USA
Tobias Marschall, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
Kalai Mathee, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA
Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computingand Information Sciences, Florida International University, Miami, FL, USA
Ekaterina Nenastyeva, Department of Computer Science, Georgia State University,Atlanta, GA, USA
Rachel O’neill, Department of Molecular and Cell Biology, University of Connecti-cut, Storrs, CT, USA
Yi Pan, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA
Sumathi Ramachandran, Division of Viral Hepatitis, Centers of Disease Controland Prevention, Atlanta, GA, USA
Thomas A. Randall, Integrative Bioinformatics, National Institute of EnvironmentalHealth Sciences, Research Triangle Park, NC, USA
Juan Riveros, Bioinformatics Research Group (BioRG), School of Computing andInformation Sciences, Florida International University, Miami, FL, USA
Alexander Schönhuth, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
Jonathan Segal, Herbert Wertheim College of Medicine, Florida InternationalUniversity, Miami, FL, USA
Huidong Shi, Cancer Center, Medical College of Georgia, Georgia Regents Univer-sity, Augusta, GA, USA Department of Biochemistry, Medical College of Georgia,Georgia Regents University, Augusta, GA, USA
Pavel Skums, Division of Viral Hepatitis, Centers of Disease Control and Prevention,Atlanta, GA, USA
Ren Sun, Department of Molecular and Medical Pharmacology, University ofCalifornia, Los Angeles, CA, USA
Sing-hoi Sze, Department of Computer Science and Engineering and Department ofBiochemistry and Biophysics, Texas A&M University, College Station, TX, USA
-
�
� �
�
xxii CONTRIBUTORS
Yvette Temate-tiagueu, Department of Computer Science, Georgia State University,Atlanta, GA, USA
Armin Töpfer, Department of Biosystems Science and Engineering, ETH Zurich,Basel, Switzerland
Bassam Tork, Department of Computer Science, Georgia State University, Atlanta,GA, USA
Nicholas C. Wu, Department of Integrative Structural and Computational Biology,The Scripps Research Institute, La Jolla, CA, USA
Shang-ju Wu, Department of Computer Science, University of British Columbia,Vancouver, BC, Canada
Yufeng Wu, Department of Computer Science and Engineering, University ofConnecticut, Storrs, CT, USA
Ning Yu, Department of Computer Science, Department of Biology, Georgia StateUniversity, Atlanta, GA, USA
Alexander Zelikovsky, Department of Computer Science, Georgia State University,Atlanta, GA, USA
Erliang Zeng, Department of Computer Science and Engineering, University ofNotre Dame, Notre Dame, IN, USA
Jin Zhang, McDonnell Genome Institute, Washington University in St. Luis,MO, USA
-
�
� �
�
PREFACE
Massively parallel DNA sequencing and RNA sequencing have become widelyavailable, reducing the cost by several orders of magnitude and placing the capacityto generate gigabases to terabases of sequence data into the hands of individualinvestigators. These so-called next-generation sequencing (NGS) technologieshave dramatically accelerated biological and biomedical research by enabling thecomprehensive analysis of genomes and transcriptomes to become inexpensive,routine, and widespread. The ensuing explosion in the volume of data has spurrednumerous advances in computational methods for NGS data analysis.
This book aims to provide an in-depth survey of some of the most important recentdevelopments in this area. It is neither intended as an introductory text nor as a com-prehensive review of existing bioinformatics tools and active research areas in NGSdata analysis. Rather, our intention is to make a carefully selected set of advancedcomputational techniques accessible to a broad readership, including graduatestudents in bioinformatics and related areas and biomedical professionals who wantto expand their repertoire of computational techniques for NGS data analysis. Wehope that our emphasis on in-depth presentation of both algorithms and softwarefor computational data analysis of current high-throughput sequencing technologieswill best prepare the readers for developing their own algorithmic techniques and forsuccessfully implementing them in existing and novel NGS applications.
The book features 18 chapters authored by bioinformatics experts who are activecontributors to the respective subjects. The chapters are intended to be largely inde-pendent, so that readers do not have to read every chapter nor have to read them in aparticular order. The chapters are grouped into the following four parts:
• Part I focuses on computing and experimental infrastructure for NGS data anal-ysis, including chapters on cloud computing, a modular pipeline for metabolicpathway reconstruction, pooling strategies for massive viral sequencing, andhigh-fidelity sequencing protocols.
-
�
� �
�
xxiv PREFACE
• Part II concentrates on analyses of DNA sequencing data and includes chapterson the classic scaffolding problem, detection of genomic variants, two chapterson finding insertions and deletions, and two chapters on the analysis of DNAmethylation sequencing data.
• Part III is devoted to analyses of RNA-seq data. Two chapters describe algo-rithms and compare software tools for transcriptome assembly: one chapterfocuses on methods for alternative splicing analysis and the other chapterfocuses on tools for transcriptome quantification and differential expressionanalysis.
• Part IV explores computational tools for NGS applications in microbiomics.The first chapter concentrates on error correction of NGS reads from viralpopulations, then two chapters describe methods for viral quasispecies recon-struction, and the last chapter surveys the state of the art and future trends inmicrobiome analysis.
We are grateful to all the authors for their excellent contributions, without whichthis book would not have been possible. We hope that their deep insights and freshenthusiasm will help in attracting new generations of researchers to this dynamicfield. We would also like to thank Yi Pan and Albert Y. Zomaya for nurturing thisproject since its inception, and the editorial staff at Wiley Interscience for theirpatience and assistance throughout the project. Finally, we wish to thank our friendsand families for their continuous support.
Ion I. Măndoiu
Storrs, Connecticut
Alexander Zelikovsky
Atlanta, Georgia
-
�
� �
�
ABOUT THE COMPANION WEBSITE
This book is accompanied by a companion website:
www.wiley.com/go/Mandoiu/NextGenerationSequencing
The book companion website contains the color version of a few selected figures
Figure 2.3, Figure 2.5, Figure 2.6, Figure 2.13, Figure 3.1, Figure 3.9,
Figure 7.5, Figure 8.3, Figure 8.4, Figure 9.4, Figure 9.8, Figure 9.9,
Figure 9.12, Figure 9.14, Figure 12.3, Figure 12.4, Figure 12.5, Figure 15.3,
Figure 16.1, Figure 16.6, Figure 16.7, Figure 16.11, Figure 16.12, Figure 16.13,
Figure 18.1, Figure 18.2, Figure 18.3, Figure 18.4, Figure 18.5, Figure 18.7.
www.wiley.com/go/Mandoiu/NextGenerationSequencing
-
�
� �
�
-
�
� �
�
PART I
COMPUTING AND EXPERIMENTALINFRASTRUCTURE FOR NGS
-
�
� �
�