Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis
description
Transcript of Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis
![Page 1: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/1.jpg)
Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis
Han Liang, Ph.D.Department of Bioinformatics and
Computational Biology3/25/2014 @ Rice University
![Page 2: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/2.jpg)
Outline
• History• NGS Platforms• Applications• Bioinformatics Analysis• Challenges
![Page 3: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/3.jpg)
Central Dogma
![Page 4: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/4.jpg)
Sanger sequencing
• DNA is fragmented• Cloned to a plasmid
vector• Cyclic sequencing
reaction• Separation by
electrophoresis• Readout with
fluorescent tags
![Page 5: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/5.jpg)
‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but……hunger for even greater sequencing throughput
and more economical sequencing technology…NGS has the ability to process millions of sequence
reads in parallel rather than 96 at a time (1/6 of the cost)
Objections: fidelity, read length, infrastructure cost, handle large volum of data
.
![Page 6: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/6.jpg)
• Roche/454 FLX: 2004• Illumina Solexa Genome Analyzer: 2006• Applied Biosystems SOLiDTM System: 2007• Helicos HeliscopeTM : recently available• Pacific Biosciencies SMRT: launching 2010
![Page 7: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/7.jpg)
Quickly reduced Cost
![Page 8: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/8.jpg)
Three Leading Sequencing Platforms
• Roche 454 • Illumina Solexa• Applied Biosystems SOLiD
![Page 9: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/9.jpg)
The general experimental procedure
Wang et al. Nature Reviews Genetics 2009
![Page 10: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/10.jpg)
454
bead microreactor
Maridis Annu. Rev. Genome. Human Genet. 2008
![Page 11: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/11.jpg)
![Page 12: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/12.jpg)
Illumina (Solexa)
Bridge amplification
Maridis Annu. Rev. Genome. Human Genet. 2008
![Page 13: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/13.jpg)
SOLiD
color coding
Maridis Annu. Rev. Genome. Human Genet. 2008
![Page 14: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/14.jpg)
Comparison of existing methods
![Page 15: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/15.jpg)
Real Data – nucleotide space• [email protected] :8:1:325:773 length=33AAAGAACATTAAAGCTATATTATAAGCAAAGAT+SRR002051.1 :8:1:325:773 length=33IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)[email protected] :8:1:409:432 length=33AAGTTATGAAATTGTAATTCCAATATCGTAAGC+SRR002051.2 :8:1:409:432 [email protected] :8:1:488:490 length=33AATTTCTTACCATATTAGACAAGGCACTATCTT+SRR002051.3 :8:1:488:490 length=33IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I
![Page 16: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/16.jpg)
Real Data – color space• SOLiD Data>1_24_47_F3T1.1.23..0120230.320033300030030010022.00.0201.0201>1_24_52_F3T2.3.21..2122321.213110332101132321002.11.0111.1222>1_24_836_F3T0.2.22..2222222.010203032021102220200.01.2211.2211>1_24_1404_F3T2.3.30..2013222.222103131323012313233.22.2220.0213>1_25_202_F3T0.3213.111202312203021101111330201000313.121122211>1_25_296_F3T0.1130.100123202213120023121112113212121.013301210
![Page 17: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/17.jpg)
Data output difference among the three platforms
• Nucleotide space vs. color space• Length of short reads 454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)
![Page 18: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/18.jpg)
Applications with “Digital output”
• De novo genome assembly• Genome re-sequencing• RNA-Seq (gene expression, exon-intron
structure, small RNA profiling, and mutation)• CHIP-Seq (protein-DNA interaction)• Epigenetic profiling
![Page 19: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/19.jpg)
• Degraded state of the sample mitDNA sequencing• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106
bp )
Problems: contamination modern humans and coisolation bacterial DNA
![Page 20: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/20.jpg)
![Page 21: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/21.jpg)
• Key part in regulating gene expression
• Chip: technique to study DNA-protein interaccions
• Recently genome-wide ChIP-based studies of DNA-protein interactions
• Readout of ChIP-derived DNA sequences onto NGS platforms
• Insights into transcription factor/histone binding sites in the human genome
• Enhance our understanding of the gene expression in the context of specific environmental stimuli
![Page 22: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/22.jpg)
• ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity
• Detecting expression level changes that correlate with changes in environmental factors, with disease onset and progression, complex disease set or severity
• Enhance the annotation of sequenced genomes (impact of mutations more interpretable)
![Page 23: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/23.jpg)
• Characterizing the biodiversity found on Earth• The growing number of sequenced genomes enables us to interpret partial
sequences obtained by direct sampling of specif environmental niches.• Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may
vary according to the health status of the individual
![Page 24: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/24.jpg)
• Common variants have not yet completly explained complex disease genetics rare alleles also contribute
• Also structural variants, large and small insertions and deletions
• Accelerating biomedical research
![Page 25: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/25.jpg)
• Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development.
• Enhanced potential to combine the results of different experiments, correlative analyses of genome-wide methylation, histone binding patterns and gene expression, for example.
![Page 26: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/26.jpg)
Kahvejian et al. 2008
:Integrating Omics
mRNA expression
Alternative Splicing
microRNA expression
Protein-DNA interaction
Mutation discovery
Copy number variation
![Page 27: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/27.jpg)
Data Analysis Flow
SOLiD machine:
Raw data
Central ServerBasic processing
decoding, filter and mapping
Local MachineDownstream analysis
![Page 28: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/28.jpg)
Short Read Mapping
• DNA-Resequencing BLAST-like approach• RNA-Seq
![Page 29: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/29.jpg)
![Page 30: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/30.jpg)
![Page 31: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/31.jpg)
Read length and pairing
• Short reads are problematic, because short sequences do not map uniquely to the genome.
• Solution #1: Get longer reads.• Solution #2: Get paired reads.
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG
![Page 32: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/32.jpg)
Post-alignment Analysis
• DNA-SEQ• SNP calling• RNA-SEQ• Quantifying gene expression level
![Page 33: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/33.jpg)
ConceptsThe reference genome: hg19 (GRC37)Main assembly: Chr1-22, X, and Y3,095,677,412 bp
Target Region: exonome
Ensembl: 85.3 Million (2.94%)RefSeq: 67.7Million (2.34%)ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons
![Page 34: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/34.jpg)
Target Coverage
![Page 35: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/35.jpg)
![Page 36: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/36.jpg)
SOLiD
color coding
Maridis Annu. Rev. Genome. Human Genet. 2008
![Page 37: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/37.jpg)
SNP calling
![Page 38: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/38.jpg)
![Page 39: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/39.jpg)
Array-based High-throughput Dataset
![Page 40: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/40.jpg)
Limitations of hybridization-based approach
• Reliance existing knowledge about genome sequence
• Background noise and a limited dynamic detecting range
• Cross-experiment comparison is difficult• Requiring complicated normalization methods
Wang et al. Nature Reviews Genetics 2009
![Page 41: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/41.jpg)
Quantifying gene expression using RNA-Seq data
RPKM: Reads Per Kb exon length and Millions of mapped readings
![Page 42: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/42.jpg)
Large Dynamic Range
Mortazavi et al. Nature Methods 2008
![Page 43: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/43.jpg)
High reproducibility
Mortazavi et al. Nature Methods 2008
![Page 44: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/44.jpg)
High Accuracy
Wang et al. Nature 2008
![Page 45: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/45.jpg)
Advantages of RNA-Seq
• Not limited to the existing genomic sequence• Very low (if any) background signal• Large dynamic detecting range• Highly reproducibility• Highly accurate• Less sample • Low cost per base
Wang et al. Nature Reviews Genetics 2009
![Page 46: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/46.jpg)
Huge amount of data!
• For a typical RNA-Seq SOLiD run, ~ 2T image file ~ 120G text file for downstream analysis ~ 75 M short reads per sample
Efficient methods for data storage and management
![Page 47: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/47.jpg)
Considerable sequencing error
High-quality image analysis for base calling
![Page 48: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/48.jpg)
Genome alignment and assembly: time consuming and memory demanding
• To perform genome mapping for SOLiD data
32-opteron HP DL785 with 128GB of ram 12~14 hours per sample
High-performance parallel computing
![Page 49: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/49.jpg)
Bioinformatics Challenges
• Efficient methods to store, retrieve and process huge amount of data
• To reduce errors in image analysis and base calling
• Fast and accurate for genome alignment and assembly
• New algorithms in downstream analyses
![Page 50: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062323/568152d6550346895dc0f324/html5/thumbnails/50.jpg)
Experimental ChallengesLibrary fragmentation
Strand specific
Wang et al. Nature Reviews Genetics 2009