Sequence Assembly and Next Generation Sequencing...
Transcript of Sequence Assembly and Next Generation Sequencing...
Sequence Assembly and Next Generation Sequencing Informatics
CBPS7711Oct 11 2011Oct 11, 2011
Sonia Leach, PhDSonia Leach, PhDAssistant Professor
Center for Genes, Environment, and Health
National Jewish Health
Center for Genes, Environment, and Health
Next Generation Sequencing
• Increasing PresenceIncreasing Presence– ISMB 2010 Tutorial on Next Gen Sequencing
– ECCB 2010 Tutorial on Next Gen Sequencingq g
– RECOMB 2011 focus on Next Gen Sequencing
• Current Trends/Discussions– ChIP-seq, RNA-seq (mRNA, miRNA), Methyl-Seq
– Whole-genome sequencing
– Single molecule sequencing technologies
2Center for Genes, Environment, and Health
Introduction• Sequencing technologies
– 1st Gen: Gel based sequencing and Capillary electrophoresis
Introduction
– 2nd Gen: Next-Generation Sequencers:• Sequencing by Synthesis: Pyrosequencing with ePCR (Roche 454), Reversible Dye-termination with
cluster/bridge amplification (Illumina Solexa)
• Sequencing by Ligation: ePCR and diColor system (ABI SOLiD)
– 3rd Generation Sequencers: • Single molecule sequencing (ABI SMS, PacBio SMRT, Helicos), nanopore sequencing, …
• De novo assembly versus mapping to reference sequence– Human Genome Project (Hierarchical versus Shotgun Sequencing)– Human Genome Project (Hierarchical versus Shotgun Sequencing)
• Contig assembly and ordering
– Mapping/Alignment Programs• Repetitive regions (interspersed repeats (SINEs, LINEs, LTR elements, and DNA transposons),
satellite sequences, and low-complexity sequences)satellite sequences, and low complexity sequences)
• Polymorphisms and structural variants (insertions, deletions, substitutions, inversions, translocation)
• Sequencing applications– ChIP Seq – transcription factor binding and chromatin modification
– RNA Seq – transcript detection (boundaries, novel exons), quantification
– Whole Genome/Exome Sequencing - sequence variant detection
3Center for Genes, Environment, and Health
Early Sequencing Technologies (1977)• Maxam-Gilbert Sequencing
– double-stranded DNA segment ends labeled with 32P, divided into 4 samples, treated with chemical specific to 1-2 of 4 bases to nick DNA at few sites, treated with piperidine to destroy nicked base, creates fragments of length from 32P to nicked base, run fragments in 4 lanes on acrylamide gel to separate by size, , g y g p y ,auto-radiograph, read sequence off gel
– fallen out of favor due to technical complexity prohibiting its use in standard molecular biology kits, extensive use of hazardous chemicals, and difficulties with scale-upwith scale up
• Sanger Sequencing (chain termination method)– single-stranded DNA template and short primer for one end, add polymerase
with normal bases (dNTPs) and special dideoxynucleotides (ddNTPs) that cannot ( ) p y ( )extend chain at certain ratio to ensure variable length fragments, each ending with ddNTP is incorporated. Create one sample with each ddNTP flavor, run in 4 lanes on gel,, auto-radiograph and read sequence off gel
• Gilbert and Sanger shared 1980 Nobel Prize in Chemistry for • Gilbert and Sanger shared 1980 Nobel Prize in Chemistry for sequencing
4Center for Genes, Environment, and Health
Maxam-Gilbert Sequencing Sanger Sequencing
5From Recombinant DNA, 2nd Ed, Watson et al
Evolution of Sequencing
Gel Sequencing and
radioactive tags
Capillary Sequencing
and fluorescent ttags
Adapted from D. Pollock
Base call Quality• Used to assess sequence
quality and determine quality and determine accurate consensus sequence (originally from Phred program 1998).
Q alit scores Q • Quality scores Q logarithmically linked to base-calling error probabilities P:
Q = 10 log PQ = -10 log10 P
• Thus 10 means 1 in 10 probability base called incorrectly, 20 means 1 in 100 ( 99% b ll 100 (or 99% base call accuracy), 30 means 1 in 1000 (or 99.9%). Typical Q20 or Q30 for ‘high
li ’ b (
7Center for Genes, Environment, and Health
Roseisis 21:57, 22 February 2007 (UTC), image created by authors,quality’ bases (max out at 35-40, <10 unusable)
Rate of sequencing over past 30 years
8Center for Genes, Environment, and HealthMR Stratton et al. Nature 458, 719-724 (2009)
Evolution of SequencingRoche 454 (2005)
Gel Sequencing and
radioactive tags
Capillary Sequencing
and fluorescent ttags
Next GenerationSequencing
Illumina GAII (2006)
Sequencing
ABI SOLiD (2007)
Adapted from D. Pollock
10Center for Genes, Environment, and Health
• Adapter-ligated DNA fragments (400-500 bp) fixed
Roche 454
to small DNA-capture beads in a water-in-oil emulsion for emulsion-based PCR in PicoTiterPlate, a fiber optic
i i i http://www dkfz de/gpcf/886 htmlchip, mixed with Enzyme Beads containing also ATP sulfurylase + luciferase.
• 4 DNA nucleotides added
http://www.dkfz.de/gpcf/886.html
sequentially in fixed order, extending DNA strand, 'sequencing-by-synthesis’ (pyrosequencing) whereby pyrophosphate released upon nucleotide incorporation. Signal strength proportional to number of nucleotides -linear up to 8 homopolymers(3.3% error A5,~50% for A8)
11Center for Genes, Environment, and HealthRoche / 454 FLX
Roche 454
12Center for Genes, Environment, and Health
http://mammoth.psu.edu/howToSeqMammoth.html
Roche 454Roche 454
13Center for Genes, Environment, and Health
http://mammoth.psu.edu/howToSeqMammoth.html
Are homopolymers a problem, really?TCR alpha and Ig Light Chain
V’s CJ’sGermline DNA
Somatic DNA
mRNA
• T-cell receptor alpha chain (John Kappler NJH)
• Gene undergoes somatic mutation in T-cells so every cell has different VJ combination from original germlinehas different VJ combination from original germlinetemplate
• Each mRNA sequenced has one V and one J region, must q J g ,map each separately to ‘reference’
14Center for Genes, Environment, and Health
Are homopolymers a problem, really?
• Representative Representative sequences for J
• Note high sequence g qsimilarity among different J and many homopolymerstretchesstretches
15Center for Genes, Environment, and Health
Ill i GAIIIllumina GAII• 'PCR on a chip' - primer-ligated fragments (1 on each end) hybridize to
slide which has complement of both primers, DNA extended, original molecule denatured and washed away. Extended molecule bends and hybrides to 2nd PCR primer (bridge amplification). Molecule extended on bridge, denatured, rebridged, extended, denatured, etc for 35 cycles, slide contains 'cluster' of double-stranded bridges - denatured
– For excellent graphical explanation of bridge amplication: http://seq.molbiol.ru/sch_clon_ampl.html
• Sequencing by Synthesis, with reversible termination so have competitive incorporation of all 4 bases per cycle, compile images across cycles to get sequence
– Technical issues: dephasing, error increases with length, polymerase problems with transversions/transitions
Illumina / Solexa
16Center for Genes, Environment, and Health
Genetic Analyzer
http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
ABI SOLiD
Note: unlike 454/Illumina paired-ends, mate-pairs sequenced in same direction
Applied BiosystemsSOLiD
17Center for Genes, Environment, and Health
SOLiD
Images copyright 2009 ABI
ABI SOLiD• Sequencing by
ligation with ligation with fluorescently-labeled di-base probes compete (2 bases, 3 (degenerate N, 3 universal Z)
• Successive cycles h di b means each di-base
seen twice.
Applied BiosystemsSOLiD
18Center for Genes, Environment, and Health
SOLiD
Images copyright 2009 ABI
ABI SOLiD• Di-base encoding allows accurate and fast detection/disambiguation
f i / lid l hiof sequencing errors/valid polymorphisms
Applied BiosystemsSOLiD
19Center for Genes, Environment, and Health
SOLiD
Images copyright 2009 ABI
Paired End/Mate Pair Reads• Illumina: paired end is both ends
Known Distancep
of ~300 bp fragment (shorter than a 454 read, shorter than most TEs)
454 i d d 3 8 20 Kb
Read 1 Read 2
• 454: paired ends span ~3,8,20 Kb
• SOLiD: paired ends span ~1, 2, 3, 5, 10 Kb5, 0 b
http://obi-linux.lbl.gov/alexa-seq-lbl.org/alexa_seq/BCCL/MCF7.htm
Paired read often map uniquely
20Center for Genes, Environment, and Health
Single reads can map to multiple positions Adapted from D. Pollock
Evolution of Sequencing3rd Generation
Sequencing(Single Molecule)
Adapted from D. Pollock
3rd-Generation Sequencing• Real time Single Molecule/Nanopore:
polymerase and single DNA molecule fixed to f E h f 4 DNA b l b ll d i h surface. Each of 4 DNA bases labelled with
fluorescent dye and when incorporated by polymerase, generates flash of light. System records time and color series of flashes (Pacific Biosystems, Helicos, ABI, Illumina)y , , , )
22Center for Genes, Environment, and Health
Example Single Molecule• Currently plagued by high
t (d h t error rates (dye perhaps not incorporated on dNTP, reaction too slow or fast for camera, etc) causes errors and phasing
bl i 2nd ti problems, as in 2nd generation technologies
– PacBio approaches problem by creating circular construct with DNA for repeated re-reads of same pmolecule, Helicos 2-pass sequencing
• Because real-time capture of incorporation, can distinquish nucleotide type by lag-time between incorporation (PacBio reports exciting implications for detecting methylated C, or A ( l ) ) l b A (plants) …), also strobe sequencing - PacBio
23Center for Genes, Environment, and Healthhttp://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tabid/64/Default.aspx
C i f N G T h l iComparison of Next-Gen TechnologiesSanger Roche 454
GS FLXIllumina GAII
ABI SOLiD Illumina HiSeq
Helicos tSMS
S l i 1 f / 1 f / i d 2 f / 5 20 2 f lSample requirement 1 µg frag/ 5µg paired
<1 µg frag/paired 2 µg frag/ 5‐20 µg paired
<2 µg frag only
Time for library prep
- 3-4 days 2 days 2-4.5 days <6 hours 1 day
Amplification - bead-based Bridge Bead-based Bridge n/a (single method emulsion PCR amplification on
flow cellemulsion PCR amplification on
flow cellmolecule sequencing)
Time for amplification
- 2 days 4 hours 2 days
Sequencing method Dye pyrosequencing Reversible dye Sequence-by- Reversible dye Virtual Sequencing method Dye terminators
pyrosequencing Reversible dye terminators SBS
Sequence byligation (dibase)
Reversible dye terminators
Virtual terminator SBS
Read length 800 bp 400 bases (800 by 12/2010)
36-100 bases 36/ 2x50/ 2x100 bases
45-50 bases
Time for sequencer 3 hours 10 hours 2.5 days 6-10 days 1.5 -8 days 8-9 days
Total bases per run 800 bp 400-500 Mb 3-6 Gb 10-20 Gb 26-200 Gb 21-28 Gb
Total bases per day 6400 bp ~1 Gb 1.5 Gb 1.7-2 Gb 25 Gb (2x100) 2.5 Gb
Cost per run $4-5 for ~750bp
~$8k ~$9k $17k < $10k(2 human, 30x)
24Center for Genes, Environment, and Health
Cost per Mb ~$84.39 ~$5.97 ~$5.80
Data per run 15 Gb 1 TB 15 TB
Compiled from Turner et al, Mamm Genome. 2009; Voelkerding et al, Clin Chem 2009; Illumina.com 2010; wikipedia 2008
PR Space versus Science Space• Alignment problems from length - 97% E coli uniquely aligned
with 18bp reads, only 90% of humna with 30bp reads – see Voelkerding, 2009
• Flow and phasing - skip basecall in cycle
• Data quality and Error rateData quality and Error rate– Variation along sequence (higher error at 3’ end as reagents begin to fail, error for next
base in elongation depends on identity of previous base, homomers)
– Quality scores (Equivalence? – can use GATK for calibration)
e
• Length distribution versus average– mate pair/paired end insert size
• Coverage versus GC content Log (cov wind/cov chrom)GC
per
cen
tag
Coverage versus GC content
• Raw versus recovered sequence - Uncallable bases waste reads -How much coverage with different methods? What about multi-mapped reads?
T i (b d ) d l i l i
Log (cov wind/cov chrom)
ABI BioScope Manual June 2010
25Center for Genes, Environment, and HealthAdapted from D. Pollock
• Tagging (barcodes) and multiplexing - Variation in coverage for each individual in multiplex
Comparison and Usage• Short Reads (ABI/Illumina) • Long Reads (454/single mol)
– Resequencing b/c high throughput runs (coverage)
– SNP detection (coverage)Mi RNA (23 b )
– Many fewer reads (lower coverage)
– Much longer (implications for – Micro RNA (23 bp)– Counting (e.g.,
transcriptome profiling, ChIPSeq)
assembly, indel detection)
– Better for de novo sequencing (can bridge scaffolds)
B tt i bilit t • Adjustable dynamic range ($$)
– Hard to place near repetitive elements
– Better in ability to span repetitive regions (find unique seq)
– Amplicon and tagging– Harder to assemble de novo– 75-100 bp reads intermediate
Amplicon and tagging• longer reads detect varying copy
numbers btwn individuals
– Meta-genomics 16S RNA d 500b d t • 16S RNA, need 500bp reads to see small variant region among species
26Center for Genes, Environment, and Health
Adapted from D. Pollock
27Center for Genes, Environment, and Health
28Center for Genes, Environment, and Health
RNA SRNA-Seq• cDNA made from transcript, sequenced, mapped to genome
– Seq more sensitive than micro-arrays since do not have to design array probes or decide chip content, especially important for microRNA (~22nt total)
– Seq can be used to distinguish isoforms, allelic expression, sequence variants
– Seq more expensive than microarrays and suffer from 5’/3’ biases, high abundance transcripts (housekeeping ribosomal) which make up the majority abundance transcripts (housekeeping, ribosomal) which make up the majority of sequencing data (e.g. in some tissues 5% of the genes represent up to 75% of the reads sequenced).
• Mapping to reference genome means recognizingover known/novel splice junctions, determining exon boundaries, potentiallydiscovering novel exons (TopHat, Cufflinks)
• Quantification done at exon or transcriptlevel (what about when exons shared?)
• Differential RNA expression between samples is measured as differences incoverage
29Center for Genes, Environment, and HealthImage from wikipedia (user:Rgocs)
ChIP S ChIP-Seq (Chromatin Immunoprecipitation + Sequencing)
• Requires read alignment and
k fi dipeak finding
30Center for Genes, Environment, and HealthImage from wikipedia Data from Dr Barbara Wold
Sequence AssemblySequence Assembly
Center for Genes, Environment, and Health 31
32Center for Genes, Environment, and Health
33Center for Genes, Environment, and Health
34Center for Genes, Environment, and Health
35Center for Genes, Environment, and Health
36Center for Genes, Environment, and Health
37Center for Genes, Environment, and Health
Short Read AlignmentShort Read Alignment
Center for Genes, Environment, and Health 38
39Center for Genes, Environment, and Health
40Center for Genes, Environment, and Health
41Center for Genes, Environment, and Health
42Center for Genes, Environment, and Health
43Center for Genes, Environment, and Health
44Center for Genes, Environment, and Health(Bioinformatics (2009) 25 (14): 1754-1760)
Example of BWT
Let W=‘go’ be a query substring of X. Then the SA interval of all occurrences of W is [1,2]. The suffix array values for this interval are 3 and 0 which give allpositions of the occurrences of ‘go’. Procedure called ‘backward search’ usesproperties of W and B to efficiently calculate. See Li and Durbin (Bioinformatics (2009) 25 (14): 1754-1760)
45Center for Genes, Environment, and Health
(2009) 25 (14): 1754-1760)
46Center for Genes, Environment, and Health
47Center for Genes, Environment, and Health
48Center for Genes, Environment, and Health
49Center for Genes, Environment, and Health
50Center for Genes, Environment, and Health
51Center for Genes, Environment, and Health
Sequence Variant Detection
52Center for Genes, Environment, and Health
W Lee et al. Nature 465, 473-477 (2010)
Sequence Variant Detection• Single Nucleotide Polymorphismsg y p
– Compute genotype from distribution of 4 nucleotides for reads mapping to same reference location
• Libraries can contain duplicate reads, inflates coverage, use Libraries can contain duplicate reads, inflates coverage, use unique
– Use coverage, read position to determine accuracy• SOLiD accurate dueSOLiD accurate due
to dibase encoding andrules for valid adjacent
– Homozygous SNPs Likely sequencing
Likely SNPs
more accurate thanheterozygous SNPs
• Typically cut-off for
sequencing errorNon-called
bases or indels?
Phasing?minimum number of readscalling each variant of het
53Center for Genes, Environment, and Health
g
Voelkerding et al, Clin Chem 2009
Sequence Variant Detection
• Small Insertion-Deletions can be detected with Small Insertion Deletions can be detected with gapped-alignment of reads
• Large Insertion-Deletions suggested by g gg ydeviations from expected insert size for paired reads
54Center for Genes, Environment, and Health
ABI BioScope Manual June 2010
Sequence Variant Detection
• Transversions suggested by correct spacing Transversions suggested by correct spacing but unexpected orientation of paired ends, also suggested by drop in coverage for normal pairs
• Translocations suggested by unexpected, mapping of paired ends to different chromosomes
• In all examples, note importance of good quality reference genome, long length of insert for paired reads (mix of insert size also beneficial)
55Center for Genes, Environment, and Health
Sequence Variant Detection
• Copy Number VariationCopy Number Variation• Can use difference from expected coverage to
indicate CNVs (e.g. see CNV-seq by Xie and Tammi, Bioinformatics, 2009)
• SOLiD algorithm: compute log-ratio compute log-ratio of coverage to expected coverage (GC t d) (GC corrected), use HMM to estimate copy number, take
56Center for Genes, Environment, and Health
ABI BioScope Manual June 2010
pyViterbi sequence
57Center for Genes, Environment, and Health
58Center for Genes, Environment, and Health
Additional Resources• Voelkerding et al, Clinical Chemistry 55(4):641 (2009) g , y ( ) ( )
“Next-Generation Sequencing: from basic research to diagnostics”
M t l S i 287 2196 (2000) “A h l • Myers et al Science 287:2196 (2000) “A whole-genome assembly of Drosophila”
• Medvedev et al, Nat Methods. 2009 Nov;6(11 Medvedev et a , Nat Met ods 009 Nov;6(Suppl):S13-20. “Computational methods for discovering structural variation with next-generation sequencing”
D id P ll k’ lid f l t ( i ll th • David Pollock’s slides from last year (especially math for detecting coverage needed to detect sequence errors))
59Center for Genes, Environment, and Health