Post on 05-Jan-2016
description
Data analysis methods for next-generation sequencing technologies
Gabor T. MarthBoston College Biology Department
Epigenomics & Sequencing MeetingJuly 14-15, 2008, Boston, MA
T1. Roche / 454 FLX system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation
T3. AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size
T4. Helicos / Heliscope system
• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
A4. Novel transcript discovery (genes)
Mortazavi et al. Nature Methods
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
A6. Expression profiling by tag counting
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
gene gene
A7. De novo organismal genome sequencing
assembled sequence contigs
short reads
longer reads
read pairs
Lander et al. Nature 2001
C1. Read length
read length [bp]0 100 200 300
~200-450 (var)
25-40 (fixed)
25-35 (fixed)
20-35 (var)
400
When does read length matter?
• short reads often sufficient where the entire read length can be used for mapping:
SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttacagacttacatacgta
Known exon 1 Known exon 2
accgattactatacta
C2. Read error rate
• error rate dictates the stringency of the read mapper
• error rate typically 0.4 - 1%
• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned
0 1 20.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fra
ctio
n of
gen
ome
Number of mismatches allowed
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35 40
Position on Read
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
Err
or r
ate
Error rate grows with each cycle
• this phenomenon limits useful read length
Substitutions vs. INDEL errors
C3. Representational biases / library complexity
fragmentation biases
amplification biases
PCR
sequencing biases
sequencing
low/no representati
on high
representation
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)• it should have major impact is on counting applications
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated onto every clonal copy
C4. Paired-end reads
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)
Technologies / properties / applications
Technology
Roche/454 Illumina/Solexa AB/SOLiD
Read properties
Read length 200-450bp 20-50bp 25-50bp
Error rate <0.5% <1.0% <0.5%
Dominant error type INDEL SUB SUB
Quality values available yes yes not really
Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)
Applications
SNP discovery ● ● ○
short-INDEL discovery ● ○
SV discovery ○ ○ ●
CHIP-SEQ ○ ● ●
small RNA/gene discovery ○ ● ●
mRNA Xcript discovery ● ○ ○
Expression profiling ○ ● ●
De novo sequencing ● ? ?
Resequencing-based SNP discovery
(iv) read assembly
REF
(iii) read mapping (pair-wise alignment to genome reference)
IND
(i) base calling
IND
(v) SNP calling
(vi) SNP validation
(ii) micro-repeat analysis
(vii) data viewing, hypothesis generation
The “toolbox”
• base callers
• microrepeat finders
• read mappers
• SNP callers
• structural variation callers
• assembly viewers
…AND they give you the cover on the box
Reference guided read mapping
Reference-sequence guided mapping:
…you get the pieces…
Some pieces are more unique than others
MOSAIK: an anchored aligner / assembler
Step 1. initial short-hash scan for possible read locations
Step 2. evaluation of candidate locations with SW method
Michael Stromberg
Non-unique mapping, gapped alignments
1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)
2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles
Read types aligned, paired-end read strategy
3. Aligns and co-assembles customary read types:ABI/capillaryIllumina/SolexaAB/SOLiDRoche/454Helicos/Heliscope
ABI/capillary
454 FLX
454 GS20
Illumina4. Paired-end read alignments
Other mainstream read mappers
• ELAND (Tony Cox, Illumina)-- the “official” read mapper supplied by Illumina, fast
• MAQ (Li Heng + Richard Durbin, Sanger)-- the most widely used read mapper, low RAM footprint
• SOAP (Beijing Genomics Institute)-- a new mapper developed for human next-gen reads
• SHRIMP (Michael Brudno, University of Toronto)-- full Smith-Waterman
Speed
Polymorphism / mutation detection
sequencing error
polymorphism
Determining genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
Software
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11GigaBayesGigaBayes
SNP
INS
Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Weichun Huang
Applications
1. SNP discovery in shallow, single-read 454 coverage(Drosophila melanogaster)
3. Mutational profiling in deep 454 and Illumina read data(Pichia stipitis)
2. SNP and INDEL discovery in deep Illumina short-read coverage(Caenorhabditis elegans)
(image from Nature Biotech.)
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits
http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
Accuracy
• As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent
C3. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
Software tools for next-gen sequence analysis
Next-generation sequencing technologies and applications