Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively...
Transcript of Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively...
Mapping of Next Generation Sequencing Data
Agnes Hotz-Wagenblatt
Bioinformatik (HUSAR)
Next Generation Sequencers
Next (or 3rd) generation sequencers came onto the scene in the early 2000’sGeneral characteristics include:
Amplification of genetic material by PCRLigation of amplified material to a solid surfaceSequence of the target genetic material is determined using Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligationSequencing done in a massively parallel fashion and sequence information is captured by a computer
Sanger sequencing
• DNA is fragmented• Cloned to a plasmid
vector• Cyclic sequencing
reaction• Separation by
electrophoresis• Readout with
fluorescent tags
Cyclic-array methods
• DNA is fragmented• Adaptors ligated to
fragments• Several possible protocols
yield array of PCR colonies.
• Enyzmatic extension with fluorescently tagged nucleotides.
• Cyclic readout by imaging the array.
Emulsion PCR
• Fragments, with adaptors, are PCR amplified within a water drop in oil.
• One primer is attached to the surface of a bead. • Used by 454, Polonator and SOLiD.
Bridge PCR
• DNA fragments are flanked with adaptors.• A flat surface coated with two types of primers, corresponding to the
adaptors.• Amplification proceeds in cycles, with one end of each bridge
tethered to the surface.• Used by Solexa.
Comparison of existing methods
Read length and pairing
Short reads are problematic, because short sequences do not map uniquely to the genome.
Solution #1: Get longer reads.Solution #2: Get paired reads.
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG
Third generation
Nanopore sequencingNucleic acids driven through a nanopore.Differences in conductance of pore provide readout.
Real-time monitoring of PCR activityRead-out by fluorescence resonance energy transfer
between polymerase and nucleotides orWaveguides allow direct observation of polymerase and
fluorescently labeled nucleotides
Analysis tasks
Base calling / polymorphism detectionMapping to a reference genomeDe novo or assisted genome assembly
Next Gen. Sequencers Cont.
Sequencing platform ABI3730xl Genome Analyzer
Roche (454) FLX Illumina Genome Analyzer
ABI SOLiD HeliScope
Sequencing chemistry Automated Sanger sequencing
Pyrosequencing on solid support
Sequencing-by- synthesis with reversible terminators
Sequencing by ligation
Sequencing-by- synthesis with virtual terminators
Template amplification method
In vivo amplification via cloning
Emulsion PCR Bridge PCR Emulsion PCR None (single molecule)
Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp
Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h
Usage of SequencingResequencing−
Map reads back to genome
−
Call bases
RNA-seq−
Map reads back to genome
−
Count tags to determine gene expression levels
Chip Seq−
Map reads back to genome
−
Peaks determine binding sites.
Nearly all experiments have the same first step!
Bioinformatics
Because of the massively parallel nature of next gen sequencers, huge amounts of data are produced quickly requiring terabytes of storageNew bioinformatics tools were developed to utilize the huge number of much shorter reads (~35bp vs ~800bp)
Bowtie - Ultrafast, memory-efficient short read aligner SOAPdenovo - Part of the SOAP suite, used to build reference genomeTopHat - TopHat is a fast splice junction mapper for RNA-Seq reads
Hash table (Lookup table)- fast, but requires perfect matches
Array Scanning- can handle mismatches, but not gaps
Dynamic Programming eq Smith-Waterman- Indels, mathematically optimal, slow- most programs use hash mapping as prefilter
Burrows-Wheeler Transform- fast and memory efficient, but less suited for- gaps and mismatches
Mapping Methods
Programs
Briefings in Bioinformatics Advance Access published online on May 11, 2010 Heng Li and Nils Homer A survey of sequence alignment algorithms for next-generation sequencing
Hash Tables:Eland, SOAP, SeqMap, MAQ, RMAP, ZOOM, Novoalign
BW transform, FM index:Bowtie, BWA, SOAP2, BWA-SW
Hash tableA hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time.An algorithm converts an object, typically a string, to a number. Then the number is compressed accordingto the size of the table and used as an index.There is the possibility of distinct items being mappedto the same key. This is called a collision and must be resolved.
Key Hash Code Generator Number Compression Index
Smith 7
0123
987654
Bob Smith123 Main St.
Orlando, FL 327816407-555-1111
First Hash Table Lookup
Blast was the first program using this algorithm
Kmer (11mer by default) seed from the query searched inAll database sequences, keeps result in hash tables.
Hash table algorithm
Each tool builds a hash table of short oligomers present in- either the reads (SHRiMP, Maq, RMAP, and ZOOM) - or the reference (SOAP).
ZOOM uses 'spaced seeds' to significantly outperform RMAP, Algorithm Yaetes and Perleberg .
Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length .
SHRiMP employs a combination of spaced seeds and the Smith-Waterman algorithm to align reads at expense of speed.
Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm with spaced seeds to align reads.
Spaced seedsA template ‘111010010100110111’requiring 11 matches at the ‘1’ positions is 55% more sensitive than BLAST’s default template ‘11111111111’for two sequences of 70% similarity.
A seed allowing internal mismatches is called spaced seed; the number of matches in the seed is its weight.
Eland was the first program that utilized spaced seed in short-read alignment. It uses six seed templates spanning the entire short read such that a two-mismatch hit is guaranteed to be identified by at least one of the templates,
SOAP adopts almost the same strategy except that it indexes the genome rather than reads.
SeqMap and MAQ extends the method to allow k-mismatches,
Used by Bowtie:Langmead et al. Genome Biology 2009 10:R25 doi:10.1186/gb-2009-10-3-r25
Borrows Wheeler Transform
identifying exact matches and building inexact alignments supported by exact matches.
First step by:suffix tree, enhanced suffix array or FM-index.
The design of the FM-index is based upon the relationship between the Burrows-Wheeler compression algorithm and the suffix array data structure.
The advantage of using a trie is that alignment to multiple identical copies of a substring in the reference is only needed to be done once.
S
= M A L A Y A L A M $1 2 3 4 5 6 7 8 9 10
$YALAM$
M
$
ALAYALAM$
$M
YALAM$
$M
YALAM$
$M
YA
LAM
$
A
AL
LA
6 2
8 4 7 3
1 9
5 10
What is a suffix tree?
Finding a (short) Pattern in a (long) String
Build a suffix tree of the string.Starting from the root, traverse a path matching characters of
the pattern.If stuck, pattern not present in string.
Otherwise, each leaf below gives a position of the pattern in the string.
Find “ALA”$YALAM$
M
$
ALAYALAM$
M$
YALAM$
M$
YALAM$
M$
YA
LAM
$
A
AL
LA
6 2
8 4 7 3
1 9
5 10
Two matches -
at 6 and 2
Finding a Pattern in a String
Suffix Array
Suffixe of abracadabra (11): abracadabra bracadabra racadabra etc.Order lexicographically:
a abra abracadabra acadabra adabra bra bracadabra cadabra dabra ra racadabra
The suffix array is a array of indices starting with 1 or 0 in lexicographical order. For the string "abracadabra" the suffix array is {11,8,1,4,6,9,2,5,7,10,3}, because suffix "a" starts at the 11th letter, "abra" starts at the 8th letter, etc.
Sort the rows
mississippi#ississippi#mssissippi#misissippi#misissippi#missssippi#missisippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
# mississipp iI #mississip pI ppi#missis sI ssippi#mis sI ssissippi# mM ississippi #P i#mississi pP pi#mississ iS ippi#missi sS issippi#mi sS sippi#miss iS sissippi#m i
F L• Every column is a permutation of T.
• Given row i, char L[i] precedes F[i] in
original T.
• Consecutive char’s in L are adjacent to
similar strings in T.
• Therefore – L usually contains long runs of
identical char’s.
1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we
get mi5. How do we choose the correct i in L?
The i’s are in the same order in L and FAs are the rest of the char’s
6. i is followed by s: mis7. And so on….
F
Reminder: Recovering T from L
L
MTFL
i p s s m # p i s s i i
# i m p s0 1 2 3 4
1 3 13 4 4 4 440 0 0L
i # m p s0 1 2 3 4
p i # m s0 1 2 3 4
s p i # m0 1 2 3 4
And so on…• Bad example
• For larger texts we will receive more runs of zeroes, and dominancy of smaller numbers.
• The reason being that BWT creates clusters of similar char’s.
Replace each char in L with the number of distinct char’sseen since its last occurrence.
Keep MTF[1,…,|Σ|] array, sorted lexicographically.
Runs of identical char’s are transformed into runs of zeroes in L(MTF)
Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column).
Copyright restrictions may apply.
Li, H. et al. Brief Bioinform 2010 0:bbq015v1-15; doi:10.1093/bib/bbq015
Data structures based on a prefix tree
(A) Prefix trie of string AGGAGC where symbol ^ marks the start of the string. The two numbers in each node give the suffix array interval of the substring represented by the node, which is the string concatenation of edge symbols from the node to the root. (B) Compressed prefix trie by contracting nodes with in- and out-degree both being one. (C) Prefix tree by representing the substring on each edge as the interval on the original string. (D) Prefix directed word graph (prefix DAWG) created by collapsing nodes of the prefix trie with identical suffix array interval. (E) Constructing the suffix array and Burrows–Wheeler transform of AGGAGC.
String: AGGAGC
Exact matching versus inexact alignment.Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'.
Role of paired-end and mate-pair mapping
Some sequencing technologies produce read pairs such that the two readsare known to be close to each other in physical chromosomal distance.
These reads are called paired-end or mate-pair reads.
- With this mate-pair information, a repetitive read will be reliably placed if its mate can be placed unambiguously.
- Alignment errors may be detected and fixed when wrong alignments break the mate-pair requirement
Effect of paired end alignment
Effect of quality values
Aligning bisulfite-treated reads
Bisulfite sequencing is a technology to identify methylation patterns
- Cytosines with underlines are not methylated. - Denaturation and bisulfite treatment will convert these cytosines to uracils. - After amplification, four different sequences from the original
double-strand DNA result.
Aligning bisulfite reads
1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment.
2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the reverse.
Xi and Li BMC Bioinformatics 2009 10:232 doi:10.1186/1471-2105-10-232
Aligning bisufite treated reads -two reference sequences:
one with all ‘C’ bases converted to ‘T’ bases (the C-to-T reference) the other with all ‘G’ bases converted to ‘A’ bases (the G-to-A reference).
-alignment: ‘C’ bases are converted to ‘T’ base for reads and
are mapped to the C-to-T reference (then a C–T mismatch is effectively regarded as a match);
a similar procedure is performed for the G-to-A conversion in the next round of alignment.
-The results from two rounds of alignment are combined to generate the final report. If there are no mutations or sequencing errors, a bisulfite treated read
can always be mapped exactly in one of the two rounds.
Aligning spliced reads
RNA-seq produces reads from transcribed sequences with introns and intergenetic regions excluded.
When RNA-seq reads are aligned against the genomic sequence, a read may be mapped to a splicing junction.
This will fail with a standard alignment algorithm.
-> Special alignment e.g. TopHat
Copyright restrictions may apply.
Trapnell, C. et al. Bioinformatics 2009 25:1105-1111; doi:10.1093/bioinformatics/btp120
TOPHAT Pipeline
SplicingSplicing
Eukaryotic genes (exons & introns)
TranslationTranslation
SplicingAlternative
Mature splice variant II
Mature splice variant I
Alternative splicing: One gene, several proteins!
Types of alternative
splicing
TopHat and Cufflinks
- Use next generation sequenceData for alternative splicing
Comparison of some mapping programsTable 1: Popular short-read alignment software
Program Algorithm SOLiD Longa Gapped PEb Qc
Bfast hashing ref. Yes No Yes Yes No
Bowtie FM-index Yes No No Yes Yes
BWA FM-index Yesd Yese Yes Yes No
MAQ hashing reads Yes No Yesf Yes Yes
Mosaik hashing ref. Yes Yes Yes Yes No
Novoaligng hashing ref. No No Yes Yes Yes
aWork
well for Sanger and 454 reads, allowing gaps and clipping.
bPaired
end mapping.
cMake
use of base quality in alignment.
dBWA
trims the primer base and the first
color
for a
color
read.
eLong-read alignment implemented in the BWA-SW module.
fMAQ
only does gapped alignment for
Illumina
paired-end reads.
gFree
executable for non-profit projects only.