Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical...
-
Upload
emil-douglas -
Category
Documents
-
view
227 -
download
1
Transcript of Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical...
Genomics Method Seminar- BWA
October 15, 2014
Sora Kim
Yonsei Biomedical Science InstituteYonsei University College of Medicine
2/12
Today’s paper
• PhD. Heng Li– a research scientist at the Broad Institute, working
with David Reich and David Altshuler.– principal developer of several projects including SAM-
tools, BWA, MAQ, TreeSoft and TreeFam with most of them started when he was a postdoctoral fellow of Richard Durbin at the Wellcome Trust Sanger Institute.
3/12
Software information
• Purpose– BWA-MEM is a new alignment algorithm for aligning se-
quence reads or assembly contigs against a large refer-ence genome such as human.
• Category– aligner
• Software URL– http://bio-bwa.sourceforge.net/
• License– Free, Open Source under Artistic License
4/12
RNA-seq
ChIP-seq
WGS, WES
5/12
Previous work
• Bowtie
– BWT + FM index– LF mapping– Backtracking
6/12
Conceptual Overview
BWA• For
short read
BWA-SW• For
long read
BWA-MEM• For both
7/12
CUSHAW2 - MEMs
• Long read alignment based on maximal ex-act match seeds, Yongchao Liu and Bertil Schmidt, Bioinformat-ics (2012) 28 (18):i318-i324
• CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. It is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments.
8/12
CUSHAW2 - MEMs
9/12
CUSHAW2 - MEMs
1. Estimation of the minimal seed size2. Generation of maximal exact
matches
10/12
1. Estimation of the minimal seed size
• qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t = max(|P|,|S|)-q+1-q*e (Exact and complete short-read alignment to microbial genomes using Graphics Pro-cessing Unit programming, Bioinformatics, Vol. 27 no. 10 2011, pages 1351–1358)
• That means that every error may destroy up to q*e overlapping qgrams.
• For non-overlapping qgrams, one error can destroy only the qgram in which it is located.
• Given this assumption, we define the length q of the qgrams as the largest value below such that
11/12
1. Estimation of the minimal seed size
• A = ACGT• B = ACTT• q=2, e=1 이라고 가정
q(A) = {AC, CG, GT}q(B) = {AC, CT, TT}
• t = max(|A|,|B|)-q+1-q*et = max(4, 4)-2+1-2*1 = 1
• A_q 와 B_q 는 최소 t, 1 만큼은 share 하는 구간이 있어야 한다 .
12/12
1. Estimation of the minimal seed size
• The estimation is based on the pigeonhole principle for non-overlapping q-grams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome.
• QL: global lower-bound = (default) 13• QH: global upper-bound = (default) 49
• employ a simplified error model for ungapped alignments to esti-mate e. w follows a binomial distribution.
13/12
2. Generation of maximal exact matches
• To identify MEMs between S and T, we ad-vance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index.
• LEMs are right/left maximal if it is not part of any previously identified MEM.
• discard the MEMs whose lengths are less than Q.– we only keep its first h (h=1024 by default) occurrences
and discard the others.
14/12
BWA-MEM
1. Aligning a single query sequencea. Seeding and re-seedingb. Chaining and chain filteringc. Seed extension
2. Paired-end mappinga. Rescuing missing hitsb. Pairing
15/12
SE. Seeding and re-seeding
• BWA-MEM follows the canonical seed-and-ex-tend paradigm.
• Seed an alignment with SMEMs (Super Maximal Exact Matches), which essentially finds at each query position the longest exact match cov-ering the position.
• Suppose we have a SMEM of length l with k occurrences in the reference genome.
• To reduce mismappings caused by missing seeds, we introduce re-seeding.
16/12
SE. Chaining and chain filtering
• We call a group of seeds that are colinear and close to each other as a chain.
• We greedily chain the seeds while seeding and then filter out short chains that are largely con-tained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain).
• Chain filtering aims to reduce unsuccessful seed extension at a later step.
• Chains detected here do not need to be accurate.
17/12
SE. Seed extension
• rank a seed by length of the chain it belongs to and then by the seed length.
• drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic pro-gramming (DP) if it potentially leads to a new alignment.
18/12
SE. Seed extension
• banded affine-gap-penalty dynamic pro-gramming
19/12
SE. Seed extension
• BWA-MEM’s seed extension differs from the standard seed extension in two aspects.1. suppose at a certain extension step we
come to reference position x with the best extension score achieved at query position y.
2. while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence.
20/12
PE. Rescuing missing hits
• estimates the mean and the variance of the in-sert size distribution from reliable single-end hits.
• For the top 100 hits (by default) of either end, if the mate is unmapped in a window [] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment for the mate within the window.
21/12
PE. Rescuing missing hits
• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.
22/12
PE. Rescuing missing hits
• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.
23/12
PE. Pairing
• Given i-th hit for the first read, j-th hit for the second read• BWA-MEM computes their distance if the two hits are in the
right orientation, or sets to infinity otherwise.
• scores the pair (i, j)
– P(d) gives the probability of observing an insert size larger than d assuming a normal distribution
– ‘log4’ arises when we interpret SW score as odds ratio.– U is a threshold that controls pairing:
if is small enough such that , BWA-MEM prefers to pair the two ends;otherwise it prefers the unpaired alignments.
24/12
Results
25/12
Running Operation
• MEM mode
26/12
SAM format - spec
27/12
SAM format - example
28/12
Discussion
• 100bp 이상의 확실한 long read 일 때 MEM 방식을 주로 사용하고 100bp 이하의 short read 일 때는 aln 을 쓰는 것을 추천
• Seed extend 와 local alignment 사용으로 인한 불필요하게 많이 split 되어 나타나는 alignment 결과물에 대해서 결과 보정 혹은 후처리를 위해 옵션 조정이 필요