Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical...

Genomics Method Seminar- BWA

October 15, 2014

Sora Kim

[email protected]

Yonsei Biomedical Science InstituteYonsei University College of Medicine

mailto:[email protected]

2/12

Today’s paper

• PhD. Heng Li– a research scientist at the Broad Institute, working

with David Reich and David Altshuler.– principal developer of several projects including SAM-

tools, BWA, MAQ, TreeSoft and TreeFam with most of them started when he was a postdoctoral fellow of Richard Durbin at the Wellcome Trust Sanger Institute.

3/12

Software information

• Purpose– BWA-MEM is a new alignment algorithm for aligning se-

quence reads or assembly contigs against a large refer-ence genome such as human.

• Category– aligner

• Software URL– http://bio-bwa.sourceforge.net/

• License– Free, Open Source under Artistic License

4/12

RNA-seq

ChIP-seq

WGS, WES

5/12

Previous work

• Bowtie

– BWT + FM index– LF mapping– Backtracking

6/12

Conceptual Overview

BWA• For

short read

BWA-SW• For

long read

BWA-MEM• For both

7/12

CUSHAW2 - MEMs

• Long read alignment based on maximal ex-act match seeds, Yongchao Liu and Bertil Schmidt, Bioinformat-ics (2012) 28 (18):i318-i324

• CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. It is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments.

8/12

CUSHAW2 - MEMs

9/12

CUSHAW2 - MEMs

1. Estimation of the minimal seed size2. Generation of maximal exact

matches

10/12

1. Estimation of the minimal seed size

• qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t = max(|P|,|S|)-q+1-q*e (Exact and complete short-read alignment to microbial genomes using Graphics Pro-cessing Unit programming, Bioinformatics, Vol. 27 no. 10 2011, pages 1351–1358)

• That means that every error may destroy up to q*e overlapping qgrams.

• For non-overlapping qgrams, one error can destroy only the qgram in which it is located.

• Given this assumption, we define the length q of the qgrams as the largest value below such that

11/12


• A = ACGT• B = ACTT• q=2, e=1 이라고 가정

q(A) = {AC, CG, GT}q(B) = {AC, CT, TT}

• t = max(|A|,|B|)-q+1-q*et = max(4, 4)-2+1-2*1 = 1

• A_q 와 B_q 는 최소 t, 1 만큼은 share 하는 구간이 있어야 한다 .

12/12


• The estimation is based on the pigeonhole principle for non-overlapping q-grams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome.

• QL: global lower-bound = (default) 13• QH: global upper-bound = (default) 49

• employ a simplified error model for ungapped alignments to esti-mate e. w follows a binomial distribution.

13/12

2. Generation of maximal exact matches

• To identify MEMs between S and T, we ad-vance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index.

• LEMs are right/left maximal if it is not part of any previously identified MEM.

• discard the MEMs whose lengths are less than Q.– we only keep its first h (h=1024 by default) occurrences

and discard the others.

14/12

BWA-MEM

1. Aligning a single query sequencea. Seeding and re-seedingb. Chaining and chain filteringc. Seed extension

2. Paired-end mappinga. Rescuing missing hitsb. Pairing

15/12

SE. Seeding and re-seeding

• BWA-MEM follows the canonical seed-and-ex-tend paradigm.

• Seed an alignment with SMEMs (Super Maximal Exact Matches), which essentially finds at each query position the longest exact match cov-ering the position.

• Suppose we have a SMEM of length l with k occurrences in the reference genome.

• To reduce mismappings caused by missing seeds, we introduce re-seeding.

16/12

SE. Chaining and chain filtering

• We call a group of seeds that are colinear and close to each other as a chain.

• We greedily chain the seeds while seeding and then filter out short chains that are largely con-tained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain).

• Chain filtering aims to reduce unsuccessful seed extension at a later step.

• Chains detected here do not need to be accurate.

17/12

SE. Seed extension

• rank a seed by length of the chain it belongs to and then by the seed length.

• drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic pro-gramming (DP) if it potentially leads to a new alignment.

18/12

SE. Seed extension

• banded affine-gap-penalty dynamic pro-gramming

19/12

SE. Seed extension

• BWA-MEM’s seed extension differs from the standard seed extension in two aspects.1. suppose at a certain extension step we

come to reference position x with the best extension score achieved at query position y.

2. while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence.

20/12

PE. Rescuing missing hits

• estimates the mean and the variance of the in-sert size distribution from reliable single-end hits.

• For the top 100 hits (by default) of either end, if the mate is unmapped in a window [] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment for the mate within the window.

21/12


• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.

22/12


• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.

23/12

PE. Pairing

• Given i-th hit for the first read, j-th hit for the second read• BWA-MEM computes their distance if the two hits are in the

right orientation, or sets to infinity otherwise.

• scores the pair (i, j)

– P(d) gives the probability of observing an insert size larger than d assuming a normal distribution

– ‘log4’ arises when we interpret SW score as odds ratio.– U is a threshold that controls pairing:

if is small enough such that , BWA-MEM prefers to pair the two ends;otherwise it prefers the unpaired alignments.

24/12

Results

25/12

Running Operation

• MEM mode

26/12

SAM format - spec

27/12

SAM format - example

28/12

Discussion

• 100bp 이상의 확실한 long read 일 때 MEM 방식을 주로 사용하고 100bp 이하의 short read 일 때는 aln 을 쓰는 것을 추천

• Seed extend 와 local alignment 사용으로 인한 불필요하게 많이 split 되어 나타나는 alignment 결과물에 대해서 결과 보정 혹은 후처리를 위해 옵션 조정이 필요

Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical...

Documents

Transcript of Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical...