Aligning Reads Ramesh Hariharan Strand Life Sciences IISc
description
Transcript of Aligning Reads Ramesh Hariharan Strand Life Sciences IISc
Aligning Reads
Ramesh Hariharan
Strand Life SciencesIISc
What is Read Alignment?
AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC
Subject’s Genome
AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC
Reference Genome
Where do these
match in the
Reference?
Close but not quite
the same as the
Subject’s Genome
What does “Match” mean?
AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC
Reference Genome
GCTACGCA
Exact Match
CATAAAGAC
With Mismatche
s
CACTT_AGT
With Gaps
Why mismatches and gaps?
The subject genome could be different from the reference
Reads
Reference
Genome
SNP
Deletion
Mismatches and Gaps
The reading process could be erroneous
How many mismatches and gaps?
Short reads ~50, few
mismatches and gaps
Long reads, ~1000, many
more mismatches
and gaps
How do aligners fare?
BWA: Very few
mismatches and gaps
CoBWebBWA-SW:
Many mismatches
and gaps
BowTie: only
mismatches, no gaps
No paired read
handling
No handling of adaptor
trimming for small RNA
Separate handling for
RNASeq
BowTie2
How does an Aligner work?
For simplicity, assume Exact Match
For each read, scan the entire reference genome sequence
SLOW!!!!
C G A C GThe
ReferenceC
C
G
T
T
A C
A G
A C
T
Index the Reference
How can we find Exact Matches of a read quickly with this index?
C G A C GThe
ReferenceC
C
G
T
T
A C
A G
A C
T
C G C
The problem: 24GB
Can this structure be compressed?
C G A C $
A C $ C GC G A C $C $ C G AG A C $ C$ C G A C
The Reference
This column is the BWT
All its circular shifts, sorted
lexicographically
The Index: now an array instead
of a tree
The Burrows-Wheeler
based Index
Sampled to reduce memory at the
expense of speed (Ferragina and
Manzini)
How about Mismatches and Gaps?
BWA, BWA-SW and BowTie force mismatches and gaps into the BW Index searching procedure
CoBWeb uses the BW Index to find a ‘seed’ exact match and does Smith-Waterman around this
seed
This 15-mer occurs at
locations x1, x2…
This 15-mer occurs at
locations x3, x4… This whole 30-
mer occurs at location x5
Dynamic Programming
• Given a location in the reference with an read anchor, how well does the read match here?
Reference
Read
Anchor 14 mer
• Smith-Waterman (optimized for large gaps)
Comparison with BWA
Read Length 50
Read Length
150
20% faster than BWA with
comparable results
CoBWeb: 3 mismatches and 2 gaps
BWA: 2 mismatches + 1 gap of possibly multiple length
Comparison with BWA-SW
Read Length
400
8 mismatches
plus 10 gaps
CoBWeb BWA-SW
Reads 1m 1m
Time taken 1130s 2242s
Incorrectly Mapped 12598 9819
5650 mapped
incorrecty by BWA-SW
The remainder
has poor BWA mapping quality
Avadis NGS
Avadis NGS Alignment, DNA Var Detection,
RNASeq, ChIPSeq, Small RNASeq
Thank You