SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of...
-
Upload
lynne-pitts -
Category
Documents
-
view
213 -
download
0
Transcript of SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of...
![Page 1: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/1.jpg)
SHRiMP: The SHort Read Mapping Package
Michael BrudnoDepartment of Computer Science
University of Toronto 11/09/08
![Page 2: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/2.jpg)
Handling NGS Data
• NGS: at least 3 distinct read types:– Illumina/Solexa, 454
letter-space
– AB SOLiD color-space (di-base sequencing)
– 2-pass SMS (Helicos) 2 reads, same location higher error rates
• Need new algorithms– SOLiD: Biologists want letters, not colors– 2-pass: How to best handle two reads?
![Page 3: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/3.jpg)
SHRiMP Overview
Isolate similarity in stages:
1. Spaced Seed Filtering
2. Vectorized Smith-Waterman
3. Full Alignment– Specialized for SOLiD, 2-pass, Letter-space
4. Compute p-values (and other statistics)
} Common
![Page 4: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/4.jpg)
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
![Page 5: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/5.jpg)
TGAGCGTTC|||TGAATAGGA
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
AB SOLiD: Dibase Sequencing
AB SOLiD reads look like this:
T012233102
A G
C T
1
2
2
33
0 0
00
1
TGAGCGTTCT012033102TGAATAGGA
HMM!!!hmm???
![Page 6: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/6.jpg)
G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT
SNPs
TGAGTT 12210 TGACTT 12120TGAATT 12030TGATTT 12300
AB SOLiD: Color space is complex!
INDELS
TGAGTTA 122103
TGA-TTA 12-303
TGAGTTTA 1221003
TGAGTATA 1221333It’s
bloody complicated!
![Page 7: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/7.jpg)
AB SOLiD: Translations
• Look at: 012233102• Recall: 012033102• 4 translations for every color sequence
A A C T T A T G G A A G
C T
1
2
2
33
0 0
00
1
0 1 2 0 3 3 1 0 2
C C A G G C G T T C
G G T C C G C A A G
T T G A A T A C C T
TGAGCGTTC|||TGAATAGGA
TGAGCGTTC|||||||||TGAGCGTTC
![Page 8: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/8.jpg)
AB SOLiD: Modified Smith-Waterman
• 4 S-W matrices, one per translation• Errors transition into other matrix• ‘Crossover’ penalty charged for errors
Translation A Translation C
T T GT T
GGe
no
me
G A T A C C T C C A A G C G T T C
A G
C G
T T
C
…
![Page 9: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/9.jpg)
AB SOLiD: Obligatory Comparison
• SHRiMP and AB Mapper (1.6)– SHRiMP seed weight 8 (1111001111)– AB 35_2, 35_3 schemas
• 10,000 35bp reads– C. savignyi (173Mb), very high polymorphism
• Considering single top hits only
SHRiMP AB 35_2 AB 35_3
% mapped 19.83 6.67 10.94
Runtime 13m04 1h24 2h25
![Page 10: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/10.jpg)
AB SOLiD: Resultant Alignments
• SHRiMP emits letter-space alignments
– Clear to biologists
– Color-space need not be scary!
G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| |||T: GAaCCCCTTACAACTGAACCCC-TACR: 1 T1211000203110121201000-231 25
![Page 11: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/11.jpg)
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
![Page 12: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/12.jpg)
2-pass SMS Reads
• SMS reads have high error rates
– “Dark bases” (skipped letters)
– Multiple passes are possible
– Ameliorate errors over passes• Good chance of missing base in one read• Acceptable chance of getting it in at least one
![Page 13: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/13.jpg)
Mapping 2-pass Reads
ReadsOriginal
C-GACTTTACTGACTTA
CTGA-T---
Reference Genome
?
![Page 14: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/14.jpg)
CTG-ACTCAGCA-T
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
S=9
SMS 2-pass: SHRiMP with 2 reads
CTGCACT
![Page 15: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/15.jpg)
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
CTGAC-TCAG-CAT
SMS 2-pass: SHRiMP with 2 reads
CTG-ACTCAGCA-TS=9
CTGCACT
CTGACAT
![Page 16: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/16.jpg)
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
C-TG-ACTCA-GCA-T
CT-GAC-TC-AG-CAT
S=8
SMS 2-pass: SHRiMP with 2 reads
CTGAC-TCAG-CAT
CTG-ACTCAGCA-TS=9
CTGCACT
CTGACAT
CATGCACT
CTAGACAT
C-TGAC-TCA-G-CAT
CT-GAC-TC-AG-CAT
CATGCACT
CTAGACAT
![Page 17: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/17.jpg)
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments
•Compute a DP matrix
•Sum it up with the DP matrix computed in reverse +
0 -2 -4 -6 -8 -10 -12
-2 4 2 0 -2 -4 -6
-4 2 1 -1 4 2 0
-6 0 -1 5 3 1 -1
-8 -2 -3 3 2 7 5
-10 -4 -5 1 7 5 4
-12 -6 0 -1 5 4 9
9 3 5 6 0 -6 -12
3 5 6 8 2 -4 -10
4 6 8 2 4 -2 -8
2 1 3 4 6 0 -6
-4 -2 0 6 1 2 -4
-6 -4 -2 0 2 4 -2
-12 -10 -8 -6 -4 -2 0
![Page 18: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/18.jpg)
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments
•Compute a DP matrix
•Sum it up with the DP matrix computed in reverse
•Leave only near optimal alignments
=
9
9 8
8 9
9 9
9 9
9 9
9
9 1 1 0 -8 -16 -24
1 9 8 7 0 -8 -16
0 8 9 1 7 0 -8
-4 0 1 9 9 1 -7
-12 -4 -3 9 3 9 1
-16 -8 -7 1 9 9 2
-24 -16 -8 -7 1 2 9
Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003)
AT
—T
A—
CC
A—
—T
GG
CC
A—
—A
AA
—C
C—
![Page 19: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/19.jpg)
• Build a DAG representing the (near) optimal alignments of the two reads
• Generate seeds (short paths) from the DAG
• Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW.
• Do full alignment for top hits
SMS 2-pass: SHRiMP with 2-pass data
AT
—T
A—
CC
A—
—T G
G
CC
A—
—A
AA
—C
C—
TT
![Page 20: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/20.jpg)
Type Separate Profile WSG
No hits % 0.13 4.91 4.31
Multiple % 26.45 9.34 9.13
Uniq cor % 63.00 74.90 75.84
Runtime 9m 11m 12m
SMS 2-pass: Results (in brief)
• 10,000 synthetic reads (~25-65 bp)– 7% deletion,1% insertion, 1% sub rate
• Mapped to Human chromosome 1– Spaced seed weight 8: 111101111
![Page 21: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/21.jpg)
• Fast mapping of short reads to a genome
-- Handles:
• color-space (SOLiD) reads
• 2-pass (SMS) reads
• insertions and deletions
-- Easy to parallelize
• Computation of p-values & other statistics for hits
SHRiMP Summary
![Page 22: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/22.jpg)
• Faster Mapping (biggest complaint)
• Matepair data support
• Transcriptome Data
• Suggestions?
SHRiMP TODO List
![Page 23: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.](https://reader035.fdocuments.us/reader035/viewer/2022062805/5697bfd61a28abf838cadddb/html5/thumbnails/23.jpg)
Acknowledgements
SHRiMP is brought to you by:
– Steve Rumble– Vlad Yanovsky– Adrian Dalca – Marc Fiume
– Phil Lacroute– Arend Sidow
http://compbio.cs.toronto.edu/shrimp
University of Toronto
Stanford University