Presented by: Xia Li
description
Transcript of Presented by: Xia Li
![Page 1: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/1.jpg)
SeqMap: mapping massive amount of oligonucleotides to the genome
Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396
The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides
from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45
Presented by: Xia Li
![Page 2: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/2.jpg)
Short-read mapping softwareSoftware Technique ReferenceGNUMAP Hashing refs + base quality +
repeated regions Clement et al., 2010
Novoalign Hashing refs Novocraft, unpublishedSOAP Hashing refs Li et al., 2008SeqMap Hashing reads Jiang et al., 2008RMAP Hashing reads + read quality Smith et al., 2008Eland Hashing reads Cox, unpublishedBowtie BWT Langmead et al., 2009
Slider lexicographically sorting + base quality Malhis et al., 2009
![Page 3: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/3.jpg)
SeqMap
• Motivation– Hashing genome usually needs large memory (e.g.
SOAP needs 14GB memory when mapping to the human genome)
– Allow more substitutions and insertion/deletion
![Page 4: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/4.jpg)
SeqMap
• Pigeonhole principle– Spaced seed alignment– ELAND, SOAP, RMAP
• Hash reads• Insertion/deletion:
2/4 combinations with1/2 shifted one nucleotideto its left or right
Short Read
Short read look up table (indexed by 2 parts)
Split into 4 parts
All combinations of 2/4 parts
Reference GenomeImage credit: J. Ruan
![Page 5: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/5.jpg)
Experiment & Result
![Page 6: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/6.jpg)
Experiment & Result
• Deal with more substitutions and insertion/deletion
Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions
![Page 7: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/7.jpg)
GNUMAP
• Motivation– Base uncertainty
• Such as nearly equal or low probabilities to A, C, G or T• Filter low quality reads [RMAP] -> discard up to half of the
reads (Harismendy et al., 2009)– Repeated regions in the genome
• Discard them -> loss of up to half of the data (Harismendy et al., 2009)
• Record one -> unequal mapping to some of the repeat regions
• Record all -> each location having 3 times the correct score
![Page 8: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/8.jpg)
GNUMAP
• Flow-chart
![Page 9: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/9.jpg)
Probabilistic Needleman-Wunsch
![Page 10: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/10.jpg)
Alignment Score
ACTGAACCATACGGGTACTGAACCATGAA
AACCAT
GGGTACAACCATTAC
Read from sequencer
GGGTACAACCAT
Read is added to both repeat regions proportionally to their match qualityweighted by its # of occurrences in the genome
Slide credit: N. Clement
![Page 11: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/11.jpg)
Experiment & Result
![Page 12: Presented by: Xia Li](https://reader033.fdocuments.us/reader033/viewer/2022051020/56815fe9550346895dceee2e/html5/thumbnails/12.jpg)
Comments
• SeqMap– Pos: dealing with more
substations/insertion/deletion– Cons: memory consuming, not fast
• GNUMAP– Pos: consider base quality and repeated regions ->
generate more useful information and achieves best performance (~15% increase)
– Cos: memory consuming, slow, more noise