SOAP 2.0 - Speed up and with scoring system
-
Upload
elmo-rodriguez -
Category
Documents
-
view
24 -
download
0
description
Transcript of SOAP 2.0 - Speed up and with scoring system
SOAP 2.0-Speed up and with scoring system
BGI
2008-05-27
Indexing reference genome
• SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables;
• BWT based Compressed Suffix Array, ~7Gb RAM for human genome;
• Load the reference genome into RAM once, so will significantly reduce I/O;
• Use reads as query will facilitate threaded parallel calculation, which fits multi-core CPUs well;
• Support varied read sizes in a file;
Alignment strategy
• “XOR”+lookup table;
• <20m for aligning 1M reads onto the human genome, 4h for 1X data vs 3Gb on an 8-core node, even faster for paired-end reads mapping;
• Allow more mismatches at 3’-end of reads;
• Gapped alignment (enumeration) if no ungapped hits exist;
• Could report all hits if necessary.
Scoring system
Trying two methods:
1. Heng’s method implemented in Maq;
2. Similar in principle
– Set quality cutoff (Q10?), not count low-quality mismatches;
– For multiple equal best hits, take it as repeat hits;
– For one best hit, and multiple second best hits, P = 1/(1+aNsecond), Nsecond is number of second best hits with one more mismatches, a is estimated average error probability (a=0.01?).
Input & Output
• Input– Text (.fa, .fq)– gziped
• Output– SOAP– .glz (GLF)– gziped– binary– ACE– Others as necessary