SOAP 2.0 - Speed up and with scoring system

5
SOAP 2.0 -Speed up and with scoring system BGI 2008-05-27

description

SOAP 2.0 - Speed up and with scoring system. BGI 2008-05-27. Indexing reference genome. SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables; BWT based Compressed Suffix Array, ~7Gb RAM for human genome; - PowerPoint PPT Presentation

Transcript of SOAP 2.0 - Speed up and with scoring system

Page 1: SOAP 2.0 - Speed up and with scoring system

SOAP 2.0-Speed up and with scoring system

BGI

2008-05-27

Page 2: SOAP 2.0 - Speed up and with scoring system

Indexing reference genome

• SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables;

• BWT based Compressed Suffix Array, ~7Gb RAM for human genome;

• Load the reference genome into RAM once, so will significantly reduce I/O;

• Use reads as query will facilitate threaded parallel calculation, which fits multi-core CPUs well;

• Support varied read sizes in a file;

Page 3: SOAP 2.0 - Speed up and with scoring system

Alignment strategy

• “XOR”+lookup table;

• <20m for aligning 1M reads onto the human genome, 4h for 1X data vs 3Gb on an 8-core node, even faster for paired-end reads mapping;

• Allow more mismatches at 3’-end of reads;

• Gapped alignment (enumeration) if no ungapped hits exist;

• Could report all hits if necessary.

Page 4: SOAP 2.0 - Speed up and with scoring system

Scoring system

Trying two methods:

1. Heng’s method implemented in Maq;

2. Similar in principle

– Set quality cutoff (Q10?), not count low-quality mismatches;

– For multiple equal best hits, take it as repeat hits;

– For one best hit, and multiple second best hits, P = 1/(1+aNsecond), Nsecond is number of second best hits with one more mismatches, a is estimated average error probability (a=0.01?).

Page 5: SOAP 2.0 - Speed up and with scoring system

Input & Output

• Input– Text (.fa, .fq)– gziped

• Output– SOAP– .glz (GLF)– gziped– binary– ACE– Others as necessary