Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger...

25
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster Department of Computer Science and Engineering, Washington University in Saint Louis, MO Supported by an NIH STTR Grant & NSF Grants DBI-0237902, ITR-0313203, CCR-0217334

description

Slide # 3Washington University in St. Louis Basic Local Alignment Search Tool Biosequence comparison software –Query sequence (new genome) to large database of known biosequences Look for similar regions Exponential growth of genomic databases –Longer time for searches to complete –Solutions Perform comparison over multiple machines Specialized hardware - Our Approach

Transcript of Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger...

Page 1: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Biosequence Similarity Search on the Mercury System

Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and

Joseph LancasterDepartment of Computer Science and Engineering,

Washington University in Saint Louis, MO

Supported by an NIH STTR Grant &NSF Grants DBI-0237902, ITR-0313203, CCR-0217334

Page 2: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 2

Outline

• Overview of BLAST• Overview of the Mercury system• Description of BLASTN algorithm• Algorithmic changes to BLASTN• Improvement in performance• Related work• Conclusion

Page 3: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 3

Basic Local Alignment Search Tool

• Biosequence comparison software– Query sequence (new genome) to large

database of known biosequences• Look for similar regions

• Exponential growth of genomic databases– Longer time for searches to complete– Solutions

• Perform comparison over multiple machines• Specialized hardware - Our Approach

Page 4: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 4

The Mercury System

Page 5: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 5

The Mercury System

• Proximity to disk– Simple operations performed close to disk

• Avoids CPU use– 400 Mbytes/s throughput from the disk

• Concurrent Independent operation– Does not use processor cache cycles,

memory or I/O buses• Reconfigurable logic

– Logic can be tuned to the particular need of the application

Page 6: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 6

BLASTN

• BLASTN– Both the query and the database are long DNA strings– Consist of {A, C, T, G} and some unknowns

• Each stage processes lesser data• The stages become more computationally

expensive

Page 7: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 7

BLASTN - Terminology

…ACTGTGTTTCACTGACGGGTGT…

…CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG…

Query

Database

‘w-mer’ is a sequence of ‘w’ consecutive bases

Page 8: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 8

BLASTN - Pipeline - Stage 1

• Matches each ‘11-mer’ in query to database– Exact string matching

• 83% of overall time is spent in this stage• Filters 92% of data entering this stage

– Only 8% of data proceeds to the next stage

Page 9: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 9

BLASTN - Pipeline - Stage 2

• Extends the matches from stage 1…ACTGTGTTTCACTGACGGGTGT…

…GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG…

Page 10: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 10

BLASTN - Pipeline - Stage 2

• Extends the matches from stage 1– Allows mismatches of individual bases– Does not allow gaps in either the query or the database– Match score should be higher than threshold to proceed

• 16% of pipeline time is spent is this stage• Only 2/100,000 of data entering this stage proceeds to the

next stage

Page 11: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 11

BLASTN - Pipeline - Stage 3

• Extends the matches from stage 2…ACCACTGTTTCACTGACG_GA_T_GT…

…CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG…

Page 12: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 12

BLASTN - Pipeline - Stage 3

• Extends the matches from stage 2– Scores matches with Gaps inserted in both the

sequences– Smith-Waterman dynamic programming algorithm

• <1% of pipeline time is spent is this stage

Page 13: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 13

NCBI - BLASTN

• Stage 1 (word matching) is implemented as a lookup table– Efficient only for certain word lengths (w=

11)• Performance degrades dramatically for

larger query sizesQuery Size

10 Kbases

100 Kbases 1 Mbases Units

Throughput 66.9 8.76 0.657 Mbases/s

Pentium-4 2.6GHz 1Gbyte RAM

Page 14: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 14

Firmware implementation - Stage 1

BloomFilters

HashLookup

RedundancyEliminator

Eliminates false-positivesfrom Bloom filters,

obtain offset in query

Discards matches that are close to one another

Matches ‘11-mers’ to query, but generates false-positives

Page 15: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 15

Bloom filters operation

‘11-mer’

KHash

Functions

Programming the query into the bloom filter (processing query)

‘m-bit’ vector

1

1

1

query

Page 16: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 16

Bloom filters operation

database KHash

Functions

Finding matches in the database

‘m-bit’ vector

?

?

?

1: Potential match

0: Not a match

‘11-mer’

Page 17: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 17

Bloom filters operation

KHash

Functions

Finding matches in the database

‘m-bit’ vector

?

?

?

1*: Potential match

0: Not a match

* False positives are eliminated using a hash table

database‘11-mer’

Page 18: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 18

0.0001

0.001

0.01

0.1

1

50 250 500 750 950Memory (KB)

False Positive Rate

50K Bases100K Bases200K Bases500K Bases1M Bases

Bloom filter performance

Page 19: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 19

Performance analysis

Firmware Vs. Software Stage 1

Query Size

10 Kbases

100 Kbases

1 Mbases

Units

NCBI BLASTN 77.4 10.5 0.771 Mbases/s

Mercury BLASTN 800 800 80 Mbases/s

Speedup 10.3 76.1 104

Page 20: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 20

Overall system throughput

Query Size 10 Kbases 100

Kbases 1 Mbases Units

NCBI BLASTN 67.0 8.76 0.657 Mbases/s

Mercury BLASTN 497 52.6 4.47 Mbases/s

Speedup 7.42 6.01 6.80

Tputoverall = min (Tput1, Tput(2&3))

Page 21: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 21

Stage 2 in firmware - Throughput

1

10

100

1000

5 10 15 20 25Stage 2 speedup

Overall system throughput

(Mbases/sec)

10 Kbase query100 Kbase query1 Mbase query

A

Page 22: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 22

Stage 2 in firmware - Speedup

0

20

40

60

80

100

120

140

5 10 15 20 25Stage 2 speedup

Overall system speedup

10 Kbase query100 Kbase query1 Mbase query

B

Page 23: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 23

Related work• Hardware based commercial systems

– Paracel GeneMatcherTM, used ASIC, and hence is inflexible

– RDisk, FPGA based system with throughput of 60 Mbases/s for stage 1

• High-end commercial system– Paracel BLASTMachine2TM, 32 CPU linux cluster

• 2.93 Mbases/s for 2.8 Mbase query• 2 times faster than 1-node Mercury BLASTN

– TimeLogic DeCypherBLASTTM, FPGA based• 213 Kbases/s for a 16 Mbase query• Comparable to 1-node Mercury BLASTN

Page 24: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 24

Conclusion

• BLASTN on the Mercury system– Bloom filters to improve performance of stage 1

• Efficient hash functions in hardware

– 7x improvement in speed with only stage 1 firmware

– >50x speedup with stage 2 implemented in firmware

• Future work– Algorithmic changes to stage 2

• Efficient use of hardware capabilities

– Other apps• BLASTP, BLASTX etc.

Page 25: Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Washington University in St. Louis Slide # 25

Thank you