Cluster Computing Applications Project: Parallelizing BLAST The field of Bioinformatics needs faster...

1
Cluster Computing Applications Project: Cluster Computing Applications Project: Parallelizing BLAST Parallelizing BLAST The field of Bioinformatics needs faster string matching algorithms. What Exactly is BLAST? BLAST (Basic Local Alignment Search Tool) is a heuristic algorithm that uses a technique of finding efficient matches between query strings and target database of strings. Abstract Parallelizing the BLAST Algorithm: Feasible or Not? The field of Bioinformatics Research, especially in the field of coding and classifying genes, has a need for fast string matching algorithms. At Oak Ridge National Laboratory (ORNL), in the Mathematics and Computer Science Division, High Performance Cluster (HPC) computing has been applied to many different areas, from Computational Biology to Computational Material Science. The purpose of this project is to do a study on the Basic Local Alignment Search Tool (BLAST) algorithm: define the structure of the BLAST algorithm, state why the algorithm is valuable as a Bioinformatics database tool and explore the ways of increasing this algorithm's effectiveness and speed. BLAST stands for Basic Local Alignment Search Tool and it is used in Bioinformatics to find alignments between strings. BLAST is a heuristic algorithm that uses the technique of finding matches between fragments of a query string and a target database. This eliminates much of the data in a database without running a full comparison for each letter in the search string. Once query and database string alignments are found (if the fragments match within a certain threshold), the full strings are matched. Several methods of parallelizing BLAST have been explored and this information will be summarized in this paper. This paper will conclude with a number of potential methods for increasing the speed and effectiveness of BLAST. This research was performed under the Research Alliance for Minorities Program administered through the Computer Science and Mathematics Division, Oak Ridge National Laboratory. This Program is sponsored by the Mathematical, Information, and Computational Sciences Division; Office of Advanced Scientific Computing Research; U.S. Department of Energy. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725. This research used resources of the Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science, U.S. Department of Energy. This work has been authored by a contractor of the U.S. Government under contract DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. I would like to extend my thanks to Stephen L. Scott Ph.D., John Mugler, Thomas Naughton, and Brian Luethke for their invaluable mentoring, Michaelangelo Salcedo Ph.D. for his guidance, Debbie McCoy, and Cheryl Hamby for their support in the RAM program. This project began with learning cluster-computing infrastructure. My training included tools developed at Oak Ridge National Laboratory: the Open Source Cluster Application Resources (OSCAR) tool and Cluster Command and Control (C3) tool. OSCAR a robust and user-friendly application is used for installation of clusters. C3 a suite of cluster tools is used for administration of clusters. The question may be asked, once you have a cluster then what do you apply it to? My further research answers this question, which is the second half of the project. It pertains to investigating a Bioinformatics application called BLAST and exploring known parallelization schemas for cluster computing Introduction Infrastructure Overview Red Hat Linux 7.2 OSCAR 1.3 C3 - http://www.csm.ornl.gov/torc/C3/ LAM/MPI - http://www.lam-mpi.org/ Maui Scheduler - http://supercluster.org/maui/ MPICH - http://www-unix.mcs.anl.gov/mpi/mpich/ OpenSSH - http://www.openssh.com/ OpenSSL - http://www.openssl.org/ PBS - http://www.openpbs.org/ PVM - http://www.csm.ornl.gov/pvm/ System Installation Suite (SIS) - http://www.sisuite.org/ Applications Overview BLAST a Bioinformatics tool. BLAST -http://www.ncbi.nlm.nih.gov/BLAST/blast_over view.html Parallelize BLAST’s algorithm. BLAST BLAST William Burke York College, City University of New York Stephen L. Scott & John Mugler Oak Ridge National Laboratory Research Alliance of Minorities (RAM), Computer Science and Mathematics Division: Poster Session 2002 C Cluster C Command & C Control O Open S Source C Cluster A Applicat ion R Resource s eXtreme TORC TORC HighTORC
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Cluster Computing Applications Project: Parallelizing BLAST The field of Bioinformatics needs faster...

Page 1: Cluster Computing Applications Project: Parallelizing BLAST The field of Bioinformatics needs faster string matching algorithms. What Exactly is BLAST?

Cluster Computing Applications Project: Parallelizing BLASTCluster Computing Applications Project: Parallelizing BLAST

The field of Bioinformatics needs faster

string matching algorithms.

What Exactly is BLAST?

BLAST (Basic Local Alignment Search Tool) is a heuristic algorithm that uses a technique of finding efficient matches between query strings and target database of strings.

Abstract

Parallelizing the BLAST Algorithm: Feasible or Not?

The field of Bioinformatics Research, especially in the field of coding and classifying genes, has a need for fast string matching algorithms. At Oak Ridge National Laboratory (ORNL), in the Mathematics and Computer Science Division, High Performance Cluster (HPC) computing has been applied to many different areas, from Computational Biology to Computational Material Science. The purpose of this project is to do a study on the Basic Local Alignment Search Tool (BLAST) algorithm: define the structure of the BLAST algorithm, state why the algorithm is valuable as a Bioinformatics database tool and explore the ways of increasing this algorithm's effectiveness and speed. BLAST stands for Basic Local Alignment Search Tool and it is used in Bioinformatics to find alignments between strings. BLAST is a heuristic algorithm that uses the technique of finding matches between fragments of a query string and a target database. This eliminates much of the data in a database without running a full comparison for each letter in the search string. Once query and database string alignments are found (if the fragments match within a certain threshold), the full strings are matched. Several methods of parallelizing BLAST have been explored and this information will be summarized in this paper. This paper will conclude with a number of potential methods for increasing the speed and effectiveness of BLAST.

Abstract

Parallelizing the BLAST Algorithm: Feasible or Not?

The field of Bioinformatics Research, especially in the field of coding and classifying genes, has a need for fast string matching algorithms. At Oak Ridge National Laboratory (ORNL), in the Mathematics and Computer Science Division, High Performance Cluster (HPC) computing has been applied to many different areas, from Computational Biology to Computational Material Science. The purpose of this project is to do a study on the Basic Local Alignment Search Tool (BLAST) algorithm: define the structure of the BLAST algorithm, state why the algorithm is valuable as a Bioinformatics database tool and explore the ways of increasing this algorithm's effectiveness and speed. BLAST stands for Basic Local Alignment Search Tool and it is used in Bioinformatics to find alignments between strings. BLAST is a heuristic algorithm that uses the technique of finding matches between fragments of a query string and a target database. This eliminates much of the data in a database without running a full comparison for each letter in the search string. Once query and database string alignments are found (if the fragments match within a certain threshold), the full strings are matched. Several methods of parallelizing BLAST have been explored and this information will be summarized in this paper. This paper will conclude with a number of potential methods for increasing the speed and effectiveness of BLAST.

This research was performed under the Research Alliance for Minorities Program administered through the Computer Science and Mathematics Division, Oak Ridge National Laboratory. This Program is sponsored by the Mathematical, Information, and Computational Sciences Division; Office of Advanced Scientific Computing Research; U.S. Department of Energy. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725. This research used resources of the Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science, U.S. Department of Energy. This work has been authored by a contractor of the U.S. Government under contract DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. I would like to extend my thanks to Stephen L. Scott Ph.D., John Mugler, Thomas Naughton, and Brian Luethke for their invaluable mentoring, Michaelangelo Salcedo Ph.D. for his guidance, Debbie McCoy, and Cheryl Hamby for their support in the RAM program.

•This project began with learning cluster-computing infrastructure. My training included tools developed at Oak Ridge National Laboratory: the Open Source Cluster Application Resources (OSCAR) tool and Cluster Command and Control (C3) tool. OSCAR a robust and user-friendly application is used for installation of clusters. C3 a suite of cluster tools is used for administration of clusters.

•The question may be asked, once you have a cluster then what do you apply it to? My further research answers this question, which is the second half of the project. It pertains to investigating a Bioinformatics application called BLAST and exploring known parallelization schemas for cluster computing

•This project began with learning cluster-computing infrastructure. My training included tools developed at Oak Ridge National Laboratory: the Open Source Cluster Application Resources (OSCAR) tool and Cluster Command and Control (C3) tool. OSCAR a robust and user-friendly application is used for installation of clusters. C3 a suite of cluster tools is used for administration of clusters.

•The question may be asked, once you have a cluster then what do you apply it to? My further research answers this question, which is the second half of the project. It pertains to investigating a Bioinformatics application called BLAST and exploring known parallelization schemas for cluster computing

Introduction

Infrastructure Overview

• Red Hat Linux 7.2

• OSCAR 1.3– C3 - http://www.csm.ornl.gov/torc/C3/ – LAM/MPI - http://www.lam-mpi.org/ – Maui Scheduler - http://supercluster.org/maui/ – MPICH - http://www-unix.mcs.anl.gov/mpi/mpich/ – OpenSSH - http://www.openssh.com/ – OpenSSL - http://www.openssl.org/ – PBS - http://www.openpbs.org/ – PVM - http://www.csm.ornl.gov/pvm/ – System Installation Suite (SIS) -

http://www.sisuite.org/

Applications Overview

• BLAST a Bioinformatics tool.

BLAST -http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html

• Parallelize BLAST’s algorithm.BLASTBLAST

William BurkeYork College, City University of New York

Stephen L. Scott & John MuglerOak Ridge National Laboratory

Research Alliance of Minorities (RAM), Computer Science and Mathematics Division:

Poster Session 2002

CCluster

CCommand&

CControl

OOpen

SSource

CCluster

AApplication

RResources

eXtreme TORC

TORCHighTORC