BLAST: Basic Local Alignment Search Tool
Click here to load reader
Embed Size (px)
Transcript of BLAST: Basic Local Alignment Search Tool
BLAST: Basic Local Alignment Search Tool
BLASTFinds regions of local similarity between sequencesCompares nucleotide or protein sequences to sequence databases and calculates the statistical significance of the matchesCan be used to:infer functional and evolutionary relationships between sequencesIdentify members of gene families
BLASTMost use BLAST by inputting a sequence as a query against a specified public sequence databaseThe query is sent over the internet and the search is performed on the NCBI databases and servers with the results posted back to the person's browser
Standalone BLASTOften utilized by biotech companies, genome scientists and bioinformatics personnelto query their own local databasesto customize BLAST to their own specific needsComes in two forms:Executables that can be run from the command lineStandalone WWW BLAST Server (allows users to set up their own in-house versions of the BLAST Web pages)
BLAST variationsDNA query to a DNA databaseProtein query to a protein databaseDNA query query translated in all 6 reading frames to a protein sequence databaseOther adaptationsPSI-BLAST (iterative protein sequence similarity searches using a position-specific score matrix)RPS-BLAST (searching for protein domains in the Conserved Domains Database)
Using BLAST Choosing the BLAST Program
Using BLAST Entering the Query SequenceAfter choosing the search we want to perform, we next need to enter the query sequence
Our example query protein>gi|4503323|ref|NP_000782.1| dihydrofolate reductase [Homo sapiens]MVGSLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEEKGIKYKFEVYEKND
Using BLAST Choosing the Database to Search
Using BLAST Choosing the Search Parameters
BLAST Output The Report Header
BLAST Output Graphical Overview
BLAST Output Another Example of the Graphical Overview
BLAST Output Report Descriptions
BLAST Output Pairwise Sequence Alignment(s)
Most of the information contained in this presentation can be obtained through the following links:http://www.ncbi.nlm.nih.gov/books/NBK21097/http://etutorials.org/Misc/blast/Part+I+Introduction/Chapter+1.+Hello+BLAST/1.2+Using+NCBI-BLAST/
Other Helpful LinksBLASThttp://blast.ncbi.nlm.nih.gov/Blast.cgiNCBIhttp://www.ncbi.nlm.nih.gov/VMD Tutorialhttp://www.ks.uiuc.edu/Training/Tutorials/vmd/tutorial-html/vmd-tutorial-2009.htmlVMD User's Guidehttp://www.ks.uiuc.edu/Research/vmd/current/ug/ug.htmlMEGAhttp://www.megasoftware.net/CLUSTALhttp://www.genome.jp/tools/clustalw/
Without explaining all the options just yet, let's get started with an example where we perform a protein-protein BLAST search (blastp) which compares a given protein query sequence to a database of protein sequences.
We can enter a query sequence in various ways:1. fasta format (as is our example sequence below)2. Bare sequence (lines of sequence data without FASTA definition line)3. Identifiers (Accession, accession.version, or gi's)4. Upload query file (users can upload a text file containing a single sequence or list of sequences in the same formats described previously). This option may be dependent upon the BLAST Program being used.
Although certain conventions are required with regard to input, the format or the input is automatically determined following query input.
Another option is the query subrange which allows one to enter coordinates for a subrange of the query sequence (from 1 to the sequence length)
? - One can always refer to this for further explanationFor this example we'll leave the default database as nr. Because of its comprehensive nature, nr is usually a good first start when trying to identify a novel sequence or when determining if related sequences have been described previously. The database is curated by the NCBI and consists of protein sequences from all of GenBank, PDB, SWISSPROT,PIR, and PRFexcluding environmental sample from WGS projects.
As of 2012/10/15 the protein nr database contained 21,062,489 sequences. For this test case, just use the default parameters
And click BLASTThe top line gives information about the type of program (in this case, BLASTP), the version (2.2.1), and a version release date. The research paper that describes BLAST is then cited, followed by the request ID (issued by QBLAST), the query sequence definition line, and a summary of the database searched. The Taxonomy reports link displays this BLAST result on the basis of information in the Taxonomy database (Chapter 4).The graphical display overs an overview of the database hits as they align to the query. At the top of the display, you can see that 100 BLAST hits passed the threshold of our search criteria. After the color key, the top line represents the query sequence as a solid red line with the sequence coordinates. Each line below represents one subject match with its position in relation to the query and the color-coded relative strength of the similarity. You can move your mouse over each line to see the definition line, and if you click on it, you will be taken to the actual alignment. In this example, all the shown database matches are high-scoring (as indicated by red).In this case, there are three high-scoring database matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches that align to two regions of the query, from about residues 360 and residues 220500. The cross-hatched parts of the these bars indicate that the two regions of similarity are on the same protein, but that this intervening region does not match. The remaining bars show lower-scoring alignments.The Report lists the one-line descriptions of the database matches and contains information about the: (a) Accession number, gi number or database designation (b) a brief textual description of the sequence (organism from which the sequence was derived, type of sequence, information about function or phenotype); this line is usually truncated (c) scores.
The hits are listed from best to worst, with high scores and low E values being better. The E-value provides an estimate of statistical significance. Note: matches having an E-value of 0.5 and above signify sequence that may have been matched by chance alone.
Also included in this part are links to other NCBI curated databases with more information about each hit. The alignment is preceded by the sequence identifier, the full definition line, and the length of the matched sequence, in amino acids. Next comes the bit score (the raw score is in parentheses) and then the E-value. The following line contains information on the number of identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps in the alignment. Finally, the actual alignment is shown, with the query on top, and the database match is labeled as Sbjct, below. The numbers at left and right refer to the position in the amino acid sequence. One or more dashes () within a sequence indicate insertions or deletions. Amino acid residues in the query sequence that have been masked because of low complexity are replaced by Xs (see, for example, the fourth and last blocks). The line between the two sequences indicates the similarities between the sequences. If the query and the subject have the same amino acid at a given location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are indicated with +.