Blast Introduction
-
Upload
keri-gobin -
Category
Documents
-
view
28 -
download
3
description
Transcript of Blast Introduction
![Page 1: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/1.jpg)
Introduction to BLAST
David FristromBibliographer/LibrarianScience and Engineering [email protected] 358-4124
![Page 2: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/2.jpg)
What is BLAST?
Free, online service from National Center for Biotechnology Information (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
![Page 3: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/3.jpg)
What is BLAST?
as
Google : Internet
Nucleotide/Protein Sequence DatabasesBLAST :
![Page 4: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/4.jpg)
Some Uses for BLAST
• Identify an unknown sequence
• Build a homology tree for a protein
• Get clues about protein structure by finding similar proteins with known structures
• Map a sequence in a genome
• Etc., etc.
![Page 5: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/5.jpg)
What is BLAST?
Basic Local Alignment Search Tool
![Page 6: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/6.jpg)
Alignment
AACGTTTCCAGTCCAAATAGCTAGGC===--=== =-===-==-====== AACCGTTC TACAATTACCTAGGC
Hits(+1): 18Misses (-2): 5Gaps (existence -2, extension -1): 1 Length: 3Score = 18 * 1 + 5 * (-2) – 2 – 2 = 6
![Page 7: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/7.jpg)
Global Alignment
• Compares total length of two sequences
• Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):443-53(1970).
![Page 8: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/8.jpg)
Local Alignment
• Compares segments of sequences
• Finds cases when one sequence is a part of another sequence, or they only match in parts.
• Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981)
![Page 9: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/9.jpg)
Search Tool
• By aligning query sequence against all sequences in a database, alignment can be used to search database for similar sequences
• But alignment algorithms are slow
![Page 10: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/10.jpg)
What is BLAST?
• Quick, heuristic alignment algorithm• Divides query sequence into short words,
and initially only looks for (exact) matches of these words, then tries extending alignment.
• Much faster, but can miss some alignments
• Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10(1990).
![Page 11: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/11.jpg)
What is BLAST?
• BLAST is not Google
• BLAST is like doing an experiment: to get good, meaningful results, you need to optimize the experimental conditions
![Page 12: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/12.jpg)
Sample Search
• Human beta globin (HBB)– Subunit of hemoglobin
• Acquisition number: NP_000509
• Limit to mouse to more easily show differences between searches
![Page 13: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/13.jpg)
Interpreting Results
• Score: Normalized score of alignment (substitution matrix and gap penalty). Can be compared across searches
• Max score: Score of single best aligned sequence
• Total score: Sum of scores of all aligned sequences
![Page 14: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/14.jpg)
Interpreting Results
• Query coverage: What percent of query sequence is aligned
• E Value: Number of matches with same score expected by chance. For low values, equal to p, the probability of a random alignment
• Typically, E < .05 is required to be considered significant
![Page 15: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/15.jpg)
Getting the most out of BLAST
1. What kind of BLAST?
2. Pick an appropriate database
3. Pick the right algorithm
4. Choose parameters
![Page 16: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/16.jpg)
Step 0: Do you need to use BLAST?
![Page 17: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/17.jpg)
Step 1:Nucleotide BLAST vs. protein BLAST• Largely determined by your query sequence
BUT
• If your nucleotide sequence can be translated to a peptide sequence, you probably want to do it (use tool such as ExPASy Translate Tool)
• Protein blasts are more sensitive and biologically significant
• Sometimes it makes sense to use other blasts
![Page 18: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/18.jpg)
Specialized Search: blastx
• Search protein database using a translated nucleotide query
• Use to find homologous proteins to a nucleotide coding region
• Translates the query sequence in all six reading frames
• Often the first analysis performed with a newly determined nucleotide sequence
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#blastx
![Page 19: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/19.jpg)
Specialized Search: tblastn
• Search translated nucleotide database using a protein query
• Does six-frame translations of the nucleotide database
• Find homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG)
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastn
![Page 20: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/20.jpg)
Specialized Search: tblastx
• Search translated nucleotide database using a translated nucleotide query
• Both translations use all six frames
• Useful in identifying potential proteins encoded by single pass read ESTs
• Good tool for identifying novel genes
• Computationally intensive
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastx
![Page 21: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/21.jpg)
Even More Specialized• Make specific primers with Primer-BLAST• Search trace archives• Find conserved domains in your sequence (cds)• Find sequences with similar conserved domain architecture (cdart)• Search sequences that have gene expression profiles (GEO)• Search immunoglobulins (IgBLAST)• Search for SNPs (snp)• Screen sequence for vector contamination (vecscreen)• Align two (or more) sequences using BLAST (bl2seq)• Search protein or nucleotide targets in PubChem BioAssay• Search SRA transcript libraries• Constraint Based Protein Multiple Alignment Tool
![Page 22: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/22.jpg)
Step 2: Choose a Database
• Too large:– Takes longer– Too many results– More random, meaningless matches
• Too small or wrong one:– Miss significant matches
![Page 23: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/23.jpg)
Protein Databases
• Non-redundant protein sequences (nr)– Kitchen-sink:
• Translations of GenBank coding sequences (CDS) • RefSeq Proteins • PDB (RCSB Protein Data Bank - 3d-structure) • SwissProt • Protein Information Resource (PIR) • Protein Research Foundation (Japanese DB)
• Reference proteins (refseq_protein) – NCBI Reference Sequences: Comprehensive, integrated, non-
redundant, well-annotated set of sequences
• Swissprot protein sequences (swissprot) – Swiss-Prot: European protein database (no incremental
updates)
![Page 24: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/24.jpg)
Protein Databases
• Patented protein sequences (pat) – Patented sequences
• Protein Data Bank proteins (pdb) – Sequences from RCSB Protein Data Bank
with experimentally determined structures
• Environmental samples (env_nr) – Protein sequences from environmental
samples (not associated with known organism)
![Page 25: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/25.jpg)
Nucleotide Databases
• Human genomic + transcript – http://www.ncbi.nlm.nih.gov/genome/guide/human/
• Mouse genomic + transcript – http://www.ncbi.nlm.nih.gov/genome/guide/mouse/
• Nucleotide collection (nr/nt)– “nr” stands for “non-redundant,” but it isn’t
• GenBank (NCBI)• EMBL (European Nucleotide Sequence Database)• DDBJ (DNA Databank of Japan)• PDB (RCSB Protein Data Bank - 3d-structure)
– Kitchen sink but not HTGS0,1,2, EST, GSS, STS, PAT, WGS
![Page 26: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/26.jpg)
Nucleotide Databases
• Reference mRNA sequences (refseq_rna) • Reference genomic sequences
(refseq_genomic) – NCBI Reference Sequences: Comprehensive,
integrated, non-redundant, well-annotated set of sequences
• NCBI Genomes (chromosome) – Complete genomes and chromosomes from
Reference Sequences
![Page 27: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/27.jpg)
Nucleotide Databases
• Expressed sequence tags (est)• Non-human, non-mouse ESTs (est_others)
– http://www.ncbi.nlm.nih.gov/About/primer/est.html – http://www.ncbi.nlm.nih.gov/dbEST/index.html
• Genomic survey sequences (gss)– Like EST, but genomic rather than cDNA (mRNA)
• random "single pass read" genome survey sequences.• cosmid/BAC/YAC end sequences• exon trapped genomic sequences• Alu PCR sequences• transposon-tagged sequences
– http://www.ncbi.nlm.nih.gov/dbGSS/index.html
![Page 28: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/28.jpg)
Nucleotide Databases
• High throughput genomic sequences (HTGS)– Unfinished sequences (phase 1-2). Finished are
already in nr/nt – http://www.ncbi.nlm.nih.gov/HTGS/
• Patent sequences (pat) – Patented genes
• Protein Data Bank (pdb) – Sequences from RCSB Protein Data Bank with
experimentally determined structures– http://www.rcsb.org/pdb/home/home.do
![Page 29: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/29.jpg)
Nucleotide Databases
• Human ALU repeat elements (alu_repeats) – Database of repetitive elements
• Sequence tagged sites (dbsts) – Short sequences with known locations from
GenBank, EMBL, DDBJ
• Whole-genome shotgun reads (wgs)– http://www.ncbi.nlm.nih.gov/Genbank/
wgs.html
![Page 30: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/30.jpg)
Nucleotide Databases
• Environmental samples (env_nt) – Nucleotide sequences from environmental
samples (not associated with known organism)
![Page 31: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/31.jpg)
Database Options
• Limit to (or exclude) an organism• Exclude Models (XM/XP)
– Model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs … which may have been generated from incomplete data.
• Entrez Query– Use Entrez query syntax to limit search
![Page 32: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/32.jpg)
Step 3:Choose an Algorithm
• How close a match are you looking for?
• Determines how similarities are “scored”
• Affects speed of search and chance of missing match
• Again, what is the goal of the search?
![Page 33: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/33.jpg)
blastp
• Protein-protein BLAST
• Standard protein BLAST
![Page 34: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/34.jpg)
PSI-BLAST
• Protein-protein BLAST
• Position-Specific Iterated BLAST
• Finds more distantly related matches
• Iterates: Initial search results provide information on “allowed” mutations; subsequent searches use these to create custom substitution matrix
![Page 35: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/35.jpg)
PHI-BLAST
• Protein-protein BLAST• Pattern Hit Initiated BLAST• Variation of PSI-BLAST• Specify a pattern that hits must match• Use when you know protein family has a
signature pattern: active site, structural domain, etc.
• Better chance of eliminating false positives• Example: VKAHGKKV
![Page 36: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/36.jpg)
megablast
• Nucleotide BLAST
• Finds highly similar sequences
• Very fast
• Use to identify a nucleotide sequence
![Page 37: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/37.jpg)
blastn
• Nucleotide BLAST
• Use to find less similar sequences
![Page 38: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/38.jpg)
discontiguous megablast
• Nucleotide BLAST– Bioinformatics. 2002 Mar;18(3):440-5.
PatternHunter: faster and more sensitive homology search. Ma B, Tromp J, Li M.
• Even more dissimilar sequences
• Use to find diverged sequences (possible homologies) from different organisms
![Page 39: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/39.jpg)
Step: 4Algorithm Parameters
Fine-tune the algorithm
• Short Queries
• Expect threshold: The lower it is, the fewer false positives (but you might miss real hits)
![Page 40: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/40.jpg)
Algorithm Parameters
Scoring Matrix: • PAM: Accepted Point Mutation
– Empirically derived chance a substitution will be accepted, based on closely related proteins
– Higher PAM numbers correspond to greater evolutionary distance
• BLOSUM: Blacks Substitution Matrix– Another empirically derived matrix, based on more distantly
related proteins– Lower BLOSUM numbers correspond to greater evolutionary
distance
• Compositional adjustment changes matrix to take into account overall composition of sequence
![Page 41: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/41.jpg)
Algorithm Parameters
Filters and Masking
• Can ignore low complexity regions in searching
![Page 42: Blast Introduction](https://reader035.fdocuments.us/reader035/viewer/2022062221/55cf9972550346d0339d6fa7/html5/thumbnails/42.jpg)
Additional Sources
• Pevsner, Jonathan Bioinformatics and Functional Genomics, 2nd ed. (Wiley-Blackwell, 2009)
• BLAST help pages: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs
• Slides from class on similarity searching; lots of technical details on algorithms and similarity matrices: http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Similarity/simsrchlast.html