Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST :...

35
database search

Transcript of Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST :...

Page 1: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

database searchdatabase search

Page 2: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Overview :

1. FastA : is suitable for protein sequence searching

2. BLAST : is suitable for DNA, RNA, protein sequence searching

Page 3: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

FastA

History : FastA was developed by Lipman and Pearson in 1985, which is the first database search software.

EBI provides fastA service, available at

http://www.ebi.ac.uk/Tools/fasta/

Idea: identify the short substring matching with the target sequence.

Page 4: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

other software

commonly used

http://www.ebi.ac.uk/Tools/sss/

Page 5: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

example: protein sequence :EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP

parametersinput

sequence

select database

Page 6: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Page 7: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

results

100% identity

17/28=60.7% (identity)28 aa overlap

Page 8: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

BLAST

Basic Local Alignment Search Tool (BLAST) .

BLAST was developed by NCBI.

BLAST finds regions of similarity between biological sequences.

Page 9: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Basic BLASTProgram Sequence database Program description

Blastn Nucleotide NucleotideSearch a nucleotide database using a nucleotide

query Algorithms: blastn, megablast, discontiguous megablast

Blastp Protein ProteinSearch protein database using a protein query

Algorithms: blastp, psi-blast, phi-blast, delta-blast

Blastx Nucleotide proteinSearch protein database using a translated

nucleotide query

Tblastn Protein NucleotideSearch translated nucleotide database using a

protein query

Tblastx Nucleotide NucleotideSearch translated nucleotide database using a

translated nucleotide query

T:translation, n: nucleotide, p:protein ; x: cross

Page 10: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

BLASTALLBLASTALL

Query Sequence

Amino acid Sequence DNA Sequence

TBLASTxBLASTxBLASTnTBLASTnBLASTp

NucleotideDatabase

ProteinDatabase

NucleotideDatabase

NucleotideDatabase

ProteinDatabase

Translated TranslatedTranslated

Page 11: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Blast source1. NCBI : http://blast.ncbi.nlm.nih.gov/Blast.cgi/ (online

version)

ftp://ftp.ncbi.nih.gov/blast/ (stand alone)

2.other websites : http://life.zsu.edu.cn/blast/

http://www.fruitfly.org/blast/

http://www.mcgb.uestc.edu.cn/blast/blast.html

Page 12: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Page 13: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

BLAST

1. online : from website

2. stand alone : download the software

Page 14: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

comparison between them web server advantages : 1. easy. 2. update. 4. database download is no need. disadvantages : 1. not suitable for large data. 2. cannot define your own database.

Page 15: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Web Blast provided by NCBIBlastn for nucleotide

Blastp for protein

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 16: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

An example :1. cctggcgataaccgtcttgtcggcggttgcgctgacgttgcgtcgtgatatcatcagggcAgaccggttacatccccctaa

2.gatcgaaaaacgcttgtgttaaaaatttgctaaattttgccaatttggtaaaacagttgcAtcacaacaggagatagcaat

Page 17: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

the first sequence

Page 18: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

The second sequence

sequence

range

softwaresimilarity from high to low

results shown in new window

Page 19: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

results of pairwise alignment

No significant similarity found

information of the two sequences

parameters selected

Page 20: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Why we need the standalone version of BLAST ?1. specific database

2. privacy

3. batch processing

Blast (standalone version)

Page 21: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

Blast (standalone version)

How to download BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release

blast-2.2.23-ia32-win32.exe

Page 22: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Page 23: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

unzip, we can get three folders

bin: all the exe files

data : data for BLAST

doc : readme

Page 24: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

We need to format the database for BLAST.

First, save your database as Fasta format;Second, use formatdb provided in BLAST package to

format the database.dos command : formatdb –i sequence.fa –p T/F –o T/F –n db_name

Blast (standalone version)

Page 25: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

An example

1. There are 13 proteins in the file “Delta.txt” as the database.

2. 1 protein is selected as the query sequence, and stored in file “seq.txt” ;

Page 26: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

1. format Delta.txt :

formatdb –i Delta.txt –p T

parameter :1. –i: database2. –p: T-protein , F-nucleotide

Page 27: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

2. search Delta.txt by using BLAST :

Blastall –p blastp –d Delta.txt –i seq.txt –o out.txt

parameter :1. –p: program name : blastp , blastn , blastx , tblastn , tblastx2. –d: database name3. –i: query sequences4. –o: output file

Page 28: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

3. To read other parameters just type blastall

Page 29: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

4. Results : Score ESequences producing significant alignments: (bits) Value

P83301|CXO_CONVE 69 1e-017P69749|CXD6A_CONBU 20 0.009P69750|CXD6A_CONCN 18 0.036P24159|CXDB_CONTE P18511|CXDA_CONTE 18 0.042P60179|CXD66_CONAA 17 0.066P60513|CXD6A_CONER 17 0.11 P69751|CXD6E_CONCT P69748|CXD6A_CONAI 16 0.19 P69754|CXD6B_CONMA P69753|CXD6A_CONMA 14 0.56 P69752|CXD6B_CONER P58913|CXD6A_CONPU 14 0.62 P69756|CXD6D_CONMA P69755|CXD6C_CONMA 13 0.89 Q9XZK5|CXSO6_CONST P69757|CXD6A_CONSE 12 2.6

Page 30: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

>P83301|CXO_CONVE Length = 33

Score = 69.3 bits (168), Expect = 1e-017, Method: Compositional matrix adjust. Identities = 33/33 (100%)

Query: 1 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP 33 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLPSbjct: 1 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP 33

>P69749|CXD6A_CONBU Length = 27

Score = 20.0 bits (40), Expect = 0.009, Method: Compositional matrix adjust. Identities = 13/30 (43%), Gaps = 6/30 (20%)

Query: 1 EDCIAVGQLCVFWNIGRP CCSGLCVFAC 28 C A G C RP CCS C FACSbjct: 1 DECSAPGAFCLI RPGLCCSEFCFFAC 26

Page 31: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

5. pairwise alignment :

bl2seq –p blastp –i seq.txt –j 1.txt –o out.txt

parameter :1.–p: program name : blastp , blastn……2. –i: first sequence3. –j: second sequence 4. –o: output filesTo read other parameter, just type bl2seq

Page 32: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

6. database can be downloaded from :

ftp://ftp.ncbi.nih.gov/blast/db/

scoring matrices can be downloaded from :ftp://ftp.ncbi.nih.gov/blast/matrices/

Page 33: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

PSI-blast

Position specific iterative BLAST (PSI-

BLAST) .

Altschul et al. (1997). Gapped Blast and PSI-Blast: a new

generation of protein database search programs. Nucleic

Acids Research, 25(17):3389-3402

target: only proteins

Page 34: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

PSI-blast Position specific iterative BLAST (PSI-BLAST) refers to a

feature of BLAST 2.0 in which a profile is automatically

constructed from the first set of BLAST alignments. PSI-

BLAST is similar to NCBI BLAST2 except that it uses

position-specific scoring matrices derived during the

search, this tool is used to detect distant evolutionary

relationships.

Page 35: Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.

online source : http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?

page=/NPSA/npsa_psiblast.html

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ebi.ac.uk/Tools/blastpgp/