Psi-Blast Morten Nielsen, CBS, BioCentrum, DTU. Objectives Understand why BLAST often fails for low...

Psi-Blast

Morten Nielsen,CBS, BioCentrum,

Objectives

• Understand why BLAST often fails for low sequence similarity

• See the beauty of sequence profiles– Position specific scoring matrices (PSSMs)

• Use BLAST to generate Sequence profiles

• Use profiles to identify amino acids essential for protein function and structure

What goes wrong when Blast fails?

• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Alignment scoring matrices

• Blosum62 score matrix. Fg=1. Ng=0?

L A G D S D

Alignment scoring matrices• Blosum62 score matrix. Fg=1. Ng=0?

• Score =2-1+6+6+4=17

L A G D S D

F 0 -2 -3 -3 -2 -3

I 2 -1 -4 -3 -2 -3

G -4 0 6 -1 0 -1

D -4 -2 -1 6 0 6

S -2 1 0 0 4 0

L 4 -1 -4 -4 -2 -4

LAGDSI-GDS

What goes wrong when Blast fails?

• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences• This scoring matrix is identical at all positions in the protein sequence!

EVVFIGDSLVQLMHQC

AGDS.GGGDS

When Blast works!

1PLB._

When Blast fails!

1PMY._

When Blast fails

Sequence profiles

• In reality not all positions in a protein are equally likely to mutate• Some amino acids (active cites) are highly

conserved, and the score for mismatch must be very high

• Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score

• Sequence profiles can capture these differences

What are sequence profiles?

Anchor positions

Binding Motif. MHC class I with peptide

SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence information

Sequence Information

Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?

P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information

Calculate pa at each positionEntropy

Information content

Conserved positions– PV=1, P!v=0 => S=0,

I=log(20)Mutable positions

– Paa=1/20 => S=log(20), I=0

P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information

Sequence information - I

I = log(20) + paa

∑ log(pa )ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

PA = 6/10 = 0.6

PG = 2/10 = 0.2

PT = PK = 1/10 = 0.1

PC = PD = …PV = 0.0

Multiple Sequence alignment

Information content

A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

Sequence logos

Height of a column equal to I

Relative height of a letter is pHighly useful tool to visualize sequence motifs

High information positions

HLA-A0201

Sequence logos

• Height of a column equal to I• Relative height of a letter is p• Letters upside-down if pa < qa

High information positions

I = log(20) + paa

∑ log(pa )

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein family

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles

Conserved

Non-conserved

Matching any thing but G => large negative score

Any thing can match

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

How to make sequence profiles

1. Align (BLAST) sequence against large sequence database (Swiss-Prot)

2. Select significant alignments and make sequence profile

3. Use profile to align against sequence database to find new significant hits

4. Repeat 2 and 3 (normally 3 times!)

Sequence profiles (1J2J.B)

>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK

A R N D C Q E G H I L K M F P S T W Y V 1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 2 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 5 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 7 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 8 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 9 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -210 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -211 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -212 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1

Blosum62 Sequence Profile

Example.

>1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

• What is the function• Where is the active site?

What would you do?

• Function• Run Blast against PDB• No significant hits

• Run Blast against NR (Sequence database)• Function is Acetylesterase?

• Where is the active site?

Example. Where is the active site?

1WAB Acetylhydrolase

1G66 Acetylxylan esterase

1USW Hydrolase

When Blast fails!

1WAB._

Example. (SGNH active site)

• Sequence profiles might show you where to look!• The active site could be around

• S9, G42, N74, and H195

Profile-profile scoring matrix

1WAB._

Align using sequence profiles

ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------

1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP

1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Where is the active site?

Rhamnogalacturonan acetylesterase (1k7c)

How to do it?

Example

>QUERY1

MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV

EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK

LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS

IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL

YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID

LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE

IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL

QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE

Using Iterative Blast

Using Iterative Blast (1st iteration)

Using Iterative Blast (3rd iteration)

HHpred webserver

Take home message

• Blast will often fail to recognize sequence relationships for low homology sequence pairs

• Sequence profiles contain information on conserved/variable residues in a protein sequence

• Sequence profiles are calculated from (multiple) sequence alignments

• Iterative Blast enables homology recognition also for low sequence similarity

• Sequence profiles give information on residues essential for protein function and protein structure

Psi-Blast Morten Nielsen, CBS, BioCentrum, DTU. Objectives Understand why BLAST often fails for low...

Documents

Transcript of Psi-Blast Morten Nielsen, CBS, BioCentrum, DTU. Objectives Understand why BLAST often fails for low...

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.

Sequence Identification using BLAST - Craig Ventermaize.jcvi.org/cellgenomics/outreach/2016/notes/Intro_BLAST.pdf · Sequence Alignment • Sequence alignment is the assignment of

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Blast-i2b2: Blast for Biological Sequence Comparison in … · · 2018-01-19BLAST (Basic Local Alignment Search Tool) is a search tool that is designed to perform the sequence alignment

Immunological databases on the web Ole Lund Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark lund@cbs.dtu.dk.

Other IRD Tools and Features: Sequence similarity search (BLAST)

BLAST and sequence alignment

BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Bioinformatics for biomedicine Sequence search: BLAST, FASTA · Bioinformatics for biomedicine Sequence search: BLAST, FASTA ... • Cookbook recipe ... PRESELVISLEY P 1 L 6,10 R

PSI-BLAST and Multiple Sequence Alignments

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Biology 4900 Biocomputing. Chapter 4 BLAST BLAST BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database.

Guide Guide to Analysis of... · Guide to ANALYSIS OF DNA MICROARRAY DATA Second Edition Steen Knudsen Center for Biological Sequence Analysis BioCentrum-DTU Technical University

Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST.

Biocentrum Helsinki in 2012 Helsinki in 2012.pdfPekka Lappalainen ... Markku Varjosalo ..... 50 Systems Biology Unit (SBU) ... breaks for . Biocentrum Helsinki in 2012 | ...

Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk.

Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.

Protein Sequence Comparison and Protein Evolution (What ... · Protein Sequence Comparison and Protein Evolution (What BLAST does/Why BLAST works William R. Pearson faculty.virginia.edu/wrpearson

Similarity Searches on Sequence Databases: BLAST, FASTA€¦ · Similarity Searches on Sequence Databases, EMBnet Course, October 2003 Heuristic Sequence Alignment: Principle •