Post on 15-Jan-2016
Multiple Sequence Alignment
Based on slides by Irit Gat-Viks
1
Example
2
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Why multiple sequence alignment
bull Structure similarity ndash aa that play the same role in each structure are in the same column
bull Evolutionary similarity ndash aa related to the same ancestor are in the same column
bull Functional similarity - aa with the same function are in the same column
bull When seqs are closely related structure-evolution-functional similarity equivalent
3
Multiple Alignment Definition
CG copy Ron Shamir 065
Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length
1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|
2 Removal of spaces from Srsquoi gives Si for all i
Example
CG copy Ron Shamir 066
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Example
2
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Why multiple sequence alignment
bull Structure similarity ndash aa that play the same role in each structure are in the same column
bull Evolutionary similarity ndash aa related to the same ancestor are in the same column
bull Functional similarity - aa with the same function are in the same column
bull When seqs are closely related structure-evolution-functional similarity equivalent
3
Multiple Alignment Definition
CG copy Ron Shamir 065
Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length
1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|
2 Removal of spaces from Srsquoi gives Si for all i
Example
CG copy Ron Shamir 066
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Why multiple sequence alignment
bull Structure similarity ndash aa that play the same role in each structure are in the same column
bull Evolutionary similarity ndash aa related to the same ancestor are in the same column
bull Functional similarity - aa with the same function are in the same column
bull When seqs are closely related structure-evolution-functional similarity equivalent
3
Multiple Alignment Definition
CG copy Ron Shamir 065
Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length
1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|
2 Removal of spaces from Srsquoi gives Si for all i
Example
CG copy Ron Shamir 066
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Multiple Alignment Definition
CG copy Ron Shamir 065
Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length
1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|
2 Removal of spaces from Srsquoi gives Si for all i
Example
CG copy Ron Shamir 066
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Example
CG copy Ron Shamir 066
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
CG copy Ron Shamir 068
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Example
CG copy Ron Shamir 069
Multiple sequence alignment of 7 neuroglobins using clustalx
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Human-centric beta globin Multiple Alignment
CG copy Ron Shamir 0610 httpglobincsepsuedu
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a
protein familybull Understand evolution - preliminary step in molecular evolution
analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly
bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function
bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family
bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals
11
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Protein Phylogenies ndash Example
CG copy Ron Shamir 0612
Kinase domain
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Scoring alignments
bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score
bull Scores previewndash Sum of pairsndash Consensusndash Tree
bull Varying methods (and controversy)
CG copy Ron Shamir 0615
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Sum of Pairs scoreDef Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example
x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG
CG copy Ron Shamir 0616
S(M) = kltl (Srsquok Srsquol)
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
SOP Score Example
CG copy Ron Shamir 0617
Consider the following alignment
AC-CDB--C-ADBDA-BCDAD
Scoring scheme match - 0mismatchindel - -1
SP score -3 -5 -4 =-12
Multiple Alignment with SOP scores is NP-hard
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
18
בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא
מדובר בזמן ריצהובגודל זיכרון הגדלים
כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים
למעשה בלתי אפשריKgt3 עבור
הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Consensus MSA
bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence
bull More difficult to finddefine as the consensus sequence itself is difficult to define
bull Used mainly for computational proofs
CG copy Ron Shamir 0619
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
20
-SCGPFIRVMSCGPGLRA-SCTPHL-A
-SCGPFIRVMSCGPGLRA
-SCGPFIRV-SCTPHL-A
MSCGPGLRA-SCTPHL-A
5 3 5 13
Scoring metrics -examplesSum of pairs
-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA
3 5 4 1 6 2 4 6 354סהכ הומוגניות
3 1 2 5 0 4 2 0 42 19סהכ מרחק
Distance from concensus
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Tree MSA
bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string
to each internal node
bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star
CG copy Ron Shamir 0621
CTGG
CCGG
GTTC
CTTG
GTTG
GTTG
CTGG
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Profile Representation of MA
bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)
CG copy Ron Shamir 0623
- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G
A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Aligning a sequence to a profile
bull Key in pairwise alignment is scoring two positions xy (xy)
bull For a letter x and a column y in a profile (xy)=value of x in col Y
bull Invent a score for (x-)bull Run the DP alg for pairwise alignment
CG copy Ron Shamir 0625
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Aligning alignments
bull Given two alignments how can we align them
bull Hint use DP on the corresponding profiles
CG copy Ron Shamir 0626
x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG
w GGACGTACC-- Alignment 2v GGACCT-----
x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG
w GGACGTACC-- v GGACCT-----
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Multiple Alignment Greedy Heuristic
bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat
CG copy Ron Shamir 0627
u1= ACGTACGTACGThellip
u2 = TTAATTAATTAAhellip
u3 = ACTACTACTACThellip
hellip
uk = CCGGCCGGCCGG
u1= ACgtTACgtTACgcThellip
u2 = TTAATTAATTAAhellip
hellip
uk = CCGGCCGGCCGGhellip
k
k-1
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
ClustalW Thompson Higgins Gibson 94
bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment
are weighted differently)bull Three-step process
1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree
CG copy Ron Shamir 0628
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 1 Pairwise Alignment
bull Aligns each sequence against each other giving a similarity matrix
bull Similarity = exact matches sequence length (percent identity)
CG copy Ron Shamir 0629
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
)17 means 17 identical(
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 2 Guide Tree
bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which
iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other
sequencessubtrees
CG copy Ron Shamir 0630
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 2 Guide Tree (contrsquod)
CG copy Ron Shamir 0631
v1
v3
v4 v2
Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)
v1 v2 v3 v4
v1 -v2 17 -v3 87 28 -v4 59 33 62 -
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair
(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices
special gap scoreshellip
CG copy Ron Shamir 0632
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
Dots and stars show how well-conserved a column is
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
33
בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2
Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale
בונים את ההתאמה לפי הסדר המוכתב3 עי העץ
ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-
בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
34
נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull
המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull
חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull
של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull
יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull
CLUSTALW algorithm
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
CLUSTALW algorithmbull We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected pairwise alignment)
bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences
bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35
Best Pairwise alignment (optimal)
Projected Pairwise alignment
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
ClustalW at EMBL
36
httpwwwebiacukclustalw Clustalw at the SRS site at EBI
httpwwwcstauacil~ulitskyicgGBAfastatxt
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
ClustalW Output Aln format
37
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA algorithms
bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic
algorithms)bull Local methods eMotifs Blocks Psi-blast
38
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA Editing Jalview
39
Conservation
wwwesembnetorgServicesMolBiojalviewindexhtml
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA formats - fasta
40
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA formats - Aln
41
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA formats - MSF
42
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Example 1a a good MSA
43
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
44
Example 1b making MSA of distantly related proteins
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
45
Example 1c including more distant relatives in the MSA
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins
Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein
residues bull IPNS is involved in biosynthesis of penicillin
46
N
SN
OCOOHO
H
H
Me
Me
COOH
NH2
N
SN
OCOOH
NH2 Me
Me
COOHO
ACV
Isopenicillin N
Fe+2
Ascorbate
O2
2H2O Fe
N
NHis268
H
N
NHis212
H
SACV
O2 (NO)
H2O
OAsp214
O
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Research IPNS
bull Goal Identify Fe+2 binding residuesbull Possible solutions
1 In the lab2 Bioinformatic approach (comparing different
IPNS sequences)
47
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 1
Multiple alignment of known IPNS
Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA
48
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
MSA ndash bacteria only
49
Not enough variation
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
50
MSA ndash bacteria amp fungi
Not enough variation
bacteria amp fungi
bacteria
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
Step 2Goal Add more enzymes similar to IPNS
Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW
51
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
52
Step 2Goal Add more enzymes similar to IPNS
ImplementationbullSearch in httpwwwexpasyorgtoolsblast
bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length
bullExport in FASTA formatbullRun CLUSTALW
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
53
bull New multiple alignment narrowing down the possibilities
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
54
Simple multiple alignment
bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences
(distant homologs)
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
55
Step 3Using the results of the MSA for further searches
Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search
OR - Construct a profile and perform a new search
3 MSA (clustalw) and search for conserved residues in the MSA
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
56
Consensus Sequence
bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column
bull Consensus each position reflects the most common character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
57
Profilebull We can deduce a statistical model describing the
multiple sequence alignment A Profile holds statistical information about characters in alignment at each column
bull Profile each position reflects the frequency of the character found at a position
A T C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 1 067 0 0
T 0 033 1 1
C 0 0 0 0
G 0 0 0 0
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
58
Profile vs ConsensusbullThe following multiple alignments will have
the same consensusA A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
59
Profile vs ConsensusbullBut have a different profile
A A C T T G C
A A G T C G T
C A C T T C T
A A C T T G T
A A C T T G T
A A C T T C T
1 2 3 4 5 6
A 066 1 0 0
T 0 0 0 1
C 033 0 066
0
G 0 0 033
0
1 2 3 4 5 6
A 1 1 0 0
T 0 0 0 1
C 0 0 1 0
G 0 0 0 0
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
60
Sequence LOGO A A C C G C T C T T
A G C C G C G C - T
A - C A G A G C C T
A A G C A C G C - T
A C G G G T G C T T
A T G C ndash C G C - T
A gc c g G C T
httpweblogoberkeleyedu
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
61
Psi BlastbullPosition Specific Iterated - automatic profile-
like search
Regular blast
Construct profile from blast results
Blast profile search
Final results
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
62
bull Alignment with distantly related proteins
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
63
bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS
Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1
min-1)
Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd
Isopenicillin N Synthase
64
ndash IPNS
65
66
64
ndash IPNS
65
66
65
66
66