Post on 02-Jan-2016
Biology 4900Biology 4900
Biocomputing
Chapter 3Chapter 3
Pairwise Sequence Alignment
Comparing protein sequencesComparing protein sequences
• Comparing protein sequences usually more informative than nucleotide sequences. Why?– Changing base at 3rd position in codon does not change AA
(Ex: Both UUU and UUC encode for phenylalanine)– Different AAs may share similar chemical properties (Ex:
hydrophobic residues A, V, L, I)– Relationships between related but mismatched AAs in
sequence analysis can be accounted for using scoring systems (matrices).
– Protein sequence comparisons can ID sequence homologies from proteins sharing a common ancestor as far back as 1 × 109 years ago (vs. 600 × 106 for DNA).
Applications of Sequence AnalysesApplications of Sequence Analyses
• Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed
Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Sequence HomologySequence Homology• Homology: Two sequences are homologous if they share a
common ancestor.• No “degrees of homology”: only homologous or not• Almost always share similar 3D structure
– Ex. myoglobin and beta globin– Sequences can change significantly over time, but 3D
structure changes more slowly
Pevsner, Bioinformatics and Functional Genomics, 2009
Beta-globin sub-unit of adult hemoglobin (2H35.pdb, in blue), superimposed over myoglobin (3RGK.pdb, in red).These sequences probably separated 600 million years ago.
Sequence Identity and SimilaritySequence Identity and Similarity• Identity: How closely two sequences match one another.
– Unlike homology, identity can be measured quantitatively
• Similarity: Pairs of residues that are structurally or functionally related (conservative substitutions).
Pevsner, Bioinformatics and Functional Genomics, 2009
>lcl|28245 3CLN:A|PDBID|CHAIN|SEQUENCELength=148
Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)
Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNSbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60
Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+ESbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120
Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +KSbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148
88% of sequences include the same amino acids. This increases to 97% when you include amino acids that are different, but with similar properties.
Orthologs and ParalogsOrthologs and Paralogs
• Orthologs: Homologous sequences in different species that arose from a common ancestral gene during speciation.– Ex. Humans and rats diverged around 80 million years ago
divergence of myoglobin genes occurred.– Orthologs frequently have similar biological functions.
• Human and rat myoglobin (oxygen transport)• Human and rat CaM
• Paralogs: Homologous sequences that arose by a mechanism such as gene duplication.
• Within same organism/species• Ex. Myoglobin and beta globin are paralogs
– Have distinct but related functions.
Pevsner, Bioinformatics and Functional Genomics, 2009
Pairwise AlignmentPairwise Alignment
• Comparing 2 or more sequences of amino acids or nucleotides.• Difficult to do visually.• Computer algorithms help us by:
– Accelerating the comparison process– Allowing for “gaps” in sequences (i.e., insertions, deletions)– Identifying substituted amino acids that are structurally or functionally
similar (D and E).
Pevsner, Bioinformatics and Functional Genomics, 2009
One way to do this is with BLAST (Basic Local Alignment Search Tool)
•Allows rapid sequence comparison of a query sequence against a database.•The BLAST algorithm is fast, accurate, and web-accessible.
Why use BLAST?Why use BLAST?
• BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences.
Applications include• identifying orthologs and paralogs• discovering new genes or proteins• discovering variants of genes or proteins• investigating expressed sequence tags (ESTs)• exploring protein structure and function
Four steps to becoming a Master BLASTerFour steps to becoming a Master BLASTer
http://mestadelsbilder.wordpress.com/2011/10/23/master-blaster/
(1) Choose the sequence (query)
(2) Select the BLAST program
(3) Choose the database to search
(4) Choose optional parameters (may leave as default params the first time)
Then click “BLAST”
Step 1: Choose your sequenceStep 1: Choose your sequence
Sequence can be input in FASTA format as text or by file upload, or as accession number
Example of the FASTA format for a BLAST queryExample of the FASTA format for a BLAST query
Note link here
Step 2: Choose the BLAST programStep 2: Choose the BLAST program
Blastn and blastp are the main programs you will want to use
Step 3: choose the database to search Step 3: choose the database to search
nr = non-redundant (most general database)
dbest = database of expressed sequence tags
dbsts = database of sequence tag sites
gss = genomic survey sequences
protein databases
nucleotide databases
Step 4a: Select optional search parametersStep 4a: Select optional search parameters
Entrez!
algorithm
organism
Step 4a: optional blastp search parametersStep 4a: optional blastp search parameters
Filter, mask
Scoring matrix
Word size
Expect
Right. So, what are these?
Step 4a: optional blastn search parameters
Filter, mask
Match/mismatch scores
Word size
Expect
Algorithm Parameters: ExpectAlgorithm Parameters: Expect
• This setting specifies the statistical significance threshold for reporting matches against database sequences.
• The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990).
• If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported.
• Lower EXPECT thresholds (e.g., set expect to 6) are more stringent, leading to fewer chance matches being reported.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Word SizeAlgorithm Parameters: Word Size• BLAST is a heuristic that works by finding word-matches between the
query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments.
• For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size.
• For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
Hit!extendextend
Algorithm Parameters: Filter, MaskAlgorithm Parameters: Filter, Mask
Filter•BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. •This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). •The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.
Mask (Lower Case)•With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Advanced options you probably won’t need to use
Algorithm Parameters: Match/Mismatch ScoresAlgorithm Parameters: Match/Mismatch Scores
• Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch.
• The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences.
• A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved
• A ratio of 0.5 (1/-2) is best for sequences that are 95% conserved
• A ratio of about one (1/-1) is best for sequences that are 75% conserved
States DJ, Gish W, and Altschul SF (1991)
Algorithm Parameters: MatricesAlgorithm Parameters: Matrices
• A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues.
• Some matrices are good for comparing sequences that diverge very little, while other matrices are good for comparing sequences that diverge a lot.
• The BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
• The BLOSUM-45 matrix may be better for particularly long and weak alignments.
• The older PAM matrices may be better for short alignments, as these need to have a higher percentage of matching residues to exceed background noise (be detectable beyond random chance).
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Matrices and Gap CostsMatrices and Gap Costs
Query Length
Substitution Matrix
Gap Costs
<35 PAM-30 (9,1)35-50 PAM-70 (10,1)50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).
Calculate the score in BLOSUM-62 for a gap with 7 residues…
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Conservative Substitutions in MatricesConservative Substitutions in Matrices
Scoring may vary based on conserved substitutions of amino acids: i.e., amino acids with similar properties will not lose as many points as AAs with very different properties
Basic AAs: K, R, HAcidic AAs: D, EHydroxylated AAs: S, THydrophobic AAs: G, V, L, I, M, F, P, W, Y
Pevsner, Bioinformatics and Functional Genomics, 2009
Dayhoff Model: Building a Scoring MatrixDayhoff Model: Building a Scoring Matrix 1978, Margaret Dayhoff provided one of the first models of a scoring matrix Model was based on rules by which evolutionary changes occur in proteins Catalogued 1000’s of proteins, considered which specific amino acid
substitutions occurred when 2 homologous proteins aligned Assumes substitution patterns in closely-related proteins can be
extrapolated to more distantly-related proteins An accepted point mutation (PAM) is an AA replacement accepted by
natural selection Based on observed mutations, not necessarily on related AA properties Probable mutations are rewarded, while unlikely mutations are penalized Scores for comparison of 2 residues (i, j) based on the following equation:
Here, qi,j is the probability of an observed substitution, while p is the likelihood of observing the AA (i) as a result of chance.
Pevsner, Bioinformatics and Functional Genomics, 2009
Practical Lessons from the Dayhoff ModelPractical Lessons from the Dayhoff Model
Less mutable amino acids likely play more important structural and functional roles
Mutable amino acids fulfill functions that can be filled by other amino acids with similar properties
Common substitutions tend to require only a single nucleotide change in codon
Amino acids that can be created from more than 1 codon are more likely to be created as a substitute (See p. 63, textbook)
Changes to sequence that do not alter structure and function of protein likely to be more tolerated in nature
Pevsner, Bioinformatics and Functional Genomics, 2009
PAM250 Mutation Probability MatrixPAM250 Mutation Probability Matrix
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y VAla A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17
Think of these values as percentages (columns sum to 100).For example, there is an 18% (0.18) probability of R being replaced by K.This probability matrix needs to be converted into a scoring matrix.
Original AA
Repl
acem
ent
AA
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Normalized Frequencies of Amino AcidsNormalized Frequencies of Amino Acids
Normalized Frequencies of Amino AcidsAla 0.096 Asn 0.042 Gly 0.090 Pro 0.041 Lys 0.085 Ile 0.035 Leu 0.085 His 0.034 Val 0.078 Arg 0.034 Thr 0.062 Gin 0.032 Ser 0.057 Tyr 0.030 Asp 0.053 Cys 0.025 Glu 0.053 Met 0.012 Phe 0.045 Trp 0.012
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
**How often a given amino acid appears in a protein (determined by empirical analyses)
PAM250 Log-Odds MatrixPAM250 Log-Odds MatrixCys C 12Ser S 0 2Thr T -2 1 3Pro P -3 1 0 6Ala A -2 1 1 1 2Gly G -3 1 0 -1 1 5Asn N -4 1 0 -1 0 0 2Asp D -5 0 0 -1 0 1 2 4Glu E -5 0 0 -1 0 0 1 3 4Gln Q -5 -1 -1 0 0 -1 1 2 2 4His H -3 -1 -1 0 -1 -2 2 1 1 3 6Arg R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8Lys K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8Val V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp
This is the PAM250 scoring matrix, calculated as follows:
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
BLOSUM62 Scoring MatrixBLOSUM62 Scoring Matrix BLOck SUbstitution Matrix By Henikoff and Henikoff (1992) Default scoring matrix for pairwise alignment
of sequences using BLAST Based on empirical observations of distantly-
related proteins
Pevsner, Bioinformatics and Functional Genomics, 2009
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V