Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.
![Page 1: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/1.jpg)
Entropy, Information contents &Logo plots
By Thomas Nordahl Petersen
![Page 2: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/2.jpg)
GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA
• Mutiple alignment of acceptor sites from 268 yeast DNA sequences– What is the biological signal around the site ?
– What are the important positions
– How can it be visualized ?
Biological information
Sequence-logo
• Logo plot with Information Content
Exon Intron Exon
![Page 3: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/3.jpg)
Entropy - Definition
• Entropy of random variable is a measure of the uncertainty
• In Thermodynamics G=H-TS– The entropy S of a system is the degree of disorder
![Page 4: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/4.jpg)
Entropy - Definition
• Entropy of a distribution of amino acids– The Shannon entropy:
H(p) = - a pa log2(pa), where p is an amino acid distribution.
H(p) is measured in bits: log2(2) = 1, log2(4)=2
Mutiple alignment of 3 sequencesSeq1: A L P KSeq2: A V P RSeq3: A I K R
High entropy - high disorderLow entropy - low disorder
![Page 5: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/5.jpg)
Entropy - example
H(p) = - a pa log2(pa)
Mutiple alignment of 3 sequencesSeq1: A L RSeq2: A V RSeq3: A I K
Pos1: H(p)= -[1*log2(1)] = 0
Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =
![Page 6: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/6.jpg)
Relative EntropyThe Kullback-Leiber distance D
How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.
D(p||q) = a pa log2(pa/qa)
Normally a background distribution of the amino acids isobtained as frequencies from a large database like UniProt.
Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71
![Page 7: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/7.jpg)
Information content
D(p||q) = a pa log2(pa/qa) Often the Information content is used as a measure of thedegree of conservation.
I = a pa log2(pa/qa)
A special case is that where all amino acids have the same background distribution: qa = 1/20
![Page 8: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/8.jpg)
Information content
• I = a pa log2(pa/(1/20)) • = a pa [log2pa - log2(1/20)]
• = -H(p) - a palog2(1/20)
• = -H(p) + a palog2(20)
• = -H(p) + log2(20)
• = -H(p) + 4.32
![Page 9: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/9.jpg)
Information content
• I = -H(p) + 4.32 = a palog2pa + 4.32
The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.
Mutiple alignment of 3 sequences:Seq1: A L RSeq2: A V RSeq3: A I K
Pos1: I = -[1*log2(1)]+ 4.32 = 4.32
Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=
![Page 10: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/10.jpg)
GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA
A 94 88 84 75 78 78 71 69 70 60 68 77 32 49 87 93 93 134 9 266 0 86 66 85 81 89 81 88 82
C 31 45 52 44 56 46 62 54 56 51 46 37 30 42 32 44 30 25 122 1 0 38 65 52 43 62 62 57 43
T 113 110 113 117 104 117 111 120 118 125 136 140 182 155 122 100 124 75 137 0 0 72 85 82 91 83 73 67 96
G 30 25 19 32 30 27 24 25 24 32 18 14 24 22 27 31 21 34 0 1 268 72 52 49 53 34 52 56 47
Count nucleotides at each position:
A 0,35 0,33 0,31 0,28 0,29 0,29 0,26 0,26 0,26 0,22 0,25 0,29 0,12 0,18 0,32 0,35 0,35 0,50 0,03 0,99 0,00 0,32 0,25 0,32 0,30 0,33 0,30 0,33 0,31
C 0,12 0,17 0,19 0,16 0,21 0,17 0,23 0,20 0,21 0,19 0,17 0,14 0,11 0,16 0,12 0,16 0,11 0,09 0,46 0,00 0,00 0,14 0,24 0,19 0,16 0,23 0,23 0,21 0,16
T 0,42 0,41 0,42 0,44 0,39 0,44 0,41 0,45 0,44 0,47 0,51 0,52 0,68 0,58 0,46 0,37 0,46 0,28 0,51 0,00 0,00 0,27 0,32 0,31 0,34 0,31 0,27 0,25 0,36
G 0,11 0,09 0,07 0,12 0,11 0,10 0,09 0,09 0,09 0,12 0,07 0,05 0,09 0,08 0,10 0,12 0,08 0,13 0,00 0,00 1,00 0,27 0,19 0,18 0,20 0,13 0,19 0,21 0,18
Convert to frequencies:
Frequency-logo:
Logo plots - HowTo
![Page 11: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/11.jpg)
Logo plots - Information Content
Sequence-logo
Calculate Information Content
I = apalog2pa + log2(4), Maximal value is 2 bits
• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.
~0.5 each
Completely conserved
![Page 12: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/12.jpg)
Programs to make a Logo plot
• WebLogo• Requires a mutiple alignment as input• Protein or DNA sequences• More output formats
• Blast2Logo• Requires a fasta file as input• Only protein sequences• Runs PSI-blast and makes a table of frequencies• pdf logo plot
![Page 13: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/13.jpg)
WebLogo - http://weblogo.berkeley.edu/
![Page 14: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/14.jpg)
WebLogo - http://weblogo.berkeley.edu/
![Page 15: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/15.jpg)
Find important positions>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesteraseMKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
What is the next step ?
1 Find homologous sequences - how ?
- Blast or PsiBlast- Download sequences- Make a mutiple alignment- ClustalW or others- or use Blast2Logo program
![Page 16: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/16.jpg)
Mutiple alignment programs
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
![Page 17: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/17.jpg)
Blast2logo - http://www.cbs.dtu
.dk/biotools/Blast2logo-1.0/
![Page 18: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/18.jpg)
Important positions
Important positions in proteins are conservedpositions => high Information Content.
Conserved for a reason:• Functionally important positions
• Catalytic residues
• Structurally important positions• Manitain the correct fold of the protein
![Page 19: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/19.jpg)
Blast2logo
Runs iterative blast i.e. Psi-Blast
Searching for homologues sequences by useof Position Specific Scoring Matrices (PSSM).
1. Iteration - use Blosum62 scoring matrix2. Iteration - make profile of seq found in iteration 13. Iteration - make profile of seq found in iteration 24. Iteration - Calculate aa freq at each position inquery sequence. Correct for low counts and weightseq such that very similar seq are down weighted
![Page 20: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/20.jpg)
Important positions - counting
![Page 21: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/21.jpg)
Example. Where is the active site?• Sequence profiles might show you where to look!• The active site could be around
• S9, G42, N74, and H195
![Page 22: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a28a/html5/thumbnails/22.jpg)
Exercise
1. Calculate nucleotide frequencies from a mutiple alignment of human donor sites
2. Calculate Entropy and Information content
3. Draw (by hand) a Logo plot
4. Use 2 Logo plot programs
5. Learn to interpret Logo & frequency plots
6. Active site residues & structural residues