Position-specific scoring matrices Decrease complexity through info analysis Training set including...
-
Upload
pauline-campbell -
Category
Documents
-
view
221 -
download
0
Transcript of Position-specific scoring matrices Decrease complexity through info analysis Training set including...
![Page 1: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/1.jpg)
Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis
Training set including sequences from two Nostocs
71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA
Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA
71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT
Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT
71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC
71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG
Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG
71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT
Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT
71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA
Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA
Might increase performance of our PSSM if we can filter out columns that don’t have “enough information”
Not every column is as well conserved – some seem to be more informative about what a binding site looks like!
![Page 2: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/2.jpg)
Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis
Uncertainty (Hc) = - [pic log2(pic)]
Uncertainty Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
fraction
Unc
erta
inty
(H
)
Confusing!!!
![Page 3: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/3.jpg)
Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable
Pretend we have a machine that spits out an infinitely long string of nucleotidesBut that each one is EQUALLY LIKELY to occur:
A
Pretend we have a machine that spits out an infinitely long string of nucleotides:
G A T G A C T C …
How uncertain are we about the outcome BEFORE we see each new character produced by the machine?
Intuitively, this uncertainty will depend on howmany possibilities exist
![Page 4: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/4.jpg)
Digression on information theoryDigression on information theory
If the possibilities are:A or G or C or T
Quantifying uncertainty when outcomes are equally probable
One way to quantify uncertainty is to ask: “What is the minimum number of questions required to
remove all ambiguity about the outcome?”
How many yes/no questions do we need to ask?
![Page 5: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/5.jpg)
Digression on information theoryDigression on information theory
AGCT
AG CT
A G C T
Quantifying uncertainty when outcomes are equally probable
M = 4 ((Alphabet size)
H = log2(M)
Number of decisions dependson the height of the decision tree
With M = 4 we are uncertain by log2(4) = 2 bits before each new symbol is made by our machine
![Page 6: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/6.jpg)
Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable
After we have received a new symbol from our machinewe are less uncertain
Intuitively, when we become less uncertain, it means we have gained information
Information = uncertaintybefore - uncertaintyafterInformation = Hbefore - Hafter
Note that only in the special case whereno uncertainty remains after (Hafter = 0) does
information = Hbefore
In the real world this never happens In the real world this never happens because of because of noisenoise in the system!! in the system!!
![Page 7: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/7.jpg)
Digression on information theoryDigression on information theory
Necessary when outcomes are not equally probable!
Fine, but where did we get
H = Pi log2Pi ?i =1
M
![Page 8: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/8.jpg)
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but some are more likely to occur than others:
PA = 0.6PG = 0.1PC = 0.1 PT = 0.2
![Page 9: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/9.jpg)
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but we know that some are more likely to occur than others:
A A A T A A G T C …
Now how uncertain are we about the outcome BEFORE we see each new character?
Are we more or less surprised when we see an“A” or a “C”?
![Page 10: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/10.jpg)
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but we know that some are more likely to occur than others:
…
Do you agree that we are less surprised to see an “A” than we are to see a “G”?
A GA AA A T T C
Do you think that the output of our new machine is more or less uncertain?
![Page 11: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/11.jpg)
Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?
log2M = -log2M-1
= - log2(1/M)= - log2(P)
P = 1/M = probability of a symbol appearing
![Page 12: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/12.jpg)
Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?
PA = 0.6PG = 0.1PC = 0.1 PT = 0.2
Pi = 1i =1
M
Remember that the probabilities of all possiblesymbols must sum to 1!
M = 4
![Page 13: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/13.jpg)
Digression on information theoryDigression on information theoryHow surprised are we to see a given symbol?
ui = - log2(Pi)
UA = -log2(0.6) = 0.7UG = -log2(0.1) = 3.3UC = -log2(0.1) = 3.3UT = -log2(0.2) = 2.3
(where Pi = probability of ith symbol)
}
Ui is therefore called the surprisal for symbol i
![Page 14: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/14.jpg)
Digression on information theoryDigression on information theory
What does the surprisal for a symbol haveto do with uncertainty?
ui = - log2(Pi)
Uncertainty is the average surprisal for the infinite string of symbols produced by our machine
the “surprisal”
![Page 15: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/15.jpg)
Digression on information theoryDigression on information theoryLet’s first imagine that our machine only
produces a finite string of N symbols
N Nii =1
M
Ni is equal to the number of times each symbol occurred in a string of length N
NA = 5NG = 2NC = 1 NT = 1
For example, for the string “AAGTAACGA”
N = 9
![Page 16: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/16.jpg)
Digression on information theoryDigression on information theory
For every Ni, there is a corresponding surprisal ui
therefore the average surprisal for N symbols will be:
Niuii =1
M
Nii =1
M
Niuii =1
M
N
Ni
i =1
M
Nui
![Page 17: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/17.jpg)
Digression on information theoryDigression on information theory
For every Ni, there is a corresponding surprisal ui
therefore the average surprisal for N symbols will be:
i =1
M
Ni
Nui Pi
i =1
M
uiRemember that Pi is simply the probability
of generating the ith symbol!
But wait! We also already defined Ui !!
![Page 18: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/18.jpg)
Digression on information theoryDigression on information theory
Pii =1
M
ui
Congratulations! This is Claude Shannon’s famousformula defining uncertainty when the probability of
each symbol is unequal!
ui = - log2(Pi)
Pii =1
M
log2(Pi)-
Therefore:
H
![Page 19: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/19.jpg)
Digression on information theoryDigression on information theory
Uncertainty is largest when all symbols are equally probable!
(1/M)i =1
M
log2(1/M)-Heq
How does it reduce assuming equiprobable symbols?
1i =1
M
- (1/M log21/M)
M- (1/M log21/M)
Mlog2
![Page 20: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/20.jpg)
Digression on information theoryDigression on information theory
Uncertainty when M = 2
Pii =1
M
log2(Pi)-H
Uncertainty is largest when all symbols are equally probable!
![Page 21: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/21.jpg)
Digression on information theoryDigression on information theory
OK, but how much information is present in each column?
Information (R) = Hbefore - Hafter
Mlog2 Pii =1
M
log2(Pi)-Now before and after refers to before and after we
examined the contents of a column
![Page 22: Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649d055503460f949d92c3/html5/thumbnails/22.jpg)
Digression on information theoryDigression on information theory
http://weblogo.berkeley.edu/
Sequence logos graphically display howMuch information is present in each column