Post on 19-Dec-2015
1
1. Protein structure study via residue environment – Residues Solvent Accessibility Environm
ent in Globins Protein Family
2. Statistical linguistic study of DNA sequences*
Ka Lok Ng
Department of Information Management Ling Tung College
*In collaborate with S.P. Li,
Institute of Physics,
Academia Sinica
2
Statistical linguistic study of DNA sequences
1. Linguistic study models – Zipf law and Compound Poisson Distribution
2. Compound Poisson Distribution study of the Fortran language and DNA sequences
3. Entropic segmentation method
4. Compound Poisson Distribution study of the DNA segments
3
Statistical linguistic study of DNA sequences
Zipf LawZipf law stated that
rf = C
where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope -~ 1 for all language studied.
DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria )
Reference
Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p 3169-3172(1994).
Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences.
Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences.
Log r
Log f
4
Statistical linguistic study of DNA sequences
Word frequency distribution - Compound Poisson Distributionan author’s total vocabulary, V words (with probability of occurrence 1 < 2 < …. < v)
The frequency distribution of a specific word with probability of occurrence i to appear r = 1, 2 …. times in a total word count of N tokens is given by
dr
NNr rNr
)()1()|(1
0
Replacing the binomial by the Poisson distribution, assuming (r) is a mixing distribution ,and integrate over the probability
distribution, one obtains
where - < < , 0 < < 1 and >0 are three parameters and Kr() is the modified Bessel function of the second kind of order r. For = -0.5,(r) stands for the inverse Gaussian distribution.
)(r ))1((
))1((2/1
2/1
K !
)2/(
r
rK )(
5
Statistical linguistic study of DNA sequences
0 10 20 30
r
0.00
0.04
0.08
0.12
0.16
ph
i r
COCO1a450t85
0 10 20 30
r
0.00
0.10
0.20
0.30
0.40
ph
i r
CONVERT
a250t85
Fortran program
6
Statistical linguistic study of DNA sequences
0 20 40 60 80 100
r
0.00
0.02
0.04
0.06
ph
i r
HUMHDABCD
a750t95
0 20 40 60 80 100
r
0.00
0.02
0.04
0.06
ph
i r
HUMMMDBCa770t95
Mammals
7
Statistical linguistic study of DNA sequences
0 20 40 60 80 100
r
0.00
0.02
0.04
0.06
0.08
ph
i r
CEC0749
a640t95
0 20 40 60 80 100
r
0.00
0.02
0.04
0.06
0.08
ph
i r
CELTW IMUSCa660t95
Invertebrate
8
Statistical linguistic study of DNA sequences
0 20 40 60 80 100
r
0.00
0.02
0.04
0.06
0.08
ph
i r
ASFV55KB
a530t95
0 20 40 60 80 100
r
0.00
0.01
0.02
0.03
0.04
ph
i r
HE1CGa730t99
Eukaryotic Virus
9
Statistical linguistic study of DNA sequences
0 20 40 60 80 100
r
0.00
0.01
0.02
0.03
0.04
ph
i r
ECOWU85
a990t96
0 20 40 60 80 100
r
0.00
0.01
0.02
0.03
0.04
ph
i r
ECOUW87a980t97
Bacteria
11
Statistical linguistic study of DNA sequences
Chi-square test
i
ii
T
TO 2)(
O is the observed frequency
T is the theoretical frequency
13
Statistical linguistic study of DNA sequences
Segmentation method
• How to define a sentence ?• DNA sequences are not a random sequences• Such as CpG island and repeated sequences• Look for subsequences different from the rest of the sequence• Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in
image segmentation)• Let S = {a1, a2, …….aN} where the a’s are symbols over the alphabet A = {A1, ….. Ak} for example{A,T,C,G}• Consider a segmentation at position n, which resulted in S(1) = {a1, a2, …….an} and S(2) = {an+1, a2, …….aN} • Let F(1) = { f1
(1), …. fk(1)} and F(2) = { f1
(2), …. fk(2)} be the relative nucleotide frequencies over alphabet A .
• The Jensen-Shannon divergence measure between the 2 distributions is given by • DJS(F(1) , F(2) ) = H(1 F(1) + 2 F(2) ) – (1H(F(1) ) + 2H(F(2) )) where
i
k
ii ffFH 2
1
log)(
is the Shannon’s entropy of the distribution F and 1 + 2 = 1.
To look for subsequences one maximize DJS. Halting of the segmentation process is determined by the significant level.
References
P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations
in DNA sequences.” Phys. Rev. E 53, p5181-5189 (1996).
15
Statistical linguistic study of DNA sequences
Summary
1. The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law.
2. The compound Poisson distribution give the correct overall normalization factor.
3. We noticed that controls the long range behavior (ie less frequently occurred, rare word), controls the short range behavior (ie more frequently occurred, frequent word), and seems to control the overall slope (ie the syntax or style) of the distribution (r).
4. It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology.
In linguistic - representation of linguistic expressions
Morpheme word phrase sentence text
Biological implications
Study the statistical significance of word frequency
• Naively, words of rare frequency because it disrupts replication or gene expression ?
• Words of significant frequency survive after natural selection ?