1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment...

1. Protein structure study via residue environment – Residues Solvent Accessibility Environm

ent in Globins Protein Family

2. Statistical linguistic study of DNA sequences*

Ka Lok Ng

Department of Information Management Ling Tung College

*In collaborate with S.P. Li,

Institute of Physics,

Academia Sinica

Statistical linguistic study of DNA sequences

1. Linguistic study models – Zipf law and Compound Poisson Distribution

2. Compound Poisson Distribution study of the Fortran language and DNA sequences

3. Entropic segmentation method

4. Compound Poisson Distribution study of the DNA segments

Zipf LawZipf law stated that

rf = C

where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope -~ 1 for all language studied.

DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria )

Reference

Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p 3169-3172(1994).

Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences.

Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences.

Word frequency distribution - Compound Poisson Distributionan author’s total vocabulary, V words (with probability of occurrence 1 < 2 < …. < v)

The frequency distribution of a specific word with probability of occurrence i to appear r = 1, 2 …. times in a total word count of N tokens is given by

NNr rNr

)()1()|(1

Replacing the binomial by the Poisson distribution, assuming (r) is a mixing distribution ,and integrate over the probability

distribution, one obtains

where - < < , 0 < < 1 and >0 are three parameters and Kr() is the modified Bessel function of the second kind of order r. For = -0.5,(r) stands for the inverse Gaussian distribution.

)(r ))1((

))1((2/1

0 10 20 30

COCO1a450t85

0 10 20 30

CONVERT

a250t85

Fortran program

0 20 40 60 80 100

HUMHDABCD

a750t95

0 20 40 60 80 100

HUMMMDBCa770t95

Mammals

0 20 40 60 80 100

CEC0749

a640t95

0 20 40 60 80 100

CELTW IMUSCa660t95

Invertebrate

0 20 40 60 80 100

ASFV55KB

a530t95

0 20 40 60 80 100

HE1CGa730t99

Eukaryotic Virus

0 20 40 60 80 100

ECOWU85

a990t96

0 20 40 60 80 100

ECOUW87a980t97

Bacteria

Chi-square test

TO 2)(

O is the observed frequency

T is the theoretical frequency

Segmentation method

• How to define a sentence ?• DNA sequences are not a random sequences• Such as CpG island and repeated sequences• Look for subsequences different from the rest of the sequence• Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in

image segmentation)• Let S = {a1, a2, …….aN} where the a’s are symbols over the alphabet A = {A1, ….. Ak} for example{A,T,C,G}• Consider a segmentation at position n, which resulted in S(1) = {a1, a2, …….an} and S(2) = {an+1, a2, …….aN} • Let F(1) = { f1

(1), …. fk(1)} and F(2) = { f1

(2), …. fk(2)} be the relative nucleotide frequencies over alphabet A .

• The Jensen-Shannon divergence measure between the 2 distributions is given by • DJS(F(1) , F(2) ) = H(1 F(1) + 2 F(2) ) – (1H(F(1) ) + 2H(F(2) )) where

ii ffFH 2

is the Shannon’s entropy of the distribution F and 1 + 2 = 1.

To look for subsequences one maximize DJS. Halting of the segmentation process is determined by the significant level.

References

P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations

in DNA sequences.” Phys. Rev. E 53, p5181-5189 (1996).

Summary

1. The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law.

2. The compound Poisson distribution give the correct overall normalization factor.

3. We noticed that controls the long range behavior (ie less frequently occurred, rare word), controls the short range behavior (ie more frequently occurred, frequent word), and seems to control the overall slope (ie the syntax or style) of the distribution (r).

4. It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology.

In linguistic - representation of linguistic expressions

Morpheme word phrase sentence text

Biological implications

Study the statistical significance of word frequency

• Naively, words of rare frequency because it disrupts replication or gene expression ?

• Words of significant frequency survive after natural selection ?

1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment...

Documents

Transcript of 1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment...

What Macromolecular Crowding Can Do to a Protein · What Macromolecular Crowding Can Do to a Protein ... Introduction The intracellular environment is extremely crowded. Estimates

SimBioSys Inc.© 2004 Conformational sampling in protein-ligand complex environment Zsolt Zsoldos SimBioSys Inc., © 2004 Contents:

THE EVOLUTION POLARITY RELATIONS GLOBINS · 2018. 10. 20. · THE EVOLUTION OF POLARITY RELATIONS IN GLOBINS HELMUTVOGELand EMILEZUCKERKANDL CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE,

INTRINSIC PROTEIN DISORDER AND PROTEIN-PROTEIN …psb.stanford.edu/psb-online/proceedings/psb12/hsu.pdf · INTRINSIC PROTEIN DISORDER AND PROTEIN-PROTEIN INTERACTIONS ... Keywords:

An Atomic Environment Potential for use in Protein ...csumma/docs/Summa_et_al_JMB_2005.pdf · modeling of protein energetics is an important part of current research aimed at understanding

Protein structures. Protein Structure Why protein structure? The basics of protein Basic measurements for protein structure Levels of protein structure.

Globins. Globin diversity Hemoglobins ( , etc) Myoglobins (muscle) Neuroglobins (in CNS) Invertebrate globins Leghemoglobins flavohemoglobins.

THTHE PROTEIN FLIPTHE PROTEIN FLIPE PROTEIN FLIP

The EMBO Systematic a secretory ER membranelims.labscout.com/labs/rapoport/data/publication/pdf/pub_327.pdf · protein environment of the secretory protein prepro-lactin, trapped

The Protein Bridge - globalfoodforums.com · Protein Selection • Protein concentration • Protein functionality desired • Protein fortification desired • Protein quality (PDCAAS)

ADVERTISEMENT “Diverse and ambitious environment at ... · extraction, Phree phospholipid removal, Impact protein precipitation plates, ˘ ˘ Clarity BioSolutions for synthetic

1.Oxygen-Binding Proteins 1a. Globins 1b. Hemerythrins 1c. Hemocyanins 2.Protein catalysis (PIMT) 1a. Drosophila PIMT 1a. Pombe PIMT Relationship between.

Comparative Proteomics Kit I: Protein Profiler Module · and with their environment. ... † DNA structure and function ... protein stain, and the other reagents used in this exercise.

Final Protein and Environment F&H Infographics

Dynamics of a protein and its surrounding environment: A ...

Protein Structure - University of Arkansas at Little Rock5445/lecture9notes.pdf · Types of Protein Structures III Fibrous, globular, integral membrane Their working environment is

Protein Structure Analysis & Protein-Protein Interactions · Protein Structure Analysis & Protein-Protein Interactions David Wishart ... Protein Data Bank ... •Objective is to match

THE EVOLUTION POLARITY RELATIONS GLOBINS …digitalassets.lib.berkeley.edu/math/ucb/text/math_s6_v5_article-07.pdfthe evolution of polarity relations in globins helmutvogeland emilezuckerkandl

proteins - apascualgarcia.github.ioapascualgarcia.github.io/assets/articles/Pascual-Garcia... · evolutionarily related proteins: globins, aldolases, P-loop and NADP-binding. We introduce

Department of Microbiology Junior Sophister Handbook … Microbiology Booklet... · are shown, such as DNA-binding proteins, globins, and immunoglobulins to highlight the link between