Protein Sequencing Algorithms

7
Protein Sequencing Algorithms A survey Muhammad Usman (Author) School of Science and Technology University of Management and Technology Lahore, Pakistan [email protected] AbstractProtein sequencing is used in many fields. In this technique, sequence of amino acids in a protein is determined by using an algorithm. For this, there should be better understanding of structures as well as functions of proteins in any living organism. In this paper, different algorithms of protein sequencing have been discussed. Majorly ten algorithms are discussed with their applied formulas as well as their steps and then well demonstrated with graphs. In comparative analysis, these algorithms are compared; paper is concluded with best possible algorithm for protein sequencing. Keywords(Adenine, Guanine Thyamine, Markov, Oligonucleotides, Nucleotides, RNA, DNA) I. INTRODUCTION The word Protein is derivate from a word in Greek language, “proteios” that stands for Primary. It’s not hard to say that proteins are one of the vital building blocks for a living individual. They are composed of a chain of amino acids of various types (around 25 are commonly used) mostly refereed as standard amino acids. Scientists have been researching on them for over 200 years that includes their structure, functionality and use. Still, there are many queries unanswered in this domain like, how they transforming a basic linear primary structure (amino acids) to a useful 3D assemblage. From core it is biological problem that is rooted in multiple domains. In Computer Sciences, it can be mapped to an NP hard problem, in both Physics and Geometry the same problem can be classified in to “self avoiding walk”. The solution for such problems needs a complete integration of various domains and is very interesting to address. An accurate prediction may lead a bundle of fields in the coming era. Proteins have various types like functional, structural, hormonal etc. Proteins are composed of a unique pattern of amino acids (essential & non essential). Amino acids that are essential and our body does not produce, we take them from outside. DNA is a structural part of gene which is in double helix form. DNA consists of 4 bases nucleotide, one phosphate group and one sugar group. Nucleotide bases further consists of adenine, guanine, thymine and cytosine. Out of these four, three combine to form a helix structure to form an amino acid. For instance, adenine, guanine, and thymine combine to form a unique amino acid called methyonine. One amino acid is coded by 3 bases. To create amino acid there must be an algorithm followed which is called transcription. After that there is another method which produces messenger RNA (ribo nucleic acid), and finally messenger RNA is translated in proteins. The method to create proteins from nucleotide chain is called translation. The overall procedure is well explained in the figure below. Tommy Bennet and James A. Coker [3] came up with NGA(Niche Genetic Algorithm) that later on was declared as an extension to GA which can address the problems related to multiple optima. They also compared NGA with DSGA (Dynamic Radius Species Conserving Genetic Algorithm) and found promising result. There are various algorithms designed for translation and transcription. Here we will find a comparison for transcription and translation algorithms for different types of proteins. II. RELATED WORK One of the common and traditional way to predict the structure (folding and formation) of proteins is GA (Genetic Algorithm). It has a good computational power to predict the structure of proteins. But when is comes to multiple optima (multiple proteins), GA is not considered to be that efficient. Michael Scott Brown, Tommy Bennet and James A. Coker [3] came up with NGA (Niche Genetic Algorithm) that later on was declared as an extension to GA which can address the problems related to multiple optima. They also compared NGA with DSGA (Dynamic Radius Species Conserving Genetic Algorithm) and found promising result. Alexander S. Krylov, and Renad I. Zhdanov [5] worked on binding proteins. They experimented on short oligonucleotides (short chained) and micro-array of hydrogel cells biochip. Firstly they worked on how a protein can recognize hortest single strand oligonucleotide which they achieved by binding oligonucleotides from 2 12 bases. They tried it for different number of bases in this range and constructed a microarray that DNA Messenger RNA Amino Acids Proteins Transcription Translation Figure 1: Proteins Formation

Transcript of Protein Sequencing Algorithms

Page 1: Protein Sequencing Algorithms

Protein Sequencing Algorithms – A survey

Muhammad Usman (Author)

School of Science and Technology

University of Management and Technology

Lahore, Pakistan

[email protected]

Abstract— Protein sequencing is used in many fields. In this

technique, sequence of amino acids in a protein is determined by

using an algorithm. For this, there should be better

understanding of structures as well as functions of proteins in

any living organism. In this paper, different algorithms of protein

sequencing have been discussed. Majorly ten algorithms are

discussed with their applied formulas as well as their steps and

then well demonstrated with graphs. In comparative analysis,

these algorithms are compared; paper is concluded with best

possible algorithm for protein sequencing.

Keywords—(Adenine, Guanine Thyamine, Markov,

Oligonucleotides, Nucleotides, RNA, DNA)

I. INTRODUCTION

The word Protein is derivate from a word in Greek language,

“proteios” that stands for Primary. It’s not hard to say that

proteins are one of the vital building blocks for a living

individual. They are composed of a chain of amino acids of

various types (around 25 are commonly used) mostly refereed

as standard amino acids. Scientists have been researching on

them for over 200 years that includes their structure,

functionality and use. Still, there are many queries unanswered

in this domain like, how they transforming a basic linear

primary structure (amino acids) to a useful 3D assemblage.

From core it is biological problem that is rooted in multiple

domains. In Computer Sciences, it can be mapped to an NP –

hard problem, in both Physics and Geometry the same problem

can be classified in to “self avoiding walk”. The solution for

such problems needs a complete integration of various domains

and is very interesting to address. An accurate prediction may

lead a bundle of fields in the coming era.

Proteins have various types like functional, structural,

hormonal etc. Proteins are composed of a unique pattern of

amino acids (essential & non essential). Amino acids that are

essential and our body does not produce, we take them from

outside. DNA is a structural part of gene which is in double

helix form. DNA consists of 4 bases nucleotide, one phosphate

group and one sugar group. Nucleotide bases further consists of

adenine, guanine, thymine and cytosine. Out of these four,

three combine to form a helix structure to form an amino acid.

For instance, adenine, guanine, and thymine combine to form a

unique amino acid called methyonine. One amino acid is

coded by 3 bases. To create amino acid there must be an

algorithm followed which is called transcription. After that

there is another method which produces messenger RNA (ribo

nucleic acid), and finally messenger RNA is translated in

proteins. The method to create proteins from nucleotide chain

is called translation. The overall procedure is well explained in

the figure below.

Tommy Bennet and James A. Coker [3] came up with

NGA(Niche Genetic Algorithm) that later on was declared as

an extension to GA which can address the problems related to

multiple optima. They also compared NGA with DSGA

(Dynamic Radius Species Conserving Genetic Algorithm) and

found promising result.

There are various algorithms designed for translation and

transcription. Here we will find a comparison for transcription

and translation algorithms for different types of proteins.

II. RELATED WORK

One of the common and traditional way to predict the structure

(folding and formation) of proteins is GA (Genetic Algorithm).

It has a good computational power to predict the structure of

proteins. But when is comes to multiple optima (multiple

proteins), GA is not considered to be that efficient. Michael

Scott Brown, Tommy Bennet and James A. Coker [3] came up

with NGA (Niche Genetic Algorithm) that later on was

declared as an extension to GA which can address the problems

related to multiple optima. They also compared NGA with

DSGA (Dynamic Radius Species Conserving Genetic

Algorithm) and found promising result.

Alexander S. Krylov, and Renad I. Zhdanov [5] worked on

binding proteins. They experimented on short oligonucleotides

(short chained) and micro-array of hydrogel cells – biochip.

Firstly they worked on how a protein can recognize hortest

single strand oligonucleotide which they achieved by binding

oligonucleotides from 2 – 12 bases. They tried it for different

number of bases in this range and constructed a microarray that

DNAMessenger

RNAAmino Acids

Proteins

Transcription Translation

Figure 1: Proteins Formation

Page 2: Protein Sequencing Algorithms

contains 16 di-nucleaotides. That array was then tested for

specific binding of proteins labeled with Texas Red of Bodipy.

Ga¨elle LENGLET and Sabine DEPAUW [6] proposed a

unique method to recognize protein structure by involving

glyceraldehyde (sugar group). They used Benzo-b-acronycine

guanine nucleo-bases of DNA helix and that has capability to

open DNA double helix locally, which is attached with its

cytotoxic activity which is also known as cell destruction

activity. Since enzymes are required to generate proteins, they

worked on an algorithm that took dehydrogenated enzyme and

combined it with alkyl group to generate the single/ double

stranded DNA coded by recorded telomerase (pattern repetition

algorithm) activity. They used the cyclic amplification of

sequence targeting (CASTing) algorithms for identification of

DNA-binding selectivity. Furthermore, there is an increase in

GAPDH binding as well as its partner HMG a.k.a high-

mobility group protein B1 to the chromatin at cellular level was

observed.

Figure 2: CASTing Tests

Pooya Zakeri and Yves Moreau [11] proposed a classical

method to recognize proteins through GEOMETRIC KERNEL

DATA FUSION. They break the linearity of base kernels by

taking the mean of individual kernels. Since geometric means

is of such proteins pattern is proven computationally hard and

expensive and one may consider it as computationally

unfeasible. This can be avoided by using Log – Euclidean

mean and can be considered as a consensus between the

arithmetic and geometric mean. They are successful to provide

a functional domain composition of proteins through a kernel –

based hybridization model.

Leonid Mirny [13] suggested a useful technique to find out

folding and binding in protein-DNA interactions. Proteins can

bind target sites ~102 -103 times faster than diffusion the limit.

They proposed a two stage mechanism, first stage involves

search of folding patterns of proteins and second step is to

recognize the pattern using various mathematical models.

Figure 3: Coupling and Bonding

An effective landscpae is made by using Random Energy

Model to get kinetics (fraction of time in S states) and stability

on the target site. It is actually a double edge sowrd that no

doubt speeds up the process due to S states but it slows down

due to a possibility to miss the site. They also used correlated

landscapes for finding coupling of binding and folding.

Quentin R. Johnson and Richard J. Lindsay [12] worked on

protein Recognition via Computer Simulation. Their main

focus was on the portion of utilization of computer simulations

as well as biophysical models for the evaluation of specificity

and strength of recognition of carbohydrate. They presented the

computational methods which are assisting quantification of

sugar recognition as well as they proved that traditional

problems such as cooperatively, purification and specificity can

be avoided by usage of those computational methods.

Additionally many other methods were compared that were

used for the calculation of binding between protein and

carbohydrate. At the end, successful examples of a binding

study by using computer simulation were used for

demonstration of the mature technique, rather than describing

existing deficiencies.

Ilda D’Annessa and Cinzia Tesauro [4] worked on Function of

elasticity in Protein-DNA-Drug Recognition. Processes of

covalent complex and reversibly stabilized of DNA-

topoisomerase were mainly contributed by these two authors.

They found that when compared with the protein of wild-type,

DNA substrate with minor rate of relegation was exhibited by

the transformed drug.

Figure 4: DNA and Protein Tests

Authors also proved that double mutant is more sensitive to

CPT as compared to the wild type. Sensitivity of CPT is

inversely proportional to the rate of relegation. This conclusion

shows that linker domain has a critical role and also shows the

effect of mutations in this domain on the catalytic site which is

actually in a region which is located at long distance from the

mutations. This paper also demonstrates the frequency of

communication between domains which are localized far away

one from the other.

Proteins binding with the help of the chemical structure is one

of the domains that are focus of attention in this era. Alexander

175 200 250 300 350100 125170

200275

300350

400450

525

1

0

200

400

600

800

1000

1200

1400

Test1 Test2 Test3 Test4 Test5

No

of

Sam

ple

s

Tests

CASTing Algorithm Tests

Telomerase Alkyle group Dehydrogenase

0

0.5

1

1.5

1 2 3 4 5 6 7Protien DNA Substrate

Page 3: Protein Sequencing Algorithms

S. Krylov, and Renad I. Zhdanov [5] worked on protein

recognition by chemical composition. To know about the

shortest and longest single strand oligonucleotide, they

performed experiments for recognition as their initial stage.

Furthermore, they tested the binding behavior by mixing 2-12

bases and identified that tetra nucleotide one is quite handsome

for protein binding. This results in the simplest protein binding

microarray.

The group of non covalent exchanges among DNA,

carbohydrates, small molecules, proteins or lipids are critical

events in many processes of biology. For better understanding

of reactions of biochemistry as well as procedure of designing

therapeutic agents which are useful to treat many diseases and

infections, characterization and discovery of the interactions of

these small molecules is essential. Since last twenty years, a

major tool used in vitro for quantification as well as

identification of protein–ligand interactions is known as

electrospray ionization mass spectrometry (ESI-MS). In this

paper, ESI-MS will be implemented for determining the

binding stoichemtry and affinity of protein–ligand. Also,

common sources of error encountered with these measurements

and many strategies for overcoming them will be discussed. At

the end, challenges which are related to the process of

implementation of the assay will be discussed with future

work.

Hoon Choi and Seungsoo Han [1] recognize protein patterns by

SAPs (Stress associated proteins). Plants contain zinc finger

domains that are helpful for the recognition of regulatory signal

in cell known as ubiquitin. Thus, it was not clear that whether

domains in plants and domains in animal cells perform similar

roles. They shows a unique series of feature among these

domains. The highly conserved diaromatic patch is replaced by

the dialipathic patch. Results have shown that AtSAP5 shows

better results for linear and K63-linked polyubiquitin chains as

compare to K48- linked one.

The entire PGLYRP1 gene from Macaca thibetana and

Rhinopithecus roxellana is identified for exploration of the

adaptive evolution of the peptidoglycan (PGN)-recognition

protein 1 gene in primates and also shows the function of this

antibacterial protein. It is shown by homology analysis that the

identity of nucleotide and deduced amino acid sequences of

PGLYRP1 among 10 primates ranged from 82.0 to 99.0% and

74.5 to 98.5%, respectively. By using the Bayes empirical

Bayes procedure, authors also found two positively selected

condos (121L and 141T sites) that are not affected by PGN-

binding and PGLYRP-specific regions and also for the

functional effect of the PGLYRP1 protein, two potential key

sites were implied.

Małgorzata Grabinska and Paweł Błazej [2] worked on Markov

chains ( the most commonly used algorithm for protein

sequencing). They used matrices that describe the

dependencies among nucleotides sequences. After that they

predict a gene measured by some content. Algorithm used was

PMC which takes 6 different Markov chains and tells about

transitions among nucleotides separately for DNA strand.

They suggested that PMC algorithm shows better precision

than the other Markov models.

W.Liu, Y.F. Yao, L.Zhou, Q.Y.Ni ans H.L.Xu [10] performed

an analysis on peptidoglycan-recognition protein gene

(PGLYRP1) which is used in primates. They discussed the

evolution of this protein gene by considering and discussing all

previous work done on this protein gene. Authors stated that

the immune system or self-defense system of any micro-

organism can be recognized by using many recognition

receptors which are highly functional. Work on this recognition

was started in 2002 by Hoffman and Reichhart. After that

immune system of mammals was discussed by Takeda and

Akira in 2005 which shows that mammals have many

compulsory constituent members of proteins such as CD14.

This work continues from insects to mammals and many other

proteins such PGRPs, PGLYRP-S and PGN-lytic enzymes

were discussed up to year 2007 by many authors. Primates of

non-human are usually used for experimentation or studies of

transplantation from 2007 own-wards. Many primates were

also used for vaccination purposes. Authors stated that protein

gene such as PGLYRP1 was also found in many parts of living

organisms such as in corneal tissue, bone marrow, kidney,

lungs and liver. It also helps in killing bacteria and helps in

activating two- component protein-sensing system whenever

skin have a contact with any complicated external

environment. They have doe analysis by using molecular

evolutionary analysis. Also roles of insect PGRs were

documented too.

Oleg V. Kovalenko, Andrea Olland, Nicole Piché-Nicholas,

Adarsh Godbole, Daniel King, Kristine Svenson, Valerie

Calabro, Mischa R. Müller, Caroline J. Barelle, William

Somers, Davinder S. Gill, Lidia Mosyak and Lioudmila

Tchistiakova [11] discussed a new category of immunoglobulin

known as new antigen receptors (IgNARs). These antigens

belong to the class of Ig-like molecules. Authors took

experiments by following some major steps for recognition of

these receptors as well processing of these IgNARs. Authors

discussed shark IgNARs which are actually associated more

with human, rat or mouse species. At the end of the paper, they

showed results according to the structure of specie as well as

they consider many other elements of a molecule.

III. TECHNICAL FRAMEWORK

pad is processed with droplets of aqueous solutions of ON and

the ON were immobilized by reductive coupling of their amino

groups with the aldehyde groups of the gel. Thus, the biochip

was formed with single stranded oligonucleotides immobilized

inside gel pad. For experimental control as well as data

processing by using the "LabVIEW virtual instrument

interface", special software was designed by the authors.

Page 4: Protein Sequencing Algorithms

Figure 5: Visual Image of Hybridization pattern through a

microchip

Ga¨elle LENGLET and Sabine DEPAUW [6] used

glyceraldehydes for protein recognition. Initially cell structure

and protein extraction was done using chromatographic

techniques, electrophoresis and MS analysis

Figure 6: Chromatographic isolation

For linear data they used chromatographic techniques but for

proper and chained data, electrophoresis was used and was then

refined by MS analysis. The extracted data is then passed

through a specific protocol, EMSAs (electrphoretic mobility –

shift assay).

Figure 7: EMSA protocol for different protein patterns

The protocol is used to sense protein composites with nucleic

acids and to analyze quality and quantity of multiple interactive

systems. After electrophoresis, the division of proteins

containing nucleic acid is obtained, by autoradiography. The

result is then fit for CASTing (a cyclic process normally used

for amplification and sequence targeting). The algorithm takes

DNA as an input and dissolves it in a calculated buffer (which

is to be amplified by PCR) and finally PCR products are

amplified that recognize protein patterns.

Figure 08: Protein Sequencing

Elena N. Kitova [8] used direct ESI-MS Measurements for

protein sequencing. Initially the algorithm detects and quantify

free and ligand proteins. For this Cafor a given protein is

obtained by a ratio (K) that describes the abundance (Ab) of

ligand and proteins. The relation is

P + L ↔ PL

Ca s calculated by following relation

Ca =K

[L]a − K

1 + K[P}a

Where K is determined by, [PL]

[P}=

Ab(PL)

Ab(P)= K

The abundance of every detected PL and P ions should include

K. The relation is fine for linear data but to break linearity in

the given data the above relation can be written as; K

{K + 1}

= 1 + Ca[P]a + Ca[L]a − √(1 + Ca[P]a − Ca [L]a)2 + 4Ca[L]a

2Ca[P]a

The relation above is used for ESI-MS binding and its values

normally range from 0.050 – 20. Moreover, P and L lies

between 0.10 - 1000 M. The relation above describes the

Page 5: Protein Sequencing Algorithms

uniformity of response factors, P and PL. But for non uniform

data, below relation is suitable. [PL]

P= 1 + CFp − Ab(PL)/CFPLAb(P)

W.Liu, Y.F. Yao, L.Zhou, Q.Y.Ni ans H.L.Xu [10] used

peptidoglycan-recognition protein gene (PGLYRP1) of

primates for their study of analysis. They took many DNA

samples which were taken from muscle tissue of one of the

species of monkey. They took experiments at the wildlife

protection laboratory in china. They also downloaded some

other samples of PGLYRP1 of crab-eating macaque as well as

from human. They used Ensembl Genome Database for the

collection of multiple samples from different species. After

collection, they amplify, clone and sequence these all proteins

genes. They used polymerase chain reaction (PCR) which was

designed on the basis of PGLYRP1 sequence. Primer 5.0

software was used for the processing. Process of PCR was held

in thermal cylinder names as Mastercycler gradient in Germany

with a total reaction volume of approx. 50 μL which contains 1

μL 10 ng/μL genomic DNA, 0.5 μL of each primer, 5 μL 2X

buffer, 25 μL 2X mix, 18 μL double-distilled water, and 5 μL

mineral oil. Some conditions of temperature as well as of

timings were also considered by the authors for better results.

PCR gel extraction kit was also then used for the purification

process of PGLYRP1. After amplification, cloning is done and

cloned into a pMD 19-T Simple vector. For sequencing of

these cloned proteins, authors used Bug Dye Terminator v3.1

cycle sequencing ready reaction kit. These sequences were then

assembled using software named DNASTAR and complete

coding sequence is obtained of PGLYRP1.

After the processes of amplification, cloning and sequences of

protein gene, analysis on data was done. Sequence obtained

was firstly confirmed through some checks using software

named Chromas 1.45 and if there is any correction needed, that

will be made before further processing. Authors took many

parameters and many sites in MegAlign program for their

analysis study. Authors used already discussed molecular

evolutionary genetic analysis for their own study. They took

different values for different ratios. In results, authors used

trees and tables for comparison of different values if ratios and

then discussed these values in detail according to the type of

specie.

IgNARs were discussed by Oleg V. Kovalenko, Andrea

Olland, Nicole Piché-Nicholas, Adarsh Godbole, Daniel King,

Kristine Svenson, Valerie Calabro, Mischa R. Müller, Caroline

J. Barelle, William Somers, Davinder S. Gill, Lidia Mosyak

and Lioudmila Tchistiakova [11]. They defined this recognition

process in nine major steps. First step was of designing and

cloning the variants of humainized V-NAR. In this step, E06

variants were codonoptimized for expression in mammalian

cells and synthesized by using GeneArt AG. Some control such

as murine CMV promoter is considered while process of

cloning. Second step was of expression and purification of V-

NAR proteins. Authors used COS-1 expression type for

representation of fusion protein named V-NAR-hFc. On basis

of recommendation of manufacturer, cells used TransIT

reagent for tranfection. Similarly monomeric V-NARs were

expressed in COS-1 cells as well and they were purified using

chromatography technique. Different minerals used for the

process of chromatigraphy such as sodium phosphate, NaCl,

and imidazole. Concentration of protein is then determined by

using OD 280mm. Cells which are grown in serum-free style,

expression of FreeStyle293 was used. Third step was of

isolation of E-06 proteins. In this step, E-06 was applied with

Ni2--NTA Super flow resin. Resulted substance is then washed

by using PBS supplement contains imidazole. Dialyze the E06

again BS will remove excessive imidazole and process it for

next step. For the removal of oligomeric speciies, PBS contains

lipid-free HSA is used. Incubation is then done for one hour

and Superdex 200 was applied to it for the removal of excess

E06. At the end, remaining fractions were pooled and prepared

it for the process of crystallization. Fourth step is of ELISA.

Proteins of serum albumin in used for binding of

experimentations. Direct and indirect ELISA is done. Detection

of V-NAR bindings in case of direct ELISA is done with costar

assay plates which were coated with PBS. Fusion protein such

as VNAR-hFc were diluted by using assay buffer and sandwich

ELISA, anti-hFc pAb coating on plates was used Fifth step of

crystallization. In this step, major consideration was of

temperature fixing. E06 crystals were obtained by keeping

temperature at 18 degree Celsius for hanged drops. Different

quantities of solutions were used with different minerals such

as protein complex, NaCl and sodium acetate. At the end of

this step, diamond shaped crystals were obtained in one night

which continues growing up to one week approx. Sixth step is

of data collection and processing. Data was collected by using

APS beamline 22-ID on a detector of MAR-300. Program

named Xia2 was used for scaling and integration of intensities.

Another program named autoProc was also used for the same

purpose. Seventh step is to phasing, model building and

refinement of E06. For this process, PHASER is used for the

replacement of complex E06 with HSA. Model used was apo

HSA (PDB ID: 1AO6). At the end Phenix was used for the

refinement process. Different programs ans models were used

for different type of proteins in this step. Eighth step is of

measurements of E06. Kinetic constants of E06 were collected

by using surface plasmon resonance (Biacore T100, GE Life

Sciences). Finally last step is of assigning accession numbers.

Factors as well as coordinates based on structure were

deposited with the Worldwide Protein Data Bank - PDB ID:

4HGK (E06) and PDB ID: 4HGM (huE06 v1.1)..

IV. COMPARITIVE ANAYSIS

of the techniques discussed in paper was by Micheal Scott

Brown Niche [3] of Genetic algorithms. These algorithms were

better for proteins recognition but it reduces the dimension. As

proteins are in 3D but this algorithm first converted proteins

into 2D and then process it further. By doing so, search space is

also reduced.

Other technique was of markov chains used for sequencing of

proteins. Authors Małgorzata Grabinska and Paweł Błazej [2]

compared their work with the already presented algorithm of

PMC. Supervised learning was used for the training of data and

then original data is tested. Gene Mark algorithm was proposed

by Paweł Mackiewicz [2]. They also used markov chains but

they treated every protein sequence has three unique markov

chains. They also compared their flow with PMC algorithm

and ROC curves were used for efficiency calculations. True

positive rate for these algorithms has shown less variation.

Page 6: Protein Sequencing Algorithms

Figure 09: PMC & Three chained Algorithm Comparison

Protein sequencing is discussed by Elena N. Kitova [8] by

using direct ESI-MS Measurements. Initially the algorithm

detects and quantify free and ligand proteins and then authors

used different formulas for linear and non linear data.

Comparatively, most of the authors used markov chains for

sequencing of proteins. Because markov chains can be used for

any dimensional data. But defficiency of this technique was

different computational cost of linear and non linear data.

Similarly the least expensive technique was used by the Gaelle

LENGLET and Sabine DEPAUW [6]. Chromatography is

widely used as well as less expensive. It also gives better

results but each stage of process used different kind of

technique.

Another protein recognition technique was presented by Ilda

D’Annessa [4] who worked on role of flexibility in Protein

DNA Drug Recognition. Author used specially designed

software for the processing of data. They took many

experiments using "LabVIEW virtual instrument interface" and

shows that results are better as compare to other algorithms.

Glyceraldehydes were used by Gaelle LENGLET and Sabine

DEPAUW [6] for recognition of protein. They used

chromatographic techniques, electrophoresis and MS analysis

for different types of data. For linear data they used

chromatographic techniques but for proper and chained data,

electrophoresis was used and was then refined by MS analysis.

They used EMSAs (electrophoretic mobility – shift assay)

protocol for the processing of extracted data.

Conclusions

The major purpose of this paper was the search and study of

protein sequencing, recogintion and creative exercise of this

knowledge to develop a novel approach to forecast protein-

protein complexes. Foundation of this study is a Neiche

Genetic Algorithm function that was derived from a previously

prepared dataset of Genetic Algorithm. On basis of its result, it

was used for computational scanning to calculate changes in

the binding of protein complexes. Computed and tentative

values proven good correlations and, thus, a PMS – algorithm

was introduced to perk up the predictive power. Based on these

findings, the PMS – algorithm was developed, which allows

identifying scums in protein and performing. The results have

shown that PMS – algorithm has not inly the state-of-the-art

process with respect to predictive power but also in terms of

computational speed. Markov chains were also productively

appraised by re - score six diferent datasets that includes bound

and unbound protein predictions. Furthermore, the chained

algorithm, it is useful if it is applied as an objective function in

mixture with different Markov chains to predict 3D structures

of protein-protein structure. For this, model based learned

learned algorithms were used to test protein sequencing. The

direct ESI-MS Measurements approach showed average results

for bound and restrained protein complex predictions. Not

many factors were recognized to persuade on the success of the

sequencing approach, such as the series of probable

conformational changes of a protein. Finally, a large-scale

validation study on peptidoglycan-recognition protein into was

performed. Results there by obtained allow identifying those

protein-protein interfaces that are best for molecular docking

approaches.

0

20

40

60

80

100

120

0 50 100 150

se

ns

itiv

ity

1 - specificity

ROC Curve

Figure 12: Linear Data Analysis

y = 2554.x + 36508

0200000400000600000800000

10000001200000

0 200 400 600

Sam

ple

s

Protiens Formation

Linear data Chromatographic Technique

Figure 13: Non Linear Data Analysis

y = -4.516x2 + 4342x - 29016

-2000000

200000400000600000800000

10000001200000

0 200 400 600

Sam

ple

s

Concentration

Non Linear Data , Electrophoresis Technique

Page 7: Protein Sequencing Algorithms

REFERENCES

[1] Hoon Choi, Seungsoo Han, Donghyuk Shin, Sangho Lee. Sangho Lee. (2012), Polyubiquitin recognition by AtSAP5, an A20-type zinc finger containing protein from Arabidopsis thaliana.

[2] Małgorzata Grabinska, Paweł Błazej and Paweł Mackiewicz (Wrocław). (2013), Two Algorithms based on Markov Chains and their application to Recognition of Protein coding genes in Prokaryotic Genomes.

[3] Michael Scott Brown and James Coker. (2014), Niche Genetic Algorithms are better than traditional Genetic Algorithms for de novo Protein Folding.

[4] Ilda D’Annessa, Cinzia Tesauro, Paola Fiorani, Giovanni Chillemi, Silvia Castelli, Oscar Vassallo, Giovanni Capranico, and Alessandro Desideri. (2012), Role of Flexibility in Protein-DNA-Drug Recognition: The Case of Asp677Gly-Val703Ile TopoisomeraseMutant Hypersensitive to Camptothecin.

[5] Alexander S. Krylov and Renad I. Zhdanov. (2012), Nucleic acid – protein fingerprints. Novel protein classification based on nucleic acid – protein recognition.

[6] Ga¨elle LENGLET, Sabine DEPAUW, Denise MENDY and Marie-H´el`ene DAVID-CORDONNIER. (2013), Protein recognition of the S23906-1–DNA adduct by nuclear proteins: direct involvement of glyceraldehyde-3 phosphate dehydrogenase (GAPDH).

[7] Alfred V.Aho. (2012), Algorithms for finding patterns in Strings.

[8] Elena N. Kitova, Amr El-Hawiet, Paul D. Schnier, John S. Klassen. (2012), Reliable Determinations of Protein–Ligand Interactions by Direct ESI-MS Measurements. Are We There Yet?

[9] Parwiz Abrahimi, William G. Chang, Martin S. Kluger, Yibing Qyang, George Tellides, W. Mark Saltzman, Jordan S. Pober. (2015), Efficient Gene Disruption in Cultured Primary Human Endothelial Cells by CRISPR/Cas9.

[10] W. Liu, Y.F. Yao, L. Zhou, Q.Y. Ni and H.L. Xu. (2013), Evolutionary analysis of the short-type peptidoglycan-recognition protein gene (PGLYRP1) in primates.

[11] Oleg V. Kovalenko, Andrea Olland, Nicole Piché-Nicholas, Adarsh Godbole, Daniel King, Kristine Svenson, Valerie Calabro, Mischa R. Müller, Caroline J. Barelle, William Somers, Davinder S. Gill, Lidia Mosyak and Lioudmila Tchistiakova. (2013), Atypical Antigen Recognition Mode of a Shark IgNAR Variable Domain Characterized by Humanization and Structural Analysis.

[12] Quentin R. Johnson, Richard J. Lindsay, Loukas Petridis and Tongye Shen. (2015), Investigation of Carbohydrate Recognition via Computer Simulation.

[13] Jiansheng Jiang, Bing-Rui Zhou, Rodolfo Ghirlando and Tsan Xiao. (2013), A conserved mechanism for centromeric nucleosome recognition by centromere protein CENP-C.

[14] Wei-Lun Hsu. (2013), Mechanisms of binding diversity in Protein Disorder: Molecular Recognition features mediating protein interaction Networks.

[15] Wells, J. A.; McClendon, C. L., Reaching for high-hanging fruit in drug discovery at protein-protein interfaces. Nature 2007, 450, (7172), 1001-9.2.

[16] Mulder, G. J., Ueber die Zusammensetzung einiger thierischen Substanzen. Journal für praktische Chemie 1839, 16, 129-151.

[17] Campbell, N. A., Biologie. Spektrum Akademischer Verlag: Heidelberg, Berlin, Oxford, 1997; p 80.4.

[18] Crick, F. H., The genetic code--yesterday, today, and tomorrow. Cold Spring Harb Symp Quant Biol 1966, 31, 1-9. 5.

[19] Atkins, J. F.; Gesteland, R., Biochemistry. The 22nd amino acid. Science 2002, 296, (5572), 1409-10.6.

[20] Xu, X. M.; Carlson, B. A.; Mix, H.; Zhang, Y.; Saira, K.; Glass, R. S.; Berry, M. J.; Gladyshev, V. N.;

[21] Hatfield, D. L., Biosynthesis of selenocysteine on its tRNA in eukaryotes. PLoS Biol 2007, 5, (1), e4.7.