Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local...

4
Interdiscip Sci Comput Life Sci (2014) 6: 1–4 DOI: 10.1007/s12539-013-0042-7 Performance Evaluation of Warshall Algorithm and Dynamic Programming for Markov Chain in Local Sequence Alignment Mohammad Ibrahim Khan, Md.Sarwar kamal (Department of Computer Science & Engineering., Chittagong University of Engineering and Technology, Chittagong 4349, Bangladesh) Received 21 September 2013 / Revised 7 February 2014 / Accepted 23 February 2014 Abstract: Markov Chain is very effective in prediction basically in long data set. In DNA sequencing it is always very important to find the existence of certain nucleotides based on the previous history of the data set. We imposed the Chapman Kolmogorov equation to accomplish the task of Markov Chain. Chapman Kolmogorov equation is the key to help the address the proper places of the DNA chain and this is very powerful tools in mathematics as well as in any other prediction based research. It incorporates the score of DNA sequences calculated by various techniques. Our research utilize the fundamentals of Warshall Algorithm (WA) and Dynamic Programming (DP) to measures the score of DNA segments. The outcomes of the experiment are that Warshall Algorithm is good for small DNA sequences on the other hand Dynamic Programming are good for long DNA sequences. On the top of above findings, it is very important to measure the risk factors of local sequencing during the matching of local sequence alignments whatever the length. Key words: Hidden Markov model, Chapman-Kolmogorov formula, Warshall algorithm, Dynamic Programming, Score measurement. 1 Introduction Markov Chain can be easily formulated in the state space for the simple model such as 0 for first Nucleotide Adenine (A), 1 for second Nucleotide Cytosine (C), 3 for Thiemann (T) and finally 4 for Guanine (G). For same data set it can be also possible by using second order Markov Chain value as {00, 01, 02, 03, 04, ···}. Since the data sets in DNA sequences contain four fundamental bases, there should be 4 2 possible space states. But the complexity for higher order model is higher than a simple model and for this reason Markov process always holds the simple state space. For ex- ample if we ask to predict for 10000 th Nucleotide in a sequence and we have to measure the 9999 th Nucleotide in the same sequences. On the same time it is possible to quantify the pair- wise evolutionary distances, Hamming distance. If U is the total number of mismatches in an alignment of length l, then the Hamming distance for per 10000 sites is H (U, l) = 10000 ϑ l (1) Corresponding author. E-mail: [email protected] The equation 1 above works well when the DNA se- quences space is discrete. To measure the discrete Markov Process for the A, G, C, and T Nucleotides the starting distribution will be as follows: P 0 = ρC, ρA, ρG, ρT. The Fig. 1 below shows the basic Markov Process Transaction for Dis- crete system in DNA sequencing. A T C G Fig. 1 The transactional states for Markov chain 2 Motivation According to the various algorithms, implemented through 2001 to 2005 as Ning (Ning et al., 2001), Kent, (Kent et al., 2002), Schwartz, (Schwartz et al., 2002) and Watanabe (Watanabe et al., 2005) examined the accuracy of the sequencing. In the age of information superhighway we know that the genome sequences may have continual or near about continual patterns in the

Transcript of Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local...

Page 1: Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local sequence alignment

Interdiscip Sci Comput Life Sci (2014) 6: 1–4

DOI: 10.1007/s12539-013-0042-7

Performance Evaluation of Warshall Algorithm and DynamicProgramming for Markov Chain in Local Sequence Alignment

Mohammad Ibrahim Khan, Md.Sarwar kamal∗(Department of Computer Science & Engineering., Chittagong University of Engineering and

Technology, Chittagong 4349, Bangladesh)

Received 21 September 2013 / Revised 7 February 2014 / Accepted 23 February 2014

Abstract: Markov Chain is very effective in prediction basically in long data set. In DNA sequencing it is alwaysvery important to find the existence of certain nucleotides based on the previous history of the data set. Weimposed the Chapman Kolmogorov equation to accomplish the task of Markov Chain. Chapman Kolmogorovequation is the key to help the address the proper places of the DNA chain and this is very powerful toolsin mathematics as well as in any other prediction based research. It incorporates the score of DNA sequencescalculated by various techniques. Our research utilize the fundamentals of Warshall Algorithm (WA) and DynamicProgramming (DP) to measures the score of DNA segments. The outcomes of the experiment are that WarshallAlgorithm is good for small DNA sequences on the other hand Dynamic Programming are good for long DNAsequences. On the top of above findings, it is very important to measure the risk factors of local sequencing duringthe matching of local sequence alignments whatever the length.Key words: Hidden Markov model, Chapman-Kolmogorov formula, Warshall algorithm, Dynamic Programming,Score measurement.

1 Introduction

Markov Chain can be easily formulated in the statespace for the simple model such as 0 for first NucleotideAdenine (A), 1 for second Nucleotide Cytosine (C), 3for Thiemann (T) and finally 4 for Guanine (G). Forsame data set it can be also possible by using secondorder Markov Chain value as {00, 01, 02, 03, 04, · · · }.Since the data sets in DNA sequences contain fourfundamental bases, there should be 42 possible spacestates. But the complexity for higher order model ishigher than a simple model and for this reason Markovprocess always holds the simple state space. For ex-ample if we ask to predict for 10000th Nucleotide in asequence and we have to measure the 9999th Nucleotidein the same sequences.

On the same time it is possible to quantify the pair-wise evolutionary distances, Hamming distance. If Uis the total number of mismatches in an alignment oflength l, then the Hamming distance for per 10000 sitesis

H(U, l) = 10000ϑ

l(1)

∗Corresponding author.E-mail: [email protected]

The equation 1 above works well when the DNA se-quences space is discrete.

To measure the discrete Markov Process for the A,G, C, and T Nucleotides the starting distribution willbe as follows: P0 = ρC, ρA, ρG, ρT. The Fig. 1 belowshows the basic Markov Process Transaction for Dis-crete system in DNA sequencing.

A T

C G

Fig. 1 The transactional states for Markov chain

2 Motivation

According to the various algorithms, implementedthrough 2001 to 2005 as Ning (Ning et al., 2001), Kent,(Kent et al., 2002), Schwartz, (Schwartz et al., 2002)and Watanabe (Watanabe et al., 2005) examined theaccuracy of the sequencing. In the age of informationsuperhighway we know that the genome sequences mayhave continual or near about continual patterns in the

Page 2: Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local sequence alignment

2 Interdiscip Sci Comput Life Sci (2014) 6: 1–4

given or collected data sets. As a result the outcomesmight be same for many positions. On the contrary themutations and Indel may generate incorrect judgmentsfor sequencing and mapping whether it is local, globalor pair-wise alignments or mapping.

The algorithms above do not consider the situationsregarding the repetitions of the patterns, mutations andincorrect mapping. Here we have noticed the systemrejects the data sets, made the area small and as conse-quences the calculations become complicated as well aswrong. According to the Ewing and Green (Ewing etal., 1998) proposed a solution to overcome the ambigu-ity but this method is candidate of low-quality regions.Here we have implemented the Markov chain conceptunder Chapman –Kolmogorov equations to establishedprobable sequenced positions. In the field of sequenc-ing, there are lot of software’s helping to detect the se-quences and the frequently used systems are PolyPhred(Stephens et al., 2005), SNP detector (Zhang et al.,2005), and novoSNP (Weckx et al., 2006). But thesethree systems only detect genotype sample. They areunable to solve the dynamic patterns and sequences.In this research we have imposed the idea on predic-tion using Markov Process under the support of Dy-namic Programming and Warshall Graph Algorithm.The fundamental step of prediction is that this Chap-man Kolmogorov calculative prediction. We have com-pared DP and WA algorithm in the light of Local Se-quence alignment with various length. The detectionbased depends on homolog finding. The homolog find-ing mainly depends on the database finding (Stephenset al., 2006). But the database finding is not alwaysefficient due to the size and cost of the equipments.Training based sequencing is not able to identify theproper coding regions due to the lack of generalizations(Weckx et al., 2006). Besides, predictions are vulnera-ble with many false positives identifications and sensi-tivities (Yetisgen-Yildiz et al., 2005).

3 Dynamic programming

Dynamic Programming decomposes the sequencesinto several parts and solved the problem recursivelyuntil it reaches to a particular condition. Sometimesdecompositions becomes difficult and it hard to get aclear-cut solution. In that case we should investigateseveral possibilities and Recursive solution is one of theacceptable methods to overcome the problem. Basi-cally, Dynamic Programming to choose the MaximumMatching Sub Sequences (MMSS).

Suppose P = pl, p2, · · · , pn, and Q = q1, q2, · · · , qm,are two DNA sequences. The indel of two sequences isdenoted by the weight M(g). At first we consider thebest alignments as F (p, q) = max∀(p∗, q∗). Where, Fis a function which relates the current alignments andnew alignment. By using Dynamic Programming we

can check the sequences F (p, q) recursively.

F (i, j) = F (pl, p2, · · · , piq1, q2, · · · , qj)

where F (0, O) = O, F (O, j) = F (q1, q2, · · · , qj) =M(j) and F (i, 0) = M(i). Then:

F (i, j) = {F (i−1, j−l)+F (pi, qj), max{F (i−k, j)+M(k)}, max{F (i, j − I) + M(l)}. For local sequencealignment the function L(p, q) = max{F (pu, pu+1, · · · ,pv, qx, qx+1, · · · , qy) : 1 � u � v � n, l � x � y � m}.

4 Warshall algorithm alignment

Warshall graph algorithm is a sequence path findingprocess which subgroups the entire data set into set ofintermediate nodes along the path. To perform the de-compositions, it is easy to label the nodes set form 1to n. The decompositions help to reduce the shortestpaths along the sequences. By designing the total pathunder the variables I, J , and K we can say if there ispath from I to J and J to K than we can say that thereis a path from I to K. In a word we can say that thetotal path is P [I, J, K] = Shortest path from I to J us-ing the only intermediate node 1, · · · , K. the recursiveprocess of solving the shortest path is as follows:

P [I, J, K] =MIN{P [I, J, K − 1],P [I, K, K − 1] + P [K, J, K − 1] (2)

We can illustrate equation 1 as follows as algorithmicsteps.

Sequencing Graph (Adjacent Matrix: ADMR, n × n)

1. Initialize Graph weight for all nodes as Weight: =ADMR,(Weight = wij)

2. Loop k = 1; to n,

3. Inner loop J = 1 to n;

4. Inner loop I = 1; to n,

5. Wij = wij ∨ (wik ∧ wkj)

6. Wij = wij ∨ wkj

7. Return initial Weight

Suppose n = 0, 1, 2, 3, · · · . And m = 1, 2, 3, · · · ,and i0, · · · , im ∈ E. E = All possible values that therandom variable Xt can assumes. Then

Pr{Xn+1 = i1, · · · , Xn+m = im|Xn = i0} =Pri0i1 · Pri1i2 · · · · · Prim−1im ·Pr = {Xn+m = j|Xn = i}

=∞∑

k=0

Pr{Xn+m = j|Xn+1 = k, Xn = i}

Page 3: Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local sequence alignment

Interdiscip Sci Comput Life Sci (2014) 6: 1–4 3

Pr{Xn+1 = k|Xn = i}

=∞∑

k=0

Pr{Xn+m = j|Xn+1 = k}Pr{Xn+1 = k|Xn = i}

=∞∑

k=0

PrikPrkj

= Prmij

In general,

Prn+mij =

k∈E

PrnikPrm

kj for all n, m � 0, all i, j ∈ E.

5 Implementation

We have implemented and experimented under theenvironments of Java with Integrated Development En-vironment (IDE) Netbeans. The object oriented imple-mentation helped us to perform the nucleotides (A, C,T, and G) as a distinct object. In our previous work(Khan et al., 2013) we have improved the performanceof (Waqaar et al., 2008) and noticed our RSAM algo-rithm is significantly better in the light of Speed, Com-plexity, Space, Sensitivity, Accuracy and risk. In thisresearch we have compared all the above parametersunder the light of Markov Process and Warshall GraphAlgorithm. For Speed, Sensitivity and Accuracy wehave measured referential value as best, average andlow. Here we have checked the complexity, risk, accu-racy and space for the first time and many local se-quence alignment tools measured the sensitivity with-out any standard parameter. According to the MUM-mer (Delcher et al., 2002) termed the parameter ‘q’ asthe ratio between accurate aligned nucleotides pairs andtotal number of nucleotides in the given sequence. To-tal Column Score (TCS) is another aspect of MUM-mer procedure. Again according to the AVID (Bray etal., 2002) where the authors considered the alignmentpairs which have the score greater than the predefinedthreshold value. Instead of all of the methods above,we have concentrated towards the set operations underthe complete machine learning process on exons andintrons. Introns measurement are also essential partof the alignment to maintain proper checking insteadof only one parameters checking (exons). For speed,sensitivity and accuracy the reference values have beenchecked according to the fuzzy manner, such as: best(H), average (M) and low (L). But according to the(Waqaar et al., 2008) there is no clue to compare theSensitivity of the sequencing.

6 Result

The outcomes of these two process, a few interest-ing changes have had observed. Chapman- Kolmogorovequation in Markov process is very efficient for any ar-bitrary predictions in any DNA segments or sequences.

It is clearly noticed that Markov Process has protentialand strong capabilities to handle the data set whateverthe environment. In this case we also found that MPhas significanlty better scpes for long data set. Figure2 below shows the experimental outcomes for MarkovProcess.

Fig. 2 Markov Process out comes for data set.

On the contrary, Warshall Graph Algorithm, haslimited scopes than that of Markov Process. But ithas faster capabilites on short data set. The speed ofthe processing is sharply better than Markov Proceess.Other parameters such as complexity , Risk, Sensibity,Space and Accuracy are also signicicantly better thanMarkov Process due to its faster solving capabilties.Only pivotal drawback is that Warsehall Graph Algo-rithm work only for short length data set what everthe protein, DNA or RNA. Figure 3 below shows theperformace Dynamic programming impact for MarkovProcess.

Here the CPU Utilization time for long data set range

14

12

10

8

6

4

2

05 6

34

4.75.2

67

7 8 9 10 11 12

Dynamic programming

13 14 15 16 17 18 19 20×105

5.9

7.6 7.9 8 8.28.5

10

11.412

12.5

Local sequence lengths

CP

U u

tiliz

atio

n (n

s)

Fig. 3 The impact of Dynamic Programming for MarkovProcess.

Page 4: Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local sequence alignment

4 Interdiscip Sci Comput Life Sci (2014) 6: 1–4

from 1500000 to 2000000 requires are 8.2, 8.5, 10, 11.4,12, and 12.5. On the contrary, Warshall Graph algo-rithm takes more time for the same data set and thetime values are 8.2, 9, 10.8, 12, 12.9 and 14. But for theprevious data set whose lengths are less than 1500000,Warshall Graph Algorithm takes less time than Dy-namic Programming. Figure 4 below shows the impactfor Warshall Graph Algorithm.

Warshall graph algorithm

16

14

12

10

8

6

4

2

05 6

1

2

4 4.8 5 5.26

77.2 7.8

8.29

10.812

12.9

14

7 8 9 10 11 12 13 14 15 16 17 18 19 20×105Local sequence lengths

CP

U u

tiliz

atio

n (n

s)

Fig. 4 Impact of Warshall Graph algorithm.

Warshall graph algorithmDynamic programming

16

14

12

10

8

6

4

2

05 6 7 8 9

Local sequence lengths

CP

U u

tiliz

atio

n (n

s)

10 11 12 13 14 15 16 17 18 19 20×105

Fig. 5 Comparitive Illustration of Dynamic Programmingand Warshall Graph Algorithm.

7 Conclusion

Both Dynamic Programming and Warshall GraphAlgorithm perform predictions of DNA base pair ac-cording to the process. Dynamic Programming has bet-ter capabilities to handle large data set due to its ran-domness. On the other side, Warshall Graph Algorithmworks based on predefine values and path. That whyWarshall Graph Algorithm has to check the entire pathand values weather the path is short or long. That is

the reason Dynamic Programming requires more time.We will find why Randomness causes more time and de-terministic process is better for small data set in futurework.

References

[1] Bray, N., Dubchak, I., Pachter, L., 2002. Avid: Aglobal alignment program, Genome Research. 2003 13:97-102.

[2] Delcher, A.L., Kasif, S., Fleischmann, R.D., PetersonJ., White O., and Salzberg S.L. 1999. Alignment ofwhole genomes. Nucleic Acids Res. 27: 2369-2376.

[3] Ewing, B., Green P. 1998. “Base-calling of automatedsequencer traces using phred. II. Error probabilities”,Genome Res.; 8: 186-194.

[4] Kent, W.J. 2002. “BLAT—the BLAST-Like Align-ment Tool”, Genome Res. 12: 656-664.

[5] Kent, W.J., Sugnet C., Furey T., Roskin K., PringleT., Zahler A., Haussler D. 2002. ”The human genomebrowser at UCSC”, Genome Res.12: 996-1006.

[6] Khan, M.I., Kamal, M.S. 2013. “RSAM: An IntegratedAlgorithm for Local Sequence Alignment”, ArchivesDes Sciences, Vol 66, No. 5, ISSN 1661-464X, 395-412.

[7] Ning, Z., Cox, A.J., Mullikin J.C. 2001 SSAHA: “Afast search method for large DNA databases”, GenomeRes.; 11: 1725-1729.

[8] Schwarz, D.S., Hutvagner, G., Du, T., Xu, Z., Aronin,N., and Zamore, P.D. 2003 “Asymmetry in the assem-bly of the RNAi enzyme complex”, Cell 115: 199.

[9] Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P.,Nickerson D.A. 2006. “Automating sequence-based de-tection and genotyping of SNPs from diploid samples”,Nat. Genet. Vol. 38, 375-381.

[10] Waqaar, H., Alex, A., Bharath, R., 2008. An EfficientAlgorithm for Local Sequence Alignment, 20-24.

[11] Watanabe, T., Takeda, A., Mise, K., Okuno, T.,Suzuki T., Minami N., and Imai H. 2005. “Stage-specific expression of microRNAs during Xenopus de-velopment”, FEBS Lett. 579: 318.

[12] Weckx, S., Del-Favero, J., Rademakers R., Claes, L.,Cruts, M., De Jonghe, P., Van Broeckhoven, C., DeRijk, P. 2005. “novoSNP, a novel computational toolfor sequence variation discovery”, Genome Res.; 15:436-442.

[13] Yetisgen-Yildiz, M. Pratt, W. 2005.“The effect offeature representation on Medline document classifi-cation”, In AMIA Annual Symposium Proceedings,American Medical Informatics Association, Vol. 23,849.

[14] Zhang, J., Wheeler, D.A., Yakub, I., Wei, S., Sood,R., Rowe, W., Liu, P.P., Gibbs, R.A. 2005. “BuetowK.H. SNPdetector: A software tool for sensitive andaccurate SNP detection”. PLoS Comput. Biol, Volume1, Issue 5, e53, 0395-0404.