Adapting Decision Tree-Based Method to Index Large DNA...

The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III) Page |57

Adapting Decision Tree-Based Method to Index Large DNA-ProteinSequence Datasets

Khalid Mohammad Jaber, Faculty of Science and Information Technology, Al-Zaytoonah University of

Jordan

Rosni Abdullah, School of Computer Sciences, Universiti Sains Malaysia

Nur’Aini Abdul Rashid, School of Computer Sciences, Universiti Sains Malaysia

Currently, the size of biological databases has increased significantly with the growing number of users and

the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to accessdatabases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast,

scalable and accuracy searching in biological databases. This may seem to be a simple task, given the speed

of current available gigabytes processors. However, this is far from the truth as the growing number of datawhich are deposited into the database are ever increasing. Hence, searching the database becomes a difficult

and time-consuming task. Here, the computer scientist can help to organize data in a way that allows

biologists to quickly search existing information. In this paper, a decision tree indexing model for DNA andprotein sequence datasets is proposed. This method of indexing can effectively and rapidly retrieve all the

similar proteins from a large database for a given protein query. A theoretical and conceptual proposed

framework is derived, based on published works using indexing techniques for different applications. Afterthis, the methodology was proved by extensive experiments using 10 data sets with variant sizes for DNA

and protein. The experimental results show that the proposed method reduced the searching space to an

average of 97.9% for DNA and 98% for protein, compared to the Top Down Disk-based suffix tree methodscurrently in use. Furthermore, the proposed method was about 2.35 times faster for DNA and 29 times for

protein compared to the BLAST+ algorithm, in respect of query processing time.

Additional Key Words and Phrases: Indexing Approaches, Searching Algorithms, Decision Tree, DNA,Protein, Bioinformatics, Data Mining

1. INTRODUCTION

One of the major challenges for researchers of computer science, is the increase in the amountof biological and genomic data, with growing numbers of users and rates of queries; wheresome databases are of terabyte size. In real life situations for any given query data in a largedatabase, the need to quickly retrieve all data which are similar is of utmost importance.With the ever growing number of data deposited in the database, searching becomes adifficult and time-consuming task. However, this paper is intended to help biologists toeasily, quickly and effectively search existing biological data.

This paper will focus on the decision tree indexing algorithm to build indexes for enormousDNA-protein datasets to reduce the computation time for searching algorithm, as well asreducing the space required to do the computation and store indexes. However, The proteinis the most prevalent bio-molecule. It consists of a long chain of amino acid sequences thatfold into a complex four dimensional structure. Structurally, each amino acid is one of 20molecules consisting of an amine, a carboxyl group, and a variable molecular side chain thatdetermines the identity and chemical activity of the amino acid. The DeoxyriboNucleic Acid(DNA) consists of a double stranded helix, close to each other. Each strand of the DNAconsists of a sugar (Deoxyribose), and a base (adenine, thymine, cytosine or guanine). Eachof the four characters (A, T, C and G) in each strand builds the code for the technologyof human life. The strands are connected by hydrogen bonds between adenine and thymine(A, T) and between cytosine and guanine (C, G). Each of these relations is called a basepair [Westhead et al. 2002].

This paper is organized in seven sections. The Section 1 explains the protein databaseproblems and the contribution of this paper. Section 2 defines the Workflow of adapting thedecision tree for indexing, followed by the benchmark for the proposed method in Section3. Section 4 presents the data sets for DNA and protein, used for the evaluation of the


indexing model. Section 5 presents the experiments and analyzes the results. Finally, theauthors conclude and present future work in the last section 6.

2. ADAPTING THE DECISION TREE FOR INDEXING: FRAMEWORK

Database indexing searching is one of the approaches adapted to achieve fast, accurate andefficient searching. Indexing genomic searching is a method that reduces the cost and timeof the searching [Williams and Zobel 2002]. In other words, the index method can efficientlysolve such kinds of memory storage capacity problems by cutting down the memory storagerequirement and access time [Williams 2003; Jiang et al. 2007]. In the absence of otherbetter methods, the index method is necessary to solve the serious problems arising fromthe use of the exhaustive search methods [Williams and Zobel 2002]. The methods usedfor genomic indexing approaches are classified under three categories which are shown inFigure 1.

Fig. 1. Classification of Methods for Genomic Searching Approaches

However, we discussed these approaches in previous works [Jaber et al. 2010b; 2010a]. Ingeneral, The searching approaches have many drawbacks which could be classified underthree categories:

(1) Storage problem: the capacity of results produced by the searching approaches are largerthan the original data.

(2) Memory problem: this problem has relationship with the storage problem, in which caseto read the large data located on the disk and try to fit it into the main memory causes


problems, since the main memory’s capability is much smaller than the capability ofthe storage disk.

(3) Poor accuracy: this problem is due to difficulties in finding the correct query solution.Large databases increase the size of the search space, which in turn increases the op-portunity to choose the wrong solution.

In some approaches, for an example the RAMdb requires a large index, about twice thesize of the original flat file database (storage problem) [Jiang et al. 2007]. In FASTA andBLAST, all sequences must be in the main memory when a program executes, thereforelarge space is needed (storage and memory problems) [Williams 2003]. Inverted Files havesuffered retrieval accuracy and as such are not very useful for small query proteins withfew SSEs (poor accuracy) [Gao and Zaki 2008]. Suffix trees, is a possible data structure forindexing, but it can only be built for small sequences due to memory bottleneck (memoryproblem) [Farach et al. 1998; Tian et al. 2005]. Therefore, due to the limitations of thesearching approaches and the limitation of searching in large genomic data sets, which is atime consuming task and due to computing complexity, the adapting decision tree based-method was proposed, to index data from large protein databases to address these problems.To the best of our knowledge so far, no research effort has used decision tree as an indexingapproach for biological data. However, the decision tree was observed to have been used indifferent fields as the indexing approach, such as in the information retrieval [Quellec et al.2007], video indexing [Shearer et al. 2001], graph database [Irniger and Bunke 2004; 2008]and others. As well, the decision tree has the ability to deal with large data sets, as severalprevious effort tried to use it in large data; such as SLIQ [Mehta et al. 1996], SPRINT [Shaferet al. 1996], CLOUDS [Alsabti et al. 1998]. Furthermore, decision tree can be constructedwithout any domain knowledge or parameter setting and it tree has good accuracy, speed,robustness, scalability and interpretability [Han and Kamber 2006]. However, to clarify theproblem area, Figure 2 shows the main steps of the proposed framework in general. Insubsequent subsections, the phases are described in detail which involves:

(1) The downloading of DNA-protein datasets from public databases.(2) Pre-processing phase.(3) Building the index using decision tree algorithms.(4) Storing the indices.(5) Searching DNA-Protein queries in the decision tree indices.

2.1. Download DNA-Protein Datasets from Public Databases

The first step in this research is to download DNA and protein datasets from genomicdatabases such as GenBank and Swiss-Prot (an example of DNA sequence is given in Figure3). The sequences are downloaded and stored locally in a text file, and used in the nextphase (pre-processing phase). Furthermore, it is used for the evaluation the indexing model.

2.2. Pre-Processing Phase

The aim of this phase is to prepare the representation of the data sets to store in thedatabases. This phase is carried out in three steps. The first step in this phase is to retrievedata from the text file by opening the file, then reading the data which it contains line byline. After that, the second step is to extract the description of the sequence, such as aunique library accession number (ID), family name and so on. Figure 4 shows an exampleof the description of the sequence.

The third step is to process and build efficient data representations which are suitablefor SQL databases, to reduce the processing time and space required. The data sets arerepresented as a collection of tuples (S) that may contain duplicates. Tuples is known also


Fig. 2. Workflow for Decision Tree Indexing

>gi|157697810|ref|NW_001838472.1| Homo sapiens chromosome 18 genomic contig, alternate assembly, whole genome shotgun sequence

GTGGTGATGGTGATGGTGATGATGGTGGTGGTGATGGTGATGGTGATGATGGTGGTGATGGTGGTAGTGATGGTGGTAATAGTGATTCTGAACTGCCTCTTAGGGGCTTAGGGGAAAGCAATGTGCAGAGTTGAAGTCACCTCAGCAGGGTGCCTGGAGATGAGACCAGTAGAAATACAGAAGTGATAAACAGAGCCTCAGGGGAAGGTTCTCCACGTAAGTAGAATGACCAATCTCCAAAAGAATTCATTGGCATTGAAACAGATTCTAACTGTTCAGAATTTTCTAAACGTTTCTCAAAAATGAAAATTCTCCCACTTGGGAAGGCAGCAGAGAAGAGGAGAAACAGC

Fig. 3. An Example of DNA Sequence

>gi|157697810|ref|NW_001838472.1| Homo sapiens chromosome 18 genomic contig, alternate

assembly (based on HuRef), whole genome shotgun sequence

Fig. 4. An Example for Description of the Sequence

as records, rows, samples or instances. Each tuple is represented as a vector of attributevalues (A). Attribute is known also as field, variable or feature. The data set provides thedescription of the attributes and domains. In this paper, data sets are denoted as D(A∪Y ),where A denotes the set of input attributes containing n attributes: A = {a1, , ai, , an} asshown in matrix Equation 1, and Y represents the target attributes.


D(A ∪ Y ) =

a11 a12 ... a1na21 a22 ... a2n. . ... .

am1 am2 ... amn

∣∣∣∣∣∣∣y1|pos(y)|y2|pos(y)|

.ym|pos(y)|

(1)

The possible value of attribute ai is denoted as pos(ai) = {vi,1, vi,2, ..., vi,|pos(ai)|}. Theset of all possible values is defined as a Cartesian product of all input possible attributes:X = {pos(a1) × pos(a2) × ... × pos(an)}. In a similar manner, pos(yi) = {T1, ..., T|pos(y)|}represents the possible values of the target attributes. The training set consists of a set ofm tuples. Formally, the training set is denoted as Tset(D) = (〈x1, y1〉, ..., 〈xm, ym〉) wherexq ∈ X and yq ∈ pos(y). The difficulty of this step is in the method of selecting targetattributes, so adoption of the target data to depend on the protein family name for proteindata, and to depend on the type of chromosomes for the DNA data was used. After buildingdata representations in preprocessing phase, these data were stored in databases, to allowfor multiple ways of accessing the stored data and make it searchable. For a particular filesystem however, there is a fixed path of accessing data and it is difficult to search into it.

2.3. Building the Indices

The first step in this phase was to initialize some parameters such as number of targetattributes, the data type (DNA or Protein), name of database and name of tables, and soon. The next step in this phase is to adapt building decision tree for large data sets. Figure5 shows the flowchart used to build the decision tree. The building algorithm has severalsteps, which are as follows:

(1) The first step is to initialize the splitting criteria. The information gained; called Entropyis used as a measure to split the attributes and compute in units called bits. Duringthis step, statistical method is used to calculate the information gain.

(2) Select an attribute which has the highest Entropy, placed at the root node and storedinto the database (loading and storing data issues are discussed in section 2.4).

(3) Create one branch for each possible value pos(ai) = {vi,1, vi,2, ..., vi,|pos(ai)|}.(4) Select one possible value vi,|pos(ai)| from pos(ai). Then, calculate the Entropy and select

the highest value to place at the branch node and store in the database. After this, theprocess could be repeated recursively for each branch, using only those instances whichactually get to the branch.

(5) If all the instances in the training set belong to a single target value, the particularnode is not split further and the recursive process down that branch terminates.

(6) If the maximum possible value of the tree is reached, to stop developing the tree. Elserepeat steps 4, 5 and 6.

2.4. Hybrid Decision Tree Indexing Model with Database Management System UsingStructured Query Language

Due to the limitations noted with the previous research which tried to fit large data setsinto the memory before an induction decision tree such as Catlett method requires thatall data are loaded into memory before induction [Sug 2005; Rokach and Maimon 2005]and [Mehta et al. 1996] uses a data structure which scales with the dataset size, this datastructure must be resident in the main memory all the time [Shafer et al. 1996]. Furthermore,in previous indexing algorithms works, all data sets are resident on disks and during thebuilding of the index, they will be moved onto the memory such as [Joshi et al. 1998; Zakiet al. 1999; Srivastava et al. 1999; Sreenivas et al. 1999; Jin and Agrawal 2003; Jin et al.2005; Ben-Haim and Tom-Tov 2010]. And due to performance of the merged trees such as[Chan and Stolfo 1997] proposed the approach to divide the datasets into many portions,


Fig. 5. Flowchart for Building the Decision Tree

then each portion is loaded alone into the memory and used to induction a decision tree.After this, every decision tree is merged, to create a single tree. And due to the size of thestorage indexing problem, the hybrid Decision Tree Indexing Model (DTIM) with DatabaseManagement System (DBMS) using SQL is proposed in the current study, to address thememory bottleneck and storage problems. However, substantial research efforts utilizes SQLdata mining techniques to improve the scalability [Sattler and Dunemann 2001]. Somefurther examples of investigated SQL techniques are presented in [Sarawagi et al. 1998;2000] for association rules, in [Ordonez and Cereghini 2000] for clustering and in [Slezakand Sakai 2009] for decision rule. This section discusses the hybrid method from many,different perspectives; such as connecting the proposed method with the database, loadingthe data for induction and data representation for storing the new indexes which wereproduced by the decision tree indexing model process. In the next section, the descriptionof the searching query with the decision tree index is made in greater details.

2.4.1. Connect DTIM with Databases. Connection of the decision tree indexing model withDBMS has several benefits. For example, it allows for multiple ways of easy access, organi-zation, management and retrieval of data stored in the database and makes it searchable viaasking of simple questions in a query language. As for the particular file systems, there arefixed paths for accessing data which are slow, difficult and expensive to search and whichare resident among all data sets in the main memory during processing computation.


1 /∗ MySQL Connector/C++ specific headers ∗/2 #include <mysql.h>3 #include <cppconn/driver.h>4 #include <cppconn/connection.h>5 #include <cppconn/statement.h>6 #include <cppconn/prepared statement.h>7 #include <cppconn/resultset.h>8 #include <cppconn/metadata.h>9 #include <cppconn/resultset metadata.h>

10 #include <cppconn/exception.h>11 #include <cppconn/warning.h>12 /∗ ========================== ∗/1314 #define DBHOST ”tcp://127.0.0.1:3306”15 #define USER ”kjaber”16 #define PASSWORD ”password123”17 #define DATABASE ”kjaber”18 #define TABLE ”‘100k‘”19 using namespace sql;2021 /∗ ==========================∗/2223 Driver ∗driver = get driver instance();24 /∗ create a database connection using the Driver ∗/25 auto ptr < Connection > con ( driver −> connect(DBHOST, USER, PASSWORD) );2627 /∗ select appropriate database schema ∗/28 con −> setSchema(database);2930 /∗ create a statement object ∗/31 auto ptr< sql::Statement > prep stmt(con1−>createStatement());3233 auto ptr< sql::ResultSet > res;3435 /∗ run a query which returns exactly one result set ∗/36 res.reset(prep stmt −> executeQuery (lbnv−>get()));3738 /∗ retrieve and display the result set metadata ∗/39 ResultSetMetaData ∗res meta = res −> getMetaData();40 numcols = res meta −> getColumnCount();

Fig. 6. Create Connection

To create a connection between the current proposed method and DBMS, latest connec-tors were used for MySQL, developed by Sun Microsystems. The MySQL connector for C++provides an object-oriented API and a database driver for connecting C++ applications tothe MySQL Server. Therefore, the main components for establishing the connection are theDriver, Connection, Statement, PreparedStatement, ResultSet and ResultSetMetaData.

Figure 6 presents the algorithm for establishing a connection to the DBMS by retriev-ing an object of sql::Connection from the sql::Driver object. An sql::Driver object is re-turned by the sql::Driver::get driver instance method and sql::Driver::connect method re-turns the sql:: Connection object. However, the first parameter in the connection objectis the TCP/IP, which is used to connect the system with DBMS server by specifying”tcp://[hostname[:port]][/schemaname]” for the connection URL (Figure 6 lines 14 and25). Specifying the schema name in the connection URL is optional, and which if not set,then there is the need to select the database schema using Connection::setSchema method(Figure 6 line 28).


ID Parent (P) Node (N) Element (E) Decision (y)

Fig. 7. Indexes Data Representation

To prevent memory leaks and to reduce the memory usage, the dynamic memory wasused, with the class template auto ptr as shown in Figure 6 lines 25, 31 and 33. An auto ptrobject has the same semantics as that of a pointer. However, when it goes out of scope,it automatically releases the dynamic memory which it manages. In other words, whenauto ptr template class is in use, there is no need to free up the memory explicitly usingthe delete operator. eg., delete con;

More highlight in respect of the hybrid DTIM with DBMS, there is no need to load allof a huge data set into the main memory for induction decision tree. Thus connection iscreated, then simple questions are asked using SQL, after which the results are retrieved,automatically closing the connection.

2.4.2. Indexes Data Representation. An efficient data representations for storing the indexesis proposed, which was produced by the decision tree indexing model process, to minimizethe indexing size. Usually, the results produced by the decision tree are stored as a rule,which has many drawbacks such as the following [Zhou et al. 2003b; 2003a]:

(1) Generated rules are often very complex than is necessary and contain redundant infor-mation.

(2) Storage problems for large data and inefficient speed.(3) Memory bottleneck problems which need to fit all the rules in the memory to do the

search.

On the other hand, the proposed data representation has several advantages as follows:

(1) Efficient way to insert, delete and retrieve the search tree elements.(2) Minimizes the number of database queries by having only one query for each activity.(3) Minimizes the storage size.(4) Is generated efficiently and easily.(5) Is a solution to the memory problem, with no need to fit all indexes into the memory.

The indexes could be represented by five attributes which are; ID, Parent, Node, Elementand Decision as shown in Figure 7. The ID represents the unique identification number foreach node. Parent (P) represents the father ID for each node. The Node (N) represents aposition of an element, depending on the sequence. Element (E) represents a set of attributes(A); where attribute (A) contains n attributes: A = {a1, ...., ai, ...., an}. The possible value ofan attribute ai is denoted by pos(ai) = {vi,1, vi,2, ..., vi,|pos(ai)|}. Finally, decision representsthe target attribute (y) or the next step in the tree; where the possible value of (yi):pos(yi) = {T1, ..., T|pos(y)|}.

To clarify the proposed index representation, Figure 8 represents an example of the in-dexes representation. Figure 8 consists of two parts; the first part on the left represents thetree from the root node a5 and one branch element A, until the last level of the indexes. Thesecond part on the right is the indexes data representation. In the first instance, ID = 0the parent P = −1, meaning that it is the root node, thus the decision is a1, meaning thatthe next step is node = a1 with parent = 0.


ID Parent Node Element Decision

0 -1 a5 A a1

1 0 a1 T a2

2 1 a2 A 3

3 1 a2 T 1

4 1 a2 G 3

5 0 a1 G 1

Fig. 8. Example of Indexes Data Representation

2.5. Search DNA-Protein Queries in the Decision Tree Indices

The fifth phase of this experiment involves the searching process. This involves using theindexes which were stored in the previous phase. Therefore, the searching process of thework could be summarized in the following steps:

Step 1:. Enter the query (Q) or queries.Step 2:. Tokenize the query, called query chopping.Step 3:. Search the query in the decision tree indices, using SQL command.Step 4:. Save the result or display it.

From the review of previous methods, it was noticed that some of the popular indexingmethods do not support the search for multiple queries, called Query multiplexing or packingsuch as CAFE [Williams and Zobel 2002], Top Down Disk-based (TDD) suffix trees [Tianet al. 2005; Tata et al. 2004]. To address this problem, the hybrid in the proposed methodis equipped with a query packing, to handle multiple queries and reduce the overhead ofreading the queries repeatedly.

Additionally, to enhance the searching speed, the query chopping technique was adaptedto merge with the proposed method. Query chopping is dividing the individual query se-quences into several segments, searching them independently and then merging the resultsback together. After adapting the query chopping for the proposed method, the querychopping definition is thus to divide individual query sequence into several segments, useparticular segments to search them dependently and then show the result. Figure 9 presentsthe query sequence, which is divided into five segments and the segments needed for thequery process only are a5, a1 and a2, whereas segments a3 and a4 are not necessary in thequery search process. Therefore, supposing there is a query length of 80 nucleotides andthis uses only 50 nucleotides in the search process; therefore the use of particular nucleotideincreases the speed of the search instead of the search process with all the 80 Nucleotides.

Before moving on to the next section, following is a clarification of an adapted decisiontree (DTIM) compared to a normal decision tree. The DTIM has many improvements(advantages) such as the preprocessing phase, where the preprocessing phase is done in threesteps. The important step in preprocessing phase is to retrieve and clean the data, while this


Fig. 9. Query Chopping Technique

step is not existing in the normal decision tree algorithms. Moreover, the normal decisiontree data representations are stored in a text file as a storage which causes a problem inmemory and difficulties in processing and building the tree -especially for large data. Whilethe DTIM data representation is stored in DBMS, -which is fast and better to handle largedata. Furthermore, the normal decision tree cannot deal with DNA-protein data directlyfor the reason that the DNA-protein is same as English text, and it does not have a targetvalue, while the DTIM can deal with DNA-protein data due to the preprocessing phases.The DTIM is connected with DBMS instead of the file system which is available in thenormal decision tree. However, the results produced from the normal decision tree are rules,whereas the results of the adapted decision tree are records in a table; the discussion ofadvantages and disadvantages is provided in subsection 2.4.2. Finally, the normal decisiontree has no concept of query multiplexing and packing, while it is available in DTIM.

3. BENCHMARK

To evaluate the DTIM method, different tests were performed as follows: first, DTIM methodwas compared with the Top Down Disk-Based suffix tree [Tata et al. 2004; Tian et al. 2005],to test the indexing size. Secondly, DTIM method was compared with BLAST+ version2.22.22, downloaded from [NCB 2010] to test the query processing time. Finally, computingthe performance of the DTIM is tested by accuracy, recall, precision and elapsed time forbuilding the indexes.

4. TEST DATA SETS

In the experiments, different sets of ten DNA sequences were used, extracted from humanchromosomes 18, 19 and 21 with varying sizes from 10 thousand to 5 million bases, whichwere downloaded from GenBank. Furthermore, other different data sets of ten protein se-quences were used with varying sizes from 10 thousand to 5 million bases, extracted fromSwiss-Prot database. As query sequences, some subsequences were randomly extracted fromthe data sets.

5. EXPERIMENTS AND RESULTS

This section describes the experiment performed and the result of the experiment on thedecision tree indexing model (DTIM). The first experiment conducted was to compare theindexing size between DTIM and TDD (Top Down Disk-Based suffix tree) download fromhttp://www.eecs.umich.edu/tdd. The experiment was performed with two different datatypes; DNA and protein data. The reason for using this is that each nucleotide from DNAconsists of 4 characters, whereas the protein data consists of 20 amino acids, which is morecomplicated. The sizes of ten data sets range from 10 thousand to 5 million bases.

The second experiment performed was to compare the DTIM with BLAST+, in term ofquery processing time. The number of queries was similar to the number of deposit sequencesin the databases, to show the effects of performance of the proposed method.


The third experiment was to examine the time speed for building an index on one proces-sor. The final experiment was is to quantify the relative performance or retrieval effectivenessof the proposed method, used to measure accuracy, precision and recall.

5.1. Indexing Size Experiments

Figure 10 shows the DTIM index sizes for nucleotide data sets, while Figure 11 is for aminoacid data sets, with changing data sizes. Furthermore, the current research approach wascompared with the Top Down Disk-Based suffix tree approach in respect of the index sizefor both DNA and protein. The Top Down Disk-Based is a previous approach based on thesuffix tree [Tian et al. 2005; Tata et al. 2004].

From the experimental results, it was observed that the index size increases linearly inproportion to the data size in both approaches. However, in comparison with TDD, thecurrent approach saves an average of about 97% of storage space for DNA and 98.2%for protein data. In comparison, in the DNA-Protein data with original data sets, it wasobserved that the DTIM reduced the size of the data significantly. For example, the DTIMreduced 99% of the size of data set number 10, which contained 5 million nucleotides,whereas the size of the original data is about 5M while the size of indexing about 550K.This is enough evidence which supports that the proposed DTIM is appropriate for largedata sets.

100 K 200 K 300 K 400 K 500 K 1 M 2 M 3 M 4 M 5 M

TDD 1210 2411.8 3606.4 4839 6052 12501 25004 37580 50115 62761

DTIM 39.6 59.7 82.4 145.9 181.6 222 387.9 443 488.5 533.8

0

10000

20000

30000

40000

50000

60000

70000

Sto

rage

Siz

e (

KB

)

Data Sets Size (KB)

DNA Indexing Size

Fig. 10. Index Sizes for DNA Data Sets

100 K 200 K 300 K 400 K 500 K 1 M 2 M 3 M 4 M 5 M

TDD 1246 2501 3656 4891 5935 11742 24169 36283 48573 60495

DTIM 48.4 55.1 60.4 65.1 73.3 222.8 403.3 548.4 661 746.2

0

10000

20000

30000

40000

50000

60000

70000

Sto

rage

Siz

e (

KB

)

Data Sets Size (KB)

Protein Indexing Size

Fig. 11. Index Sizes for Protein Data Sets


100 K 200 K 300 K 400 K 500 K 1m 2m 3m 4m 5m

BLAST+ 0.93 1.978 3.266 4.919 6.878 31.287 137.742 320.855 635.264 733.334

DTIM 1.095 2.585 4.428 8.765 12.424 28.569 88.272 147.105 211.139 292.64

0

100

200

300

400

500

600

700

800

Tim

e (

sec)

Data Sets Size

DNA Query Processing Time

Fig. 12. DNA Query Processing Time

5.2. Query Processing Time

In this experiment, DTIM was compared with BLAST+ version 2.22.22, in respect of theelapsed time for query search. The elapsed time is the time spent on finding all the queries,which are submitted to the DTIM or BLAST+. The effects of variant data sets sizes wasalso examined, with changing number of queries on DTIM and BLAST+ methods.

Figures 12, 13 respectively show the DNA and protein queries processing time of DTIMand BLAST+ for exact match queries. The X axis denotes a varying data sets size, while theY axis the elapsed time in unit of seconds. From this experiment results, it was observedthat the query processing time increased linearly in proportion to the data size in bothapproaches (DTIM and BLAST+) as shown in Figures 12, 13. However, in Figure 12, itwas observed that the BLAST+ achieves 1.48 faster than the DTIM when the DNA dataset size is greater than 500K. On the other hand, DTIM shows good performance whenDNA data set size is less than 500K and achieves about 2.35 times of query processing timecompared with BLAST+. This is because of the BLAST+ methodology, where it examinesevery possible alternative to find a particular solution. In other words, BLAST+ needs tocompare a query with each of the sequences in the database, depending on the sequencealignment techniques. In the case of the proposed approach, the data sets which are lessthan 500K is somewhat small in size, hence the BLAST+ completes the search quickly.However, when the size of data sets increased to more than 500k, DTIM achieved about2.35 times speedup for the DNA and 29 times speedup for the protein, compared with theBLAST+ in terms of query processing time, as clarified in Figure 12 and 13. This is becauseof two reasons. The first was mentioned previously, in respect of the BLAST+ methodology.Secondly, is the combination of query packing and query chopping techniques with DTIM,to increase the speed in the process of retrieval queries. For example, DTIM packed 1412queries together when it searched in the 100K and 72753 queries together when it searchedin 5M.

5.3. Building Index Time

In this experiment, the time taken to build the index on one processor was calculated, withchanging data sets size for the two groups of data; DNA and protein as shown in TableI. The aim of this experiment was to examine the time cost of building the index for thecurrent approach DTIM.

The results shows that the required time to build the index increased exponentially inproportion to the data size. However, the required time to build 100K DNA data is about72 seconds while for 100K protein data is 105 seconds. Compared with the 500K DNA data,


100 K 200 K 300 K 400 K 500 K 1m 2m 3m 4m 5m

BLAST+ 65.318 123.077 183.973 224.997 245.02 430.943 940.003 1487.88 2147.42 2910.98

DTIM 0.4747 1.138 1.6155 2.2659 2.867 12.342 32.227 57.899 82.921 107.823

0

500

1000

1500

2000

2500

3000

3500

Tim

e (

sec)

Data Sets Size

Protein Query Processing Time

Fig. 13. Protein Query Processing Time

which is about 2531 seconds while in 500K protein data is about 1118 seconds as shown inTable I. Hence, this indicates that a large data takes a long time to build the index and theDTIM approach is not practical for large data sets. To improve the performance of DTIMin order to minimize the required time to build the index, the parallel version of DTIM willpropose as future work.

Table I. Building Index Time using OneProcessor

Data Size Time (Sec)

DNA

100k 72.2857200k 205.76300k 541.135400k 1704.87500k 2531.77

Protein

100k 105.103200k 314.421300k 542.843400k 852.372500k 1118.05

5.4. Evaluating DTIM

In this experiment, the retrieval effectiveness of DTIM was examined, using the measureof accuracy, precision and recall. Before doing this, the holdout method was used as shownin Figure 14. In this method, the data set was randomly partitioned into two independentsets; a training set and a test set [Han and Kamber 2006]. Typically, two-thirds of the dataare allocated to the training set, and the remaining one-third are allocated to the test set.The training set is used to produce the index data, whose accuracy, precision and recall areestimated with the test set using confusion matrix. The confusion matrix is a useful toolfor analyzing the DTIM, to calculate the accuracy, recall and precision.

From experimental results, it is observed that the accuracy of DTIM for DNA is slightlylow but according to [Doolittle 1986] it is acceptable because when two amino acids ornucleotides are more than 30% identical, then the sequences are most likely homologous.However, low accuracy was achieved for many reasons; the important one is with regardsto the sampling method (Holdout), which selected the training set randomly, leading to apessimistic estimate because only a portion of the initial data was used to derive the mode.


Fig. 14. Estimating Accuracy, Precision and Recall with the Holdout Method

Probably, this portion was not sufficient to learn the mode to achieve good accuracy. Onthe other hand, DTIM performs better accuracy in the protein data. The average overallaccuracy for the ten data sets are about 90.48% for protein while for DNA data, it is about55.89%. This is also enough evidence to support that the proposed DTIM is appropriatefor large data sets in terms of accuracy.

Figures 15, 16 illustrates the comparison between average precision training test andaverage precision of testing set for DNA and protein respectively, whereas Figures 17, 18illustrate the comparison between average recall training set and average recall testing setfor DNA and protein respectively. The precision values for training set was mostly 100% inboth DNA and protein, which imply that no false positive values were produced by DTIM.Testing sets for DNA produces the most false positives while testing set for protein producesa very small number of false positive. However, the DNA recall in the Figure 17 is betterthan the precision of testing data for DNA in the same figure, implying that DTIM forDNA produces less false negative. The recall values for protein testing set are above 79%,indicating that the false negative produced is low.

Overall, DTIM for protein performs better than DTIM for DNA. However this also sup-ports that DTIM is appropriate for complicated data sets such as protein (20 amino acids)in terms of precision and recall.

34.96 35.39 33.14 32.68 32.7625.39 26.56 25.94 25.89 26.20

100 100 100 100 100 100 100 100 100 100

0.00

20.00

40.00

60.00

80.00

100.00

120.00

100K 200K 300K 400K 500K 1M 2M 3M 4M 5M

Pre

cisi

on

(%

)

Data Size

DNA Precision

Testing Precision Training Precision

Fig. 15. Average Precision for DNA

6. CONCLUSION

Database indexing for larger biological data has become an important task in bioinformatics.The main objective for indexing is to reduce the computation time searching algorithm, aswell as reducing the space required to do the computation and store indexes. To achieve


89.78

94.2193.40

96.81 96.43 97.01

94.77

77.85

95.44 96.05

100 100 100 100 100 100 100 100 100 100

77.00

82.00

87.00

92.00

97.00

102.00

100K 200K 300K 400K 500K 1M 2M 3M 4M 5M

Pre

cisi

on

(%

)

Data size

Protein Precision

Testing Precision Training Precision

Fig. 16. Average Precision for Protein

100 100 100 100 100 100 100 100 100 99.98

32.56 34.14 32.22 32.34 32.0622.93 23.50 23.20 22.85 23.12

0

20

40

60

80

100

120

100K 200K 300K 400K 500K 1M 2M 3M 4M 5M

Re

call

(%)

Data Size

DNA Recall

Training Recall Testing Recall

Fig. 17. Average Recall for DNA

100 100 100 100 10097.94 97.23 97.06 96.93 96.85

79.77

87.60 86.7389.68 88.59 87.36 88.48

79.86

89.17 89.96

60

65

70

75

80

85

90

95

100

105

100K 200K 300K 400K 500K 1M 2M 3M 4M 5M

Re

call

(%)

Darta Size

Protein Recall

Training Recall Testing Recall

Fig. 18. Average Recall for Protein


this, the decision tree for indexing the large DNA-Protein data was adapted from manyperspectives such as data representation, connecting the DTIM with DBMS to store theindexes which were produced from the model, loading data from DBMS to do the inductionfor DTIM, search in DTIM, workflow, benchmark, data sets, system requirement and so on.After this, the methodology was proved by extensive experiments using 10 data sets withvariant sizes for DNA and protein.

From the experiment, it was observed that the proposed DTIM reduces the indexingspace an average of 97.9% for DNA and 98% for protein compared with TDD and performsthe queries in lower query processing time is about 2.35 times faster for DNA and 29 timesfor protein compared with BLAST+ algorithm. Furthermore, the experiment examined thetime cost of building the index for DTIM and shows that the required time to build theindex increased exponentially in proportion to the data size and therefore, to improve theperformance of DTIM in order to minimize the required time to build the index, the parallelversion of DTIM will be proposed in future work. Finally, it could be concluded that DTIMperformed better accuracy, precision and recall for protein data set than DNA data set inthe ten data sets tested. However, low accuracy of DNA data was derived, the accuracy ofwhich needs to be improved in future work.

REFERENCES

2010. Ncbi website, http://blast.ncbi.nlm.nih.gov.

Alsabti, K., Ranka, S., and Singh, V. 1998. Clouds: A decision tree classifier for large datasets. InConference on Knowledge Discovery and Data Mining KDD. 2.

Ben-Haim, Y. and Tom-Tov, E. 2010. A streaming parallel decision tree algorithm. Journal of MachineLearning Research 11, 849?872.

Chan, P. K. and Stolfo, S. J. 1997. On the accuracy of meta-learning for scalable data mining. Journalof Intelligent Information Systems 8, 1, 5–28.

Doolittle, R. F. 1986. Of Urfs and Orfs: A Primer on how to Analyze Derived Amino Acid. UniversityScience Books.

Farach, M., Ferragina, P., and Muthukrishnan, S. 1998. Overcoming the memory bottleneck in suf-fix tree construction. In FOCS ’98: Proceedings of the 39th Annual Symposium on Foundations ofComputer Science. IEEE Computer Society, Washington, DC, USA, 174.

Gao, F. and Zaki, M. J. 2008. Psist: A scalable approach to indexing protein structures using suffix trees.Journal of Parallel and Distributed Computing 68, 1, 54–63.

Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann PublishersInc.

Irniger, C. and Bunke, H. 2004. Graph database filtering using decision trees. In ICPR ’04: Proceedingsof the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 3. IEEE ComputerSociety, Washington, DC, USA, 383–388.

Irniger, C. and Bunke, H. 2008. Decision trees for filtering large databases of graphs. Int. J. Intell. Syst.Technol. Appl. 3, 3/4, 166–187.

Jaber, K., Abdullah, R., and Abdul Rashid, N. 2010a. A framework for decision tree-based methodto index data from large protein sequence databases. In 2010 IEEE EMBS Conference on BiomedicalEngineering & Sciences (IECBES 2010). Kuala Lumpur, Malaysia.

Jaber, K., Abdullah, R., and Abdul Rashid, N. 2010b. Indexing protein sequence/ structure databasesusing decision tree: A preliminary study. In 4th International Symposium on Information Technology2010 (ITSim 2010). Vol. 2. Kuala Lumpur Convention Center, Kuala Lumpur, Malaysia, 844–849.

Jiang, X., Zhang, P., Liu, X., and Yau, S. S.-T. 2007. Survey on index based homology search algorithms.J. Supercomput. 40, 2, 185–212.

Jin, R. and Agrawal, G. 2003. Communication and memory efficient parallel decision tree construction.In In Proceedings of Third SIAM Conference on Data Mining.

Jin, R., Yang, G., and Agrawal, G. 2005. Shared memory parallelization of data mining algorithms:Techniques, programming interface, and performance. IEEE Trans. on Knowl. and Data Eng. 17, 1,71–89.


Joshi, M. V., Karypis, G., and Kumar, V. 1998. Scalparc: A new scalable and efficient parallel classifica-tion algorithm for mining large datasets. In IPPS ’98: Proceedings of the 12th. International ParallelProcessing Symposium. IEEE Computer Society, Washington, DC, USA, 573.

Mehta, M., Agrawal, R., and Rissanen, J. 1996. Sliq: A fast scalable classifier for data mining. In EDBT’96: Proceedings of the 5th International Conference on Extending Database Technology. Springer-Verlag, London, UK, 18–32.

Ordonez, C. and Cereghini, P. 2000. Sqlem: fast clustering in sql using the em algorithm. In SIGMOD’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. ACM,New York, NY, USA, 559–570.

Quellec, G., Lamard, M., Bekri, L., Cazuguel, G., Cochener, B., and Roux, C. 2007. Multimediamedical case retrieval using decision trees. In Engineering in Medicine and Biology Society, 2007.EMBS 2007. 29th Annual International Conference of the IEEE. 4536 –4539.

Rokach, L. and Maimon, O. 2005. Decision trees. In Data Mining and Knowledge Discovery Handbook.165–192.

Sarawagi, S., Thomas, S., and Agrawal, R. 1998. Integrating association rule mining with relationaldatabase systems: alternatives and implications. SIGMOD Rec. 27, 2, 343–354.

Sarawagi, S., Thomas, S., and Agrawal, R. 2000. Integrating association rule mining with relationaldatabase systems: Alternatives and implications. Data Mining and Knowledge Discovery 4, 2, 89–125.

Sattler, K.-U. and Dunemann, O. 2001. Sql database primitives for decision tree classifiers. In CIKM ’01:Proceedings of the tenth international conference on Information and knowledge management. ACM,New York, NY, USA, 379–386.

Shafer, J. C., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining.In VLDB ’96: Proceedings of the 22th International Conference on Very Large Data Bases. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 544–555.

Shearer, K., Bunke, H., and Venkatesh, S. 2001. Video indexing and similarity retrieval by largestcommon subgraph detection using decision trees. Pattern Recognition 34, 5, 1075–1091.

Slezak, D. and Sakai, H. 2009. Automatic extraction of decision rules from non-deterministic data systems:Theoretical foundations and sql-based implementation. In Database Theory and Application. SpringerBerlin Heidelberg, 151–162.

Sreenivas, M. K., Alsabti, K., and Ranka, S. 1999. Parallel out-of-core divide-and-conquer techniqueswith application to classification trees. In IPPS ’99/SPDP ’99: Proceedings of the 13th InternationalSymposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing.IEEE Computer Society, Washington, DC, USA, 555–562.

Srivastava, A., Han, E.-H., Kumar, V., and Singh, V. 1999. Parallel formulations of decision-tree classi-fication algorithms. Data Min. Knowl. Discov. 3, 3, 237–261.

Sug, H. 2005. A comprehensively sized decision tree generation method for interactive data mining of verylarge databases. In Advanced Data Mining and Applications. 141–148.

Tata, S., Hankins, R. A., and Patel, J. M. 2004. Practical suffix tree construction. In VLDB ’04: Pro-ceedings of the Thirtieth international conference on Very large data bases. VLDB Endowment, 36–47.

Tian, Y., Tata, S., Hankins, R. A., and Patel, J. M. 2005. Practical methods for constructing suffixtrees. The VLDB Journal 14, 3, 281–299.

Westhead, D., Parish, H., and Twyman, R. 2002. Instant Notes in Bioinformatics. Topeka Bindery.

Williams, H. E. 2003. Genomic information retrieval. In ADC ’03: Proceedings of the 14th Australasiandatabase conference. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 27–35.

Williams, H. E. and Zobel, J. 2002. Indexing and retrieval for genomic databases. Knowledge and DataEngineering, IEEE Transactions on 14, 1, 63–78.

Zaki, M., Ho, C.-t., and Agrawal, R. 1999. Parallel classification for data mining on shared-memorymultiprocessors. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering.IEEE Computer Society, Washington, DC, USA, 198.

Zhou, C., Xiao, W., Tirpak, T., and Nelson, P. 2003a. Evolving accurate and compact classificationrules with gene expression programming. Evolutionary Computation, IEEE Transactions on 7, 6, 519– 531.

Zhou, C., Xiao, W., Tirpak, T. M., and Nelson, P. C. 2003b. Evolving classification rules with geneexpression programming. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7.

Adapting Decision Tree-Based Method to Index Large DNA...

Documents

Transcript of Adapting Decision Tree-Based Method to Index Large DNA...