Post on 24-Mar-2019
Faculty of Cognitive Sciences and Human Development
PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM
Tan Jia Kae
Bachelor of Science with Honours (Cognitive Science)
2015
UNIVERSITI MALAYSIA SARAWAK
Grade _____
Please tick one
Final Year Project Report IZI Masters D PhD D
DECLARATION OF ORIGINAL WORK
This declaration is made on the 05 day of JUNE year 2015
Students Declaration I TAN JIA KAE 39023 FACULTY OF COGNITIVE SCIENCES AND HUMAN DEVELOPMENT hereby declare that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM is my original work I have not copied from any other students work or from any other sources with the exception where due reference or acknowledgement is made explicitly in the text nor has any part of the work been written for me by another person
5 JUNE 2015
TAN JIA KAE (39023)
Supervisors Declaration I DR LEE NUNG KlON hereby certify that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM was prepared by the aforementioned or above mentioned student and was submitted to the FACULTY as a partiallfull fulfillment for the conferment of BACHELOR OF SCIENCE WITH HONOURS (COGNITIVE SCIENCE) and the aforementioned work to the best of my knowledge is the said students work ~
5 JUNE 2015 Date ________Received for examination by
I declare this ProjectThesis is classified as (Please tick (Jraquo
o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)
o RESTRICTED (Contains restricted information as specified by the organisation where research was done)
~ OPEN ACCESS
I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo
~ YES
o NO
Validation of ProjectJThesis
I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows
bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes
bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes
bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS
bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without
Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15
Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak
Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction
first obtaining a proval from UNlMAS t
Students signature -1----------shyDate
I
ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK
PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM
TANJIAKAE
This project is submitted in partial fulfilment of the requirements for a
Bachelor of Science with Honours (Cognitive Science)
I
I
Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK
(2015)
-
The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for examination by
--------------------~--(Dr Lee Nung Kion)
Date 5 June 2015
Grade
II
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
UNIVERSITI MALAYSIA SARAWAK
Grade _____
Please tick one
Final Year Project Report IZI Masters D PhD D
DECLARATION OF ORIGINAL WORK
This declaration is made on the 05 day of JUNE year 2015
Students Declaration I TAN JIA KAE 39023 FACULTY OF COGNITIVE SCIENCES AND HUMAN DEVELOPMENT hereby declare that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM is my original work I have not copied from any other students work or from any other sources with the exception where due reference or acknowledgement is made explicitly in the text nor has any part of the work been written for me by another person
5 JUNE 2015
TAN JIA KAE (39023)
Supervisors Declaration I DR LEE NUNG KlON hereby certify that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM was prepared by the aforementioned or above mentioned student and was submitted to the FACULTY as a partiallfull fulfillment for the conferment of BACHELOR OF SCIENCE WITH HONOURS (COGNITIVE SCIENCE) and the aforementioned work to the best of my knowledge is the said students work ~
5 JUNE 2015 Date ________Received for examination by
I declare this ProjectThesis is classified as (Please tick (Jraquo
o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)
o RESTRICTED (Contains restricted information as specified by the organisation where research was done)
~ OPEN ACCESS
I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo
~ YES
o NO
Validation of ProjectJThesis
I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows
bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes
bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes
bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS
bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without
Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15
Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak
Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction
first obtaining a proval from UNlMAS t
Students signature -1----------shyDate
I
ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK
PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM
TANJIAKAE
This project is submitted in partial fulfilment of the requirements for a
Bachelor of Science with Honours (Cognitive Science)
I
I
Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK
(2015)
-
The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for examination by
--------------------~--(Dr Lee Nung Kion)
Date 5 June 2015
Grade
II
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
I declare this ProjectThesis is classified as (Please tick (Jraquo
o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)
o RESTRICTED (Contains restricted information as specified by the organisation where research was done)
~ OPEN ACCESS
I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo
~ YES
o NO
Validation of ProjectJThesis
I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows
bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes
bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database
bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes
bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS
bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without
Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15
Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak
Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction
first obtaining a proval from UNlMAS t
Students signature -1----------shyDate
I
ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK
PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM
TANJIAKAE
This project is submitted in partial fulfilment of the requirements for a
Bachelor of Science with Honours (Cognitive Science)
I
I
Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK
(2015)
-
The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for examination by
--------------------~--(Dr Lee Nung Kion)
Date 5 June 2015
Grade
II
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
I
ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK
PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM
TANJIAKAE
This project is submitted in partial fulfilment of the requirements for a
Bachelor of Science with Honours (Cognitive Science)
I
I
Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK
(2015)
-
The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for examination by
--------------------~--(Dr Lee Nung Kion)
Date 5 June 2015
Grade
II
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for examination by
--------------------~--(Dr Lee Nung Kion)
Date 5 June 2015
Grade
II
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
ACKNOWLEDGEMENTS
First and foremost I would like to take this opportunity to express my deepest
appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his
precious time in order to give me a lot of remarks as well as sharing his superior knowledge
experience and expertise during the process in completing my Final Year Project Without his
guidance my project would not be completed successfully at the limited of time
Next I am deeply indebted to my family for affording their unceasing encouragement
support and attention effluence to me during the whole process of doing my Final Year Project
study especially for those periods that I really need some of their love to help me finish my Final
Year Project Thesis
In addition I would like to thank to all my friends and course mates who supported and
encouraged me in completion of this project During the completion of this project I faced some
ofdifficulties that would pull me to give up Luckily they are giving me full of advices and
support that give me the strength and confidence to finish my Final Year Project Thesis
III
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
Pusa unnit MwumalA Oil (-1 bullbull
UNlVEKSITI MALAYSIA SAItAWAK
TABLE OF CONTENTS
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT viii
ABSTRAK ix
CHAPTER ONE INTRODUCTION 1
CHAPTER TWO LITERATURE REVIEW 11
CHAPTER THREE METHODOLOGY 39
CHAPTER FOUR RESULT AND DISCUSSION 62
CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69
REFERENCE 73
APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79
IV
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
LIST OF TABLES
Table I Phylogenetic Tree Classification Cross-validation results based on different features 62
Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66
v
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
LIST OF FIGURES
Figure 1 The first evolution tree diagram sketched by Darwin 3 I
Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3
Figure 3 Non-phylogenetic tree- family tree 8
Figure 4 phylogenetic rooted-tree rectangular cladogram 13
Figure 5 phylogenetic rooted-tree Slanted diagram 13
Figure 6 phylogenetic unrooted-tree circular cladogram 14
Figure 7 phylogenetic scaled-tree 16
Figure 8 phylogenetic unscaled-tree 16
Figure 9 A quick review of phylogenetic tree 19
Figure 10 Object detection in computer perception 25
Figure 11 Feature Representation 25
Figure 12 SIFT 27
Figure 13 RIFT - 27
Figure 14 Spin image 28
Figure 15 Pre-pocessing of model Objects 32
Figure 16 Recognition of object in the scene 33
Figure 17 TreeRipper 36
Figure 18 TreeSnatcher Plus 37
Figure 19 Windows Snipping Toolbox 44
Figure 20 Original lpng 46
Figure 21 After Thresholding 46
Figure 22 Grayscale image 46
Figure 23 SURF Feature Detection and Extraction 53
vi
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
LIST OF FIGURES
Figure 24 GIST Feature Detection and Extraction 54
Figure 25 lO-fold cross-validation accuracy 63
Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67
Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67
vii
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
ABSTRACT
A study is conducted to develop an automated phylogenetic tree image classification system by
using machine learning algorithm This study adopted supervised machine learning algorithm
which is the Support Vector Machine (SVM) for classification Image data were collected from
online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance
comparisons of three types of features to characterize the phylogenetic tree images are presented
in this project The aim is to detennine the suitable features for the phylogenetic tree image
classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of
each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our
results show that the suitable combination features for the phylogenetic tree image classification
system are SIFT SURF and GIST The accuracy obtained from these combinations of the three
features can achieve just over 82 On the other hands the results show the average accuracy
obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the
utility of using SIFT SURF and GIST features for building phylogenetic tree image
classification system
Keywords phylogenetic tree image classification system image processing feature extraction
SIFT GIST SURF
VIII
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
ABSTRAK
Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok
filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah
menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data
imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan
Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga
telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai
untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah
digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang
pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy
cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT
SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh
memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata
yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian
ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem
filogenetik klasifikasi pokok ini
Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF
IX
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
CHAPTER ONE
INTRODUCTION
Overview
It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary
analysis of different species organisms or genes from a collaborative ancestor (Laubach von
Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection
of expedients for ascertainment long-term phenotypical evolution which developed during the
year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis
which is evolution theory This is because the evolutionary analysis shows the ecological
characterization of the species that uses the concept of frequency dependence from gene theory
(Brinkman 2005) This chapter mainly discusses about the background of the study problem
statements research objectives research questions hypothesis and conceptual framework of the
study and significance of the study In addition this chapter also describes the definition of
relevant terms
Introduction
The evolutionary tree or phylogenetic tree is a visualization to show the relationship
between all entities according to the similarities and differences in their hereditary or physical
characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship
among the species was also important This can be reflectedby the way of phylogenetic tree to
demonstrate the evolution analysis of any species in this world Evolution analysis generally
iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding
and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)
Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
(2008) the analogous sequence is used to identify the similar sequence whereas the diverse
calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding
is the process to build up the phylogenetic tree after the analogous seqence and the diverse
calibration process and then for the graphic representation or figure signification is used to show
the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can
show that the increasing use of phylogenetic trees in biological sciences especially for biologists
who did the evolution analysis on the species Therefore the use of phylogentic tree is quite
important for the evolution analysis of life on Earth
Apart from that phylogeny is the evolutionary history of a species or group of related
species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies
organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to
determine organisms evolutionary relationship by systematist According to Campbell and Reece
(2008) the term systematist in this research refers to the professional who used fossil molecular
and genetic data to infer evolutionary relationships They also proposed PhyloCode which can
be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic
analysis presents as a collection of nodes and branch For instance the taxa that closely related
are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly
related are in the different branches of the tree or there is a distance which is far from each other
in such tree
Background of the study
In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin
1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year
2
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
2000
1000
of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the
simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)
9L-shy ~ ~ A 2$ ~laquo
~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt
-z-a ~ ltZ- ~~-
~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy
Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins
notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission
o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000
Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission
3
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
First illustration of a phylogenetic tree is the first scientific argument for the theory of
advancement by means of innate selection Darwin (1998) stated that The time will come I
believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach
great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see
how modem genetics supported and confirmed by his owns ideas He provided evidence which
is not only for what had happened in the aspect of evolution but precisely how living things
evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)
In fact there are few approaches used for discovering the evolution analysis of species
before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s
immunochemical studies were used to discover cross-reactions that stronger for closely related
organism Next in between the year of 1940s until 1960s biologists used the protein sequencing
method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular
phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other
biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the
late of 1970s biologists started to discover evolutionary analysis of organisms by using
molecular phylogeny One of the examples of experts from German biologists who supported
Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using
phylogenetic trees for biologists because they can use them to describe the relations between
living creatures genomes atd genes
With the development of phylogenetic data technique there are the numbers of studies
depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies
based on gene-sequence information has been increasing exponentially Figure 2 shows the data
aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging
4
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(
from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the
phylogenetic tree becomes popular and important for the evolutionary analysis of organisms
nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship
of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural
procedure to infer about the populations It can be described as the platfonn to show the
transformation in the hereditary traits of biological population over continuous generation
On the other hand phylogeny can show the similarities and differences in physical and
hereditary traits This is because there are the taxa that can attach together in the affinnation
which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can
be concluded that it was similar to a family tree Moreover the construction of phylogenetic
trees is based on the similarities or differences of their physical or genetic features Few years
ago the scientists only used the tradition way which only focused on physical features of
constructing phylogenetic trees Luckily the advancement of high technologies has been led to
accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the
changing towards the way of biological studies in various aspects
As mentioned by Wan and Che (2013) building phylogenetic trees can use the
information of interacting pathways They did apply the hierarchical clustering on two domains
of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase
the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)
Phylogenetic tree was constructed using variety evidence such as generally comparing DNA
(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically
the lengths of branches represented time since the groups split from each other and the node for
he tree is known as ancestors The set of exterior nodes are called leaves
5
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
Apart from constructing the phylogenetic tree the new approach nowadays can extract
the phylogenetic tree data from the literacture review In fact it is using the content mining to
extract the data from the literature review (Mounce 2012) Content mining can be split into
content and mining in explanation Content can be included anything such as the audio video
metadata text and image Besides the mining shows the huge number of data information
extraction from the content Extracting phylogenetic tree data from literacture review uses more
content mining than text mining because the content was more than just text (Mounce 2012)
In short phylogenetic trees provides a framework that shows the evolution of features
(Baum D 2008) This shows that the related species shared in many common of similar
features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy
that exploited phylogenetic information to target closely related species to search for shared
feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search
for shared features in common Therefore the phylogenetic trees are useful for conservation
evaluation in choosing sets of species that can maximized the present utilitarian benefits of
extant feature diversity as well as the range of evolutionary trajectories in the future
Problem Statement of the study
With the increase volume of publication databases volume of the phylogenetic trees is
getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~
phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and
time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next
the types of contents in these published documents are various such as images audio arts and
tables Search engines rely on texts or captions are often associated with a figure to perform a
search This makes the classification of the phylogenetic trees image one by one by the
6
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
researcher becoming challenging and waste of time Moreover if the biologist becomes
challenging and time consuming when searching for the particular phylogenetic tree this may
delay their research works Furtermore the purpose for the invented phylogentic trees is to study
the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is
used to reuse purpose for those biologists Therefore the use of automated digization application
to search the phylogenetic trees for them is truthly needed It is because this can replace the very
challenging task of human works and determine whether an image is a phylogenetic tree
Therefore the main purpose of conducting this project is to do the automated digitation
of phylogenetic tree image classification by using machine learning algorithm This classification
is mainly focusing on the classification the images in pdf file or text file whether they are
phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram
phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are
the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-
phylogenetic trees- family tree (Murdoch 2013)
7
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I
Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker
John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch
1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker
James Murdoch shy CID shy Agnes Cumming
Mary Murdoch
1841-1929
1814 - 1900 ClJplaln
Jane Murdoch
1848-1924
Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917
Mil5UMaf1ller
1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist
1873 - 1912 1e oftiagtr01 the TI14R1C
~tn these ApI~ 191 2
Agnes Murdoch
1850-1944
1818-1891
William Murdoch 1856-1906
John Murdoch lS57 -1907
uptain Iltolaquoxr
I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907
Margaret Elisabeth Murdoch 1882 -1973
teacher headmislress
Samuel Jr - CID shy ~artha Murdoch Patience Scott
1880middot1950 Merchant
1891 middot1976
Samuel Scott Murdoch
Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch
1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn
OJowrerlln ~nt Nwy
HI~ cxItnl~ ~
Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml
Copyright 2013 by the Murdoch Adapted with permission
8
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
General Objective The main objective of this research study is to employ a machine
learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees
Specific Objective The specific objectives of this study are
i To employ machine learning that can predict phylogenetic tree that represent in the
Image
II To compare and contrast the different features that represent phylogenetic tree on
image
Research Question
I Can neural network be used for prediction of phylogenetic tree images
II What are the discriminative features can be used for classifier learning
I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate
the lines of evolutionary relationships of different kinds of species organism or
genes from a common ancestor (Baum D 2008)
II Phylogeny is the evolution relationship between organisms (Baum D 2008)
1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with
cautionary notes (Brinkman 2005)
iv Content Mining is defined as a significant part of figure mining which is nonshy
textual content (Mounce 2014)
9
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
This research study hopes to advance knowledge on the automated digitization images of
phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree
This research study is mainly focused on the rooted tree (c1adogram) and the unrooted
In conclusion phylogenetic is the science of constructing hypothesis related to the
Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not
laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of
phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy
ylogenetic trees by using machine learning algorithm
10
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
CHAPTER TWO
LITERATURE REVIEW
As mentioned by Mounce (2012) recently there are millions of papers published each
at an ever growing rate about the phylogenetic tree This is because the amount and
mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus
phylogenetic trees become an integral part of various biological studies with the exponential
iDcrease of sequence data which is being generated by various classical and next generation
sequence studies (Baum D 2008) This chapter divides into few sections The first section
tbcuses on phylogenetic trees which explain more on the meaning and purpose for the
ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature
mimage This section also emphasizes on the suitable features that were suitable used for image
ification process Besides this section reviewed on image recognition system frameworks as
nvaoSEeoletic Tree
Phylogenetic tree or evolution tree is an illustrative representation of biological entities
were associated with common descent such as species or higher-level taxonomic
___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the
_tlon of a species with branches that separated hybridized or terminated by extinction
readers can read and understand the patterns of descent from the phylogenetic trees
the phylogenetic trees do not indicate when species evolved or how much genetic
11
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12
CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic
should not be assumed that a taxon can be evolved from the taxon next to it
Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct
itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial
r evolutionary theory In fact they were trying to tell the readers that practical understanding
ofwhat phylogenetic tree represented is really important in understand the evolution relationship
( the species Thus the phylogenetic trees become important in the evolution analysis of any
species as the biologists should increase the use of phylogentic trees in biological sciences Next
ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it
elopes accurate conception of totality of evolutionary history Therefore it is important for
aspiring biologists to develop the understanding of phylogenetic trees
of Phylogenetic Tree
Phylogenetic trees can be divided into different kinds of trees There were two main
ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart
the two main categories the phylogenetic tree can represent in several form slanted
iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic
2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially
to unrooted trees by means of a species that had unambiguously separated early from
species being considered (Bacardit 2009)
12