Using Phylogenetic Trees for Disease Diagnosis
Dissertation
submitted in partial fulfillment of the requirements
for the degree of
Master of Technology, Computer Engineering
by
Shamsudduha Tabish M Sabir Danish
Roll No:121122018
under the guidance of
Mr. Satish S Kumbhar
College of Engineering, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE-411005
June, 2013
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation titled
Using Phylogenetic Trees for Disease Diagnosis
has been successfully completed
By
Shamsudduha Tabish M Sabir Danish
(121122018)
and is approved for the degree of
Master of Technology in Computer Engineering.
Date: June 2013. Prof. Satish S. Kumbhar
Place:Pune Department of Computer Engg.
and Information Technology,
College of Engineering Pune,
Shivajinagar, Pune - 411005.
Abstract
The Phylogenetic Tree is a tool for tracking the evolution process by looking into the changes in the
genome sequences under study. This tree is a graphical representation of the evolutionary relationships
among multiple genes or organisms. In this work we apply the this principle of phylogeny to diagnose
what disease an individual is suffering from. In our method the multiple sequence alignment is applied to
a set of omic (Genomic or Proteomic) sequences of the patient, a few family members of the patient and
the diseased sequences or reference sequences. Once we get the result of Multiple Sequence Alignment,
the similarity in the omic sequences of patients family members is found along with the loci of each com-
mon nucleotide/amino acid, and the dissimilar nucleotides or amino acid at respective loci are discarded
also from the patients and diseased sequences. Finally we create a phylogenetic tree from these sequences
which can now be used to visualize the distance among the patients genome sequence and the diseased
genome sequences. After applying this algorithm on the data available at the 1000 genome project and
dbSNP we got the expected ressults and hence the algorithms is proved for the accuracy.
Keywords: Disease diagnosis, evolution, medical diagnosis, Phylograms, cladograms, Phylogenetic
trees, Multiple Sequence Alignment.
Acknowledgments
I would like to take this opportunity to express my gratitude towards my guide Prof. Satish S Kumb-
har for his constant help and suppoert, encouragement and inspiration for the project work. Without
his invaluable guidance, this work would never have been a reached to this level. I would also like to
thank all the faculty members and staff of Computer and IT department for providing us ample facility
and flexibility and for making my journey of post-graduation successful.
Last, but not the least, I would like to thank my classmates for their valuable suggestions and helpful
discussions. I am thankful to them for their unconditional support and help throughout the year.
Shamsudduha Tabish
College of Engineering, Pune.
ii
Contents
Abstract ii
Acknowledgements i
List of Figures v
1 Introduction 1
1.1 DNA ( Deoxyribo Nucleic Acid ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 SNP (Single Nucleotide Polymorphism) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Mutagens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Chemical Mutagens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Radiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Sunlight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.5 Spontaneous mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Survey 7
2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Multiple Sequence Alignment (MSA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Constructing Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Distance Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Character Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Data Sets 18
3.1 The HapMap Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 dbSNP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 The 1000 Genomes Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Technologies 20
4.1 Tomcat Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
4.3 JSP (Java Server Pages) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 HTML 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Java Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Eclipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 DiagnosTree -The Tool 23
5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Required Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 System Architecture 27
7 Results 30
8 Conclusion 31
9 Future Work 32
List of Figures
1.1 The Eukaryotic Cell Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The DNA Composition and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The Chemical Structures of Cytosine, Thymine, Adenine and Guanine . . . . . . . . . . . 4
2.1 A Phylogeny of Six Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Rooted and Unrooted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Example: A distance Matrix M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Unrooted tree from the given matrix of M nodes . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Comparison of two sequences with their ancestor shows several types of substitutions . . . 13
2.6 Set of Input sequences for Maximumparsimony Algorithm . . . . . . . . . . . . . . . . . . 14
2.7 Trees for first two sites of sequences A through E . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Pictorial Example Employing Fitch’s Algorithm for given site . . . . . . . . . . . . . . . . 16
2.9 Choosing the right algorithm that suits your needs . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Set of Input Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Aligned Sequences (Output of MSA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Uncommon Nucleotieds to be omitted out of the sequences . . . . . . . . . . . . . . . . . 25
5.4 Set of Family Members Sequences to be removed from The Sequences . . . . . . . . . . . 26
5.5 Final set of Sequences to be used for creating The Phylogenetic Tree . . . . . . . . . . . . 26
5.6 The resultant Tree depicting relationship among the patients gene sequence and different
diseased sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1 Layered System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Component Based System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Flowchart for the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
Chapter 1
Introduction
Our work is completely based on the DNA/RNA/Protein found in the cell of almost all the living
organisms. To understand these elements lets get into the cell and find out where they are created and
what role do they play. The basic b building block of every living being on this planet is biologic al cell.
The Cell is composed of Nucleus, Mitochondria, cytoplasm, etc. There are two types of cells, prokaryotic
and eukaryotic cells. Most of single cellular organisms are made up of prokaryotic cells (eg. Bacteria),
where as the all the multi-cellular organisms are made up of eukaryotic cells. In this work we focus on
eukaryotic cellular organisms.
Figure 1.1: The Eukaryotic Cell Structure
The above figure 1.1 shows the structure of a cell in eukaryotic organism. The DNA is found in
1
almost every living organism. The chromosomes are composed of DNA and are found in the cell. The
Nucleus in the above figure 1.1 is the main part of the cell containing large amount of DNA, only a small
portion of the DNA is found in the Mitochondrion as shown in the figure. This DNA is called as mtDNA
or Mitochondrion DNA. The DNA is the code which encodes everything about the organism including
the behavior, appearance, diseases, resistance to diseases and every character an organism posses.
1.1 DNA ( Deoxyribo Nucleic Acid )
DeoxyriboNucleic Acid (DNA) is the hereditary material found in almost all living organisms. Nearly
every cell in the human body has exactly the same replica of DNA. Most of the DNA is located in the
nucleus of the call (called nuclear DNA), but a small amount of DNA is also be found in mitochondria
(mitochondrial DNA or mtDNA).
The DNA is composed of two strands having backbone madeup of phosphorous group and pentose
sugar. These strands are connected to each other by adenine (A), guanine (G), cytosine (C), and thymine
(T) as shown in the figure. The Human DNA has about 3 billion base pairs, and more than 99% of
those bases are the same in all human beings. The sequence of these bases determine the information
for building and maintaining an organism, in a similar way in which letters of the alphabet are arranged
in a certain order to form words and sentences.
The DNA bases, pair up with each other, A pairs with T and C pairs with G, to form units which
are called base pairs. Each base is also attached to a sugar molecule and a phosphate molecule which are
together the backbone for DNA. A base, sugar, and phosphate together are called a nucleotide. These
nucleotides are arranged in two long sequences called strands that together form a spiral called a double
helix. The structure of the double helix looks like a ladder, where the base pairs form the ladders rungs
and the sugar and phosphate molecules form the vertical sidepieces of the ladder but in a spiral form.
The figure 1.2 shows the chemical structure of DNA as explaind in the forth coming description and
figure 1.3 shows the chemical structure of the different nucleotides playing a vital role in the structure
and composition of DNA. Only because of these chemical compounds the DNA has the two strands
connected and a spiral shape.
1.2 SNP (Single Nucleotide Polymorphism)
Single Nucleotide Polymorphism also known as SNP (Snip) is a change of single nucleotide in the genome
a particular locus. If such a variation at a single locus is found common in more than 1% of the population,
only then it is considered as SNP. Around 90% of the variation in the genome is because of SNPs. SNPs
are scattered across the human genome by an approximate average of one SNP per thousand base pairs,
these SNPs directly affect the gene product that is the protein. Sequence variations in the genomes exist
at defined positions and are responsible for phenotypic characteristics, including a person’s tendency
towards complex diseases like heart disease and cancer.
Single nucleotide polymorphisms, frequently called SNPs (pronounced snips), are the most common
2
Figure 1.2: The DNA Composition and Structure
type of genetic variation among people. Each SNP represents a difference in a single DNA building block,
called a nucleotide. For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide
thymine (T) in a certain stretch of DNA.
SNPs occur normally throughout a persons DNA. More precisely, they occur once in every 300
nucleotides on average, which means there are roughly 10 million SNPs in the human genome. Most
commonly, these variations are found in the DNA between genes. They can act as biological markers,
helping scientists locate genes that are associated with disease. When SNPs occur within a gene or in a
regulatory region near a gene, they may play a more direct role in disease by affecting the genes function.
Most SNPs have no effect on health or development. Some of these genetic differences, however, have
proven to be very important in the study of human health. Researchers have found SNPs that may help
predict an individuals response to certain drugs, susceptibility to environmental factors such as toxins,
and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease
genes within families. There is a scope for future studies for identifying SNPs associated with complex
diseases such as heart disease, diabetes, and cancer.
At present there are a number of SNP analysis techniques available, some of these methods are
inefficient and others require manual intervention. Using a 5’ nuclease assay chemistry protocol is a fast
and simple way to get data results. The experiment protocol involves combining purified genomic DNA,
3
Figure 1.3: The Chemical Structures of Cytosine, Thymine, Adenine and Guanine
master mix, and a 5’ nuclease assay, then thermal cycling, reading, and analyzing the results.
For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA. For a variation
to be considered a SNP, it must occur in at least 1% of the population. SNPs, which make up about
90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome.
Two of every three SNPs involve the replacement of cytosine (C) with thymine (T). SNPs can occur in
coding (gene) and non-coding regions of the genome. Many SNPs have no effect on cell function, but
scientists believe others could predispose people to disease or influence their response to a drug.
Although more than 99% of human DNA sequences are the same, variations in DNA sequence can
have a major impact on how humans respond to disease, environmental factors such as bacteria, viruses,
toxins, and chemicals and drugs and other therapies. This makes SNPs valuable for biomedical research
and for developing pharmaceutical products or medical diagnostics. SNPs are also evolutionarily stable
that is not changing much from generation to generation which make them easier to follow in population
studies. Scientists believe SNP maps will help them identify the multiple genes associated with complex
ailments such as cancer, diabetes, vascular disease, and some forms of mental illness. These associations
are difficult to establish with conventional gene-hunting methods because a single altered gene may make
only a small contribution to the disease.
Several previous contributions to find SNPs and ultimately create SNP maps of the human genome.
Among these were the U.S. Human Genome Project (HGP) and a large group of pharmaceutical compa-
nies called the SNP Consortium or TSC project. The likelihood of duplication among the groups is small
because of the estimated 3 million SNPs, and the potential payoff of a SNP map was high. In addition
to pharmacogenomic, diagnostic and biomedical research implications, SNP maps are being utilized to
identify thousands of additional markers in the genome, thus simplifying navigation of the much larger
genome map generated by HGP researchers. SNPs as risk factors in disease development SNPs do not
cause disease, but they can help determine the likelihood that someone will develop a particular illness.
One of the genes associated with Alzheimer’s disease, apolipoprotein E or ApoE, is a good example of
how SNPs affect disease development. ApoE contains two SNPs that result in three possible alleles for
this gene: E2, E3, and E4. Each allele differs by one DNA base, and the protein product of each gene
differs by one amino acid.
Each individual inherits one maternal copy of ApoE and one paternal copy of ApoE. Research has
4
shown that a person who inherits at least one E4 allele will have a greater chance of developing Alzheimer’s
disease. Apparently, the change of one amino acid in the E4 protein alters its structure and function
enough to make disease development more likely. Inheriting the E2 allele, on the other hand, seems to
indicate that a person is less likely to develop Alzheimer’s. Of course, SNPs are not absolute indicators of
disease development. Someone who has inherited two E4 alleles may never develop Alzheimer’s disease,
while another who has inherited two E2 alleles may. ApoE is just one gene that has been linked to
Alzheimer’s. Like most common chronic disorders such as heart disease, diabetes, or cancer, Alzheimer’s
is a disease that can be caused by variations in several genes. The polygenic nature of these disorders is
what makes genetic testing for them so complicated.
1.3 Mutation
A Mutation occurs when a DNA gene is damaged or changed in such a way as to alter the genetic
message carried by that gene.
A Mutagen is an agent of substance that can bring about a permanent alteration to the physical
composition of a DNA gene such that the genetic message is changed.
Once the gene has been damaged or changed the mRNA transcribed from that gene will now carry
an altered message. The polypeptide made by translating the altered mRNA will now contain a different
sequence of amino acids. The function of the protein made by folding this polypeptide will probably be
changed or lost. In this example, the enzyme that is catalyzing the production of flower color pigment
has been altered in such a way it no longer catalyzes the production of the red pigment.
No product (red pigment) is produced by the altered protein. In subtle or very obvious ways, the
phenotype of the organism carrying the mutation will be changed. In this case the flower, without the
pigment is no longer red.
1.3.1 Mutagens
A Mutagen is an agent of substance that is responsible for permanent alteration to the physical compo-
sition of a DNA such that the genetic message is changed. Such a change may impact the organism on
its physical appearance or in the other way which may not be directy visible.
1.3.2 Chemical Mutagens
change the sequence of bases in a DNA gene in a number of ways;
• It mimics the correct nucleotide bases in a DNA molecule, but fail to base pair correctly during
DNA replication.
• Remove parts of the nucleotide (such as the amino group on adenine), again causing improper base
pairing during DNA replication.
• Add hydrocarbon groups to various nucleotides, also causing incorrect base pairing during DNA
replication.
5
1.3.3 Radiation
High energy radiation from a radioactive material or from X-rays is absorbed by the atoms in water
molecules surrounding the DNA. This energy is transferred to the electrons which then fly away from
the atom. Left behind is a free radical, which is a highly dangerous and highly reactive molecule that
attacks the DNA molecule and alters it in many ways. Radiation can also cause double strand breaks in
the DNA molecule, which the cell’s repair mechanisms cannot put right.
1.3.4 Sunlight
contains ultraviolet radiation (the component that causes a suntan) which, when absorbed by the DNA
causes a cross link to form between certain adjacent bases. In most normal cases the cells can repair
this damage, but unrepaired dimmers of this sort cause the replicating system to skip over the mistake
leaving a gap, which is supposed to be filled in later. Unprotected exposure to UV radiation by the
human skin can cause serious damage and may lead to skin cancer and extensive skin tumors.
1.3.5 Spontaneous mutations
occur without exposure to any obvious mutagenic agent. Sometimes DNA nucleotides shift without
warning to a different chemical form (know as an isomer) which in turn will form a different series of
hydrogen bonds with it’s partner. This leads to mistakes at the time of DNA replication.
6
Chapter 2
Literature Survey
The current diagnosis methods are mostly based on the non genetic tests, which involve blood test,
urine test, thyroid test, stool test, saliva test etc, all of these look into the chemicals and microbes
found in their respective inputs. And X-Ray, MRI, CT scan, ultra sound etc, look for the physical
appearance and functioning of the organs. Whereas Electroencephalography (EEG), Electrocardiogram
(ECG) also known as Electrocardiography (EKG), Electromyography (EMG) etc, look into the accuracy
of functioning of the organs. So these tests may or may not be successful in diagnosis of disease also a
combination of such tests is required to reach the actual cause of the disease.
Another new method that is on its way is through the analysis of human genome. For this method the
patients genome needs to be sequenced. Then it is compared using Multiple Sequence Analysis (MSA)
with the other reference genome of diseased people known to be suffering from a particular disease, if the
similarity is found then patient is diagnosed to be suffering from the disease of most similar sequence in
the set of input, but this requires a long time, in order to cut short this time we propose our method to
be used for the diagnosis.
2.1 Problem statement
Many a times doctors come across a situation where the diagnosis of a disease (a patient is suffering
from) become quite difficult and this diagnosis process may take months of time, and during this time
the patient is given treatment based on assumptions, if the assumptions go wrong then the patient has
to take drugs targeted for the disease he/she is not suffering from. Such drugs may leave heavy side
effects. Hence its the requirement of the medical system to speed up the diagnosis process and increase
its accuracy.
To this end, a modern technique which employ genome sequencing has been discovered lately for
efficient diagnosis of diseases. In this method the patients genome is sequenced first and is then compared
with the reference sequences. Although existing methods offer good accuracy but are a bit slow. This
motivates a need for a faster yet accurate method to diagnose the diseases.
7
2.2 Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA) is the alignment of multiple biological sequences (of protein
or nucleic acid) of equal length. From the output of the multiple sequence alignment homology is inferred
and the evolutionary relationships between the sequences can be studied by creating Phylogenetic Trees.
Multiple Sequence Alignment (MSA) is usually the alignment of three or more nitrogen base se-
quences or Nucleic acid sequences of similar length. Homology can be inferred from the output and the
evolutionary relationships between the sequences studied. Usually protein sequences are aligned using
multiple sequence alignment to find out the relationship among them. The multiple sequence alignment
tools compare these sequences and try to correlate each other by introducing gaps in the sequences in
order to match these sequences.
A multiple sequence alignment arranges protein or nucleotide sequences into a rectangular array with
the goal that residues in a given column are homologous (that is they are derived from a single ancestral
sequence), and in a rigid local structural alignment or play a common functional role. Although these
criteria are essentially equivalent for closely related proteins (most similar sequences of amino acids),
structure and function diverge over evolutionary time sequences, and different criteria may result in
different alignments of these sequences.
Most of the existing tools do not meet the efficiency / precision expectations because the length of
these sequences is very high, and a complex algorithm is required to accurately align these sequences
and hence continuous efforts are being put in to improve the method. Such an algorithms require a
huge amount of RAM and processing power because of the nature of the input and complex algorithms
involved for getting a solution. Homology is the similarity that is the result of inheritance from a common
ancestor, and identification and analysis of homologies is central to phylogenetic systematics.
An Alignment is an hypothesis of positional homology between bases/Amino Acids.
Many tools exist for finding the MSA of given set of omic sequences, namely: Clustalw2 Clustal
Omega from EBI UK, T-COFFEE from Lausanne Switzerland, VRIJE universitys PARALINE, bioin-
formatics.orgs STRAP, MAFFT from Tokyo, Japa, MUSCLE from EBI UK, and many more. We have
chosen the popular EMBL EBIs Clustal Omega for multiple sequence alignment in our work. Almost all
these tools are based on dynamic programming.
2.3 Phylogenetic Trees
A phylogenetic tree is described as, a branching diagram that shows, for each species, with which other
species it shares its most recent common ancestor. The evolutionary tree or cladograms were traditionally
used to draw evolutionary relationship among the organism; a more modern version of the same is phylo-
genetic tree which uses gene / protein sequences to draw the evolutionary relationship. These trees dictate
the relationship among the organisms based on the similarity and dissimilarity among the nucleotide or
nucleic acid sequences.
The tree construction can be done through variety of tree-building methods which include methods
8
based on distances, likelihood and characters. After a phylogenetic tree is constructed, it is important
to test its accuracy which refers to the degree to which a tree is close to the true tree.
Phylogenetics is the study of evolutionary relationships among organisms or genes. Below, we will
refer to the objects whose phylogeny we are studying as organisms or species, but the discussion of
methods is valid for the phylogeny of genes as well. We construct phylogenetic trees to illustrate the
evolutionary relationships among a group of organisms. The purpose of phylogenetic studies are (1)
to reconstruct evolutionary ties between organisms and (2) to estimate the time of divergence between
organisms since they last shared a common ancestor.
There are several types of data that can be used to build phylogenetic trees: Traditionally, phylo-
genetic trees were built from morphological features (e.g., beak shapes, presence of feathers, number of
legs, etc). Today, we use mostly molecular data like DNA sequences and protein sequences. A phy-
logeny example showing the evolutionary history of six species: Fish, Deer, Cow, Human, Monkey and
Chimpanzee is shown in Figure 2.1.
Figure 2.1: A Phylogeny of Six Species
Each of the organism has discrete characters each character has a finite number of states. For
example, discrete characters include the number of legs of an organism, or a column in an alignment of
DNA sequences. In the latter case, the number of states for the column character is 4 (A, C, T, G).
Comparative Numerical Data These data encode the distances between objects and are usually derived
from sequence data. For example, we could hypothetically say distance (man, mouse) = 500 and distance
(man, chimp) = 100.
External nodes are things under comparison, also called operational taxonomic units (OTUs). Internal
nodes are hypothetical ancestral units. They are used to group current-day units. In rooted trees, the root
is the common ancestor of all OTUs under study. The path from root to a node defines an evolutionary
path. An unrooted tree specifies relationships among OTUs but does not specify evolutionary paths
9
Figure 2.2: Rooted and Unrooted Trees
(Figure 2.2). We can root an unrooted tree by finding an outgroup (i.e., if we have some external reason
indicating that a certain OTU branched off first). For example, in Figure 2.2, the unrooted tree can be
transformed to the rooted tree by making E the outgroup.
The topology of a tree is the branching pattern of a tree. All internal nodes of a bifurcating tree
have 2 descendants if it is rooted or 3 neighbors if it is unrooted. It is sometimes useful to allow more
than 2 descendants (or more than 3 neighbors in the unrooted case), but we will focus on bifurcating
trees. The branch length can represent the number of changes that have occurred in that branch, or can
indicate the genetic distance between nodes connected by that branch, or can indicate the amount of
evolutionary time passed along the branch.
In every phylogenetic tree, a time axis is implicit. In our example, the time at C is more recent than
the time at B which is in turn more recent than that at A. In this phylogeny, it shows that monkey and
chimpanzee had the most recent common ancestor at the time C. Then, some time before this, at time
B, the most recent common ancestor of human, monkey and chimpanzee were found. Finally, the most
recent common ancestor of all six species was found at time A.
Phylogeny inference can be used for analysis of sequences of proteins and DNA. The concept of
phylogeny is extended to haplotype sequences. The sequences of the individuals replace the species in
the phylogenetic tree. In this case, the phylogeny shows the evolutionary history of the individuals. This
concept also makes sense for sequences coming from the same individual, as in our case of using phylogeny
for reconstructing the haplotype sequences from genotypes. This is because the two sequences of the
individual actually come from his/her father and mother. The phylogeny shows the common ancestor of
both father and mother of the individuals. In our algorithm, we further extend the concept of phylogeny
and use it to represent only a column of the set of haplotype sequences. In every phylogenetic tree,
a time axis is implicit. In our example, the time at C is more recent than the time at B which is in
turn more recent than that at A. In this phylogeny, it shows that monkey and chimpanzee had the most
recent common ancestor at the time C. Then, some time before this, at time B, the most recent common
10
ancestor of human, monkey and chimpanzee were found. Finally, the most recent common ancestor of
all six species was found at time A.
2.4 Constructing Phylogenetic Trees
The three major methods for constructing phylogenetic trees are:
• Distance methods: Evolutionary distances are computed for all OTUs and these are used to
construct trees.
• Maximum Parsimony: The tree is chosen to minimize the number of changes required to explain
the data.
• Maximum Likelihood: Under a model of sequence evolution, the tree which gives the highest
likelihood of the given data is found.
2.4.1 Distance Methods
The problem can be described as follows:
Input: Given an n X n matrix M where Mij ≥ 0 and Mij is the distance between objects i and j.
Goal: Build an edge-weighted tree where each leaf corresponds to one object of M, and such that the
distances measured on the tree between leaves i and j correspond exactly to the value of Mij . When
such a tree can be constructed, we say the distances in M are additive.
Example: Suppose we are given the distances as in Table below.
Figure 2.3: Example: A distance Matrix M
Distance methods do not use the actual molecular sequence alignment during the tree inference but
calculate a symmetric n X n matrix from the input alignment in the beginning. The entries of this matrix
are the pair-wise-distances of the n sequences. The actual tree inference is then performed solely on the
basis of this matrix. n provides a measure for the genetic distance of each pair of the n sequences in the
input alignment. In the simplest case this function would only count the number of differing characters
of the two sequences. More elaborate functions, however, utilize a sophisticated model of molecular
11
Figure 2.4: Unrooted tree from the given matrix of M nodes
evolution. The most frequently used distance-based approaches are probably the LS (Least-Squares)
method and the UPGMA (Un-weighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor-
Joining) heuristics.
Least-Squares
The Least-Squares method estimates the branch lengths of a tree topology by matching the distances
described by them as closely as possible to the values of the pair-wise distances matrix. This is achieved by
minimizing the sum of squared differences between the given (by the distances matrix) and the predicted
distances. The predicted distance between two sequences is calculated as the sum of the branch lengths
along the path connecting both of them. The sum of all squared differences represents a measure for
the fit of the tree to the given sequence data: the tree with the minimal sum is the optimal tree. The
complexity of LS is O(n3).
UPGMA
UPGMA is a clustering algorithm that builds a rooted tree topology by stepwise addition. A molecular
clock is assumed for the evolutionary process, which means that all species contained in the phylogenetic
tree are supposed to evolve at the same rate. This assumption leads to the fact that trees obtained by
UPGMA are ultra metric trees, that is, all end nodes (representing the species of interest) are equidistant
from the root.
The algorithm works as follows: In the beginning, each node represents a cluster. At each step,
the two clusters whose associated sequences have minimal distance according to the distance matrix
are joined. Their entries are removed from the matrix and an entry for the new cluster is added. The
distance of the new cluster to other clusters is computed as the mean distance of the sequences contained
in each cluster. The algorithm terminates when all clusters have been joined into a single cluster. The
complexity of UPGMA is O(n2).
Neighbor-Joining
Neighbor-Joining is also a clustering algorithm and is based on the minimum-evolution criterion. The
tree that explains the sequence data with the minimal amount of change, i.e., the tree which minimizes
the sum of all branch lengths (the total tree length), is the optimal tree. The algorithm starts with a
12
star-tree. At each step, two nodes are removed from the tree and reconnected via a common newly added
internal node. The distance of both nodes to any other node of the tree (i.e., the sum of the branch
lengths on the path connecting the nodes) stays constant. Yet, the total tree length is reduced as two
rather long branches are replaced by three shorter branches. The nodes to be reorganized are selected
such that the greatest reduction of the tree length is achieved. This procedure is repeated until the tree
is fully resolved. The complexity of the original NJ implementation is O(n3) which can be reduced to
O(n2) by using a more sophisticated algorithm for selecting the nodes to be joined.
Computing Distances
We have looked at a couple of distance method heuristics for reconstructing trees, given distance
data. One question we could ask at this point is: how do we obtain the distance data? One answer is
that distance data can be obtained from sequence data. Let us compare the following two sequences:
Figure 2.5: Comparison of two sequences with their ancestor shows several types of substitutions
There are only 3 observed difference between the 2 sequences; however, considering the ancestral
sequence, we see that are actually 12 total substitutions. Thus, if multiple substitutions have occurred
at any site (e:g:, the convergent substitution at site 11), then the naive way of computing distance is an
underestimate. How can we correct for multiple substitutions? For DNA sequences, we can use models
for nucleotide substitution. For protein sequences, we have already talked about models for amino acid
substitution in our discussion of PAM matrices. (We will also use these models when we talk about
maximum likelihood methods for phylogenetic reconstruction.)
13
2.4.2 Character Based Methods
Discrete characters include morphological data (such as the absence or presence of feathers), protein
data (20 possible amino acids), and DNA data (four possible nucleotides). All character based methods
assume that different characters are independent of each other. Given character data, how does one find
a tree out of the given data? What criteria are used to pick the best tree?
Maximum Parsimony
One method is to use maximum parsimony. In this instance, we want to find the tree that minimizes
the number of changes needed to explain the data. For example, given the following DNA data, which
tree is most parsimonious?
Figure 2.6: Set of Input sequences for Maximumparsimony Algorithm
Sites 1 and 2 each require one change for the given tree. It turns out that the entire data can be
explained with a minimum of 9 changes using the tree in Figure below. However, changing the tree will
alter the minimum number of changes required. This example leads us to ask two important questions
relating to parsimony:
• Given a particular tree, how do you find the minimum number of changes needed to explain the
data? (Easy)
• How do you find the most parsimonious tree? (NP-hard)
To answer the easy first question, we use Fitch’s Algorithm. The idea is to construct a set of possible
states (eg: nucleotides) for internal nodes based on the states of the children. For each site, each leaf is
labeled by a singleton set containing, for example,
the nucleotide at that position. For each internal node i, with children j and k (labels Sj and Sk):
Si = SjUnionSk, ifSjIntersectionSk = φ
Si = SjIntersectionSkotherwise
The total number of changes equals the total number of union operations. This is illustrated by the
Figure 2.7. We can see from Figure 2.7 that there are three unions in the tree; this implies that this
site requires three changes. It is easy to implement this algorithm by post-order traversal of the tree. In
14
Figure 2.7: Trees for first two sites of sequences A through E
contrast, the answer to the second question, finding the most parsimonious tree, is not easy. There are
many heuristics for doing this. We will quickly talk about two techniques:
1) the branch-and-bound method (prunes search space, and find the most parsimonious tree) and
2) the nearest-neighbor interchange method (fast heuristic, which may not find most parsimonious tree).
Maximum Parsimony favors the tree topology which explains the given data (the multiple sequences
alignment) with the least amount of change, i.e., the lowest number of nucleotide or amino acid substi-
tutions. In this sense, it is similar to the minimum-evolution criterion of NJ. However, MP computes the
distance between two sequences on a per-column (per-site) basis and considers only so-called informative
sites. Those are the columns of the sequence alignment that contain at least two different kinds of charac-
ters, each of which is represented in at least two of the sequences. The distance between two sequences is
the number of differing characters at informative sites and is attributed as weight to the branch connect-
ing the two sequences. For the inner nodes of the tree hypothetical sequences are calculated such that the
distances between an inner node and its adjacent nodes are minimal. The Maximum Parsimony score of
a tree can be calculated by summing up the weights of all branches. The tree with minimal score is the
most parsimonious tree and thus the optimal tree under the Maximum Parsimony optimality criterion.
Since the Maximum Parsimony criterion is very similar to the minimum-evolution criterion, it also suffers
from identical shortcomings. Additionally, the phenomenon of so-called long branch attraction can be
observed on MP-inferred phylogenies: sequences which are connected to the tree by very long branches,
might be grouped together though they developed from very different lineages. Long branches indicate
a high rate of change, i.e., the sequence at the terminal node of the branch differs from the hypothetical
sequence at the internal node in many sites. Maximum Parsimony only accounts for the fact that some
substitution took place at a specific site and not which substitution. Thus, it groups the two nodes with
the long branches together solely because both highly differ from the other sequences. The fact that both
of them also are highly different to each other is neglected. Nevertheless, Maximum Parsimony is still
frequently used for phylogenetic inference for several reasons. Firstly, it is a character-based method and
15
as such considered to be superior to distance methods at it uses all information that is contained in the
input alignment for the tree reconstruction. Secondly, it is fast and therefore an alternative to Maximum
Likelihood for large-scale datasets if computational resources are restricted. Thirdly, the phenomenon of
long-branch attraction is only an issue for small datasets. Fourthly, many biologists appreciate the fact
that MP only makes few assumptions about the evolutionary process besides evolutionary change being
rare.
Branch and bound
The branch-and-bound method (as applied here) counts the number of changes for an initial tree
(e.g., an initial tree may be obtained using the neighbor-joining method). Then, starting from scratch,
we will search our space by building partial trees (i:e:, one branch is added at a time). That is, in the kth
level of the search, we will have nodes representing all possible phylogenetic trees with k leaves for the
first k species (the order is fixed beforehand arbitrarily). If the cost of any partial tree we are building is
greater than that of the initial tree, then search along this line is abandoned. We can improve our search
(potentially getting rid of more things) by computing an estimate of the minimum number of changes
required to add the additional species. There is no guarantee with branch and bound on how much of
the search space is eliminated.
Figure 2.8: Pictorial Example Employing Fitch’s Algorithm for given site
Nearest-neighbor interchange
The nearest-neighbor interchange method involves rearranging trees at the neighbor ” level and
choosing the neighbor” tree with the best score (ie. the least number of changes). There are many
possibilities for how you can define neighbors. Neighbors in this heuristic procedure are defined as
follows. Considering any internal edge, we break up our tree into 4 sub-trees. For example, in the tree
in Figure 4, the subtrees would consist of the leaves A, B, C and D, although in general these subtrees
consist of more than 1 leaf. This original tree (which has A and B branching separately from C and D)
has two neighbors : one with the roles of B and D switched (i.e., with A and D branching separately
from B and C) and one with the roles of B and C switched (i.e., with A and C branching separately
from B and D). Starting with one tree, we repeatedly choose the neighboring tree with the best score,
until there are no neighboring trees with better scores. This is a hill-climbing method, and there is no
guarantee that we will find the most parsimonious tree.
16
While the parsimony method makes very few assumptions, it ignores branch lengths in building trees.
If there are branches that diverge much more rapidly than others, it is easy to convince yourself that the
parsimony method can lead to incorrect topologies.
2.4.3 Maximum Likelihood
Maximum Likelihood is a method for the inference of phylogeny. It evaluates a hypothesis about evolu-
tionary history in terms of the probability that the proposed model and the hypothesized history would
give rise to the observed data set. The supposition is that a history with a higher probability of reaching
the observed state is preferred to a history with a lower probability. The method searches for the tree
with the highest probability or likelihood.
In general, Maximum Likelihood is a parametric statistical method for fitting a mathematical model
to some data. The principle of likelihood suggests that the explanation that makes the observed outcome
the most likely occurrence is the one to be preferred. Formally, given some data D and a hypothesis O,
the likelihood of that data is given by which the probability of obtaining D given v.
L(Dj|O) = f(Dj|O)
Though both terms are colloquially used synonymously, it is important to distinguish between probability
and likelihood here. Informally, probability allows one to predict unknown outcome based on known
parameters, whereas likelihood allows one to predict unknown parameters based on known outcome.
Figure 2.9: Choosing the right algorithm that suits your needs
17
Chapter 3
Data Sets
There are numerous open-source bioinformatics databanks available on internet. Every country is in a
race to develop a rich bioinformatics databank. In this work we select SCBIs DBSNP, EMBL EBIs 1000
genome as a data source from
3.1 The HapMap Project
We have identified one of the sources of data for inferring phylogenetic trees and analyzing them as the
international HapMap project. The International HapMap Project is an effort by multiple countries
to identify and catalog genetic similarities and differences in human beings. Using the information in
the HapMap, researchers will be able to find genes that affect health, disease, and individual responses
to medications and environmental factors. The Project is collaboration among scientists and funding
agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. All of the
information generated by the Project is publically available.
The goal of the International HapMap Project is to compare the genetic sequences of different indi-
viduals to identify chromosomal regions where genetic variants are shared. By making this information
freely available, the Project will help biomedical researchers find genes involved in disease and responses
to therapeutic drugs. In the initial phase of the Project, genetic data are being gathered from four
populations with African, Asian, and European ancestry. Ongoing interactions with members of these
populations are addressing potential ethical issues and providing valuable experience in conducting re-
search with identified populations.
Public and private organizations in six countries are participating in the International HapMap
Project. Data generated by the Project can be downloaded with minimal constraints.
This project is supposed to use the data available at the International Haplotype Map (HapMap Phase
II) for the purpose of conducting a fine-scale genome-wide scan of human genetic variations.Computationally
phased HapMap data is used for this analysis. Although what algorithms we have developed infers max-
imum parsimony phylogenies directly from un-phased data, these algorithms are not efficient enough for
use on a whole-genome scale. We restrict this project to the HapMap population of single subcontinent
because these subpopulations were genotyped for parent-child trios and can thus be expected to have
18
minimal phasing error. The other two HapMap data sets (Han Chinese in Beijing, China and Japanese
in Tokyo, Japan) were genotyped only for unrelated individuals and were omitted here due to the higher
likelihood of phasing errors. All HapMap data sets were downloaded in phased form from the HapMap
web site, where the PHASE program had been used to identify most likely phases from the trio data.
This HapMap build was based on the NCBI human genome assembly build 35. SNP location assign-
ments and genomic coordinates are therefore based on NCBI build 35. The resulting data contained 120
haplotypes from 60 unrelated individuals for each of the two populations typed at approximately 3.7
million SNPs.
Phylogeny inferences are proposed to run for window sizes of five, six, seven, eight, and nine consec-
utive SNPs at each overlapping window of the given size across the 22 autosomal human chromosomes
in each of the HapMap subcontinental populations.
3.2 dbSNP
The Single Nucleotide Polymorphism database (dbSNP) is a database which maintains the variation
(occurring in more than 1
dbSNP is a database that contains entries submitted by public laboratories and private organizations
for a large number of organisms across the globe. Each of these submissions include information about
the actual nucleotide variation and the 5 and 3 flanking sequences.
3.3 The 1000 Genomes Project
The 1000 Genomes Project is the first ever project to sequence the genomes of a large number of people,
to provide a comprehensive data set resource on human genetic variation. The goal of the 1000 Genomes
Project is to locate most genetic variants that have frequencies of at least 1% in the populations under
study. This goal is being attained by sequencing many individuals lightly. To sequence a person’s genome,
many copies of the persons DNA are broken into short pieces and each piece is sequenced individually.
The many copies of DNA indicate that the DNA pieces are more-or-less randomly distributed across the
genome. The pieces are then aligned with the reference sequence and merged together. To accurately
sequence the complete genomic sequence of one person with the existing sequencing platforms, it requires
sequencing that person’s DNA the equivalent of about 28 times. If the amount of sequence done is only
an average of once across the genome, then much of the sequence would be missed, since some genomic
locations will be covered by several pieces while others will have nothing. Deeper the sequencing coverage,
more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing
coverage, the more likely that both chromosomes at a loci will be included. In addition, deeper coverage
is mainly useful for diagnosing structural variants, and it corrects the sequencing errors.
The 1000 Genome Project offers genome sequences from various families across the geographic loca-
tions. It also maintains the relationship information about the individuals.
19
Chapter 4
Technologies
Following are the technologies we have used in our research.
4.1 Tomcat Server
We use Tomcat Server to provide web based access to our system. Also the comcat server is used to deploy
the webservice clients for the Multiple Sequence alignment through Clustal Omega and Phylogenetic
Trees through Clustal Phylogeny from EMBL EBI.
4.2 Web Services
Web services are application components providing access to certain methods and objects through in-
ternet. Web services communicate using open protocols like tcp/ip and http and make it easy to access
the components across the platforms. Web services are self contained and self describing services. All
this description is offered through an XML file with extension as wsdl (stands for web service description
language). Web services are discovered using UDDI (Universal Description Discovery and Integration)
which allows the client to connect with a specific web service running on that server. Web services can
also be used by other applications existing within the local area network of the server or through inter-
net. XML is the base for Web services as it offers interoperability across the platforms and simplifies the
communication through basic protocols.
The basic Web services platform is XML and HTTP protocol combination. XML offers a language
which can be used across different platforms and programming languages and still deliver complex mes-
sages and functions. The HTTP protocol is the core and most used Internet protocol. Web services
platform elements include:
• SOAP - (Simple Object Access Protocol)
• UDDI - (Universal Description, Discovery and Integration)
• WSDL - (Web Services Description Language)
20
Various web service are offered by the global bioinformatics community. And we have used a couple
of them offered by EMBL EBI. The web services that we have used are ClustalOmega for Multiple
Sequence Alignment And ClustalW2 Phylogeny for retrieving phylogenetic tree related data.
4.3 JSP (Java Server Pages)
Java Server Pages (JSP) is a technology for developing dynamic web pages that is to provide support
dynamic content. It helps developers insert java code in HTML pages by making use of special JSP
tags. A JSP component is a type of Java servlets that is designed to interact with the client offering
realtime contents using a Java web application. The JSP files are written as text files that combine
HTML or XHTML code, XML elements, and embedded JSP actions and commands in order to offer
dynamic contents. The User interface for JSP is offered through web browsers as JSP happens to be a
web application development language.
JavaServer Pages often offers the same applications as offered by Common Gateway Interface (CGI)
language but on the top of it has tons of benefits both functional and non functional. Performance is
significantly improved because JSP allows embedding Dynamic Elements in HTML Pages itself instead of
having a separate CGI files. JSP files are always compiled before it’s processed by the server as opposed
to CGI/Perl which requires the server to load an interpreter and the target script each time the page is
requested. JavaServer Pages are built using the base as the Java Servlets API, so like Servlets, JSP also
has access to all the powerful Enterprise Java APIs, including JDBC, EJB, JNDI, JAXP etc. JSP pages
can also be used in combination with servlets that are used to handle the business logic, the model that
is supported by Java servlet template engines. JSP is an integral part of J2EE, a complete platform for
enterprise standard applications. This implies that JSP can be used to develop simplest applications to
the most complex and demanding applications.
4.4 HTML 5
HTML5 is a co-operation between the (W3C) World Wide Web Consortium and the (WHATWG) Web
Hypertext Application Technology Working Group. HTML5 is the new standard for HTML. For HTML5
still a lot of work is in progress. However, Many browsers have incorporated support for HTML 5. It
heavily uses java script and CSS. By use of these technologies it reduces the use of external plugins like
flash, reduces use of scripting by incorporating new tags, and has improved on error handling. Also
HTML5 targets to be compatible with every device. In our research we have used the canvas tag in
combination with the java scrip language for rendering the results in the form of phylogenetic trees.
4.5 Java Script
A scripting language is a lightweight programming language used with the web applications. This is
client side scripting language mainly used for data validation, animations, and small calculations at the
21
client end. It is programming code that can be inserted into HTML pages. JavaScript when inserted
into HTML pages, is supported by all modern web browsers and hence can be executed with ease. It
can detect the browser the client is using so that respective code can be executed. The java script is an
interpreted language that is you do not need to compile it before execution, its directly interpreted by
the web browser.
4.6 Eclipse
Eclipse is an opensource IDE Integerated Development Environment. It is created by Open Source
Community and is used in several different areas, e.g. as a development environment for Java or Android
applications, python, c, c++ pearl etc. The Eclipse projects are governed by the Eclipse Foundation. The
Eclipse Foundation is a member supported, non-profit corporation that hosts the Eclipse Open Source
projects. Also helps to cultivate both an Open Source community and an Ecosystem of complementary
products and services. The Eclipse IDE can be easily extended with additional software components or
plugins. Several Open Source projects and companies have extended the Eclipse IDE and customized
according to their requirements in their working environment.
Eclipse is also used as a base for creating general purpose applications. These applications are known
as Eclipse Rich Client Platform applications (Eclipse RCP). The Eclipse Foundation uses Eclipse Public
License (EPL) and is an Open Source software license for its software. The EPL is specially designed to
be business-friendly. EPL Licence states that the EPL licensed programs can be used, modified, copied
and distributed free of cost. The consumer of EPL licensed software can go for using this software in
closed source programs. Any modifications in the original EPL code must also be released as EPL code
as stated by EPL.
We have extensively used Eclipse IDE for implementing our algorithm by implementing the web
service clients and our intermediate code and the HTML 5 with Java Script code.
22
Chapter 5
DiagnosTree -The Tool
We name our tool as DiagnosTree since it facilitates diagnosis of diseases through the use of phylogenetic
trees.
Although the diagnosis is possible with gene sequences, protein sequences, and the RNA sequences,
but for this paper we will stick to gene sequences. Our method is based on the similarity that the human
beings are having in their gene sequences and the assumption that any change in the gene sequence at
the loci where the nitrogen bases are usually common in all the human beings is responsible for the
abnormality an individual is having.
5.1 The Algorithm
5.1.1 Required Inputs
• Patients gene sequence
• A few of Patients family members gene sequences
• Diseased gene sequences (which will be downloaded from the Bioinformatics databases).
We consider patients family members sequences for analysis since their gene sequences are most close
to the patients gene sequence, with the help of these sequences we try to find out which mutation in the
sequence of the patient is responsible for the disorder. To diagnose the disease we need to compare the
sequence of the patient with the gene sequences of diseased genomes. To reduce the time required for
diagnosis (through computer processing) we suggest to find out the probable diseases the patient might
be suffering from based on the symptoms. In our method we find out the common nucleotides among
the patient and the family members gene sequences and discard the dissimilar nucleotides to retain the
common nucleotides with respect to their loci. Following are the steps that we suggest to diagnose the
disease.
Step 1: Align the gene sequences of
• The patient
23
• The family members of the patient
• And the diseased sequences.
Step 2: Find out the common nucleotides among the family members of the patient, and discard the
dissimilar nucleotides from all the sequences (of the patient, patients family members, and the diseased
sequences) from the respective loci after alignment.
Step 3: Now Discard the Patients family members gene sequences.
Step 4: Create a phlyogenetic tree (we prefer maximum parsimony based phylogenetic tree) based
on the sequences we got in the previous step (Modified gene sequences of the patient and the diseased
sequences).
Step 5: From this tree we can say that the patient is suffering from a disease which is having least
distance from the patients gene sequence.
5.1.2 Example
Lets consider the following hypothetical sequences.
Figure 5.1: Set of Input Sequences
Where P is the Patient, F1, F2, F3 and F4 are close relatives of the patient and D1, D2, D3 and D4
are People suffering from different diseases (Reference sequences).
Now we do apply multiple sequence alignment on these sequences and get the following output.
From the above result we discard the dissimilar nucleotides/characters from the family members
sequences and discard the nucleotides/characters at respective loci from the other sequences as follows.
And we get
Further we ignore the close relatives sequences and construct a phylogenetic tree based on the rest of
the sequences.
The tree shown in the above figure depicts that the patient is suffering from the disease D2.
24
Figure 5.2: Aligned Sequences (Output of MSA)
Figure 5.3: Uncommon Nucleotieds to be omitted out of the sequences
25
Figure 5.4: Set of Family Members Sequences to be removed from The Sequences
Figure 5.5: Final set of Sequences to be used for creating The Phylogenetic Tree
Figure 5.6: The resultant Tree depicting relationship among the patients gene sequence and different
diseased sequences
26
Chapter 7
Results
After using the sequences from 1000 genome project, dbSNP and HapMap databases we got accurate
results the algorithm . We have used the family memeber’s sequences from the 1000 genome project
and the reference disease sequences from dbSNP. After doing 30 such tests on the algorithm we found
that our algorithm gave all the results as correct. The only concern remains here is the set of Diseased
sequences. As of now there are thousands of diseases. And if the sequences are not opt properly there is
a danger of mis-diagnosis. To avoid this scenario we need to use the sequences of all the existing diseases
which incurr a lot of computational resourcers and is very time consuming, that is it may take months of
time to give the result. We have an agenda of working out on this scenario and sort out an optimmum
resultant algorithm.
Half the portion of the algorithm in our tool is executed on the other servers the portion of code
which excutes at the local system is very efiicient and has proved to take O(n2) time. The tool requires
a cluster of workstations if needed to execute the entire algorithm on the local system. Such a use and
analysis is out of the scope of this project thesis as of now.
Following are few input sequences on which we have tested the algorthm:
• Family ID:13291 Individual ID:NA06986
• Family ID:13291 Individual ID:NA06995
• Family ID:13291 Individual ID:NA06997
• Family ID:13291 Individual ID:NA07037
• Family ID:13291 Individual ID:NA07045
• Family ID:13291 Individual ID:NA07435
The result after aplication of our algorithm happens to be the individual ”NA07435” is suffering from
Alzhymers and is as provided with the database itself.
30
Chapter 8
Conclusion
The phylogenetic trees are beinng utilized to diagnose the disease after multiple sequence analysis on
the various sequences. This improves the diagnosis process and hence accelerating the process of treat-
ment. The open source bioinformatics resources should be utilized to improve the Disease Diagnosis and
Treatment process such that the current loop holes in the medical system must be closed and human
society must be benefited. To this end we present a novel algorithm which makes an effort to effectively
utilize the available bioinformatics services and databases to enhance the accuracy and performance of
diagnosis process. We believe that our work can be used to improve the current scenario in the medical
system and benefit the society to become more secure against the diseases. Since diagnosis of a disease
is half the recovery.
31
Chapter 9
Future Work
We believe that there is a scope to further improve the proposed algorithm so as to target the non-
genetic diseases which have an impact over the genome sequences. Also there is a scope to improve the
performance of this Algorithm by parallelizing it and improve the comparison methods used here. We
plan to implement our own service as an open application available in public, so that the research process
in this direction can be improved.
32
Top Related