1
CHAPTER 1: INTRODUCTION
1.1 Introduction
India is a world capital of diabetes, hence it is required immediate attention of drug
design and development of novel drug. Though Type 2 Diabetes has many drugs, it lacks
100% effective cure. Drug design is mainly based on protein-ligand interactions and the
active site residues. To help innovative drug designers a consensus database with 5
important candidate proteins causing diabetes has been developed.
Relational database concepts of computer science and Information retrieval concepts of
digital libraries are important for understanding biological databases. Biological database
design, development, and long-term management is a core area of the discipline of
Bioinformatics. Data contents include gene sequences, textual descriptions, attributes and
ontology classifications, citations, and tabular data. These are often described as semi-
structured data, and can be represented as tables, key delimited records, and XML
structures. Cross-references among databases are common, using database accession
numbers.
A mutation in a protein may leads to malfunction which result in causing disease. A
Protein may cause more diseases, a disease can caused by many proteins. There are many
proteins which causes diabetes, all the proteins are not having ligands to correct the
sequence. Designing drug using conventional process is time consuming and expensive,
but using Bioinformatic tools can be minimized and also cheaper.
2
1.2 Background
Protein coding genes related to Diabetes are figured out from the gene cards website.
Many of them are screened as they don’t have PDB id, and some doesn’t have ligands.
All those are eliminated and only the closely linked proteins with ligand are selected and
the best suited 5 proteins are filtered finally (dipeptidyl-peptidase 4 (DPP-4), peroxisome
proliferator-activated receptor gamma (PPAR-γ), protein tyrosine phosphatase, non-
receptor type 1 (PTPN1), Glycogen synthase kinase -3 beta (GSK-3β) and Aldose
Reductase). Aldose Reductase exihibts more consenus from the remaining. The average
docking score of the ligands, inhibitors is -126.048 kcal/mol.
1.3 Problem Statement
The problem addressed in my study is to identify protein ligand which inhibits high
affinity than the existing ligands associated with diabetes causing proteins. Finding the
best protein ligand for diabetes from various sources like plant database, chalcon
database, ZINC database such that best ligand which is having high affinity than
-126.048 kcal/mol.
Advances in computational techniques have enabled virtual screening to have a positive
impact on the discovery process. In ligand-based virtual screening, the strategy is to use
information provided by a compound or set of compounds that are known to bind to the
desired target and to use this to identify other compounds in the corporate database or
external databases with similar properties.
3
Design and development of novel drug with fewer side effects and less costly using
various bioinformatic tools. We developed a technique to minimize the human
intervention in the calculation of the ligand properties.
1.4 Nature of Study
The current research extracted the data from the databases available online. The data
thus extracted is assembled in a desired format. To identify high consensus protein
causing diabetes by using Root Mean Square Deviation, Rank sum technique. In addition
to protein ligand docking can be performed to identify the ligand that binds with high
affinity. It also compares the results of different software to identify best ligand for
protein causing diabetes.
1. 5 Thesis Overview
Figure 1. 1 Potential areas for in silico intervention in drug discovery process
4
The principal topic of this work is the application of Root mean square deviation, rank
sum technique to identify ligands with high affinity.
Chapter 1 provides motivation, role of bioinofmatics in drug discovery, background
information on present research of me on diabetes, diabetes causing protein Aldose
reductase.
Chapter 2 provides literature reviews of the present study and describes about anti
diabetic agents. In addition to that various bioinformatic tools and techniques are
reviewed.
Chapter 3 provides materials and methodologiesto predict protein ligand interactions by
using root mean square deviation, Tsar-rank sum techniquee to figure out high affinity
ligands.
Chapter 4 provides results and discussions and identified Apigetrin ranked high and was
reported to be the best compound that can bind with high affinity to Aldose reductase
enzyme. Similarly, ZINC00844930 (out of 1001 hits) and ZINC00702953 (out of 837
hits) from ZINC database; Allium38 from in-house plant database;
44[IC]COX,LOinhibitor-Me-UCH3 from chalcone database gave the best results with a
binding energy better than the original co-crystallized ligand of 1AH3.
Chapter 5 provides conclusions and further work directions to improve the process and
mechanism to optimize the non conventional drug discovery and that leads to efficient
drug design through computer aided drug discovery.
5
1.6 Research Questions
• What is the computational approach to highlight the crucial amino acid residues
responsible for functional attributes?
• How to predict the structures of ligands for protein
• How to understand the interactions at binding sites of ligands.
• How computer technology can be used to reduce the time spent in the synthesis of
compounds and the use of experimental methods designed compounds would lead
to effective compounds with drug use computer aided design?
1.7 Biological Databases
A biological database is a large, organized body of persistent data, usually associated
with computerized software designed to update, query, and retrieve components of the
data stored within the system.
As of 2006, there are over 1,000 public and commercial biological databases. These
biological databases usually contain genomics and proteomics data, but databases are also
used in taxonomy. The data are nucleotide sequences of genes or amino acid sequences
of proteins. Furthermore information about function, structure, localisation on
chromosome, clinical effect of mutations as well as similarities of biological sequences
can be found [1].
1.7.1 Types of Biological Databases
There are many different types of databases but for routine sequence analysis, the
following are initially the most important:
� Primary databases
� Secondary databases
6
� Composite databases
1.7.1.1 Primary Databases
These contain sequence data such as nucleic acid or protein. Some examples of primary
databases include:
Nucleic Acid Databases: EMBL, Genbank, DDBJ
Protein Databases: SWISS-PROT, TREMBL, PIR
EMBL (European Molecular Biology Laboratory)
The EMBL-Nucleotide Sequence Database is a comprehensive database of DNA and
RNA sequences collected from the scientific literature and patent applications and
directly submitted from researchers and sequencing groups. It constitutes Europe’s
primary nucleotide sequence resource. The database is produced in an international
collaboration with Genbank (USA) and the DNA Database of Japan (DDBJ). Each of the
three groups collects a portion of the total sequence data reported worldwide, and all new
and updated database entries are exchanged between the groups on a daily basis. The
current database release is Release 88, September 2006.
The EMBL Nucleotide Sequence Database can be accessed a variety of ways. You can
query the database using the SRS system or choose an access method. In EMBL, one can
access feature tables, FAQ’s, manuals and guides.
Submission of sequence information to the nucleotide sequence database prior to
publication has become standeard practise. A unique accession number is assigned by the
database which permanently identifies the sequence submitted. The database accession
number should be included in the manuscript, preferably on the first page of the journal
7
article, or as required by individual journal procedures. This procedures ensures
availability and distribution of new sequence data in a timely fashion.[2]
Figure 1. 2 The International Sequence Database Collaboration
Genbank
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of
known genetic sequences. It has a flat file structure that is an ASCII text file, readable by
both humans and computers. In addition to sequence data, GenBank files contain
information like accession numbers and gene names, phylogenetic classification and
references to published literature.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences. GenBank, along with EMBL and DDBJ have reached a
milestone of 100 billion bases from over 165,000 organisms. GenBank is part of the
8
International Nucleotide Sequence Database Collaboration, which comprises the DNA
DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and
GenBank at NCBI. These three organizations exchange data on a daily basis.
Many journals require submission of sequence information to a database prior to
publication so that an accession number may appear in the paper. The WWW-based
submission tool, called BankIt, for convenient and quick submission of sequence data.
Sequin, NCBI's stand-alone submission software is available by FTP. When using
Sequin, the output files for direct submission should be sent to GenBank by electronic
mail.
The GenBank database is designed to provide and encourage access within the scientific
community to the most up to date and comprehensive DNA sequence information.
Therefore, NCBI places no restrictions on the use or distribution of the GenBank data.
However, some submitters may claim patent, copyright, or other intellectual property
rights in all or a portion of the data they have submitted.[3]
Figure 1. 3Growth of the International Nucleotide Sequence Database Collaboration
9
DDBJ (DNA Databank of Japan)
DDBJ began its DNA Databank activities in 1986 at the National Institute of Genetics
(NIG). It has been functioning as international nucleotide sequence database in
collaboration with EBI/EMBL and NCBI/Genbank.
DNA sequence records the organismic evolution more directly than other biological
materials and ,thus, is invaluable not only for research in life sciences, but also human
welfare in general. The databases are a common treasure of human beings.
From the beginning, DDBJ has been functioning as one of the International DNA
Databases, including EBI (European Bioinformatics Institute; responsible for the EMBL
database) in Europe and NCBI (National Center for Biotechnology Information;
responsible for GenBank database) in the USA as the two other members. Consequently,
DDBJ has been collaborating with the two data banks through exchanging data and
information on Internet and by regularly holding two meetings, the International DNA
Data Banks Advisory Meeting and the International DNA Data Banks Collaborative
Meeting.
DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA
sequences from researchers and to issue the internationally recognized accession number
to data submitters. It collects data mainly from Japanese researchers, but accepts data and
issue the accession number to researchers in any other countries. Since DDBJ exchanges
data with EMBL/EBI and GenBank/NCBI on a daily basis, the three data banks share
virtually the same data at any given time.
10
In DDBJ, data submission can be done using SAKURA (via a WWW server) for
nucleotide sequence data submission. Mass Submission System (MSS) can be used when
the submission consists of large number of sequences, involves long nucleotide
sequences which result in a complex submission containing many features such as
genome data, and when the submission is unsuitable for SAKURA. DDBJ can also
accept submissions using Sequin , a stand alone software tool to make the files for data
submission [4].
SwissProt
The UniprotKB/Swiss-Prot Protein Knowledgebase is an annotated protein sequence
database established in 1986. It is a curated protein sequence database that provides a
high level of annotation, a minimal level of redundancy and a high level of integration
with other databases. Together with UniProtKB/TrEMBL, it constitutes the UniProt
Knowledgebase. It is maintained collaboratively by the Swiss Institute for Bioinformatics
(SIB) and the European Bioinformatics Institute (EBI).
The UniProtKB/Swiss-Prot group is headed by: Rolf Apweiler. The current Swiss-Prot
Release is version 51.2 as of 28/11/2006.
The UniProtKB/Swiss-Prot database can be accessed using the search engines SRS and
UniProt Power Search. SRS is the simplest and easiest method available to access the
UniProtKB/Swiss-Prot sequence database. This search tool can also be used for more
complex and/or multiple database queries. UniProt Power Search provides full text,
advanced search, set manipulation and search filtering on the Universal Protein Resource.
11
The UniProtKB/Swiss-Prot database can also be accessed using ExPASy Server (in
Geneva offers the choice of full-text search or of individual lines (e.g. ID, AC , DE, OS,
OG, GN, RL, RA) and SP-ML .
The UniProtKB/Swiss-Prot protein data bank provides accession numbers for protein
sequences when the peptide(s) have been directly sequenced. These sequences should be
submitted to UniProtKB/Swiss-Prot at the EBI. Swiss-Prot does not provide accession
numbers, in advance, for protein sequences that are the result of translation of nucleic
acid sequences. These translations will automatically be forwarded to Swiss-Prot from
the EMBL nucleotide database and are assigned UniProtKB/Swiss-Prot accession
numbers on incorporation into UniProtKB/TrEMBL.
SPIN is the web-based tool for submitting directly sequenced protein sequences and their
biological annotations to the UniProtKB/Swiss-Prot Protein Knowledgebase. SPIN
guides you through a sequence of WWW forms allowing interactive submission. The
information required to create a database entry will be collected during this process [5].
TrEMBL
UniProtKB/TrEMBL is a computer-annotated protein sequence database complementing
the UniProtKB/Swiss-Prot Protein Knowledgebase.
UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in
the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein sequences
extracted from the literature or submitted to UniProtKB/Swiss-Prot. The database is
12
enriched with automated classification and annotation. The UniProtKB/TrEMBL group is
headed by Rolf Apweiler. The current TrEMBL Release is version 34.2 as of 28-Nov-
2006.
The UniProtKB/TrEMBL database is split into two main sections; SP-TrEMBL and
REM-TrEMBL. SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries that will
eventually be incorporated into UniProtKB/Swiss-Prot and can be considered as a
preliminary section of UniProtKB/Swiss-Prot as all SP-TrEMBL entries have been
assigned UniProt accession numbers. REM-TrEMBL (REMaining TrEMBL) contains the
entries that we do not want to include in UniProtKB/Swiss-Prot, such as
immunoglobulins and T-cell receptors, synthetic sequences, patent application sequences,
small fragments, pseudogenes and truncated proteins. REM-TrEMBL entries have no
accession numbers.
In addition, there is a weekly update to UniProtKB/TrEMBL called TrEMBLnew.
UniProtKB/TrEMBLnew is produced from nucleotide sequences deposited in the EMBL
nucleotide sequence database. At each UniProtKB/TrEMBL release the annotation of
UniProtKB/TrEMBLnew entries is upgraded, redundant entries are merged and the
remainder are then added to TrEMBL.
The UniProtKB/TrEMBL database can be accessed using the search engines SRS and
UniProt Power Search. SRS is the simplest and easiest method available to access the
UniProtKB/Swiss-Prot sequence database. This search tool can also be used for more
complex and/or multiple database queries. UniProt Power Search provides full text,
advanced search, set manipulation and search filtering on the Universal Protein Resource.
The UniProtKB/Swiss-Prot database can also be accessed using ExPASy Server (in
13
Geneva offers the choice of full-text search or of individual lines (e.g. ID, AC , DE, OS,
OG, GN, RL, RA) and SP-ML[6] .
PIR (Protein Information Resource)
The PIR, located at Georgetown University Medical Center (GUMC), is an integrated
public bioinformatics resource to support genomic and proteomic research, and scientific
studies.
PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as
a resource to assist researchers in the identification and interpretation of protein sequence
information [7]. Prior to that, the NBRF compiled the first comprehensive collection of
macromolecular sequences in the Atlas of Protein Sequence and Structure, published
from 1965-1978 under the editorship of Margaret O. Dayhoff. Dr. Dayhoff and her
research group pioneered in the development of computer methods for the comparison of
protein sequences, for the detection of distantly related sequences and duplications within
sequences, and for the inference of evolutionary histories from alignments of protein
sequences.
For four decades, PIR has provided many protein databases and analysis tools freely
accessible to the scientific community, including the Protein Sequence Database (PSD),
the first international database, which grew out of Atlas of Protein Sequence and
Structure.
In 2002, PIR along with its international partners, EBI (European Bioinformatics
Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to
create UniProt, a single worldwide database of protein sequence and function, by
unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.
14
PIR can be searched in the following ways:
• Text Search: Enter text or identifiers to search against iProClass for individual
proteins, or PIRSF for protein families. An advanced search option is available for
restriction of search terms, options, matching, and operators.
• Batch Retrieval: Retrieve multiple entries from iProClass or multiple families
from PIRSF database using a specific identifier, or a combination of different
identifiers.
• BLAST/FASTA Search: Retrieve entries similar to your query using the BLAST
or FASTA program. Two UniProt databases are available to perform the search:
(1) UniProtKB, which contains functional information on proteins, with accurate,
consistent, and rich annotation; or (2) UniRef100, which combines identical
sequences and sub-fragments, from any organism, into a single entry.
A line summary containing sequence similarity results will be displayed.
• Related Sequences: Save time and have a glance at your protein sequence
similarity neighbors by retrieving sequences based on pre-computed BLAST
results.
• Peptide Match: Find an exact match for a peptide sequence (3 to 30 amino acid
long). Two UniProt databases can be used to perform the search: (1) UniProtKB,
which contains functional information on proteins, with accurate, consistent, and
rich annotation; or (2) UniRef100, which combines identical sequences and sub-
fragments, from any organism, into a single entry.
15
• Pattern Search: (1) Find proteins matching a user-defined or a PROSITE pattern
in the UniProtKB database; or (2) Look for PROSITE patterns present in a query
sequence.
• Multiple Alignment: Enter multiple sequences in FASTA format and/or multiple
UniProtKB identifiers in the ID box to get the CLUSTALW alignment of the
sequences along with a neighbor-joining tree and a PIR interactive tree and
alignment viewer.
• Pairwise Alignment: Insert two sequences using the single letter amino acid code
or enter two UniProtKB identifiers. The results show the SSearch Smith-
Waterman full-length alignments between the two sequences.
1.7.1.2 Secondary Databases
These databases are also known as pattern databases. They contain results from the
analysis of the sequences in primary databases. Some examples of primary databases
include:
PROSITE, Pfam, BLOCKS, PRINTS.
PROSITE
PROSITE consists of documentation entries describing protein domains, families and
functional sites as well as associated patterns and profiles to identify them.
PROSITE is a method of determining what is the function of uncharacterized proteins
translated from genomic or cDNA sequences. It consists of a database of biologically
significant sites and patterns formulated in such a way that with appropriate
16
computational tools it can rapidly and reliably identify to which known family of protein
(if any) the new sequence belongs[8].
The PROSITE tools are ScanProsite (for advanced scan) and PRATT, which allows to
interactively generate conserved patterns from a series of unaligned proteins.
The PROSITE database can be browsed by documentation entry, ProRule description,
taxonomic scope and number of positive hits.
Pfam
Pfam is a large collection of multiple sequence alignments and hidden Markov models
covering many common protein domains and families. For each family in Pfam it is
possible to look at multiple alignments, view protein domain architectures, examine
species distribution, follow links to other databases, and view known protein structures.
Pfam can be used to view the domain organization of proteins. A single protein can
belong to several Pfam families[9].
Pfam is a database of two parts, the first is the curated part of Pfam containing over 8957
protein families. Pfam-B contains a large number of small families taken from the
PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B
families can be useful when no Pfam-A families are found.
BLOCKS
The BLOCKS database contains multiple alignments of conserved regions in protein
families. Blocks are multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins. The blocks for the BLOCKS database are made
automatically by looking for the most highly conserved regions in groups of proteins
represented in the PROSITE database. These blocks are then calibrated against the Swiss-
17
Prot database to obtain a measure of the chance distribution of matches. It is these
calibrated blocks that make up the BLOCKS database[10].
BLOCKS is a service for biological sequence analysis at the Fred Hutchinson Cancer
Research Center in Seattle, Washington, USA.
The BLOCKS Database is based on InterPro entries with sequences from SWISS_PROT
and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART,
and/or PFAM and/or ProDom entries.
The BLOCKS database can be searched either by key word or by number. Block
Searcher is the tool to search a sequence vs BLOCKS.
PRINTS
The PRINTS database houses a collection of protein fingerprints, which may be used to
assign family and functional attributes to uncharacterized sequences, such as those
currently emanating from the various genome-sequencing projects.
Fingerprints are groups of conserved motifs that, taken together, provide diagnostic
protein family signatures. They derive much of their potency from the biological context
afforded by matching motif neighbors; this makes them at once more flexible and
powerful than single-motif approaches. The technique further departs from other pattern-
matching methods by readily allowing the creation of fingerprints at super family-,
family- and subfamily-specific levels, thereby allowing more fine-grained diagnoses[11].
18
The PRINTS database can be accessed by accession number, PRINTS code, database
code, text, sequence, title, number of motifs, author or query language. The database can
be searched using the tools FingerPRINTScan, FPScan, GRAPHScan,and MULScan.
PRINTS is a companion to the BLOCKS, PROSITE, Pfam and ProDom databases.
1.7.1.3 Composite Databases
These databases combine different sources of primary databases. They make querying
and searching efficient, without the need to go to each of the databases.
Examples of composite databases include:
NRDB- Non Redundant Database
OWL
NRDB (Non-Redundant Databases)
NRDB is a so-called non-redundant composite of the following sources: PDB sequences,
SWISS-PROT, SWISS-PROTupdate, PIR, GenPept and GenPeptupdate. The database is
thus similar in content to OWL, but contains more up-to-date information. However,
strictly speaking, it is not non-redundant, but non-identical - i.e., only identical sequence
copies are removed from the database. As a result, NRDB is larger and less efficient to
search than OWL. To be rigorous, it is sensible to search NRDB, but it is more practical
to search OWL.
19
NRDB is a program that can be used to compare a batch of sequences and find those that
are identical to each other. This is useful for sorting new alleles that are not yet in the
MLST databases and for developing new MLST schemes.
This is a generic program which can process any sequence data (DNA or amino acid) and
is by no means restricted to MLST data. The nrdb program was written by Warren Gish
at Washington University [12].
OWL
OWL is a non-redundant composite of 4 publicly-available primary sources: SWISS-
PROT, PIR (1-3), Genbank (translation) and NRL-3D. SWISS-PROT is the highest
priority source, all others being compared against it to eliminate identical and trivially-
different sequences. The strict redundancy criteria render OWL relatively "small" and
hence efficient in similarity searches.
The sources are assigned a priority with regard to their level of annotation and sequence
validation - SWISS-PROT has the highest priority, so all the others are compared against
it during the redundancy checking procedure (this process eliminates both identical
copies of sequences and those containing single amino acid differences). The contribution
to OWL made by each of the primary sources is shown in the following pie chart. Its non-
redundant status renders OWL highly "compact" and therefore efficient for use in
sequence comparisons [13].
20
1.7.2 Importance of Biological Databases
The importance of biological databases is to encourage new approaches to the
management, analysis and dissemination of biological knowledge that will enable both
the scientific community and the broader public to gain maximum benefit and utility.
Future advances in biological sciences will depend both upon the creation of new
knowledge and upon effective management of proliferating information. The biological
sciences have become increasingly data rich. Much of the biology of tomorrow will arise
through discovery based on information contained in community-accessible databases.
Much, if not all, of our accumulated knowledge of biology will be accessible in electronic
form. Future progress in biological research will be highly dependent on the ability of the
scientific community to both deposit and utilize stored information on-line. Thus, the
information management challenge for the future will be to develop new ways to acquire,
store and retrieve not only biological data per se, but also those data in the context of
biological knowledge.
1.7.2.1 Interpreting sequence data
The amount of data available from sequences is growing rapidly and outstripping the
ability of biologists to understand and assimilate it all. There is a pressing need for
improved systems that will allow the prediction of the function of a protein molecule
given the DNA sequence that codes for it and to link this sequence information to other
biological data. There is also a need to identify anomalous events, such as lateral transfer,
polymorphisms or mutations.
21
A range of IT and computer science tools are needed to tackle these problems. Some will
involve the integration of data from different sources, the development of structures of
data to enable more efficient querying, the development of better ways of interrogating
databases and presenting the information, etc.
Other approaches will involve the development of better ways of modeling data,
including whole systems. For example, the assembly sequence in a metabolic system,
regulatory development to assist in the design of appropriate experiments to assess its
contribution to the phenotype. These approaches will need to be holistic in nature, taking
in as much of the whole picture as possible. This in turn will require the development and
application of new systems modeling methods.
1.7.2.2 Comparative genome analysis
Most bioinformatics tools were developed prior to the widespread availability of
complete genome sequences, and tend to support storage and presentation of information
at the gene level. There is thus a need for novel techniques and tools that address genome
level bioinformatics, with analyses that account for the location of genes or regulatory
elements within the genome and that encompass new data sets, such as derived from
research on micro array and DNA chip technology, proteomics, analysis of 2-D gel
information, single nucleotide polymorphisms and transcriptomes.
In addition, researchers need to be able to compare and analyze the organization and
evolution of genomes within a single genome and between genomes of different
22
organisms. In particular, comparative genomics could uncover relationships between
model organisms, crops and domestic animals, and facilitate the exploitation of
conservation of synteny. Tools and techniques need to be developed to conduct these
analyses, including the development of novel algorithms, efficient storage techniques and
graphical displays to visualize the results of the comparisons.
1.7.2.3 Understanding, integrating and modeling cellular processes
Rapid advances are being made in understanding how cells differentiate, function and
interact. Descriptions of the molecular mechanisms of membrane activity, signal
transduction, metabolism, gene expression and other nuclear processes are fundamental
to our appreciation of cellular and intercellular function. They also provide the means, by
which new molecules introduced to cells can enter, function and affect these cellular
processes. The picture of whole cell activity that is emerging from genomics, structural
biology and related areas of cytology and biochemistry will require, however, the use of
biological databases to organize, integrate and elucidate these complex data in
meaningful and realistic ways.
1.7.2.4 Methods to create biological databases from uncomputerised or unstructured
data
Much biological information is still published and stored in non-electronic forms (e.g.
books, journals). Biological nomenclature is of central importance to the recovery and
establishment of links between data from different sources and of different types.
Schemes to handle nomenclature instability are particularly challenging. In addition,
23
evolutionary studies based on sequence information must consider the whole organism,
yet the ways of linking sequence to organism reliably have not been developed.
There is a need to develop tools to capture data in their entirety and to extract data from
non-electronic media so as to create organized electronic forms of the data that can be
accessed and searched in the same way as other databases. This means not only
extracting information from free text sources, but also capturing data that are presented in
the form of tables, etc. Input from various aspects of computer science is expected
including, for example, the natural language processing community and those involved in
diagrammatic reasoning.
Research proposed under this heading must have strong links to the biological data
needed by the community. The feasibility of some of these techniques may require the
building of exemplar databases, which should, however be of wider benefit to the
community. Creation of new databases by entering data into existing database structures
was not supported.
1.7.2.5 Integrity and maintenance of biological databases
There is an opportunity for novel computer science and IT approaches to be applied to
the maintenance of biological databases: in the processing of data, the addition of new
data, automating the processes of quality control (validation, global integrity checking,
error checking, traceability of results, etc). An important aspect of this process is the
systematic checking for anomalies and identifying whether these are errors or interesting
biological phenomena. Some of the technologies that might be applied in this area
include those for semi-structured data management.
24
Any proposed system will need to be scalable and applicable to the complex data that
exist in biology.[14]
1.7.3 Applications of Biological Databases
� The sequencing of the human genome and the emerging intense interest in
proteomics and molecular structure have caused an enormous explosion in the
need for biological databases. These include genome and sequence databases such
as GenBank and Ensemble, protein databases such as PDB and SWISS-PROT,
and their analysis tools and tools for accessing and manipulating sequence
databases such as BLAST, multiple alignment, Perl, and gene finding tools.
� Databases are used in many applications, spanning virtually the entire range of
computer software.
� Databases are the preferred method of storage for large multiuser applications,
where coordination between many users is needed. Even individual users find
them convenient, though, and many electronic mail programs and personal
organizers are based on standard database technology. Software database drivers
are available for most database platforms so that application software can use a
common application programming interface (API) to retrieve the information
stored in a database.
� Researchers are widening their scope of research. The explosion of available
sequence data from many organisms has enabled researchers to more readily
compare sequences of interest from many different species in combination with a
number of modek organism databases. In addition to databases focused on a
25
single species, databases that deal with taxonomically related species have
emerged recently.
� With advances and in-deep applications of computer technologies in biology,
database modeling for biological data management is emerging as a new
discipline.
� By means of database technology, large volumes of biological data with complex
structures can be modeled in conceptual data models and further stored in
databases. Then biologists can use the biological databases to handle and retrieve
these data and further support a team of biologists to analyze and mine their data
throughout a biological discovery process.
1.7.4 Micro Organism Databases
Due to the enormous explosion in the amount of sequence data available related to
microorganisms, a number of microorganism databases have been established. Five such
microbial databases are mentioned below:
1.7.4.1 Antimicrobial wild type distributions of Microorganisms
The EUCAST (European Committee on Antimicrobial Susceptibility Testing) under the
auspices of the ESCMID (European Society for Clinical Microbiology and Infectious
Diseases) offers this free website of distributions of MIC-values of wild type bacteria and
fungi. Each MIC-distribution is defined by the micro-organism and the antimicrobial
drug. It is the compound result of a number of separate distributions submitted to
EUCAST from organizations such as national breakpoint committees, pharmaceutical
26
industry, antimicrobial resistance surveillance programs and research projects. The
database is released for public use, drug by drug, by the EUCAST steering committee
and thereby also by the national breakpoint committees. The distributions are used by the
committee for defining epidemiological cut-off values for early detection and
surveillance of resistance development, and for the harmonization of European clinical
breakpoints.
Each graph contains information on the number of sources of data, the total number of
organisms, and when defined by EUCAST, clinical breakpoints and/or the
epidemiological cut-off value. The epidemiological cut-off value is related to the MIC
distribution of the wild type organism.
The distributions of MIC values of wild Escherichia coli are calculated against the
following 35 compounds:
Amikacin, Aztreonam, Cefazoline, Cefepime, Cefotaxime, Cefoxitin, Cefpodoxime,
Ceftazidime, Ceftibuten, Ceftiofur, Ceftriaxone, Cefuroxime, Chloramphenicole,
Ciprofloxacin, Colistin, Enrofloxacin, Ertapenem, Florfenicol, Flumequine, Gentamicin,
Imipenem, Kanamycin, Levofloxacin, Meropenem, Moxifloxacin, Nalidixic acid,
Neomycin, Netilmicin, Nitrofurantoin, Norfloxacin, Ofloxacin, Streptomycin,
Tigecycline, Tobramycin, and Trimethoprim.[15]
27
1.7.4.2 RIDOM ( Ribosomal Differentiation of Medical Micro-organisms Datatbase)
This database differentiates medical microorganisms based on partial small subunit
ribosomal DNA (16S rDNA) sequence.
This web server is an evolving electronic resource designed to provide micro-organism
differentiation services for medical identification needs. The diagnostic procedure begins
with a specimen partial small subunit ribosomal DNA (16S rDNA) sequence. Resulting
from a similarity search, a species or genus name for the specimen in question will be
returned. Where the first results are ambiguous or do not define to species level, hints for
further molecular, i.e. internal transcribed spacer, and conventional phenotypic
differentiation will be offered (‘sequential and polyphasic approach’). Additionally, each
entry in RIDOM contains detailed medical and taxonomic information linked, context-
sensitive, to external World Wide Web services. Nearly all sequences are newly
determined and the sequence chromatograms are available for intersubjective quality
control[16].
1.7.4.3 WDCM (World Data Centre for Microorganisms)
WFCC-MIRCEN World Data Centre for Microorganisms (WDCM) provides a
comprehensive directory of culture collections, databases on microbes and cell lines, and
the gateway to biodiversity, molecular biology and genome projects [17].
1.7.4.4 HBMMD (Harbor Branch Marine Microbe Database)
The Division of Biomedical Marine Research (DBMR) at Harbor Branch Oceanographic
Institution (HBOI) has one of the most comprehensive collections of deep-water sponges
28
in the world, having led numerous expeditions and submersible collections at more than
400 sites over nearly two decades. Through accumulation of new species and site records,
long-term surveys are effectively being conducted and biodiversity inventories and
catalogs (in the form of cruise reports) are a significant by-product of our primary
mission of drug discovery.
The HBMM Culture Collection consists of over 16,000 total marine bacteria and fungi
(9000 derived specifically from marine invertebrates). This collection is maintained as a
source of microbes for DBMR’s Fermentation Program which systematically cultures the
isolates for novel bioactive product discovery. However, except for an initial Gram-stain
and a description of basic cellular and colonial morphology, few of the strains in this
collection have been taxonomically classified.[18]
1.7.4.5 EchoBASE
This is a relational database designed to contain and manipulate information from post-
genomic experiments using the model bacterium Escherichia coli K-12. Its aim is to
collate information from a wide range of sources to provide clues to the functions of the
approximately 1500 gene products that have no confirmed cellular function. The database
is built on an enhanced annotation of the updated genome sequence of strain MG1655
and the association of experimental data with the E.coli genes and their products.
Experiments that can be held within Echo BASE include proteomics studies, micro array
data, protein–protein interaction data, and structural data and bioinformatics studies.
Echo BASE also contains annotated information on ‘orphan’ enzyme activities from this
29
microbe to aid characterization of the proteins that catalyze these elusive biochemical
reactions [19].
Databases are used in many applications, spanning virtually the entire range of
computer software. Databases are the preferred method of storage for large multiuser
applications, where coordination between many users is needed. Even individual users
find them convenient, and many electronic mail programs and personal organizers are
based on standard database technology. Software database drivers are available for most
database platforms so that application software can use a common Application
Programming Interface to retrieve the information stored in a database. Two commonly
used database APIs are JDBC and ODBC.
Importance of databases
1. Compactness
Where there is no need for the old paper files that has a big size.
2. Speed
Because of the computer can restore the stored Data Base and upgrading it very
fast than the normal human manual hand can do.
3. Less drudgery
because the computer do everything for you.
4. Currency
The more specific you can have when you asking for a Data Base information [20].
Databases are easy to set-up, easy to manipulate and easy to use. A database allows you
to maintain order in what could be a very chaotic environment [21].
30
Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high throughput experiment technology, and
computational analyses. They contain information from research areas including
genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics.
[22]
Information contained in biological databases includes gene function, structure,
localization (both cellular and chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
1.8 Diabetes
Diabetes mellitus often referred to simply as diabetes - is a disease in which the body
does not produce enough, or properly respond to, insulin, a hormone produced in the
pancreas. Insulin is needed to turn sugar and other food into energy. In diabetes, the body
either doesn't make enough insulin or can't use its own insulin as well as it should, or
both. This causes sugar to accumulate in the blood, often leading to various
complications [23, 24].
Types of Diabetes:
Many types of diabetes are recognized2 The principal three are:
1. Type 1
2. Type 2
3. Gestational diabetes.
31
Type I
Type 1 diabetes mellitus is characterized by loss of the insulin-producing beta cells
of the islets of Langerhans in the pancreas leading to a deficiency of insulin. This type of
diabetes can be further classified as immune-mediated or idiopathic. The majority of
type 1 diabetes is of the immune-mediated variety, where beta cell loss is a T-cell
mediated autoimmune attack[23] The principal treatment of type 1 diabetes, even in its
earliest stages, is the delivery of artificial insulin via injection combined with careful
monitoring of blood glucose levels using blood testing monitors. Without insulin, diabetic
ketoacidosis often develops which may result in coma or death.
Type II
Type 2 diabetes mellitus is the most common and is due to insulin resistance or
reduced insulin sensitivity, combined with relatively reduced insulin secretion. There are
numerous theories as to the exact cause and mechanism in type 2 diabetes. Central
obesity is known to predispose individuals to insulin resistance. Abdominal fat is
especially active hormonally, secreting a group of hormones called adipokines that may
possibly impair glucose tolerance. Obesity is found in approximately 55% of patients
diagnosed with type 2 diabetes[25]. Environmental exposures may contribute to recent
increases in the rate of type 2 diabetes. A positive correlation has been found between the
concentration in the urine of bisphenol A, a constituent of polycarbonate plastic from
some producers, and the incidence of type 2 diabetes[26]. Type 2 diabetes is usually first
treated by increasing physical activity, decreasing carbohydrate intake, and losing weight.
These can restore insulin sensitivity even when the weight loss is modest, for example
32
around 5 kg, most especially when it is in abdominal fat deposits. It is sometimes
possible to achieve long-term, satisfactory glucose control with these measures alone.
However, the underlying tendency to insulin resistance is not lost, and so attention to
diet, exercise, and weight loss must continue. The usual next step, if necessary, is
treatment with oral antidiabetic drugs. Insulin production is initially only moderately
impaired in type 2 diabetes, so oral medication (often used in various combinations) can
be used to improve insulin production, to regulate inappropriate release of glucose by the
liver and attenuate insulin resistance to some extent, and to substantially attenuate insulin
resistance.[27]
Gestational diabetes mellitus
Pregnant women who have never had diabetes before but who have high blood sugar
(glucose) levels during pregnancy are said to have gestational diabetes mellitus. It is a
risk factor for type 2 diabetes in the mother [28]. It occurs in about 2%–5% of all
pregnancies and may improve or disappear after delivery. Gestational diabetes is fully
treatable but requires careful medical supervision throughout the pregnancy. About 20%–
50% of affected women develop type 2 diabetes later in life[29]
Long-term secondary complications of Diabetes mellitus
Long-term secondary complications are main cause of morbidity and mortality in diabetic
patients[30]. The major microvascular complications of diabetes include nephropathy,
neuropathy, retinopathy while cataract is, however, an avascular complication[31].
Several metabolic factors contribute to the dysfunction observed in diabetic
33
vasculopathy[32] which include increased glucose flux through the polyol pathway,
increased production of reactive oxygen species by the mitochondrial respiratory chain,
nonenzymatic glycations, protein kinase-C activation and increased flux through the
hexosamine pathway[30]. The polyol pathway has received considerable attention.
Cataract
Diabetic cataract formation follows an increase in sugars in the lens. The excess sugar
within the lens is reduced by aldose reductase to its alcohol, but the lens capsule is
relatively impermeable to sugar alcohols. Because of the excess sugar alcohol (polyol),
the lens imbibes water, causing osmotic imbalance. Eventually, increased sodium and
decreased potassium levels and decreased glutathione levels lead to cataract formation
[33]
Figure 1. 4 Image showing the normal clear lens and the lens clouded by cataract
34
Figure 1. 5 Image showing the vision through a cataract eye.
Retinopathy
Diabetic retinopathy is the result of microvascular retinal changes. Hyperglycemia-
induced pericyte death and thickening of the basement membrane lead to incompetence
of the vascular walls. These damages change the formation of the blood-retinal barrier
and also make the retinal blood vessels become more permeable[34].
Small blood vessels – such as those in the eye – are especially vulnerable to poor
blood sugar (blood glucose) control. An over accumulation of glucose and/or fructose
damages the tiny blood vessels in the retina.
35
Figure 1. 6 Image showing the normal retina and the retina with diabetic retinopathy.
Figure 1. 7 Image showing the vision with diabetic retinopathy.
Nephropathy
Diabetic nephropathy is damage to your kidneys caused by diabetes. The kidneys have
many tiny blood vessels that filter waste from your blood. High blood sugar from
36
diabetes can destroy these blood vessels. Over time, the kidney isn't able to do its job as
well. Later it may stop working completely.
Figure 1. 8 Image showing the abnormal protein leaking in glomerulus of kidney due to nephropathy.
Neuropathy
Diabetic neuropathies are neuropathic disorders that are associated with diabetes mellitus.
These conditions are thought to result from diabetic microvascular injury involving small
blood vessels that supply nerves (vasa nervorum).
The Polyol Pathway
The polyol pathway of glucose metabolism becomes active when intracellular glucose
levels are elevated [35,36]. Aldose reductase (AR), the first and rate-limiting enzyme in
the pathway, reduces glucose to sorbitol using NADPH as a cofactor; sorbitol is then
metabolized to fructose by sorbitol dehydrogenase that uses NAD+ as a cofactor[35].
37
Figure 1. 9 Polyol (sorbitol) pathway; glucose-6-P, glucose 6-phosphate
Consequences of Polyol pathway
There are several effects of the Polyol pathway. Sorbitol is an alcohol, polyhydroxylated,
and strongly hydrophilic, and therefore does not diffuse readily through cell membranes
and accumulates intracellularly with possible osmotic consequences [35]. The fructose
produced by the polyol pathway can become phosphorylated to fructose-3-phosphate [37,
38] which is broken down to 3-deoxyglucosone; both compounds are powerful
glycosylating agents that enter in the formation of advanced glycation end products
(AGEs) [37]. The usage of NADPH by AR may result in less cofactor available for
glutathione reductase, which is critical for the maintenance of the intracellular pool of
reduced glutathione (GSH). This would lessen the capability of cells to respond to
oxidative stress [39]. Compensatory increased activity of the glucose monophosphate
shunt, the principal supplier of cellular NADPH, may occur [39]. The usage of NAD by
sorbitol dehydrogenase leads to an increased ratio of NADH/NAD+, which has been
termed “pseudohypoxia” and linked to a multitude of metabolic and signaling changes
38
known to alter cell function [40]. It has been proposed that the excess NADH may
become a substrate for NADH oxidase, and this would be a mechanism for generation of
intracellular oxidant species [41]. Thus, activation of the polyol pathway, by altering
intracellular tonicity, generating AGEs precursors, and exposing cells to oxidative stress
perhaps through decreased antioxidant defenses and generation of oxidant species, can
initiate and multiply several mechanisms of cellular damage.
Diabetes causes increased oxidative stress in various tissues as evidenced by increased
levels of oxidized DNA, proteins, and lipids. Besides damaging the functions of these
molecules, oxidative stress also triggers a series of cellular responses, including the
activation of protein kinase C (PKC) [42,43], transcription factor NF-_B [44], and JNK
stress-associated kinases [45], and so forth. Inappropriate activation of these important
regulatory molecules would have deleterious effects on cellular functions, and it is
thought to contribute to the pathogenesis of various diabetic complications [46].
However, it is not clear how hyperglycemia leads to increased oxidative stress. It is most
likely the combined effects of increased levels of reactive oxygen species (ROS) and
decreased capacity of the cellular antioxidant defense system. Glucose auto-oxidation
[47], nonenzymatic glycation [48], and the interaction between glycated products and
their receptors [49], overproduction of ROS by mitochondria [50], and the polyol
pathway [51,52] all are potential sources of hyperglycemia-induced oxidative stress.
39
Figure 1. 10 A schematic diagram of possible interactions among factors involved in the pathogenesis
of diabetic complications.
(DAG), diacylglycerol; (b2 PKC), b2 isoform of protein kinase C; (NO), nitric oxide;
(GSH), glutathione; (SOD), superoxide dismutase.
1.9 Aldose Reductase
Aldose reductase (EC1.1.1.21) is a small monomeric protein composed of 315 amino
acid residues. The primary structure, first determined on rat lens aldose reductase [53,54],
demonstrated high similarities to another NADPH-dependent oxidoreductase, human
liver aldehyde reductase (EC1.1.1.2) [55] and to r-crystallin, a major structural
component of the lens of frog Rana pipiens [56]. The degree of similarity clearly
40
suggests that these proteins belong to the same family, namely aldoketo reductase
superfamily, with related structures and evolutionary origins.
Tertiary Structure of Aldose Reductase
Crystallographic structures have been determined for pig [57] and human aldose
reductases [58, 59]. The enzyme molecule contains a (b/a)8 barrel structural motif with a
large hydrophobic active site. The cofactor NADPH binds in an extended conformation
to the bottom of the active site, located at the center of the barrel. The holoenzyme
structure complexed with the enzyme inhibitor zopolrestat further demonstrated that the
inhibitor binds to the active site on top of the nicotinamide ring of the NADPH [60].
When zopolrestat was complexed with the holoenzyme, however, it perturbed the
position of two loops in the protein and changed the shape of the active site pocket. When
the enzyme was complexed with another inhibitor sorbinil, the inhibitor simply occupied
the active site pocket and did not induce further conformational change in the enzyme
molecule [61]. These findings suggest that many compounds with diverse chemical
structures can interact with the enzyme in different conformations. This illustrates the
dangers of using theoretical approaches to predict the rigid inhibitor binding site of
aldose reductase, as the enzyme apparently retained considerable flexibility in its tertiary
structure [62].
41
Physiological Significance of Aldose Reductase
Osmoregulatory Role in the Kidney
In the previous decade, elevated extracellular NaCl was demonstrated to elicit marked
increase in aldose reductase expression and accumulation of intracellular sorbitol in the
cultured cell line from rabbit renal papilla [63]. In the kidney, aldose reductase mRNA
was abundantly expressed in the medulla compare with relatively low expression in the
cortex [64]. These findings were confirmed by biochemical and immunohistochemical
analyses of rat and human kidneys [65]. Sorbitol is one of the organic osmolytes that
balance the osmotic pressure of extracellular NaCl, fluctuating in accord with urine
osmolality [66]. These findings, therefore, suggest the osmoregulatory role of aldose
reductase in the renal homeostasis.
Unique Tissue Distribution Pattern of Aldose Reductase
Recent investigations disclosed the unexpected distribution pattern of aldose
reductase not only in different species but in tissues other than “target” organs of diabetic
complications. In mouse, aldose reductase mRNA was most abundantly expressed in the
testis, whereas a very low level of the transcript was detected in the sciatic nerve and lens
[67].
Diverse Substrates for Aldose Reductase
Other lines of investigation have demonstrated that aldose reductase exhibits broad
substrate specificity for both hydrophilic and hydrophobic aldehydes. Aldose reductase
42
and the structurally related enzyme in the aldoketo reductase family, aldehyde reductase,
both catalyze the reduction of biogenic aldehydes derived from the catabolism of the
catecholamines and serotonin by the action of monoamine oxidase [68,69,70]. These two
enzymes also catalyze the reduction of isocorticosteroids, intermediates in the catabolism
of the corticosteroid hormones [71]. Recently, aldose reductase in the adrenal gland was
reported to be a major reductase for isocaproaldehyde, a product of sidechain cleavage of
cholesterol [72].
Variable Levels of Aldose Reductase in Diabetic Patients
Substantial variations in the levels of aldose reductase expression in various tissues exist
among individuals with or without diabetes. Marked variability in aldose reductase
activity was reported for enzyme preparations isolated from human placentas [73].
Aldose reductase purified from erythrocytes exhibited a nearly three-fold variation in
activity among diabetic patients [74]. Such differences in the activity of aldose reductase
may influence the susceptibility of patients to glucose toxicity via acceleration of polyol
pathway when these individuals are maintained under equivalent glycemic control.
Table 1. 1 Various proteins in causing the diabetes
Diabetes type Responsible protein
Type 1 Diabetes 1) Methylglyoxal (MGO)-induced hydroimidazolones
2) Amino guanidine
3) Glutamatedecarboxylase 2 integrin,
4) Alpha M (complement component 3 receptor 3 subunit)
Type 2 Diabetes
5) Tyrosine phosphatase
6) Reg family (hepatocellular carcinoma intestine pancreas
[HIP]/pancreatitis-associated protein [PAP]
7) Poly (ADP-ribose) polymerase-1 (PARP-1)
1) retinol binding protein-4 (RBP4), a binding protein for
retinol (vitamin A)
2)New C1q/TNF-related Protein (CTRP-3)
3) streptozotocin
4) AMP-activated protein kinase (AMPK)
5) Protein Kinase C
6)Peroxisome proliferator–activated receptor(PPAR)
7) glutamine:fructose-6-phosphate amidotransferase
8)serum alanine aminotransferase(ALT)
9)insulin-like growth factor binding protein-3 (IGFBP
10) Glycogen synthase kinase-3 (GSK-3)
11) Aldose Reductase
43
intestine pancreas
1)
RBP4), a binding protein for
activated receptor(PPAR) /
phosphate amidotransferase (GFAT)
3 (IGFBP-3)
Top Related