Download - CHAPTER 1: INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/8271/11/11_chapter 1.pdf · CHAPTER 1: INTRODUCTION 1.1 Introduction India is a world capital of diabetes,

1

CHAPTER 1: INTRODUCTION

1.1 Introduction

India is a world capital of diabetes, hence it is required immediate attention of drug

design and development of novel drug. Though Type 2 Diabetes has many drugs, it lacks

100% effective cure. Drug design is mainly based on protein-ligand interactions and the

active site residues. To help innovative drug designers a consensus database with 5

important candidate proteins causing diabetes has been developed.

Relational database concepts of computer science and Information retrieval concepts of

digital libraries are important for understanding biological databases. Biological database

design, development, and long-term management is a core area of the discipline of

Bioinformatics. Data contents include gene sequences, textual descriptions, attributes and

ontology classifications, citations, and tabular data. These are often described as semi-

structured data, and can be represented as tables, key delimited records, and XML

structures. Cross-references among databases are common, using database accession

numbers.

A mutation in a protein may leads to malfunction which result in causing disease. A

Protein may cause more diseases, a disease can caused by many proteins. There are many

proteins which causes diabetes, all the proteins are not having ligands to correct the

sequence. Designing drug using conventional process is time consuming and expensive,

but using Bioinformatic tools can be minimized and also cheaper.

2

1.2 Background

Protein coding genes related to Diabetes are figured out from the gene cards website.

Many of them are screened as they don’t have PDB id, and some doesn’t have ligands.

All those are eliminated and only the closely linked proteins with ligand are selected and

the best suited 5 proteins are filtered finally (dipeptidyl-peptidase 4 (DPP-4), peroxisome

proliferator-activated receptor gamma (PPAR-γ), protein tyrosine phosphatase, non-

receptor type 1 (PTPN1), Glycogen synthase kinase -3 beta (GSK-3β) and Aldose

Reductase). Aldose Reductase exihibts more consenus from the remaining. The average

docking score of the ligands, inhibitors is -126.048 kcal/mol.

1.3 Problem Statement

The problem addressed in my study is to identify protein ligand which inhibits high

affinity than the existing ligands associated with diabetes causing proteins. Finding the

best protein ligand for diabetes from various sources like plant database, chalcon

database, ZINC database such that best ligand which is having high affinity than

-126.048 kcal/mol.

Advances in computational techniques have enabled virtual screening to have a positive

impact on the discovery process. In ligand-based virtual screening, the strategy is to use

information provided by a compound or set of compounds that are known to bind to the

desired target and to use this to identify other compounds in the corporate database or

external databases with similar properties.

3

Design and development of novel drug with fewer side effects and less costly using

various bioinformatic tools. We developed a technique to minimize the human

intervention in the calculation of the ligand properties.

1.4 Nature of Study

The current research extracted the data from the databases available online. The data

thus extracted is assembled in a desired format. To identify high consensus protein

causing diabetes by using Root Mean Square Deviation, Rank sum technique. In addition

to protein ligand docking can be performed to identify the ligand that binds with high

affinity. It also compares the results of different software to identify best ligand for

protein causing diabetes.

1. 5 Thesis Overview

Figure 1. 1 Potential areas for in silico intervention in drug discovery process

4

The principal topic of this work is the application of Root mean square deviation, rank

sum technique to identify ligands with high affinity.

Chapter 1 provides motivation, role of bioinofmatics in drug discovery, background

information on present research of me on diabetes, diabetes causing protein Aldose

reductase.

Chapter 2 provides literature reviews of the present study and describes about anti

diabetic agents. In addition to that various bioinformatic tools and techniques are

reviewed.

Chapter 3 provides materials and methodologiesto predict protein ligand interactions by

using root mean square deviation, Tsar-rank sum techniquee to figure out high affinity

ligands.

Chapter 4 provides results and discussions and identified Apigetrin ranked high and was

reported to be the best compound that can bind with high affinity to Aldose reductase

enzyme. Similarly, ZINC00844930 (out of 1001 hits) and ZINC00702953 (out of 837

hits) from ZINC database; Allium38 from in-house plant database;

44[IC]COX,LOinhibitor-Me-UCH3 from chalcone database gave the best results with a

binding energy better than the original co-crystallized ligand of 1AH3.

Chapter 5 provides conclusions and further work directions to improve the process and

mechanism to optimize the non conventional drug discovery and that leads to efficient

drug design through computer aided drug discovery.

5

1.6 Research Questions

• What is the computational approach to highlight the crucial amino acid residues

responsible for functional attributes?

• How to predict the structures of ligands for protein

• How to understand the interactions at binding sites of ligands.

• How computer technology can be used to reduce the time spent in the synthesis of

compounds and the use of experimental methods designed compounds would lead

to effective compounds with drug use computer aided design?

1.7 Biological Databases

A biological database is a large, organized body of persistent data, usually associated

with computerized software designed to update, query, and retrieve components of the

data stored within the system.

As of 2006, there are over 1,000 public and commercial biological databases. These

biological databases usually contain genomics and proteomics data, but databases are also

used in taxonomy. The data are nucleotide sequences of genes or amino acid sequences

of proteins. Furthermore information about function, structure, localisation on

chromosome, clinical effect of mutations as well as similarities of biological sequences

can be found [1].

1.7.1 Types of Biological Databases

There are many different types of databases but for routine sequence analysis, the

following are initially the most important:

� Primary databases

� Secondary databases

6

� Composite databases

1.7.1.1 Primary Databases

These contain sequence data such as nucleic acid or protein. Some examples of primary

databases include:

Nucleic Acid Databases: EMBL, Genbank, DDBJ

Protein Databases: SWISS-PROT, TREMBL, PIR

EMBL (European Molecular Biology Laboratory)

The EMBL-Nucleotide Sequence Database is a comprehensive database of DNA and

RNA sequences collected from the scientific literature and patent applications and

directly submitted from researchers and sequencing groups. It constitutes Europe’s

primary nucleotide sequence resource. The database is produced in an international

collaboration with Genbank (USA) and the DNA Database of Japan (DDBJ). Each of the

three groups collects a portion of the total sequence data reported worldwide, and all new

and updated database entries are exchanged between the groups on a daily basis. The

current database release is Release 88, September 2006.

The EMBL Nucleotide Sequence Database can be accessed a variety of ways. You can

query the database using the SRS system or choose an access method. In EMBL, one can

access feature tables, FAQ’s, manuals and guides.

Submission of sequence information to the nucleotide sequence database prior to

publication has become standeard practise. A unique accession number is assigned by the

database which permanently identifies the sequence submitted. The database accession

number should be included in the manuscript, preferably on the first page of the journal

7

article, or as required by individual journal procedures. This procedures ensures

availability and distribution of new sequence data in a timely fashion.[2]

Figure 1. 2 The International Sequence Database Collaboration

Genbank

GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of

known genetic sequences. It has a flat file structure that is an ASCII text file, readable by

both humans and computers. In addition to sequence data, GenBank files contain

information like accession numbers and gene names, phylogenetic classification and

references to published literature.

GenBank is the NIH genetic sequence database, an annotated collection of all publicly

available DNA sequences. GenBank, along with EMBL and DDBJ have reached a

milestone of 100 billion bases from over 165,000 organisms. GenBank is part of the

8

International Nucleotide Sequence Database Collaboration, which comprises the DNA

DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and

GenBank at NCBI. These three organizations exchange data on a daily basis.

Many journals require submission of sequence information to a database prior to

publication so that an accession number may appear in the paper. The WWW-based

submission tool, called BankIt, for convenient and quick submission of sequence data.

Sequin, NCBI's stand-alone submission software is available by FTP. When using

Sequin, the output files for direct submission should be sent to GenBank by electronic

mail.

The GenBank database is designed to provide and encourage access within the scientific

community to the most up to date and comprehensive DNA sequence information.

Therefore, NCBI places no restrictions on the use or distribution of the GenBank data.

However, some submitters may claim patent, copyright, or other intellectual property

rights in all or a portion of the data they have submitted.[3]

Figure 1. 3Growth of the International Nucleotide Sequence Database Collaboration

9

DDBJ (DNA Databank of Japan)

DDBJ began its DNA Databank activities in 1986 at the National Institute of Genetics

(NIG). It has been functioning as international nucleotide sequence database in

collaboration with EBI/EMBL and NCBI/Genbank.

DNA sequence records the organismic evolution more directly than other biological

materials and ,thus, is invaluable not only for research in life sciences, but also human

welfare in general. The databases are a common treasure of human beings.

From the beginning, DDBJ has been functioning as one of the International DNA

Databases, including EBI (European Bioinformatics Institute; responsible for the EMBL

database) in Europe and NCBI (National Center for Biotechnology Information;

responsible for GenBank database) in the USA as the two other members. Consequently,

DDBJ has been collaborating with the two data banks through exchanging data and

information on Internet and by regularly holding two meetings, the International DNA

Data Banks Advisory Meeting and the International DNA Data Banks Collaborative

Meeting.

DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA

sequences from researchers and to issue the internationally recognized accession number

to data submitters. It collects data mainly from Japanese researchers, but accepts data and

issue the accession number to researchers in any other countries. Since DDBJ exchanges

data with EMBL/EBI and GenBank/NCBI on a daily basis, the three data banks share

virtually the same data at any given time.

10

In DDBJ, data submission can be done using SAKURA (via a WWW server) for

nucleotide sequence data submission. Mass Submission System (MSS) can be used when

the submission consists of large number of sequences, involves long nucleotide

sequences which result in a complex submission containing many features such as

genome data, and when the submission is unsuitable for SAKURA. DDBJ can also

accept submissions using Sequin , a stand alone software tool to make the files for data

submission [4].

SwissProt

The UniprotKB/Swiss-Prot Protein Knowledgebase is an annotated protein sequence

database established in 1986. It is a curated protein sequence database that provides a

high level of annotation, a minimal level of redundancy and a high level of integration

with other databases. Together with UniProtKB/TrEMBL, it constitutes the UniProt

Knowledgebase. It is maintained collaboratively by the Swiss Institute for Bioinformatics

(SIB) and the European Bioinformatics Institute (EBI).

The UniProtKB/Swiss-Prot group is headed by: Rolf Apweiler. The current Swiss-Prot

Release is version 51.2 as of 28/11/2006.

The UniProtKB/Swiss-Prot database can be accessed using the search engines SRS and

UniProt Power Search. SRS is the simplest and easiest method available to access the

UniProtKB/Swiss-Prot sequence database. This search tool can also be used for more

complex and/or multiple database queries. UniProt Power Search provides full text,

advanced search, set manipulation and search filtering on the Universal Protein Resource.

11

The UniProtKB/Swiss-Prot database can also be accessed using ExPASy Server (in

Geneva offers the choice of full-text search or of individual lines (e.g. ID, AC , DE, OS,

OG, GN, RL, RA) and SP-ML .

The UniProtKB/Swiss-Prot protein data bank provides accession numbers for protein

sequences when the peptide(s) have been directly sequenced. These sequences should be

submitted to UniProtKB/Swiss-Prot at the EBI. Swiss-Prot does not provide accession

numbers, in advance, for protein sequences that are the result of translation of nucleic

acid sequences. These translations will automatically be forwarded to Swiss-Prot from

the EMBL nucleotide database and are assigned UniProtKB/Swiss-Prot accession

numbers on incorporation into UniProtKB/TrEMBL.

SPIN is the web-based tool for submitting directly sequenced protein sequences and their

biological annotations to the UniProtKB/Swiss-Prot Protein Knowledgebase. SPIN

guides you through a sequence of WWW forms allowing interactive submission. The

information required to create a database entry will be collected during this process [5].

TrEMBL

UniProtKB/TrEMBL is a computer-annotated protein sequence database complementing

the UniProtKB/Swiss-Prot Protein Knowledgebase.

UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in

the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein sequences

extracted from the literature or submitted to UniProtKB/Swiss-Prot. The database is

12

enriched with automated classification and annotation. The UniProtKB/TrEMBL group is

headed by Rolf Apweiler. The current TrEMBL Release is version 34.2 as of 28-Nov-

2006.

The UniProtKB/TrEMBL database is split into two main sections; SP-TrEMBL and

REM-TrEMBL. SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries that will

eventually be incorporated into UniProtKB/Swiss-Prot and can be considered as a

preliminary section of UniProtKB/Swiss-Prot as all SP-TrEMBL entries have been

assigned UniProt accession numbers. REM-TrEMBL (REMaining TrEMBL) contains the

entries that we do not want to include in UniProtKB/Swiss-Prot, such as

immunoglobulins and T-cell receptors, synthetic sequences, patent application sequences,

small fragments, pseudogenes and truncated proteins. REM-TrEMBL entries have no

accession numbers.

In addition, there is a weekly update to UniProtKB/TrEMBL called TrEMBLnew.

UniProtKB/TrEMBLnew is produced from nucleotide sequences deposited in the EMBL

nucleotide sequence database. At each UniProtKB/TrEMBL release the annotation of

UniProtKB/TrEMBLnew entries is upgraded, redundant entries are merged and the

remainder are then added to TrEMBL.

The UniProtKB/TrEMBL database can be accessed using the search engines SRS and

UniProt Power Search. SRS is the simplest and easiest method available to access the

UniProtKB/Swiss-Prot sequence database. This search tool can also be used for more

complex and/or multiple database queries. UniProt Power Search provides full text,

advanced search, set manipulation and search filtering on the Universal Protein Resource.

The UniProtKB/Swiss-Prot database can also be accessed using ExPASy Server (in

13

Geneva offers the choice of full-text search or of individual lines (e.g. ID, AC , DE, OS,

OG, GN, RL, RA) and SP-ML[6] .

PIR (Protein Information Resource)

The PIR, located at Georgetown University Medical Center (GUMC), is an integrated

public bioinformatics resource to support genomic and proteomic research, and scientific

studies.

PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as

a resource to assist researchers in the identification and interpretation of protein sequence

information [7]. Prior to that, the NBRF compiled the first comprehensive collection of

macromolecular sequences in the Atlas of Protein Sequence and Structure, published

from 1965-1978 under the editorship of Margaret O. Dayhoff. Dr. Dayhoff and her

research group pioneered in the development of computer methods for the comparison of

protein sequences, for the detection of distantly related sequences and duplications within

sequences, and for the inference of evolutionary histories from alignments of protein

sequences.

For four decades, PIR has provided many protein databases and analysis tools freely

accessible to the scientific community, including the Protein Sequence Database (PSD),

the first international database, which grew out of Atlas of Protein Sequence and

Structure.

In 2002, PIR along with its international partners, EBI (European Bioinformatics

Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to

create UniProt, a single worldwide database of protein sequence and function, by

unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.

14

PIR can be searched in the following ways:

• Text Search: Enter text or identifiers to search against iProClass for individual

proteins, or PIRSF for protein families. An advanced search option is available for

restriction of search terms, options, matching, and operators.

• Batch Retrieval: Retrieve multiple entries from iProClass or multiple families

from PIRSF database using a specific identifier, or a combination of different

identifiers.

• BLAST/FASTA Search: Retrieve entries similar to your query using the BLAST

or FASTA program. Two UniProt databases are available to perform the search:

(1) UniProtKB, which contains functional information on proteins, with accurate,

consistent, and rich annotation; or (2) UniRef100, which combines identical

sequences and sub-fragments, from any organism, into a single entry.

A line summary containing sequence similarity results will be displayed.

• Related Sequences: Save time and have a glance at your protein sequence

similarity neighbors by retrieving sequences based on pre-computed BLAST

results.

• Peptide Match: Find an exact match for a peptide sequence (3 to 30 amino acid

long). Two UniProt databases can be used to perform the search: (1) UniProtKB,

which contains functional information on proteins, with accurate, consistent, and

rich annotation; or (2) UniRef100, which combines identical sequences and sub-

fragments, from any organism, into a single entry.

15

• Pattern Search: (1) Find proteins matching a user-defined or a PROSITE pattern

in the UniProtKB database; or (2) Look for PROSITE patterns present in a query

sequence.

• Multiple Alignment: Enter multiple sequences in FASTA format and/or multiple

UniProtKB identifiers in the ID box to get the CLUSTALW alignment of the

sequences along with a neighbor-joining tree and a PIR interactive tree and

alignment viewer.

• Pairwise Alignment: Insert two sequences using the single letter amino acid code

or enter two UniProtKB identifiers. The results show the SSearch Smith-

Waterman full-length alignments between the two sequences.

1.7.1.2 Secondary Databases

These databases are also known as pattern databases. They contain results from the

analysis of the sequences in primary databases. Some examples of primary databases

include:

PROSITE, Pfam, BLOCKS, PRINTS.

PROSITE

PROSITE consists of documentation entries describing protein domains, families and

functional sites as well as associated patterns and profiles to identify them.

PROSITE is a method of determining what is the function of uncharacterized proteins

translated from genomic or cDNA sequences. It consists of a database of biologically

significant sites and patterns formulated in such a way that with appropriate

16

computational tools it can rapidly and reliably identify to which known family of protein

(if any) the new sequence belongs[8].

The PROSITE tools are ScanProsite (for advanced scan) and PRATT, which allows to

interactively generate conserved patterns from a series of unaligned proteins.

The PROSITE database can be browsed by documentation entry, ProRule description,

taxonomic scope and number of positive hits.

Pfam

Pfam is a large collection of multiple sequence alignments and hidden Markov models

covering many common protein domains and families. For each family in Pfam it is

possible to look at multiple alignments, view protein domain architectures, examine

species distribution, follow links to other databases, and view known protein structures.

Pfam can be used to view the domain organization of proteins. A single protein can

belong to several Pfam families[9].

Pfam is a database of two parts, the first is the curated part of Pfam containing over 8957

protein families. Pfam-B contains a large number of small families taken from the

PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B

families can be useful when no Pfam-A families are found.

BLOCKS

The BLOCKS database contains multiple alignments of conserved regions in protein

families. Blocks are multiply aligned ungapped segments corresponding to the most

highly conserved regions of proteins. The blocks for the BLOCKS database are made

automatically by looking for the most highly conserved regions in groups of proteins

represented in the PROSITE database. These blocks are then calibrated against the Swiss-

17

Prot database to obtain a measure of the chance distribution of matches. It is these

calibrated blocks that make up the BLOCKS database[10].

BLOCKS is a service for biological sequence analysis at the Fred Hutchinson Cancer

Research Center in Seattle, Washington, USA.

The BLOCKS Database is based on InterPro entries with sequences from SWISS_PROT

and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART,

and/or PFAM and/or ProDom entries.

The BLOCKS database can be searched either by key word or by number. Block

Searcher is the tool to search a sequence vs BLOCKS.

PRINTS

The PRINTS database houses a collection of protein fingerprints, which may be used to

assign family and functional attributes to uncharacterized sequences, such as those

currently emanating from the various genome-sequencing projects.

Fingerprints are groups of conserved motifs that, taken together, provide diagnostic

protein family signatures. They derive much of their potency from the biological context

afforded by matching motif neighbors; this makes them at once more flexible and

powerful than single-motif approaches. The technique further departs from other pattern-

matching methods by readily allowing the creation of fingerprints at super family-,

family- and subfamily-specific levels, thereby allowing more fine-grained diagnoses[11].

18

The PRINTS database can be accessed by accession number, PRINTS code, database

code, text, sequence, title, number of motifs, author or query language. The database can

be searched using the tools FingerPRINTScan, FPScan, GRAPHScan,and MULScan.

PRINTS is a companion to the BLOCKS, PROSITE, Pfam and ProDom databases.

1.7.1.3 Composite Databases

These databases combine different sources of primary databases. They make querying

and searching efficient, without the need to go to each of the databases.

Examples of composite databases include:

NRDB- Non Redundant Database

OWL

NRDB (Non-Redundant Databases)

NRDB is a so-called non-redundant composite of the following sources: PDB sequences,

SWISS-PROT, SWISS-PROTupdate, PIR, GenPept and GenPeptupdate. The database is

thus similar in content to OWL, but contains more up-to-date information. However,

strictly speaking, it is not non-redundant, but non-identical - i.e., only identical sequence

copies are removed from the database. As a result, NRDB is larger and less efficient to

search than OWL. To be rigorous, it is sensible to search NRDB, but it is more practical

to search OWL.

19

NRDB is a program that can be used to compare a batch of sequences and find those that

are identical to each other. This is useful for sorting new alleles that are not yet in the

MLST databases and for developing new MLST schemes.

This is a generic program which can process any sequence data (DNA or amino acid) and

is by no means restricted to MLST data. The nrdb program was written by Warren Gish

at Washington University [12].

OWL

OWL is a non-redundant composite of 4 publicly-available primary sources: SWISS-

PROT, PIR (1-3), Genbank (translation) and NRL-3D. SWISS-PROT is the highest

priority source, all others being compared against it to eliminate identical and trivially-

different sequences. The strict redundancy criteria render OWL relatively "small" and

hence efficient in similarity searches.

The sources are assigned a priority with regard to their level of annotation and sequence

validation - SWISS-PROT has the highest priority, so all the others are compared against

it during the redundancy checking procedure (this process eliminates both identical

copies of sequences and those containing single amino acid differences). The contribution

to OWL made by each of the primary sources is shown in the following pie chart. Its non-

redundant status renders OWL highly "compact" and therefore efficient for use in

sequence comparisons [13].

20

1.7.2 Importance of Biological Databases

The importance of biological databases is to encourage new approaches to the

management, analysis and dissemination of biological knowledge that will enable both

the scientific community and the broader public to gain maximum benefit and utility.

Future advances in biological sciences will depend both upon the creation of new

knowledge and upon effective management of proliferating information. The biological

sciences have become increasingly data rich. Much of the biology of tomorrow will arise

through discovery based on information contained in community-accessible databases.

Much, if not all, of our accumulated knowledge of biology will be accessible in electronic

form. Future progress in biological research will be highly dependent on the ability of the

scientific community to both deposit and utilize stored information on-line. Thus, the

information management challenge for the future will be to develop new ways to acquire,

store and retrieve not only biological data per se, but also those data in the context of

biological knowledge.

1.7.2.1 Interpreting sequence data

The amount of data available from sequences is growing rapidly and outstripping the

ability of biologists to understand and assimilate it all. There is a pressing need for

improved systems that will allow the prediction of the function of a protein molecule

given the DNA sequence that codes for it and to link this sequence information to other

biological data. There is also a need to identify anomalous events, such as lateral transfer,

polymorphisms or mutations.

21

A range of IT and computer science tools are needed to tackle these problems. Some will

involve the integration of data from different sources, the development of structures of

data to enable more efficient querying, the development of better ways of interrogating

databases and presenting the information, etc.

Other approaches will involve the development of better ways of modeling data,

including whole systems. For example, the assembly sequence in a metabolic system,

regulatory development to assist in the design of appropriate experiments to assess its

contribution to the phenotype. These approaches will need to be holistic in nature, taking

in as much of the whole picture as possible. This in turn will require the development and

application of new systems modeling methods.

1.7.2.2 Comparative genome analysis

Most bioinformatics tools were developed prior to the widespread availability of

complete genome sequences, and tend to support storage and presentation of information

at the gene level. There is thus a need for novel techniques and tools that address genome

level bioinformatics, with analyses that account for the location of genes or regulatory

elements within the genome and that encompass new data sets, such as derived from

research on micro array and DNA chip technology, proteomics, analysis of 2-D gel

information, single nucleotide polymorphisms and transcriptomes.

In addition, researchers need to be able to compare and analyze the organization and

evolution of genomes within a single genome and between genomes of different

22

organisms. In particular, comparative genomics could uncover relationships between

model organisms, crops and domestic animals, and facilitate the exploitation of

conservation of synteny. Tools and techniques need to be developed to conduct these

analyses, including the development of novel algorithms, efficient storage techniques and

graphical displays to visualize the results of the comparisons.

1.7.2.3 Understanding, integrating and modeling cellular processes

Rapid advances are being made in understanding how cells differentiate, function and

interact. Descriptions of the molecular mechanisms of membrane activity, signal

transduction, metabolism, gene expression and other nuclear processes are fundamental

to our appreciation of cellular and intercellular function. They also provide the means, by

which new molecules introduced to cells can enter, function and affect these cellular

processes. The picture of whole cell activity that is emerging from genomics, structural

biology and related areas of cytology and biochemistry will require, however, the use of

biological databases to organize, integrate and elucidate these complex data in

meaningful and realistic ways.

1.7.2.4 Methods to create biological databases from uncomputerised or unstructured

data

Much biological information is still published and stored in non-electronic forms (e.g.

books, journals). Biological nomenclature is of central importance to the recovery and

establishment of links between data from different sources and of different types.

Schemes to handle nomenclature instability are particularly challenging. In addition,

23

evolutionary studies based on sequence information must consider the whole organism,

yet the ways of linking sequence to organism reliably have not been developed.

There is a need to develop tools to capture data in their entirety and to extract data from

non-electronic media so as to create organized electronic forms of the data that can be

accessed and searched in the same way as other databases. This means not only

extracting information from free text sources, but also capturing data that are presented in

the form of tables, etc. Input from various aspects of computer science is expected

including, for example, the natural language processing community and those involved in

diagrammatic reasoning.

Research proposed under this heading must have strong links to the biological data

needed by the community. The feasibility of some of these techniques may require the

building of exemplar databases, which should, however be of wider benefit to the

community. Creation of new databases by entering data into existing database structures

was not supported.

1.7.2.5 Integrity and maintenance of biological databases

There is an opportunity for novel computer science and IT approaches to be applied to

the maintenance of biological databases: in the processing of data, the addition of new

data, automating the processes of quality control (validation, global integrity checking,

error checking, traceability of results, etc). An important aspect of this process is the

systematic checking for anomalies and identifying whether these are errors or interesting

biological phenomena. Some of the technologies that might be applied in this area

include those for semi-structured data management.

24

Any proposed system will need to be scalable and applicable to the complex data that

exist in biology.[14]

1.7.3 Applications of Biological Databases

� The sequencing of the human genome and the emerging intense interest in

proteomics and molecular structure have caused an enormous explosion in the

need for biological databases. These include genome and sequence databases such

as GenBank and Ensemble, protein databases such as PDB and SWISS-PROT,

and their analysis tools and tools for accessing and manipulating sequence

databases such as BLAST, multiple alignment, Perl, and gene finding tools.

� Databases are used in many applications, spanning virtually the entire range of

computer software.

� Databases are the preferred method of storage for large multiuser applications,

where coordination between many users is needed. Even individual users find

them convenient, though, and many electronic mail programs and personal

organizers are based on standard database technology. Software database drivers

are available for most database platforms so that application software can use a

common application programming interface (API) to retrieve the information

stored in a database.

� Researchers are widening their scope of research. The explosion of available

sequence data from many organisms has enabled researchers to more readily

compare sequences of interest from many different species in combination with a

number of modek organism databases. In addition to databases focused on a

25

single species, databases that deal with taxonomically related species have

emerged recently.

� With advances and in-deep applications of computer technologies in biology,

database modeling for biological data management is emerging as a new

discipline.

� By means of database technology, large volumes of biological data with complex

structures can be modeled in conceptual data models and further stored in

databases. Then biologists can use the biological databases to handle and retrieve

these data and further support a team of biologists to analyze and mine their data

throughout a biological discovery process.

1.7.4 Micro Organism Databases

Due to the enormous explosion in the amount of sequence data available related to

microorganisms, a number of microorganism databases have been established. Five such

microbial databases are mentioned below:

1.7.4.1 Antimicrobial wild type distributions of Microorganisms

The EUCAST (European Committee on Antimicrobial Susceptibility Testing) under the

auspices of the ESCMID (European Society for Clinical Microbiology and Infectious

Diseases) offers this free website of distributions of MIC-values of wild type bacteria and

fungi. Each MIC-distribution is defined by the micro-organism and the antimicrobial

drug. It is the compound result of a number of separate distributions submitted to

EUCAST from organizations such as national breakpoint committees, pharmaceutical

26

industry, antimicrobial resistance surveillance programs and research projects. The

database is released for public use, drug by drug, by the EUCAST steering committee

and thereby also by the national breakpoint committees. The distributions are used by the

committee for defining epidemiological cut-off values for early detection and

surveillance of resistance development, and for the harmonization of European clinical

breakpoints.

Each graph contains information on the number of sources of data, the total number of

organisms, and when defined by EUCAST, clinical breakpoints and/or the

epidemiological cut-off value. The epidemiological cut-off value is related to the MIC

distribution of the wild type organism.

The distributions of MIC values of wild Escherichia coli are calculated against the

following 35 compounds:

Amikacin, Aztreonam, Cefazoline, Cefepime, Cefotaxime, Cefoxitin, Cefpodoxime,

Ceftazidime, Ceftibuten, Ceftiofur, Ceftriaxone, Cefuroxime, Chloramphenicole,

Ciprofloxacin, Colistin, Enrofloxacin, Ertapenem, Florfenicol, Flumequine, Gentamicin,

Imipenem, Kanamycin, Levofloxacin, Meropenem, Moxifloxacin, Nalidixic acid,

Neomycin, Netilmicin, Nitrofurantoin, Norfloxacin, Ofloxacin, Streptomycin,

Tigecycline, Tobramycin, and Trimethoprim.[15]

27

1.7.4.2 RIDOM ( Ribosomal Differentiation of Medical Micro-organisms Datatbase)

This database differentiates medical microorganisms based on partial small subunit

ribosomal DNA (16S rDNA) sequence.

This web server is an evolving electronic resource designed to provide micro-organism

differentiation services for medical identification needs. The diagnostic procedure begins

with a specimen partial small subunit ribosomal DNA (16S rDNA) sequence. Resulting

from a similarity search, a species or genus name for the specimen in question will be

returned. Where the first results are ambiguous or do not define to species level, hints for

further molecular, i.e. internal transcribed spacer, and conventional phenotypic

differentiation will be offered (‘sequential and polyphasic approach’). Additionally, each

entry in RIDOM contains detailed medical and taxonomic information linked, context-

sensitive, to external World Wide Web services. Nearly all sequences are newly

determined and the sequence chromatograms are available for intersubjective quality

control[16].

1.7.4.3 WDCM (World Data Centre for Microorganisms)

WFCC-MIRCEN World Data Centre for Microorganisms (WDCM) provides a

comprehensive directory of culture collections, databases on microbes and cell lines, and

the gateway to biodiversity, molecular biology and genome projects [17].

1.7.4.4 HBMMD (Harbor Branch Marine Microbe Database)

The Division of Biomedical Marine Research (DBMR) at Harbor Branch Oceanographic

Institution (HBOI) has one of the most comprehensive collections of deep-water sponges

28

in the world, having led numerous expeditions and submersible collections at more than

400 sites over nearly two decades. Through accumulation of new species and site records,

long-term surveys are effectively being conducted and biodiversity inventories and

catalogs (in the form of cruise reports) are a significant by-product of our primary

mission of drug discovery.

The HBMM Culture Collection consists of over 16,000 total marine bacteria and fungi

(9000 derived specifically from marine invertebrates). This collection is maintained as a

source of microbes for DBMR’s Fermentation Program which systematically cultures the

isolates for novel bioactive product discovery. However, except for an initial Gram-stain

and a description of basic cellular and colonial morphology, few of the strains in this

collection have been taxonomically classified.[18]

1.7.4.5 EchoBASE

This is a relational database designed to contain and manipulate information from post-

genomic experiments using the model bacterium Escherichia coli K-12. Its aim is to

collate information from a wide range of sources to provide clues to the functions of the

approximately 1500 gene products that have no confirmed cellular function. The database

is built on an enhanced annotation of the updated genome sequence of strain MG1655

and the association of experimental data with the E.coli genes and their products.

Experiments that can be held within Echo BASE include proteomics studies, micro array

data, protein–protein interaction data, and structural data and bioinformatics studies.

Echo BASE also contains annotated information on ‘orphan’ enzyme activities from this

29

microbe to aid characterization of the proteins that catalyze these elusive biochemical

reactions [19].

Databases are used in many applications, spanning virtually the entire range of

computer software. Databases are the preferred method of storage for large multiuser

applications, where coordination between many users is needed. Even individual users

find them convenient, and many electronic mail programs and personal organizers are

based on standard database technology. Software database drivers are available for most

database platforms so that application software can use a common Application

Programming Interface to retrieve the information stored in a database. Two commonly

used database APIs are JDBC and ODBC.

Importance of databases

1. Compactness

Where there is no need for the old paper files that has a big size.

2. Speed

Because of the computer can restore the stored Data Base and upgrading it very

fast than the normal human manual hand can do.

3. Less drudgery

because the computer do everything for you.

4. Currency

The more specific you can have when you asking for a Data Base information [20].

Databases are easy to set-up, easy to manipulate and easy to use. A database allows you

to maintain order in what could be a very chaotic environment [21].

30

Biological databases are libraries of life sciences information, collected from scientific

experiments, published literature, high throughput experiment technology, and

computational analyses. They contain information from research areas including

genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics.

[22]

Information contained in biological databases includes gene function, structure,

localization (both cellular and chromosomal), clinical effects of mutations as well as

similarities of biological sequences and structures.

1.8 Diabetes

Diabetes mellitus often referred to simply as diabetes - is a disease in which the body

does not produce enough, or properly respond to, insulin, a hormone produced in the

pancreas. Insulin is needed to turn sugar and other food into energy. In diabetes, the body

either doesn't make enough insulin or can't use its own insulin as well as it should, or

both. This causes sugar to accumulate in the blood, often leading to various

complications [23, 24].

Types of Diabetes:

Many types of diabetes are recognized2 The principal three are:

1. Type 1

2. Type 2

3. Gestational diabetes.

31

Type I

Type 1 diabetes mellitus is characterized by loss of the insulin-producing beta cells

of the islets of Langerhans in the pancreas leading to a deficiency of insulin. This type of

diabetes can be further classified as immune-mediated or idiopathic. The majority of

type 1 diabetes is of the immune-mediated variety, where beta cell loss is a T-cell

mediated autoimmune attack[23] The principal treatment of type 1 diabetes, even in its

earliest stages, is the delivery of artificial insulin via injection combined with careful

monitoring of blood glucose levels using blood testing monitors. Without insulin, diabetic

ketoacidosis often develops which may result in coma or death.

Type II

Type 2 diabetes mellitus is the most common and is due to insulin resistance or

reduced insulin sensitivity, combined with relatively reduced insulin secretion. There are

numerous theories as to the exact cause and mechanism in type 2 diabetes. Central

obesity is known to predispose individuals to insulin resistance. Abdominal fat is

especially active hormonally, secreting a group of hormones called adipokines that may

possibly impair glucose tolerance. Obesity is found in approximately 55% of patients

diagnosed with type 2 diabetes[25]. Environmental exposures may contribute to recent

increases in the rate of type 2 diabetes. A positive correlation has been found between the

concentration in the urine of bisphenol A, a constituent of polycarbonate plastic from

some producers, and the incidence of type 2 diabetes[26]. Type 2 diabetes is usually first

treated by increasing physical activity, decreasing carbohydrate intake, and losing weight.

These can restore insulin sensitivity even when the weight loss is modest, for example

32

around 5 kg, most especially when it is in abdominal fat deposits. It is sometimes

possible to achieve long-term, satisfactory glucose control with these measures alone.

However, the underlying tendency to insulin resistance is not lost, and so attention to

diet, exercise, and weight loss must continue. The usual next step, if necessary, is

treatment with oral antidiabetic drugs. Insulin production is initially only moderately

impaired in type 2 diabetes, so oral medication (often used in various combinations) can

be used to improve insulin production, to regulate inappropriate release of glucose by the

liver and attenuate insulin resistance to some extent, and to substantially attenuate insulin

resistance.[27]

Gestational diabetes mellitus

Pregnant women who have never had diabetes before but who have high blood sugar

(glucose) levels during pregnancy are said to have gestational diabetes mellitus. It is a

risk factor for type 2 diabetes in the mother [28]. It occurs in about 2%–5% of all

pregnancies and may improve or disappear after delivery. Gestational diabetes is fully

treatable but requires careful medical supervision throughout the pregnancy. About 20%–

50% of affected women develop type 2 diabetes later in life[29]

Long-term secondary complications of Diabetes mellitus

Long-term secondary complications are main cause of morbidity and mortality in diabetic

patients[30]. The major microvascular complications of diabetes include nephropathy,

neuropathy, retinopathy while cataract is, however, an avascular complication[31].

Several metabolic factors contribute to the dysfunction observed in diabetic

33

vasculopathy[32] which include increased glucose flux through the polyol pathway,

increased production of reactive oxygen species by the mitochondrial respiratory chain,

nonenzymatic glycations, protein kinase-C activation and increased flux through the

hexosamine pathway[30]. The polyol pathway has received considerable attention.

Cataract

Diabetic cataract formation follows an increase in sugars in the lens. The excess sugar

within the lens is reduced by aldose reductase to its alcohol, but the lens capsule is

relatively impermeable to sugar alcohols. Because of the excess sugar alcohol (polyol),

the lens imbibes water, causing osmotic imbalance. Eventually, increased sodium and

decreased potassium levels and decreased glutathione levels lead to cataract formation

[33]

Figure 1. 4 Image showing the normal clear lens and the lens clouded by cataract

34

Figure 1. 5 Image showing the vision through a cataract eye.

Retinopathy

Diabetic retinopathy is the result of microvascular retinal changes. Hyperglycemia-

induced pericyte death and thickening of the basement membrane lead to incompetence

of the vascular walls. These damages change the formation of the blood-retinal barrier

and also make the retinal blood vessels become more permeable[34].

Small blood vessels – such as those in the eye – are especially vulnerable to poor

blood sugar (blood glucose) control. An over accumulation of glucose and/or fructose

damages the tiny blood vessels in the retina.

35

Figure 1. 6 Image showing the normal retina and the retina with diabetic retinopathy.

Figure 1. 7 Image showing the vision with diabetic retinopathy.

Nephropathy

Diabetic nephropathy is damage to your kidneys caused by diabetes. The kidneys have

many tiny blood vessels that filter waste from your blood. High blood sugar from

36

diabetes can destroy these blood vessels. Over time, the kidney isn't able to do its job as

well. Later it may stop working completely.

Figure 1. 8 Image showing the abnormal protein leaking in glomerulus of kidney due to nephropathy.

Neuropathy

Diabetic neuropathies are neuropathic disorders that are associated with diabetes mellitus.

These conditions are thought to result from diabetic microvascular injury involving small

blood vessels that supply nerves (vasa nervorum).

The Polyol Pathway

The polyol pathway of glucose metabolism becomes active when intracellular glucose

levels are elevated [35,36]. Aldose reductase (AR), the first and rate-limiting enzyme in

the pathway, reduces glucose to sorbitol using NADPH as a cofactor; sorbitol is then

metabolized to fructose by sorbitol dehydrogenase that uses NAD+ as a cofactor[35].

37

Figure 1. 9 Polyol (sorbitol) pathway; glucose-6-P, glucose 6-phosphate

Consequences of Polyol pathway

There are several effects of the Polyol pathway. Sorbitol is an alcohol, polyhydroxylated,

and strongly hydrophilic, and therefore does not diffuse readily through cell membranes

and accumulates intracellularly with possible osmotic consequences [35]. The fructose

produced by the polyol pathway can become phosphorylated to fructose-3-phosphate [37,

38] which is broken down to 3-deoxyglucosone; both compounds are powerful

glycosylating agents that enter in the formation of advanced glycation end products

(AGEs) [37]. The usage of NADPH by AR may result in less cofactor available for

glutathione reductase, which is critical for the maintenance of the intracellular pool of

reduced glutathione (GSH). This would lessen the capability of cells to respond to

oxidative stress [39]. Compensatory increased activity of the glucose monophosphate

shunt, the principal supplier of cellular NADPH, may occur [39]. The usage of NAD by

sorbitol dehydrogenase leads to an increased ratio of NADH/NAD+, which has been

termed “pseudohypoxia” and linked to a multitude of metabolic and signaling changes

38

known to alter cell function [40]. It has been proposed that the excess NADH may

become a substrate for NADH oxidase, and this would be a mechanism for generation of

intracellular oxidant species [41]. Thus, activation of the polyol pathway, by altering

intracellular tonicity, generating AGEs precursors, and exposing cells to oxidative stress

perhaps through decreased antioxidant defenses and generation of oxidant species, can

initiate and multiply several mechanisms of cellular damage.

Diabetes causes increased oxidative stress in various tissues as evidenced by increased

levels of oxidized DNA, proteins, and lipids. Besides damaging the functions of these

molecules, oxidative stress also triggers a series of cellular responses, including the

activation of protein kinase C (PKC) [42,43], transcription factor NF-_B [44], and JNK

stress-associated kinases [45], and so forth. Inappropriate activation of these important

regulatory molecules would have deleterious effects on cellular functions, and it is

thought to contribute to the pathogenesis of various diabetic complications [46].

However, it is not clear how hyperglycemia leads to increased oxidative stress. It is most

likely the combined effects of increased levels of reactive oxygen species (ROS) and

decreased capacity of the cellular antioxidant defense system. Glucose auto-oxidation

[47], nonenzymatic glycation [48], and the interaction between glycated products and

their receptors [49], overproduction of ROS by mitochondria [50], and the polyol

pathway [51,52] all are potential sources of hyperglycemia-induced oxidative stress.

39

Figure 1. 10 A schematic diagram of possible interactions among factors involved in the pathogenesis

of diabetic complications.

(DAG), diacylglycerol; (b2 PKC), b2 isoform of protein kinase C; (NO), nitric oxide;

(GSH), glutathione; (SOD), superoxide dismutase.

1.9 Aldose Reductase

Aldose reductase (EC1.1.1.21) is a small monomeric protein composed of 315 amino

acid residues. The primary structure, first determined on rat lens aldose reductase [53,54],

demonstrated high similarities to another NADPH-dependent oxidoreductase, human

liver aldehyde reductase (EC1.1.1.2) [55] and to r-crystallin, a major structural

component of the lens of frog Rana pipiens [56]. The degree of similarity clearly

40

suggests that these proteins belong to the same family, namely aldoketo reductase

superfamily, with related structures and evolutionary origins.

Tertiary Structure of Aldose Reductase

Crystallographic structures have been determined for pig [57] and human aldose

reductases [58, 59]. The enzyme molecule contains a (b/a)8 barrel structural motif with a

large hydrophobic active site. The cofactor NADPH binds in an extended conformation

to the bottom of the active site, located at the center of the barrel. The holoenzyme

structure complexed with the enzyme inhibitor zopolrestat further demonstrated that the

inhibitor binds to the active site on top of the nicotinamide ring of the NADPH [60].

When zopolrestat was complexed with the holoenzyme, however, it perturbed the

position of two loops in the protein and changed the shape of the active site pocket. When

the enzyme was complexed with another inhibitor sorbinil, the inhibitor simply occupied

the active site pocket and did not induce further conformational change in the enzyme

molecule [61]. These findings suggest that many compounds with diverse chemical

structures can interact with the enzyme in different conformations. This illustrates the

dangers of using theoretical approaches to predict the rigid inhibitor binding site of

aldose reductase, as the enzyme apparently retained considerable flexibility in its tertiary

structure [62].

41

Physiological Significance of Aldose Reductase

Osmoregulatory Role in the Kidney

In the previous decade, elevated extracellular NaCl was demonstrated to elicit marked

increase in aldose reductase expression and accumulation of intracellular sorbitol in the

cultured cell line from rabbit renal papilla [63]. In the kidney, aldose reductase mRNA

was abundantly expressed in the medulla compare with relatively low expression in the

cortex [64]. These findings were confirmed by biochemical and immunohistochemical

analyses of rat and human kidneys [65]. Sorbitol is one of the organic osmolytes that

balance the osmotic pressure of extracellular NaCl, fluctuating in accord with urine

osmolality [66]. These findings, therefore, suggest the osmoregulatory role of aldose

reductase in the renal homeostasis.

Unique Tissue Distribution Pattern of Aldose Reductase

Recent investigations disclosed the unexpected distribution pattern of aldose

reductase not only in different species but in tissues other than “target” organs of diabetic

complications. In mouse, aldose reductase mRNA was most abundantly expressed in the

testis, whereas a very low level of the transcript was detected in the sciatic nerve and lens

[67].

Diverse Substrates for Aldose Reductase

Other lines of investigation have demonstrated that aldose reductase exhibits broad

substrate specificity for both hydrophilic and hydrophobic aldehydes. Aldose reductase

42

and the structurally related enzyme in the aldoketo reductase family, aldehyde reductase,

both catalyze the reduction of biogenic aldehydes derived from the catabolism of the

catecholamines and serotonin by the action of monoamine oxidase [68,69,70]. These two

enzymes also catalyze the reduction of isocorticosteroids, intermediates in the catabolism

of the corticosteroid hormones [71]. Recently, aldose reductase in the adrenal gland was

reported to be a major reductase for isocaproaldehyde, a product of sidechain cleavage of

cholesterol [72].

Variable Levels of Aldose Reductase in Diabetic Patients

Substantial variations in the levels of aldose reductase expression in various tissues exist

among individuals with or without diabetes. Marked variability in aldose reductase

activity was reported for enzyme preparations isolated from human placentas [73].

Aldose reductase purified from erythrocytes exhibited a nearly three-fold variation in

activity among diabetic patients [74]. Such differences in the activity of aldose reductase

may influence the susceptibility of patients to glucose toxicity via acceleration of polyol

pathway when these individuals are maintained under equivalent glycemic control.

Table 1. 1 Various proteins in causing the diabetes

Diabetes type Responsible protein

Type 1 Diabetes 1) Methylglyoxal (MGO)-induced hydroimidazolones

2) Amino guanidine

3) Glutamatedecarboxylase 2 integrin,

4) Alpha M (complement component 3 receptor 3 subunit)

Type 2 Diabetes

5) Tyrosine phosphatase

6) Reg family (hepatocellular carcinoma intestine pancreas

[HIP]/pancreatitis-associated protein [PAP]

7) Poly (ADP-ribose) polymerase-1 (PARP-1)

1) retinol binding protein-4 (RBP4), a binding protein for

retinol (vitamin A)

2)New C1q/TNF-related Protein (CTRP-3)

3) streptozotocin

4) AMP-activated protein kinase (AMPK)

5) Protein Kinase C

6)Peroxisome proliferator–activated receptor(PPAR)

7) glutamine:fructose-6-phosphate amidotransferase

8)serum alanine aminotransferase(ALT)

9)insulin-like growth factor binding protein-3 (IGFBP

10) Glycogen synthase kinase-3 (GSK-3)

11) Aldose Reductase

43

intestine pancreas

1)

RBP4), a binding protein for

activated receptor(PPAR) /

phosphate amidotransferase (GFAT)

3 (IGFBP-3)