BIOINFORMATICS:
An Introduction
What is
‘Bioinformatics’? • The term was first coined in 1988 by
Dr. Hwa Lim
• The original definition was :
“a collective term for data compilation, organisation, analysis and dissemination”
Simple
Concept
• The application of computer science and
engineering techniques to biological
analysis.
• The creation of repeatable, reusable,
intelligent software
That means….
• Using information technology to help solve
biological problems by designing novel
and incisive algorithms and methods of
analyses
And…
• It also serves to establish innovative
software and create new/maintain existing
databases of information, allowing open
access to the records held within them
It’s a huge
subject • ‘Bioinformatics’ - the new ‘buzz word’ in the
scientific community
• It is an umbrella term for genomics, proteomics and evolution, and computer science
• It is now necessary for scientists to be inter-disciplinary
Why?
• The data is collected from a variety of sources
• The terminology is specific to its particular
branch of science
• To make the data easily and universally
interpretable by scientists
What data?
• Biologists have been classifying data on
species of plants and animals since the
17th century
• The knowledge acquired has escalated in
harmony with the evolution of technology
A brief history
of progress….
• Genetics began when Mendel proved his
laws of hereditary with varieties of peas and
flowers in 1865
• The invention of the compound microscope in
the 19th century
A brief history
of progress….
• The first protein to be sequenced – insulin
• The first complete sequencing of an enzyme,
ribonuclease in 1960
• To the sequencing of the first complete
genome (Haemophilus influenzae) published
in 1995
A brief history
of progress….
• We have since moved on to technologies
permitting the sequencing, recombination and
cloning of DNA
The Human Genome Project
The Human
Genome Project
• In 1990 the unveiling of the Human
Genome Project (HGP) by the United
States Department of Energy and the
National Institutes of Health
• Goals:
HGP
• To identify all chemical base pairs and all genes that make up the 23 chromosome pairs found in human DNA
• To develop the next generation of methods for simulating cellular behaviour and pathways
HGP
• Ultimately to devise means to apply IT to
the modelling of cellular functions as
specified by the enormous datasets
The ‘omic’
revolution • Bioinformatics has been split into various
subjects:
• Genomics – the sequencing and annotation of genomes
• Proteomics – the description of the complete set of proteins a particular genome codes for
Sequence
Databases
• Since it became possible to elucidate
protein & nucleic acid sequences, they
have been determined at an ever
increasing rate.
• These sequences were printed in research
journals
Sequence
Databases
• Their enormous numbers and lengths
(particularly for genome sequences) make
it no longer practical to do so
• It is far more useful to have sequences in
computer-accessible form
Sequence
Databases
• As an example of a sequence database, let
us describe the annotated protein
sequence database named SWISS-PROT
• A sequence record in SWISS-PROT begins
with the proteins’ ID code of the form X_Y
Sequence
Databases
• X is up-to-four-character mnemonic
indicating the protein name
e.g., CYC for cytochrome c
e.g., HBA for hemoglobin α chain
Sequence
Databases • Y is up-to-five-character identification code
indicating the proteins biological source that usually of the first three letters of the genus and the first two letters of the species
e.g., CANFA for Canis familiaris (dog)
Sequence
alignment
• One can quantitate the sequence similarly
of two polypeptides or two DNA's by
determining their number of aligned
residues that are identical
Sequence
alignment
• For Example; human & dog cytochrome c,
which differ in 11 of their 104 residues are
89% identical
[ ( 104 - 11 ) / 104 ] X 100 = 89%
The Homology of Distantly Related
Proteins May Be Difficult to Recognized
• Mutation is a stochastic (Probabilistic or
random) process
• At every stage of evolution each residue has
an equal chance of mutation
The Homology of distantly related
proteins may be difficult to recognized
• The relative evolutionary distances between
neighboring branch points are expressed as
the number of amino acid differences per 100
residue of the protein or PAM units
Percentage of Accepted Point Mutations
The Homology of distantly related
proteins may be difficult to recognized
• Assume that we have a 100-residue protein
in which all point mutations have an equal
probability of being accepted and occur at a
constant rate, thus at an evolutionary
distance of one PAM units, the original and
evolved proteins are 99% identical
The Homology of distantly related
proteins may be difficult to recognized
• At an evolutionary distance of two PAM units
they are 98% identical
• Whereas at 50 PAM units they are 61%
identical
(0.99)1 X 100 = 99%
(0.99)2 X 100 = 98%
(0.99)50 X 100 = 61%
The Homology of distantly related
proteins may be difficult to recognized
• Mutational events may result in the insertion
or deletion of one or more residues within a
chain
SQMCILFKAQMNYGH
MFYACRLPMGAHYWL
The Homology of distantly related
proteins may be difficult to recognized
• If we allowed unlimited gapping: SQMCILFKAQMNYGH
- - M - - F - - - - - -Y - - ACRLPMGAHYWL
• Thus we cannot allow unlimited gapping to maximize the match between two peptides, but neither can we forbid all gapping because it really do occur
The Homology of distantly related
proteins may be difficult to recognized
• Consequently, for each allowed gap we must impose some sort of penalty in our alignment algorithm that strike a balance between finding the best alignment between:
- distantly related peptides
- rejecting improper alignment
The Homology of distantly related
proteins may be difficult to recognized
• Unrelated protein will exhibit sequence
identities in the range 15% to 25%
• Yet distantly related proteins may have
similar levels of sequence identity
• This the origin of the twilight zone
The Homology of distantly related
proteins may be difficult to recognized
• A plot of percent identify vs. evolutionary
distance is an exponential curve that
approaches but never equal zero
The Homology of distantly related
proteins may be difficult to recognized
• To differentiate homologous proteins in the
twilight zone from those that are unrelated, it
requires sophisticated alignment algorithm
Sequencing Alignment Using Dot Matrices
The Homology of distantly related
proteins may be difficult to recognized
The Homology of distantly related
proteins may be difficult to recognized
• The following sections in this introduction of
Bioinformatics will be demonstrated from: the
soft copy of the text book: BIOCHEMISTRY
by: D. Voet & J. Voet, 3rd Edition,
Biochemical Interactions
Software • Learning Objectives:
1. To understand the alignment process
2. To understand how natural selection affects the likelihood of an amino acid substitution being accepted
3. To understand the basis of sophisticated alignment programs such as BLAST
Top Related