Download - BIOINFORMATICS: An Introduction · A brief history of progress…. •The first protein to be sequenced – insulin •The first complete sequencing of an enzyme, ribonuclease in

BIOINFORMATICS:

An Introduction

What is

‘Bioinformatics’? • The term was first coined in 1988 by

Dr. Hwa Lim

• The original definition was :

“a collective term for data compilation, organisation, analysis and dissemination”

Simple

Concept

• The application of computer science and

engineering techniques to biological

analysis.

• The creation of repeatable, reusable,

intelligent software

That means….

• Using information technology to help solve

biological problems by designing novel

and incisive algorithms and methods of

analyses

And…

• It also serves to establish innovative

software and create new/maintain existing

databases of information, allowing open

access to the records held within them

It’s a huge

subject • ‘Bioinformatics’ - the new ‘buzz word’ in the

scientific community

• It is an umbrella term for genomics, proteomics and evolution, and computer science

• It is now necessary for scientists to be inter-disciplinary

Why?

• The data is collected from a variety of sources

• The terminology is specific to its particular

branch of science

• To make the data easily and universally

interpretable by scientists

What data?

• Biologists have been classifying data on

species of plants and animals since the

17th century

• The knowledge acquired has escalated in

harmony with the evolution of technology

A brief history

of progress….

• Genetics began when Mendel proved his

laws of hereditary with varieties of peas and

flowers in 1865

• The invention of the compound microscope in

the 19th century

A brief history

of progress….

• The first protein to be sequenced – insulin

• The first complete sequencing of an enzyme,

ribonuclease in 1960

• To the sequencing of the first complete

genome (Haemophilus influenzae) published

in 1995

A brief history

of progress….

• We have since moved on to technologies

permitting the sequencing, recombination and

cloning of DNA

The Human Genome Project

The Human

Genome Project

• In 1990 the unveiling of the Human

Genome Project (HGP) by the United

States Department of Energy and the

National Institutes of Health

• Goals:

HGP

• To identify all chemical base pairs and all genes that make up the 23 chromosome pairs found in human DNA

• To develop the next generation of methods for simulating cellular behaviour and pathways

HGP

• Ultimately to devise means to apply IT to

the modelling of cellular functions as

specified by the enormous datasets

The ‘omic’

revolution • Bioinformatics has been split into various

subjects:

• Genomics – the sequencing and annotation of genomes

• Proteomics – the description of the complete set of proteins a particular genome codes for

Sequence

Databases

• Since it became possible to elucidate

protein & nucleic acid sequences, they

have been determined at an ever

increasing rate.

• These sequences were printed in research

journals

Sequence

Databases

• Their enormous numbers and lengths

(particularly for genome sequences) make

it no longer practical to do so

• It is far more useful to have sequences in

computer-accessible form

Sequence

Databases

• As an example of a sequence database, let

us describe the annotated protein

sequence database named SWISS-PROT

• A sequence record in SWISS-PROT begins

with the proteins’ ID code of the form X_Y

Sequence

Databases

• X is up-to-four-character mnemonic

indicating the protein name

e.g., CYC for cytochrome c

e.g., HBA for hemoglobin α chain

Sequence

Databases • Y is up-to-five-character identification code

indicating the proteins biological source that usually of the first three letters of the genus and the first two letters of the species

e.g., CANFA for Canis familiaris (dog)

Sequence

alignment

• One can quantitate the sequence similarly

of two polypeptides or two DNA's by

determining their number of aligned

residues that are identical

Sequence

alignment

• For Example; human & dog cytochrome c,

which differ in 11 of their 104 residues are

89% identical

[ ( 104 - 11 ) / 104 ] X 100 = 89%

The Homology of Distantly Related

Proteins May Be Difficult to Recognized

• Mutation is a stochastic (Probabilistic or

random) process

• At every stage of evolution each residue has

an equal chance of mutation

The Homology of distantly related

proteins may be difficult to recognized

• The relative evolutionary distances between

neighboring branch points are expressed as

the number of amino acid differences per 100

residue of the protein or PAM units

Percentage of Accepted Point Mutations



• Assume that we have a 100-residue protein

in which all point mutations have an equal

probability of being accepted and occur at a

constant rate, thus at an evolutionary

distance of one PAM units, the original and

evolved proteins are 99% identical



• At an evolutionary distance of two PAM units

they are 98% identical

• Whereas at 50 PAM units they are 61%

identical

(0.99)1 X 100 = 99%

(0.99)2 X 100 = 98%

(0.99)50 X 100 = 61%



• Mutational events may result in the insertion

or deletion of one or more residues within a

chain

SQMCILFKAQMNYGH

MFYACRLPMGAHYWL



• If we allowed unlimited gapping: SQMCILFKAQMNYGH

- - M - - F - - - - - -Y - - ACRLPMGAHYWL

• Thus we cannot allow unlimited gapping to maximize the match between two peptides, but neither can we forbid all gapping because it really do occur



• Consequently, for each allowed gap we must impose some sort of penalty in our alignment algorithm that strike a balance between finding the best alignment between:

- distantly related peptides

- rejecting improper alignment



• Unrelated protein will exhibit sequence

identities in the range 15% to 25%

• Yet distantly related proteins may have

similar levels of sequence identity

• This the origin of the twilight zone



• A plot of percent identify vs. evolutionary

distance is an exponential curve that

approaches but never equal zero



• To differentiate homologous proteins in the

twilight zone from those that are unrelated, it

requires sophisticated alignment algorithm

Sequencing Alignment Using Dot Matrices



• The following sections in this introduction of

Bioinformatics will be demonstrated from: the

soft copy of the text book: BIOCHEMISTRY

by: D. Voet & J. Voet, 3rd Edition,

Biochemical Interactions

Software • Learning Objectives:

1. To understand the alignment process

2. To understand how natural selection affects the likelihood of an amino acid substitution being accepted

3. To understand the basis of sophisticated alignment programs such as BLAST