Download - Genetics, Evolution and Bioinformatics Wen-Hsiung Li, Ph.D. George Beadle Professor Department of Ecology & Evolution University of Chicago.

Genetics, Evolution and Bioinformatics

Wen-Hsiung Li, Ph.D. George Beadle Professor Department of Ecology & Evolution University of Chicago

Topics• DNA and Genes

• DNA Sequence Evolution

• Traditional Genetics

• Reverse Genetics

• The Human Genome Project

• What is Bioinformatics?

Deoxyribonucleic acid (DNA) from bacteria to man

Ribonucleic acid (RNA) in many viruses

4 types of nucleotides: A, C, G, and T (U)

Hereditary Material

Double-stranded DNA sequence:

5’-ATCGTAATGCGTAGCAGT-3’3’-TAGCATTACGCATCGTCA-5’

Single-stranded DNA sequence:

ATCGTAATGCGTAGCAGT

What is a Gene?

A gene is a DNA region that controls a certain function.

For example, the insulin gene codes for an amino acid sequence that is processed to form the insulin.

Transcription

Translation

Gene (DNA)

mRNA

polypeptideor protein

Mathematical Theory of DNA Sequence Evolution

To develop mathematical theory to describe the change of nucleotides in a DNA sequence in time.

Sequence at time 0

Sequence at time t

t years

A G

TC

α

α

α

αα

α

Given that the nucleotide at a particular site is A at time 0, what is the probability, PA(t) , that the nucleotide will be A at time t?

Starting with A: PA(0) =1

PA(1) = 1 - 3 α

PA(2) = (1 - 3 α) PA(1) + α [1 - PA(1) ]

t = 0

t = 1

t = 2

A

A

A

A

not A

A

No subst.

No subst.

Subst.

Subst.

P A ( t + 1 ) = ( 1 – 3 ) P A ( t ) + [ 1 – P A ( t ) ]

P A ( t + 1 ) – P A ( t ) = – 3P A ( t ) + [ 1 – P A ( t ) ]

P A ( t ) = – 4P A ( t ) +

d P A ( t )

d t= – 4P A ( t ) +

Discrete time

Continuous time approximation

P A ( t ) = + [ P A ( 0 ) – ] e – 4t

4

1

4

1

P ii ( t ) =

P A ( t ) = 4

1

4

1

4

3

4

3

+

+

e – 4t

e – 4t

P AA ( t ) = 4

1

4

3+ e – 4t

P ij ( t ) =

P GA( t ) = 4

1

4

1

4

1

4

1

_

–

e – 4t

e – 4t

Frequency

P AA ( t ) = 4

1

4

3+ e – 4t

Kimura (1980) : two parameters

A G

C T

r = r1 = r2 = r3 = r4 = + 2

Transition

Transversion

Purines

Pyrimidines

A T C G

A

T

C

G

Markov Chain : Stochastic Matrix

1-r11 r12 r13 r14

r21 1-r22 r23 r24

r31 r32 1-r33 r34

r41 r42 r43 1-r44

R =

r11 = r12 + r13 + r14

Rate matrix

12 parameters

Divergence between two DNA Sequences

Multiple hits

Parallel substitutions

Convergent substitutions

Ancestral Sequence

tt

Sequence 1 Sequence 2

1 GTCTGTTCCAAGGGCCTTTGCGTCAGG-TGGGC-T * # * # *2 GTCTGTTCCAAGGGCCTTCGAGCCAGTCTGGGCCC

1 TT---------------CCAGGGTGGCTGGACCCC * * ** *2 CTGCCCCACTCGGGGTTCCAGAGCAGTTGGACCCC

1 CCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG * **2 TCAGC---------GGGAGGGTGTGGCTGGGCTC-

Number of Nucleotide Substitutions between Two

Sequences

A basic quantity in Molecular Evolution:

1. Rates of evolution

2. Phylogenetic reconstruction

3. Dating of evolutionary

events

ACTGAACGTAACGC

ACTGAACGTAACGC

ACTGAACGTAACGC

A A A T T

C

C G A C

T

*

*

+

Single

T

Multiple

Coincidental

Parallel

Convergent

Back

Sequence Similarity

Proportion of identical nucleotides between 2 seqs

I ( t ) = P AA ( t )2

+ P AT ( t )2

+ P AC ( t )2

+ P AG ( t )2

parallel substitutions

4

1

4

3+ e – 8t

4

1

4

1+ e – 8t

2

1+ e – 4(t

= for one – p model

for two – p =

I ( t ) = Prob. of identical nucleotides at a site

A

A

A

A

A

A

T

T

C

C

G

G

Sequence dissimilarity

D(t) = 1 – I(t)

P(t) = prop. of transitional diff’s between 2 seq’s

One-Parameter model

P(t) = 2PAA(t) PAG(t) + 2PAT(t) 2PAC(t)

= ¼ + ¼ e -8αt

= Pij (2t) (transversional)

2 – P model

P(t) = Y(2t) = ¼ + ¼ e -8βt – ½ e –4(α+β)t

Q(t) = 2Z(t) = ½ - ½ e -8βt

Transitional difference

Transversional difference

Number of nucleotide substitutions between 2 seqs

K = # of substitutions / site

between 2 sequences

Not total #: N

K = N / L, L = # of sites compared

Protein–coding Seq’sNon-coding Seq’s

I(t) = ¼ + (¾) e -8βt

p = 1 - I(t)

p = ¾ - (¾) e -8αt

8αt = - ln [1 – (4/3) p]

K = 2(3αt)

= - (¾ ) ln [1 – (¾) p]

V(K) = p(1 - p) / L(1- ¾ p)2

An Exmple of Application

White-handed gibbon

Hylobates lar

Borneo orangutan

Western lowland gorilla

Common chimpanzee

Human

Chimpanzee

Gorilla

Orangutan

Gibbon

Family

Hominidae

Pongidae

Hylobatidae

Species GenusTraditional view

Human Chimp Gorilla

Chimp 1.24

Gorilla 1.62 1.63

Orang 3.08 3.12 3.09

Number of nucleotide substitutions per 100 sites between species

Human and chimpanzee genomes differ by only 1%!

Human

Chimpanzee

Gorilla

Orangutan

0.620.20

0.73

4.8~6.4 Mya (Million years ago)

6.3~8.5 Mya

12~16 Mya

Man’s closest relative is chimpanzee!

What is a Genetic Disease?

A genetic defect is a defect in a gene.

A genetic disease is a disease caused by one or more genetic defects.

D: disease allele (gene), dominant

d: normal allele (gene), recessive

Mating: Dd (disease) X dd (normal)

Offspring: Dd (disease)

dd (normal)

d: disease allele, recessive

D: normal allele, dominant

Mating: Dd (normal) X Dd (normal)

Offspring: DD (normal)

Dd (normal)

dd (disease)

Traditional Genetics

From a disease or trait (character), look for the gene

Gene cloning from partial protein sequences

Positional cloning: from patients and normal individuals of family trees

Time consuming and expensive

Reverse Genetics

Sequence DNA and look for potentially functional regions by bioinformatic tools. Then conduct experiments to determine the function of such a region.

Genes involved in complex diseases: Diabetes Hypertension Cancer

What is a Genome?

The entire set of the genetic complement (material) of an organism.

For example, the human genome consists of 22 autosomes, one X chromosome, and one Y chromosome.

The Human Genome Project

To sequence the following genomes Human Mouse Drosophila (fruit fly) C. elegans: worm (nematode) Yeast E. coli (bacterium)Purpose: To identify all human genes and genes in model organisms

Why Sequence non-Human Genomes?

Comparing genes from different species can help identify important DNA regions.

More convenient to do experiments in non-humans: e.g., Knock-outs in mice. Transgenic mice.

Why Sequence Bacterial Genomes?

•Pathogens•Drug Resistant strains: e.g., drug resistant TB strains•Bacteria that can clean wastes•Special enzymes: e.g., Enzymes that can function in very high temperature

New Challenges• Gene identification How many genes in the human genome?

• Gene network Which genes control which genes?

• Protein-protein interaction network Which proteins interact with which

proteins?

• Personalized medicine Treatment according to a person’s genetic

makeup

Immediate Challenges to Computer scientists (1)

• Storage of huge amounts of data: The size of the human genome is 3 billion

nucleotides.

• Good databases: How to make a database informative,

easy to search, and retrieve

• Search and Retrieval tools Which genes control which genes?

Immediate Challenges to Computer scientists (2)

How to find useful signals from a sequence or sequences

tattcatcaa gtgccctcta gctgttaagt cactctgatc tctgactgca gctcctactgttggacacac ctggccggtg cttcagttag atcaaaccat tgctgaaact gaagaggacatgtcaaatat tacagatcca cagatgtggg attttgatga tctaaatttc actggcatgccacctgcaga tgaagattac agcccctgta tgctagaaac tgagacactc aacaagtatgttgtgatcat cgcctatgcc ctagtgttcc tgctgagcct gctgggaaac tccctggtgatgctggtcat cttatacagc agggtcggcc gctccgtcac tgatgtctac ctgctgaacctggccttggc cgacctactc tttgccctga ccttgcccat ctgggccgcc tccaaggtgaatggctggat ttttggcaca ttcctgtgca aggtggtctc actcctgaag gaagtcaact

What is Bioinformatics?

The word “bioinformatics” means the application of informatics tools and skills to study biology.

But, existing informatics tools and skills are not sufficient for this purpose.

So, it requires the development of tools and skills.

These include statistical methods and computer programs or software packages.

It also requires databases that are especially created for the study of biology, that is, biological databases.

E.g. , GenBank

Obviously, it also requires special search and retrieval tools.

Definition of Bioinformatics

The study of biological problems by creating biological databases and by using existing or developing search and retrieval tools, and methods and computer programs for analyzing biological data.