Genetics, Evolution and Bioinformatics
Wen-Hsiung Li, Ph.D. George Beadle Professor Department of Ecology & Evolution University of Chicago
Topics• DNA and Genes
• DNA Sequence Evolution
• Traditional Genetics
• Reverse Genetics
• The Human Genome Project
• What is Bioinformatics?
Deoxyribonucleic acid (DNA) from bacteria to man
Ribonucleic acid (RNA) in many viruses
4 types of nucleotides: A, C, G, and T (U)
Hereditary Material
Double-stranded DNA sequence:
5’-ATCGTAATGCGTAGCAGT-3’3’-TAGCATTACGCATCGTCA-5’
Single-stranded DNA sequence:
ATCGTAATGCGTAGCAGT
What is a Gene?
A gene is a DNA region that controls a certain function.
For example, the insulin gene codes for an amino acid sequence that is processed to form the insulin.
Mathematical Theory of DNA Sequence Evolution
To develop mathematical theory to describe the change of nucleotides in a DNA sequence in time.
Given that the nucleotide at a particular site is A at time 0, what is the probability, PA(t) , that the nucleotide will be A at time t?
Starting with A: PA(0) =1
PA(1) = 1 - 3 α
PA(2) = (1 - 3 α) PA(1) + α [1 - PA(1) ]
t = 0
t = 1
t = 2
A
A
A
A
not A
A
No subst.
No subst.
Subst.
Subst.
P A ( t + 1 ) = ( 1 – 3 ) P A ( t ) + [ 1 – P A ( t ) ]
P A ( t + 1 ) – P A ( t ) = – 3P A ( t ) + [ 1 – P A ( t ) ]
P A ( t ) = – 4P A ( t ) +
d P A ( t )
d t= – 4P A ( t ) +
Discrete time
Continuous time approximation
P A ( t ) = + [ P A ( 0 ) – ] e – 4t
4
1
4
1
P ii ( t ) =
P A ( t ) = 4
1
4
1
4
3
4
3
+
+
e – 4t
e – 4t
P AA ( t ) = 4
1
4
3+ e – 4t
Kimura (1980) : two parameters
A G
C T
r = r1 = r2 = r3 = r4 = + 2
Transition
Transversion
Purines
Pyrimidines
1-r11 r12 r13 r14
r21 1-r22 r23 r24
r31 r32 1-r33 r34
r41 r42 r43 1-r44
R =
r11 = r12 + r13 + r14
Rate matrix
12 parameters
1 GTCTGTTCCAAGGGCCTTTGCGTCAGG-TGGGC-T * # * # *2 GTCTGTTCCAAGGGCCTTCGAGCCAGTCTGGGCCC
1 TT---------------CCAGGGTGGCTGGACCCC * * ** *2 CTGCCCCACTCGGGGTTCCAGAGCAGTTGGACCCC
1 CCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG * **2 TCAGC---------GGGAGGGTGTGGCTGGGCTC-
Number of Nucleotide Substitutions between Two
Sequences
A basic quantity in Molecular Evolution:
1. Rates of evolution
2. Phylogenetic reconstruction
3. Dating of evolutionary
events
ACTGAACGTAACGC
ACTGAACGTAACGC
ACTGAACGTAACGC
A A A T T
C
C G A C
T
*
*
+
Single
T
Multiple
Coincidental
Parallel
Convergent
Back
Sequence Similarity
Proportion of identical nucleotides between 2 seqs
I ( t ) = P AA ( t )2
+ P AT ( t )2
+ P AC ( t )2
+ P AG ( t )2
parallel substitutions
4
1
4
3+ e – 8t
4
1
4
1+ e – 8t
2
1+ e – 4(t
= for one – p model
for two – p =
I ( t ) = Prob. of identical nucleotides at a site
Sequence dissimilarity
D(t) = 1 – I(t)
P(t) = prop. of transitional diff’s between 2 seq’s
One-Parameter model
P(t) = 2PAA(t) PAG(t) + 2PAT(t) 2PAC(t)
= ¼ + ¼ e -8αt
= Pij (2t) (transversional)
2 – P model
P(t) = Y(2t) = ¼ + ¼ e -8βt – ½ e –4(α+β)t
Q(t) = 2Z(t) = ½ - ½ e -8βt
Transitional difference
Transversional difference
Number of nucleotide substitutions between 2 seqs
K = # of substitutions / site
between 2 sequences
Not total #: N
K = N / L, L = # of sites compared
Protein–coding Seq’sNon-coding Seq’s
I(t) = ¼ + (¾) e -8βt
p = 1 - I(t)
p = ¾ - (¾) e -8αt
8αt = - ln [1 – (4/3) p]
K = 2(3αt)
= - (¾ ) ln [1 – (¾) p]
V(K) = p(1 - p) / L(1- ¾ p)2
Human
Chimpanzee
Gorilla
Orangutan
Gibbon
Family
Hominidae
Pongidae
Hylobatidae
Species GenusTraditional view
Human Chimp Gorilla
Chimp 1.24
Gorilla 1.62 1.63
Orang 3.08 3.12 3.09
Number of nucleotide substitutions per 100 sites between species
Human and chimpanzee genomes differ by only 1%!
Human
Chimpanzee
Gorilla
Orangutan
0.620.20
0.73
4.8~6.4 Mya (Million years ago)
6.3~8.5 Mya
12~16 Mya
Man’s closest relative is chimpanzee!
What is a Genetic Disease?
A genetic defect is a defect in a gene.
A genetic disease is a disease caused by one or more genetic defects.
D: disease allele (gene), dominant
d: normal allele (gene), recessive
Mating: Dd (disease) X dd (normal)
Offspring: Dd (disease)
dd (normal)
d: disease allele, recessive
D: normal allele, dominant
Mating: Dd (normal) X Dd (normal)
Offspring: DD (normal)
Dd (normal)
dd (disease)
Traditional Genetics
From a disease or trait (character), look for the gene
Gene cloning from partial protein sequences
Positional cloning: from patients and normal individuals of family trees
Time consuming and expensive
Reverse Genetics
Sequence DNA and look for potentially functional regions by bioinformatic tools. Then conduct experiments to determine the function of such a region.
Genes involved in complex diseases: Diabetes Hypertension Cancer
What is a Genome?
The entire set of the genetic complement (material) of an organism.
For example, the human genome consists of 22 autosomes, one X chromosome, and one Y chromosome.
The Human Genome Project
To sequence the following genomes Human Mouse Drosophila (fruit fly) C. elegans: worm (nematode) Yeast E. coli (bacterium)Purpose: To identify all human genes and genes in model organisms
Why Sequence non-Human Genomes?
Comparing genes from different species can help identify important DNA regions.
More convenient to do experiments in non-humans: e.g., Knock-outs in mice. Transgenic mice.
Why Sequence Bacterial Genomes?
•Pathogens•Drug Resistant strains: e.g., drug resistant TB strains•Bacteria that can clean wastes•Special enzymes: e.g., Enzymes that can function in very high temperature
New Challenges• Gene identification How many genes in the human genome?
• Gene network Which genes control which genes?
• Protein-protein interaction network Which proteins interact with which
proteins?
• Personalized medicine Treatment according to a person’s genetic
makeup
Immediate Challenges to Computer scientists (1)
• Storage of huge amounts of data: The size of the human genome is 3 billion
nucleotides.
• Good databases: How to make a database informative,
easy to search, and retrieve
• Search and Retrieval tools Which genes control which genes?
Immediate Challenges to Computer scientists (2)
How to find useful signals from a sequence or sequences
tattcatcaa gtgccctcta gctgttaagt cactctgatc tctgactgca gctcctactgttggacacac ctggccggtg cttcagttag atcaaaccat tgctgaaact gaagaggacatgtcaaatat tacagatcca cagatgtggg attttgatga tctaaatttc actggcatgccacctgcaga tgaagattac agcccctgta tgctagaaac tgagacactc aacaagtatgttgtgatcat cgcctatgcc ctagtgttcc tgctgagcct gctgggaaac tccctggtgatgctggtcat cttatacagc agggtcggcc gctccgtcac tgatgtctac ctgctgaacctggccttggc cgacctactc tttgccctga ccttgcccat ctgggccgcc tccaaggtgaatggctggat ttttggcaca ttcctgtgca aggtggtctc actcctgaag gaagtcaact
What is Bioinformatics?
The word “bioinformatics” means the application of informatics tools and skills to study biology.
But, existing informatics tools and skills are not sufficient for this purpose.
So, it requires the development of tools and skills.
These include statistical methods and computer programs or software packages.
It also requires databases that are especially created for the study of biology, that is, biological databases.
E.g. , GenBank
Obviously, it also requires special search and retrieval tools.
Top Related