MSc in Genetics - UABbioinformatica.uab.es/base/documents/masterGP/Population...MSc in Genetics...
Transcript of MSc in Genetics - UABbioinformatica.uab.es/base/documents/masterGP/Population...MSc in Genetics...
Antonio Barbadilla
Group Genomics, Bioinformatics & Evolution
Institut Biotecnologia I Biomedicina
Departament de Genètica i Microbiologia
UAB
Course 2012-13 1
Population Genomics: Theory
MSc in Genetics
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
2 Antonio Barbadilla
Outline
Population thinking: population genetics
The explanation of genetic variation: neutral theory and selection
Detecting natural selection at the nucleotide level: The MKT
Measures of nucleotide variation
Linkage disequilibrium
The golden age of population genetics
Cataloguing nucleotide variation at the genome scale
Exercices
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
3 3 Antonio Barbadilla Lesson 6. Genome variation: I. nucleotide variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
4 Antonio Barbadilla Lesson 6. Genome variation: I. nucleotide variation
Genetic variation is the
cornerstone of biological
evolution
R. C. Lewontin
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
5 Antonio Barbadilla
A G A G T T C T G C T C G A G G G T T C T G C G C G A G T G T T C T G C G C G
Origin and
substitution of genetic variants
within populations
Evolution
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
6
Population is the substrate where evolution occurs
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Nothing in Biology Makes Sense Except in the Light of Evolution
Theodosious Dobzhansky
Nothing in Evolution Makes Sense Except in the Light of Population Genetics
Michael Lynch
Evolution and population genetics
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
8
Population is the substrate where evolution occurs
Mendelian population: a group of interbreeding individuals sharing a common gene pool
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
•Genetic variation or genetic polymophism: Existence in a population of two or more allelic forms in appreciable frequencies
•Gene or allelic frequency (population property, basic unit of evolution): f(A) proportion of an allele in the population
One gene, two alleles case A and a How does allelic frequency
change over time?
Hardy-Weinberg equilibrium law plays the role of the inertia principle in Dynamics: if not force is acting on the population, allelic and genotypic frequencies remains unchangable over time.
Sperm
Egg
AA p2
Aa pq
Aa pq
aa q2
A
p a
q
a q
A p
Allelic frequencies
Hardy-Weinberg law assumes that alleles of a infinite population unite at random to form genotypes of the next generation
9
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
The problematic of population genetics is the description and explanation of genetic variation within and between populations
Theodosious Dobzhansky
Population genetics
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
The theory of population genetics
Ronald Fisher J. B. S. Haldane Sewall Wright
Founders of Population Genetics
(1918-1932)
Genetic Drift
Natural Selection
Mutation
Migration
Factors changing gene frequencies in populations
Structure of population genetics
A theory of forces
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
1
3 2
The struggle for the measurement of genetic variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
13
The Great Obsession of population genetics (Gillespie 2004) What evolutionary forces led to the observed pattern of genetic variation?
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
H. J. Muller
Teorías de la variación en los 60
•Ausencia de variación
•Selección purificadora
•Genotipo salvaje es óptimo
•Muller (laboratorio)
•Eugenesia
•Variación ubicua
•Selección equilibradora
•No existe un genotipo salvaje
•Dobzhansky (naturalista)
•¡Viva la diversidad!, no interferencia
Teoría clásica Teoría equilibradora
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
60-70
•Electropheretic variation
The struggle for the measurement of genetic variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
16
Neutral theory of molecular evolution (1968)
Motoo Kimura
Mutations are mainly neutral or strongly deleterious
Tomoko Ohta
0 0
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
DGRP Freeze 1 Patterns polymorphism and divergence
along chromosome arms in D. melanogaster
PopDrowser -> http://popdrowser.uab.cat;
2L 2R 3L 3R X Tel Cen Cent Tel Cen Cen Cen Tel Tel Cen
Genome variation data
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
20
Neutral theory of molecular evolution
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
21 21 Antonio Barbadilla Lesson 6. Genome variation: I. nucleotide variation
Schrödinger equation (general)
F = ma (Newton’s dynamics 2nd law)
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Nature of genetic variation
Mutation
=
Individual
Substitution
=
Population
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
23
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
24
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Allelic frequency
Time
1
0
gen.4Nt fix
Dynamics of neutral substitutions
Neutral theory of molecular evolution Assumption
New mutations are mainly neutral or strongly deleterious
µ
ln2 Ntlost
•Polymorphism •Heterozygosity in the equilibrium H = = 2N x 2N µ = 4N µ
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
26
µ N
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Polymorphism and divergence are coupled
Species A Species B
Species B
Species A
Substitution Substitution => divergence
Time from separation
Neutral theory of molecular evolution
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Polymorphism and divergence are coupled
Species A
• Divergence increases over time
D = 2Tµ
• Polymorphism reaches a dynamic equilibrium
H = = 4N µ
Neutral theory of molecular evolution
Species B
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
29
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Allelic frequency
Time
1
0
The intellectual elegance of Neutral theory: Play the role of null hypothesis
The Myth of Sysiphus and molecular evolution
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
II: Mutaciones selectivamente ventajosas
gen. )2ln(2
Ns
t
1
0
4Ns
1
Dynamics of selective advantage mutations
Allelic frequency
Time
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
How to detect positive (adaptive) selection in a background of neutral mutations?
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
33
•Divergence • K = µ
•Polymorphism • Neutral heterozygosity = 4N µ
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
34
Synonymous and non-synonymous sites
C T T C T A
C T T C C T
non-synonymous site
synonymous site
Leu
Leu Pro
Leu
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
How to detect adaptive fixation in a class of sites? •Neutral hypothesis: Correlation polymorphism and divergence •Adaptive hypothesis: adaptive fixation uncouples divergence and polymorphism
McDonald, J. H. Y M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652-654.
McDonald and Kreitman Test (MKT)
Fixation by positive (adaptive) selection
Fixation by neutral process
Neutral sites (synonymous sites)
Putative positive selected sites (non-synonymous sites)
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
McDonald-Kreitman test (MKT)
K Kneut μneut
Ratio divergence = ω =
i 4Nμneut_i neut 4Nμneut
Ratio polymorphism =
= 4Nμ k = μ
Neutral expectance
i = non-synonymous or putative selected sites
=
Only neutral fixation
= μneut_i
= μ neut_i
μneut
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Extended McDonald-Kreitman test (ext MKT)
i neut
= ki kneut
Only neutral fixation
+ Adaptive fixation i neut
< ki kneut
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Fixed Polymorphic Synonymous Ds Ps Non-synonymous Dn Pn
•Null hypothesis: the proportion of fixed versus polymorphic mutations is the same for synonymous and non-synonymous mutations.
•G or 2 test: variants are classified as (1) fixed or polymorphic; or (2) synonymous or non-synonymous
MKT Chi-square table. Observed (Expected) segregating sites
Relative excess suggests directional selection
Dn Relative excess suggests directional selection or purifying selection
Pn
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Structure of gene Adh (256 codons)
Exon 1 Exon 2 Exon 3 Exon 4
3’ 5’
Adh Gene of Drosophila melanogaster Kreitman, M. 1983. Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 304: 412-417.
•13 synonymous polymorphism
•1 non-synonymous polymorphism
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Fixed Polymorphic Non-synonymous 7 (3,2) 2 (5,8) Synonymous 17 (20,8) 42 (38,2)
G = 7,43 or X2
= 8,1 **
Polymorphism and divergence in locus Adh for closely related species of Drosophila melanogaster, D. simulans and D. yakuba
•Null hypothesis: the proportion of fixed versus polymorphic mutations is the same for synonymous and non-synonymous mutations.
•G or 2 test: variants are classified as (1) fixed or polymorphic; or (2) synonymous or non-synonymous
MKT Chi-square table. Observed (Expected) segregating sites
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
41
Exercise 1: Go to the Web page Standard & Generalized MacDonald-Kreitman test (http://mkt.uab.cat). Test the examples and try to understand the difference between the standard and generalized MKT. Which is the parameter alpha whose value is given in the results of the test. How do you would estimate it?
Figure 1
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
42
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
43
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
G
C
Single nucleotide polymorphism (SNP)
44
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Nature and classification of Single Nucleotide Polymorphism (SNP)
•A nucleotide change •Transversion-Transition | Indel
•Changes of two o few nucleotides or indels – >
Simple Nucleotide Polymorhism
•Coding SNP •Synonymous •Non synonymous or replacement
•Non-coding SNP: CNS, 5’ and 3’UTR, intron, 5’ and 3’ intergenic
Classification
45
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Description levels of genetic variation
SNPs BRCA2
Individual 1 acgtagcatcgtatgcgttagacgggggggtagcaccagtacag
Individual 2 acgtagcatcgtatgcgttagacgggggggtagcaccagtacag
Individual 3 acgtagcatcgtatgcgttagacgggggggtagcaccagtacag
Individual 4 acgtagcatcgtttgcgttagacgggggggtagcaccagtacag
Individual 5 acgtagcatcgtttgcgttagacgggggggtagcaccagtacag
Individual 6 acgtagcatcgtttgcgttagacggcatggcaccggcagtacag
Individual 7 acgtagcatcgtttgcgttagacggcatggcaccggcagtacag
Individual 8 acgtagcatcgtttgcgttagacggcatggcaccggcagtacag
Individual 9 acgtagcatcgtttgcgttagacggcatggcaccggcagtacag
one-dimensional: SNP to SNP
multi-dimensional: Haplotype
46
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
SNPs per 10000 bases along the human chromosome 6 (2001 Nature 409: 928-941)
HLA
47
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Functional genome regions
48
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
49
• Number of segregating sites per nucleotide (Watterson 1975):
S/m • Watterson Ѳ estimator (1975):
Ѳw = (S/m) / 1/𝑖𝑛−1𝑖=1
• , nucleotide diversity o expected nucleotide
heterozygosity (Tajima 1983): average number of differences by site among pair of randomly samples sequences
= 1𝑛2𝑚 𝑘𝑖𝑗
𝑛𝑗=𝑖+1
𝑛−1𝑖=1
S = number segregating sites m = number of nucleotides analyzed n = sample size (number of sequences)
𝑘𝑖𝑗 is the number of differences between sequence i and j
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
50
Site frequency spectrum
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Nucleotide variation in gen Rhodopsin 3 of Drosophila simulans
Data summary: Sample n = 5 sequences. Size m = 500 nucleotides
•Number of segregating sites per nucleotide: 16/500 = 0.0320
•Watterson Ѳ estimator = Ѳw = (16/500)/(1+1/2+1/3+1/4) = 0.0154
•Diversidad nucleotídica: = 79/(500 x 10) = 0.0158
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 T C T A C C T C C T C G G T T A
2 T C C T A C C T C C T G G T T T
3 C T C C C C C T C T T T G C T A
4 C T C C C C C T T C T G A C T T
5 C T C C C T C T T T T G G C C A
6 6 4 7 4 4 4 4 6 6 4 4 4 6 4 6
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
52
Exercise 2: Given the data set of DNA sequences of Figure 1 and 2 below: • Estimate the most common measures or summary statistics of nucleotide variation (S/m, Ѳw , and the site frequency spectrum)
Sequence 1 … A … G … C … G …
Sequence 2 … A … G … T … G …
Sequence 3 … A … A … T … T …
Sequence 4 … T … G … T … T …
Sequence 5 … T … G … T … T …
Figure 1
Figure 2
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
53
Exercise 1: Given the data set of DNA sequences of Figure below: • Estimate the most common summary statistics of nucleotide variation (S/m, Ѳw , and the site frequency spectrum)
Sequence 1 … A … G … C … G …
Sequence 2 … A … G … T … G …
Sequence 3 … A … A … T … T …
Sequence 4 … T … G … T … T …
Sequence 5 … T … G … T … T …
0
0,5
1
1,5
2
2,5
0.2 0.4
Site frequency spectrum
Number of SNPs
Minor allele frequency
S/m = 4/m Ѳw = (4/m) / [1 + (1 / 2) + (1 / 3) + (1 / 4) = (4/m) / 2.083] = (6+4+4+6)/(m x 10)=2/m
Responses
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
54
Software for estimation of nucleotide variation DnaSP — DNA Sequence Polymorphism, is a software package for the analysis of nucleotide polymorphism from aligned DNA sequence data. Variscan is a software package for the analysis of DNA sequence polymorphisms at the whole genome scale. MEGA, Molecular Evolutionary Genetics Analysis, is a software package used for estimating rates of molecular evolution, as well as generating phylogenetic trees, and aligning DNA sequences. Available for Windows, Linux and Mac OSX (since ver. 5.x). Arlequin3 software can be used for calculations of nucleotide diversity and a variety of other statistical tests for intra-population and inter-population analyses. Available for Windows.
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
First estimates of nucleotide diversity in the Drosophila genus
Species Number of
genes
No coding region
Coding region Total Synonymous
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
First estimates of nucleotide diversity in the human genome
Nucleotide diversity (π) Density SNP number Chromosome
Autosomes 1-22
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
57
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Multidimensional structure of genetic variability
A G A G T T C T G C T C G
A G G G T T C T G C G C G A G G G T T A T G C G C G
A G A G T T C T G C T C G A G A G T T C T G C T C G
A G A G T T C T G C T C G
A G A G T T C T G C T C G A G A G T T C T G C T C G
A G G G T T A T G C G C G A G G G T T A T G C G C G A G G G T T A T G C G C G
A G G G T T A T G C G C G
A G G G T T A T G C G C G
58
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Linkage disequilibrium (D’ Lewontin)
B1 B2 Total
A1 p11 = p1q1 + D p12 = p1q2 - D p1
A2 p21 = p2q1 - D p22 = p2q2 + D p2
Total q1 q2 1
2. D’ = D / Dmax
1. DAB = pAB - pApB
3. r2AB = D2/ [pA(1-pA) qB(1-qB)]
(Lewontin & Kojima 1960)
(Lewontin 1964)
(Hill & Robertson 1968)
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Linkage disequilibrium
D = 0.5625 – 0.625 x 0.6875 = 0.1328
p1 ^ = n1. /2n = 10 / 16 = 0.625
q1 = n.1 /2n = 11 / 16 = 0.6875 ^
Χ2 = 5.606 > 3.84 *
B1 B2 Total
A1 9 1 10
A2 2 4 6
Total 11 5 16
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Linkage disequilibrium
D’ = 0.1328 / 0.1953 = 0.68
p1 q2
p2 q1
= 10/16 x 5/16
= 6/16 x 11/16
r212 = D2/ [p1 p2 q1 q2)] = 0.3502
B1 B2 Total
A1 9 1 10
A2 2 4 6
Total 11 5 16
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
c
c
c
c
Recombinaton and
linkage disequilibrium
DAB (t + 1) = (1 – c) DAB (t )
62
c = recombination rate
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Linkage disequilibrium
Linkage blocks
&
Tag SNPs
Linkage disequilibrium distribution along the human lipase lipoprotein gene (LPL). 66
SNPs around 10 kb of 142 chromosomes 63
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
64
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
65
The golden age of the study of genetic variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Description and explanation of patterns of nucleotide variation at a large scale
• Patterns of polymorphism and divergence
• Neutral vs Selection variation (adaptive evolution)
Explanation of the genome complexity from first population
genetic principles (sensu M. Lynch)
The golden age of the study of genetic variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
Comparative and Functional Genomics
• Role of conserved and fast-evolution non-coding regions
Catalogue of human genetic variation and association studies genotype -> phenotype
• HapMap, Biobanks, GWAS
• Personalized Genomics
The golden age of the study of genetic variation
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
68
Exercise 2: Estimate manually the most common measures or summary statistics of nucleotide variation (S/m, Ѳw , p and the site frequency spectrum) and the three measures of linkage disequilibrium (D, D' and r2) for the 8 aligned sequences given below. m, the analyzed sequence length, is 100. Only variable sites are shown.
Sequence 1 … A … G … C … G … Sequence 2 … A … G … T … G … Sequence 3 … A … A … T … G … Sequence 4 … T … G … T … T … Sequence 5 … T … G … T … T … Sequence 6 … T … G … C … T … Sequence 7 … T … G … T … T … Sequence 8 … T … G … T … T …
Exercise 1: Go to the Web page Standard & Generalized MacDonald-Kreitman test (http://mkt.uab.cat). Test the examples and try to understand the difference within the standard and generalized MKT. Which is the parameter alpha whose value is given in the results of the test. How do you would estimate it?
Exercices
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
69
> ki kneut
i neut <
=
+ Adaptive fixation +Weakly negative
selection
Exercices
Exercise 3: Consider that adaptive and weakly deleterious selection are acting in a DNA sequence. If you want to perform a MKT, search for a statistical approach to take into account the weakly negative selection to detect adaptive selection.
MSc in Genetics Module Genomics & Proteomics
Antonio Barbadilla Population Genomics: Theory
70
• Population genomics in Drosophila: T.F.C. Mackay*, S. Richards*, E.A. Stone*, A. Barbadilla*, M. Barrón, D. Castellano, P. Librado, M. Ràmia, J. Rozas et al. 2012. The Drosophila melanogaster Genetic Reference Panel: A Community Resource for Analysis of Population Genomics and Quantitative Traits. Nature 482: 173-178.
• 1000 genome project: The 1000 Genomes Project Consortium. 2012. An integrated
map of genetic variation from 1,092 human genomes. Nature 491: 56-65.
Readings