COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran...
-
Upload
derek-blair -
Category
Documents
-
view
214 -
download
0
Transcript of COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran...
![Page 1: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/1.jpg)
COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS
Eran HalperinNovember 10, 2009
1
![Page 2: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/2.jpg)
Environmental Factors
Genetic Factors
Complexdisease
Multiple genes may affect the disease.
Therefore, the effect of every single gene may be negligible.
![Page 3: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/3.jpg)
April 05’
The Human ChromosomesThe Human Chromosomes
![Page 4: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/4.jpg)
………ACCAGGACGA……
………ACCAGGACGA……
Each chromosome ‘is’ a sequence over the alphabet {A,G,C,T} (base pairs)
Copy from mother
Copy from father
![Page 5: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/5.jpg)
Facts about our genome
23 pairs of chromosomes. X and Y are the sex chromosomes (XX
for women, XY for men). 3,300,000,000 base pairs in the human
genome
![Page 6: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/6.jpg)
The Human Genome Project
“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”
“But our work previously has shown… that having one genetic code is important, but it's not all that useful.”
“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”
Washington, DCJune, 26, 2000
![Page 7: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/7.jpg)
The Vision of Personalized Medicine
Genetic and epigenetic variants + measurable environmental/behavioral factorsGenetic and epigenetic variants + measurable environmental/behavioral factors would be used for a personalized treatment and diagnosis would be used for a personalized treatment and diagnosis
![Page 8: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/8.jpg)
Paradigm shifts in medicine
![Page 9: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/9.jpg)
Example: WarfarinAn anticoagulant drug, useful in the prevention of thrombosis.
![Page 10: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/10.jpg)
Warfarin was originallyused as rat poison.
Optimal dose variesacross the population
Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.
Example: WarfarinExample: Warfarin
![Page 11: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/11.jpg)
Association Studies
![Page 12: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/12.jpg)
12
Where should we look first?
person 1: ….AAGCTAAATTTG….person 2: ….AAGCTAAGTTTG….person 3: ….AAGCTAAGTTTG….person 4: ….AAGCTAAATTTG….person 5: ….AAGCTAAGTTTG….
SNP = Single Nucleotide Polymorphism
Each common SNP has only two possible letters (alleles).
![Page 13: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/13.jpg)
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
Cases:
Controls: Associated SNP (high Relative Risk)
Disease Association Studies
SNP = Single Nucleotide Polymorphism13
Associated SNP (lower Relative Risk)
![Page 14: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/14.jpg)
Preliminary Definitions14
SNP – single nucleotide polymorphism. A genetic variant which may carry different ‘value’ for different individuals.
Allele – the variant’s value: A,G,C, or T. Most SNPs are bi-allelic. There are only two
observed alleles in the populations. Risk allele – the allele which is more
common in cases than in controls (denoted R)
Nonrisk allele – the allele which is more common in the controls (denoted N)
![Page 15: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/15.jpg)
Relative Risk
Chances of developing type II
diabetes: 30%
Chances of developing type II
diabetes: 20%
Relative Risk: Pr(D|R)/Pr(D|N) = 1.5
Risk=G
Nonrisk=A
![Page 16: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/16.jpg)
Other Structural Variants
InversionDeletionCopy number variant
![Page 17: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/17.jpg)
Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8
NHGRI GWA Catalogwww.genome.gov/GWAStudies
![Page 18: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/18.jpg)
![Page 19: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/19.jpg)
19
Public Genotype Data Growth
2001
Daly et al.Nature Genetics103 SNPs40,000genotypes
Gabriel et al.Science3000 SNPs400,000 genotypes
2002
TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes
2003
Perlegen DataScience1,570,000 SNPs100,000,000 genotypes
2004
NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes
2005
HapMap Phase 25,000,000+ SNPs600,000,000+genotypes
2006
![Page 20: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/20.jpg)
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
Cases:
Controls:
Chance or Real Association?20
Associated SNP (lower Relative Risk)
![Page 21: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/21.jpg)
How does it work?
For every SNP we can construct a contingency table:
R N Total
Cases a b N
Controls
c d N
![Page 22: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/22.jpg)
Hypothesis testing22
Null hypothesis: Pr(R|case) = Pr(R|control) Alternative hypothesis: Pr(R|case) ≠ Pr(R|
control) The model assumes that all individuals are
independent (unrelated), and therefore our sample is a random sample from a Binomial distribution Cases sampled from distribution X~B(n,Pr(R|
cases)) Controls sampled from distribution Y~B(n,Pr(R|
controls))
![Page 23: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/23.jpg)
Hypothesis testing, cont.23
When n is large, B(n,p) ~ N(np, np(1-p)). Under the null hypothesis:
€
X −Y ~ N(0,2np(1− p))
⇓
Z =X −Y
2np(1− p)~ N(0,1)
Set p =X +Y
2n. Then we get :
Z =2n (X −Y )
(X +Y )(2n − X −Y )~ N(0,1+
1
2n −1)
![Page 24: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/24.jpg)
P-value24
Z is called a test-statistic (z-score in this case).
We can calculate Z* for our data, and then calculate (using the normal approximation):p-value = Pr(|Z| > |Z*|)
Often we take , which is €
T = Z 2
€
T = χ12
![Page 25: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/25.jpg)
Results: Manhattan Plots
![Page 26: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/26.jpg)
The curse of dimensionality – corrections of multiple testing
In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs.
If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease.
This needs to be corrected.
![Page 27: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/27.jpg)
Bonferroni Correction
If the number of tests is n, we set the threshold to be 0.05/n.
A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: Example: If all SNPs are identical, then we lose
a lot of power; the false positive rate reduces, but so does the power.
![Page 28: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/28.jpg)
Population Substructure
Challenge 128
![Page 29: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/29.jpg)
Population Substructure
Imagine that all the cases are collected from Africa, and all the controls are from Europe. Many association signals are going to be
found The vast majority of them are false;
Why ???
Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.
![Page 30: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/30.jpg)
Evolution Theory
Mutations add to genetic variation Natural Selection controls the frequency
of certain traits and alleles Genetic drift
![Page 31: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/31.jpg)
Mutations
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA
AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA
Estimated probability of a mutation in a single generation is 10^-8
![Page 32: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/32.jpg)
Other ‘mutations’ - recombination
Copy 1
Copy 2
child chromosome
Probability ri (~10^-8) for recombination in position i.
![Page 33: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/33.jpg)
Natural Selection
Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene
different allele frequencies in LCT
![Page 34: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/34.jpg)
Genetic Drift
Even without selection, the allele frequencies in the population are not fixed across time.
Consider the case where we assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population.
If at the first generation the allele frequencies are p0 (of a) and q0=1-p0 (of A).
Under HWE, E[pk+1]=pk, but V[pk+1] > 0, so the next generation will have pk+1≠p0.
![Page 35: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/35.jpg)
The rate of the drift
N – effective population size (if all individuals are entirely unrelated than N is the total population size).
Under an assumption of constant population size, if Xk counts the number of occurrences of a at generation k, then Xk+1 ~ B(N,pk).
E[pk+1] = E[Xk+1]/N = pk. Var[pk+1] = pk(1-pk)/N. The effect of genetic drift depends on the time
and the effective populations size. Small population increases the effect.
![Page 36: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/36.jpg)
Bottleneck effect
Effective population size
Tim
e Genetic drift’s rate is higher.
![Page 37: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/37.jpg)
Generation 1Allele frequency 1/9
The Wright-Fisher Model
![Page 38: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/38.jpg)
Generation 2Allele frequency 1/9
The Wright-Fisher Model
![Page 39: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/39.jpg)
Generation 3Allele frequency 1/9
The Wright-Fisher Model
![Page 40: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/40.jpg)
Generation 4Allele frequency 1/3
The Wright-Fisher Model
![Page 41: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/41.jpg)
The Wright-Fisher Model
![Page 42: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/42.jpg)
The Wright-Fisher Model
![Page 43: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/43.jpg)
Ancestral population
![Page 44: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/44.jpg)
Ancestral population
migration
![Page 45: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/45.jpg)
Ancestral population
Genetic drift
different allele frequencies
![Page 46: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/46.jpg)
Population Substructure
Imagine that all the cases are collected from Africa, and all the controls are from Europe. Many association signals are going to be
found The vast majority of them are false;
What can we do about it?
![Page 47: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/47.jpg)
Jakobsson et al, Nature 421: 998-103
![Page 48: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/48.jpg)
Principal Component Analysis Dimensionality reduction Based on linear algebra (Singular Value
Decomposition) Intuition: find the ‘most important’
features of the data – project the data on the axis with the largest variance.
![Page 49: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/49.jpg)
Principal Component Analysis
Plotting the data on a onedimensional line for which the spread is maximized.
![Page 50: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/50.jpg)
Principal Component Analysis In our case, we want to look at two
dimensions at a time. The original data has many dimensions –
each SNP corresponds to one dimension.
![Page 51: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/51.jpg)
To what extent can population structure be detected from SNP data?
What can we learn from these inferences? Can we build the tree of life? How do we analyze complex
populations (mixed)?
Novembre et al., Nature, 2008
Ancestry Inference
![Page 52: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/52.jpg)
Modeling Correlation
Challenge 252
![Page 53: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/53.jpg)
A typical associated region53
![Page 54: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/54.jpg)
Linkage Disequilibrium54
![Page 55: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/55.jpg)
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
![Page 56: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/56.jpg)
Phasing - haplotype inference
Cost effective genotyping technology gives genotypes and not haplotypes.
Haplotypes Genotype
A
CCG
A
C
G
TA
ATCCGAAGACGC
ATACGAAGCCGC
Possiblephases:
AGACGAATCCGC ….
mother chromosomefather chromosome
![Page 57: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/57.jpg)
57
1??11?1??11?
?100???100??
1?0???1?0???
10?11?11?11?
1100??0100??
100???110???
1??11?1??11?
1100??0100??
1?0???1?0???
10011?11111?
11000?01001?
10011?11000?
Inferring Haplotypes From Trios
Parent 1
Parent 2
Child
122112
210022
120222
Assumption: No recombination
![Page 58: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/58.jpg)
Maximum Likelihood
Until now we discussed the case of two hypotheses (null, and alternative).
In some cases we are interested in many hypotheses and we search for the best one.
Normally a hypothesis will be defined by a set of parameters θ.
The likelihood of θ is .We are interested in the hypothesis that maximizes the likelihood.
€
L(θ;D) = Pr(D |θ )
![Page 59: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/59.jpg)
Soft assignment59
Compute probabilities P={ph} for all possible haplotypes.
For each genotype g, we do not assign one pair of haplotypes, but a distribution of possible pairs.
The set of pairs of haplotypes compatible with g is denoted as C(g).
In soft assignment, a pair is explaining g with probability
€
(h,h')∈C(g)
€
ph ph '
ph1ph2(h1,h2)∈C (g )
∑
![Page 60: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/60.jpg)
Phasing via Maximum Likelihood
60
Soft decision:
Hard decision:
€
log(L(P = {ph};D)) = log(g∈G
∑ ph ph '(h,h' )∈C (g )
∑ )
€
log(L(Z,P = {ph};D)) = nh log(ph )h
∑
![Page 61: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/61.jpg)
An iterative algorithm
0 0 0 1 01/120 0 0 1 11/121 0 0 0 11/121 0 0 1 01/121 0 0 1 13/121 0 1 0 11/121 0 1 1 12/121 1 0 1 11/121 1 1 1 11/12
Data:
1 0 h h 1
h 0 0 1 h
1 h h 1 1
1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1
0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0
1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
![Page 62: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/62.jpg)
An iterative algorithm
0 0 0 1 0.1250 0 0 1 1.0421 0 0 0 1.0671 0 0 1 0.0421 0 0 1 1.3251 0 1 0 1 .11 0 1 1 1.0671 1 0 1 1.0671 1 1 1 1 .1
Data:
1 0 h h 1
h 0 0 1 h
1 h h 1 1
1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1
0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0
1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
0.40.6
0.750.25
0.60.4
0 0 0 1 01/120 0 0 1 11/121 0 0 0 11/121 0 0 1 01/121 0 0 1 13/121 0 1 0 11/121 0 1 1 12/121 1 0 1 11/121 1 1 1 11/12
![Page 63: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/63.jpg)
An iterative algorithm
0 0 0 1 0 1/60 0 0 1 1 01 0 0 0 1 01 0 0 1 0 01 0 0 1 1 1/21 0 1 0 1 1/61 0 1 1 1 01 1 0 1 1 01 1 1 1 1 1/6
Data:
1 0 h h 1
h 0 0 1 h
1 h h 1 1
1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1
0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0
1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
01
10
10
![Page 64: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/64.jpg)
Expectation Maximization (EM)
64
D – given data Θ– parameters that need to be
estimated Z – Latent missing variables
€
1. E - step : Compute Q(θ |θ n ) = EZ |D,θ n[log(Pr(D,Z |θ ))]
2. M - step : Find a θ n+1 which maximizes Q(θ |θ n )
![Page 65: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/65.jpg)
EM rationale65
Lemma:.
Proof: First, note that€
Pr(D |θ n+1) ≥ Pr(D |θ n )
€
Q(θ |θ n ) = Pr(z |D,θ n )log(Pr(z,D |z
∑ θ ))
![Page 66: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/66.jpg)
66
€
log(Pr(D |θ )) = log( Pr(D | z,θ )Pr(z |θ ))z
∑
= log( Pr(z |D,θ n )Pr(D | z,θ )Pr(z |θ ))
Pr(z |D,θ n )z
∑ )
≥ Pr(z
∑ z |D,θ n )log(Pr(D | z,θ )Pr(z |θ ))
Pr(z |D,θ n ))
= Pr(z
∑ z |D,θ n )log(Pr(D,z |θ n ))
− Pr(z
∑ z |D,θ n )log(Pr(z |D,θ n )) =Q(θ |θ n ) − R(θ n )
€
f (θ |θ n ) :=Q(θ |θ n ) − R(θ n )
€
f (θ n |θ n ) :=Q(θ n |θ n ) − R(θ n ) = log(Pr(D |θ n )
![Page 67: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/67.jpg)
67
€
log(Pr(D |θ n+1) ≥ f (θ n+1 |θ n ) =Q(θ n+1 |θ n ) − R(θ n )
≥Q(θ n |θ n ) − R(θ n ) = f (θ n |θ n ) = log(Pr(D |θ n ))
€
log(Pr(D |θ ))≥Q(θ |θ n ) − R(θ n )
€
f (θ |θ n ) :=Q(θ |θ n ) − R(θ n )
€
f (θ n |θ n ) = log(Pr(D |θ n )
QED
![Page 68: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/68.jpg)
log P(x| )
Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”)
MLE from Incomplete DataFinding MLE parameters: nonlinear optimization
problem
E ’[log P(x,y| )]
![Page 69: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/69.jpg)
log P(x| )
MLE from Incomplete Data
E ’[log P(x,y| )]
€
log(L(θ;D))
€
Q(θ |θ 0) = EZ |θ 0[log(L(θ ;D,Z))]
€
0
![Page 70: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/70.jpg)
EM for phasing70
€
log(L(θ = {ph};D)) = log(g∈G
∑ ph ph '(h,h' )∈C (g )
∑ )
€
Q(θ |θ n ) = EZ |θ n[log(L(Z,θ = {ph};D)) =
EZ |θ n[ nh(Z)log(ph )h
∑ ] = EZ |θ n[ IZ (g )=(h1,h2)(log(ph1) + log(ph2)
(h1,h2)∈C (g )
∑ ]g∈G
∑
=ph1n ph2
n
ph ph'(h,h ' )∈C (g )
∑(
(h1,h2)
∑g
∑ log(ph1) + log(ph2))
![Page 71: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/71.jpg)
71
This is maximized for:€
Q(θ |θ n ) =ph1n ph2
n
phn ph'
n
(h,h ' )∈C (g )
∑(
(h1,h2)
∑g
∑ log(ph1) + log(ph2))
€
ph =phn ph '
n
ph1n ph2
n
(h1,h2)∈C (g )
∑(h,h' )∈C (g )
∑g
∑
![Page 72: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/72.jpg)
Phasing summary72
Expectation maximization is easy to implement, works reasonably well in practice.
We can use other models (tree models) to improve the accuracy of the phasing prediction.
![Page 73: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/73.jpg)
Human Genetics – where to?
We can typically explain 5%-15%of the heritability of commondiseases.
Where is the missing heritability? Rare variants Gene-gene interactions Gene-environment interactions
Creative computational methods are key to the discovery of the missing heritability.
73
![Page 74: COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009 1.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f155503460f94c2a0e7/html5/thumbnails/74.jpg)
Course: Computational Human Genetics
74
Semester bet More background in human genetics,
statistics, and machine learning. Studying genetics of human disease Privacy and forensics Analysis of new technologies (sequencing) Population genetics – detecting selection,
mutation rate, recombination rates, etc. Reconstructing human history