Statistical Challenges in Agent-Based Computational Modeling
Computational and Statistical Challenges in Association Studies
description
Transcript of Computational and Statistical Challenges in Association Studies
Computational and Statistical Challenges in Association Studies
Eleazar Eskin
University of California, Los Angeles
The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”
“I would be willing to make a prediction that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”
Washington, DCJune, 26, 2000.
Human Genetics
Mother Father
Child
Disease Risk “genetic” factors account for
20%-80% of disease risk. Many genes contribute to
“complex” diseases.
Personalized Medicine Treatment decisions influenced
by diagnostics
Understanding Disease Biology New drug targets. Understanding of mechanism of
disease.
Mother
Child
Risk Factors
Risk Factors
Where are the risk factors?(Genetic Basis of Disease)
Disease Association StudiesThe search for genetic factors
Comparing the DNA contents of two populations:
• Cases - individuals carrying the disease.• Controls - background population.
Differences within a gene between the two populations is evidence the gene is involved in the disease.
Single Nucleotide Polymorphisms(SNPs)
AGAGCCGTCGACAGGTATAGCCTAAGAGCCGTCGACATGTATAGTCTA
AGAGCAGTCGACAGGTATAGTCTAAGAGCAGTCGACAGGTATAGCCTA
AGAGCCGTCGACATGTATAGCCTAAGAGCAGTCGACATGTATAGCCTA
AGAGCCGTCGACAGGTATAGCCTAAGAGCCGTCGACAGGTATAGCCTA
Human Variation Humans differ by
0.1% of their DNA. A significant
fraction of this variation is accounted by SNPs.
Single Nucleotide PolymorphismsAssociation Analysis
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
Cases: (Individuals with the disease)
Controls: (Healthy individuals) Associated SNP
Correlations between SNPs
Single Nucleotide Polymorphisms Association Analysis
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Cases:
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Controls:
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Challenges: Millions of Common SNPs
False Positives
Single Nucleotide Polymorphisms(SNPs)
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Cases:
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Controls:
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA
Challenges: Millions of Common SNPs Correlations between SNPs SNP locations unknown
False Positives
•Successor to the Human Genome Project •International consortium that aims in genotyping the genome of 270 individuals from four different populations.• Launched in 2002. First phase was finished in October (Nature, 2005).•Collected genotypes for 3.9 million SNPs.•Location and correlation structure of many common SNPs.
Public Genotype Data Growth
2001
Daly et al.Nature Genetics103 SNPs40,000genotypes
Gabriel et al.Science3000 SNPs400,000 genotypes
2002
TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes
2003
Perlegen DataScience1,570,000 SNPs100,000,000 genotypes
2004
NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes
2005
HapMap Phase 25,000,000+ SNPs600,000,000+genotypes
2006
More SNPs increase genome coverage in association studies.
More genotypes allow for discovery of weaker associations.
Some Computational Challenges
Genetics - identifying disease genes Haplotype phasing - preprocessing SNPs Association study design Association study analysis Population stratification Inferring evolutionary processes (recombination rates,
selection, haplotype ancestry). Etc…
Genomics - functions of disease genes Predicting functional effect of variation Understanding disease effect on gene regulation Understanding disease effect on metabolic pathways Combining systems biology with genetics Etc…
HAP
WHAPSAT Tagger
Haplotype Phasing using Imperfect Phylogeny
Haplotype Phasing
High throughput cost effective sequencing technology gives genotypes and not haplotypes.
HaplotypesATCCGAAGACGC
ATACGAAGCCGC
Possiblephases:
AGACGAATCCGC ….
mother chromosomefather chromosome
Genotype
A
CCG
A
C
G
TA
Haplotype Limited Diversity
Previous studies on local haplotype structure: (Daly et al., 2001) chromosome 5q31. (Patil et al., 2001) chromosome 21.
Study findings: The SNPs on each haplotype are correlated. SNPs can be separated into blocks of limited diversity.
Local regions have few haplotypes.
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
2nd Possibleresolution
11100110 100011001 1
10000001 201000001 2
01011001 110000000 1
10101110 101010001 1
11000001 100000001 1
11001000 100010001 1
01000001 210000001 2
or?
MaximumLikelihoodCriterion
?
ExamplePhasing
Genotypes
22222222
22000001
22022002
22222222
22000001
22022002
22000001
1st Possibleresolution
11111110 200000001 7
11000001 300000001 7
11011000 200000001 7
11111110 200000001 7
11000001 300000001 7
11011000 200000001 7
11000001 300000001 7
MaximumLikelihoodHaplotypeInference
is aNP-HardProblem
2
10
1
11
0
00
Narrowing the Search:Perfect Phylogeny
A directed phylogenetic tree. {0,1} alphabet. Each site mutates at most
once. No recombination.
00000
01000
1100001001
11100
11110
4
3
15
2
The Perfect Phylogeny Haplotype Problem (PPH)
Given genotypes over a short region. Find compatible haplotypes which
correspond to a perfect phylogeny tree.
[Gusfield 02’]. PPH deficiency – the data does not fit the
model.
Solving PPH
A very simple o(nm2) algorithm for PPH problem. (Also Gusfield 02, Bafna et al., 2003)
But – in practice, we do not expect to see perfect phylogeny in biological data.
We extend our algorithms to the case where the data is almost perfect phylogeny.
Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.
HAP Algorithm
HAP Local Predictions http://research.calit2.net/hap/ Over 6,000 users of webserver.
Main Ideas: Imperfect Phylogeny Maximum Likelihood Criterion
Extremely efficient. Orders of magnitude faster than other algorithms.
Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.
Public Genotype Data Growth
2001
Daly et al.Nature Genetics103 SNPs40,000genotypes
Gabriel et al.Science3000 SNPs400,000 genotypes
2002
TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes
2003
Perlegen DataScience1,570,000 SNPs100,000,000 genotypes
2004
NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes
2005
HapMap Phase 25,000,000+ SNPs600,000,000+genotypes
2006Eskin, Halperin, KarpRECOMB 2003
HAPTimeline
:
Phasing Methods
HAP is one of many phasing algorithms. Clark, 1990, Excoffier and Slatkin, 1995, PHASE – Stephens et al., 2001, HAPLOTYPER - Niu et al., 2002. Gusfield, 2000, Lancia et al. 2001. Many more…
How do we phase entire chromosomes?
Algorithms were designed for only 4-12 SNPs!
HAP “tiling” extension phasing for long regions.
Leverages the speed of HAP.
• For each window we compute the haplotypes using HAP
• We tile the windows using dynamic programming
genotypes
Local predictions
Scaling to Whole GenomesHAP-TILE
0010000011011011111001
Haplotype Tiling Problem(ignoring homozygous positions)
001000110111 010000 101111 011111 100000 000101 111010 000011 111100 100110 011001
0010000011011011111001
(minimum number of conflicts)
001000110111 010000 101111 011111 100000 000101 111010 000011 111100 100110 011001
• NP-Hard Problem• Dynamic Programming Solution
(Eskin et al. 2004.)
Phasing Running Time Comparison(Phaseoff Competition)
Marchini et al. American Journal of Human Genetics, 2006.
HAP is over 1000x faster than PHASE.
Public Genotype Data Growth
2001
Daly et al.Nature Genetics103 SNPs40,000genotypes
Gabriel et al.Science3000 SNPs400,000 genotypes
2002
TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes
2003
Perlegen DataScience1,570,000 SNPs100,000,000 genotypes
2004
NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes
2005
HapMap Phase 25,000,000+ SNPs600,000,000+genotypes
2006Eskin, Halperin, KarpRECOMB 2003
HAPTimeline
:
Perlegencollaboration
(12 hours)
NCBI dbSNPcollaboration
(24 hours) (48 hours)
Only 103 SNPs,0.02% of the genome!
RE
CO
MB
200
3 S
ub
mis
sio
n
Weighted Haplotype Association
Association Statistics
Assume we are given N/2 cases and N/2 control individuals.
Since each individual has 2 chromosomes, we have a total of N case chromosomes and N control chromosomes.
At SNP A, let p+A and p-
A be the observed case and control frequencies respectively.
We know that:
p+A ~ N(p+
A, p+A(1-p+
A)/N).
p-A ~ N(p-
A, p-A(1-p-
A)/N).
^ ^
^
^
Association Statistics
p+A ~ N(p+
A, p+A(1-p+
A)/N).
p-A ~ N(p-
A, p-A(1-p-
A)/N).
p+A- p-
A ~ N(p+A- p-
A,(p+A(1-p+
A)+p-A(1-p-
A))/N)
We approximate
p+A(1-p+
A)+p-A(1-p-
A) ≈ 2 pA(1-pA)
then if p+A =p-
A
^
^
^ ^
€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N(0,1)
^ ^
-
Association Statistic
Under the null hypothesis p+A- p-
A=0
We compute the statistic SA.
If SA< -1(/2) or SA>--1(/2) then the association is significant at level .€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N(0,1)
Association Power
Lets assume that SNP A is causal and p+A ≠ p-
A
Given the true p+A and p-
A, if we collect N individuals, and compute the statistic SA, the probability that SA has a significance level of is the power.
Power is the chance of detecting an association of a certain strength with a certain number of individuals.
Association Statistic Lets assume that p+
A ≠ p-A then
€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N
pA+ − pA
−
2 /N pA (1− pA ),1
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N
( pA+ − pA
− ) N
2pA (1− pA ),1
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N λ A N ,1( )
Association Power
€
SA =ˆ p +A − ˆ p −A
2 /N ˆ p A (1− ˆ p A )~ N λ A N ,1( )
€
λA N
Power ofassociationtest
Threshold forsignificance
Non-centralityparameter.
Association Power
Statistical Power of an association with N individuals, non-centrality parameter and significance threshold is P(, )=
Note that if λ=0, power is always .€
λ N
€
λ N
€
(Φ−1
(α / 2) + λ N ) + 1 − Φ(−Φ−1
(α / 2) + λ N )
Indirect Association
Now lets assume that we have 2 markers, A and B. Let us assume that marker B is the causal mutation, but we are observing marker A.
If we observed marker B directly our statistic would be
€
λB =( pB
+ − pB− )
2pB (1− pB )
€
SB ~ N λ B N ,1( )
Indirect Association
However, we are observing A where our statistic is
What is the relation between SA and SB?
€
λA =( pA
+ − pA− )
2 pA (1− pA )
€
SA ~ N λ A N ,1( )
Indirect Association
We want to relate
to
€
λA =( pA
+ − pA− )
2 pA (1− pA )
€
SA ~ N λ A N ,1( )
€
λB =( pB
+ − pB− )
2pB (1− pB )
€
SB ~ N λ B N ,1( )
Indirect Association
We assume conditional probability distributions are equal in case and control samples
€
pA+ = pAB
+ + pAb+
pA+ = pB
+ pA |B + (1− pB+ )pA |b
pA− = pB
− pA |B + (1− pB− )pA |b
pA+ − pA
− = pA |B ( pB+ − pB
− ) − pA |b (pB+ − pB
− )
pA+ − pA
− = (pB+ − pB
− )(pA |B − pA |b )
Indirect Association Then
€
λA =( pA
+ − pA− )
2pA (1− pA )=
( pB+ − pB
− )( pA |B − pA |b )
2pA (1− pA )
=( pB
+ − pB− )( pA |B − pA |b )
2pA (1− pA )
2 pB (1− pB )
2 pB (1− pB )
=( pB
+ − pB− )
2pB (1− pB )
( pA |B − pA |b ) 2pB (1− pB )
2pA (1− pA )
= λ B
( pA |B − pA |b ) 2pB (1− pB )
2 pA (1− pA )
Indirect Association
Note that
€
λA = λ B
( pA |B − pA |b ) 2pB (1− pB )
2pA (1− pA )
= λ B
pAB
pB
−pAb
1− pB
⎛
⎝ ⎜
⎞
⎠ ⎟ pB (1− pB )
pA (1− pA )
= λ B
pAB − pAB pB − pAb pB
pB (1− pB )
⎛
⎝ ⎜
⎞
⎠ ⎟ pB (1− pB )
pA (1− pA )
= λ B
pAB − pA pB
pA (1− pA ) pB (1− pB )= λ B r2
€
λA = λ B r2
Indirect Association
How many individuals, NA, do we need to collect at marker A to achieve the same power as if we collected NB markers at marker B?
€
SA ~ N λ A NA ,1( )
€
SB ~ N λ B NB ,1( )
€
λA NA = λ B NB
λ B r2 NA = λ B NB
NA =NB
r2
€
λA = λ B r2
Visualization in terms of Power
€
λB N
Power ofassociationtest
Threshold forsignificance
Non-centralityparameters.
€
λA N
€
λA = λ B r2
Correlating Haplotypes with the Disease
The disease may be correlated with a SNP not in the panel.
The disease may be more correlated with a haplotype (group of SNPs) than with any single SNP in the panel.
Haplotype tests: Which haplotypes should we test? Which blocks should we pick?
Key Problem: Indirect Association
We have the HapMap. Information on 4,000,000 SNPs.
AffyMetrix gene chip collects information on 500,000 SNPs. What about the remaining 3,500,000 SNPs?
So far, we have designed studies by picking tag SNPs with high r2.
Can we use the HapMap when performing association? Multi-Tag methods.
Haplotypes as Proxies for Hidden SNPs (de Bakker 2005)
HaplotypesFreq.
1 2 3 4 5
A A A A A .25
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
HaplotypesFreq.
1 2 3 4 5
A A A A A .25
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
HaplotypesFreq.
1 2 3 4 5
A A A A A .25
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
HaplotypesFreq.
1 2 3 4 5
A A A A A .25
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
WHAP - Weighted Haplotypes
HaplotypesFreq.
1 2 3 4 5
A A A A A .25
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
A
0.71AA + 0.29AG0.71AA + 0.29AG
Basic MultiMarker Method
For each SNP in HapMap, find haplotype among genotyped SNPs that has highest r2 to the SNP.
Perform association at each SNP and each added haplotype.
Now instead of performing 500,000 tests, we perform 4,000,000 tests.
Weighted Haplotype Test
For each haplotype h, we assign a weight wh
We use a “weighted” allele frequency statistic:
This statistic is the weighted numerator in SA. What is the variance of this statistic?
Complication: Haplotype frequencies are not independent!
€
Wh = wh ( ph+ − ph
−)h
∑
Weighted Haplotype Example
Assume we have 4 haplotypes AB, Ab, aB and ab. If we set the weights so that wAB=wAb=1 and
waB=wab=0, this is equivalent to looking at the single SNP A.
If we set the weights so that wAB=1 and wAb=waB=wab=0, this is equivalent to looking at the single haplotype AB.
Other weights are can be something in between.
The -test
€
(w) =N wh ph
case − phcontrol
( )h=1
k
∑ ⎛ ⎝ ⎜ ⎞
⎠ ⎟2
2 wh2 ph − wh ph∑ ⎛ ⎝
⎜ ⎞ ⎠ ⎟2
h=1
k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Each haplotype h is assigned a weight wh. N is the number of individuals. ph - the probablity for h in cases/controls, or
average. Under the null, the -test is 2 distributed.
Non-Centrality Parameter
Under weights w1,w2,w3,w4 and true case/control probabilities p1
+,p2+,p3
+,p4+ and
p1-,p2
-,p3-,p4
-, Wh is expected to be
When normalizing for the variance, the non-centrality parameter is
€
Wh = wi( pi+ − pi
−)i=1
4
∑
€
λh N =
wi( pi+ − pi
−)i=1
4
∑
2 /N wi
2 pi − wi pi
i=1
4
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
2
i=1
4
∑
Wh and indirect association
Let us assume that SNP C is causal with non-centrality parameter λC.
If we perform weighted haplotype association, the noncentrality parameter is λh.
How are they related? (i.e. What is the power of the weighted haplotype association test).
Using the same technique, we can show that λC=rh λh, where rh is the conceptual equivalent of r in 2 SNP case.
The Relation to Power
€
rh2 =
wh qhC − qhc( )h=1
k
∑ pC (1− pC )
wh2 phh
∑ − wh phh∑( )
2
€
qhC = P(h | C)
qhc = P(h | c)
pC = P(c)
The power of detecting the SNPwith N individuals is the sameas using the tag SNPs withN/rh
2 individuals.
Choosing the Weights
Haplotypes
1 2 3 4 5
A A A A A .05
A G A G G .15
A G A G A .10
G A G G G .25
G G G G G .25
Optimal weights:
wh(s5) = P(s5 = ‘A’ | h) = qAh
The Relation to Power
€
rh2 =
qCh pCh − pC ph( )h=1
k
∑pC (1− pC )
This is exactly r2 in the case of one tag SNP.
WHAP always has at least as much power as:• single SNP test• single haplotype test• haplotype group test• 2 with k degrees of freedom.
Cases0.5M SNPs
Controls0.5M SNPs
HapMap4M SNPs
Use as training dataset to getthe weights
Tests: T1,…,T4M
Apply tests: T1,…,T4M
Positive results give evidence for a causal SNP - can be verified by a follow up/two stage study.
How Many SNPs are Captured?
Tag Set Pop SNP HAP WHAP
Affy500 CEU 0.61 0.77 0.84
Affy500 CHB 0.62 0.76 0.83
Affy500 JPT 0.59 0.73 0.81
Affy500 YRI 0.37 0.61 0.74
Illumina CEU 0.88 0.97 0.98
Illumina CHB 0.80 0.91 0.94
Illumina JPT 0.78 0.90 0.95
Illumina YRI 0.52 0.83 0.92
Power Simulations
Pop SNP HAP WHAP
CEU 0.92 0.94 0.96
CHB 0.90 0.94 0.95
JPT 0.90 0.93 0.95
YRI 0.77 0.88 0.92
- Relative power to using all SNPs. - Tested on the ENCODE regions, Affy 500k tag SNPs.
Practical Issues
We assume we have the haplotype frequencies in the HapMap (not the phase).
We assume the case/control populations are coming from the same population as the HapMap.
Over-fitting: Train with half of the data, test the other half. No correlation between the haps and random SNPs.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
WHAP r2 in a region. Red lines are collected SNPs. Blue lines are rh2 values.
Associations using WHAP. Red lines are assocations at collected SNPs. Blue lines are associations at uncollected SNPs inferred by WHAP.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Optimal Genome Wide Tagging by Reduction to SAT
Correlation Strucutre
QuickTime™ and a decompressor
are needed to see this picture.
Example r2 Matrix
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Graph Representation
Satisfiability and SAT Solvers Boolean variables called literals Logical operators
AND ∧ OR ∨ NOT ¬
Example: (s1 ∨ ¬ s2) ∧ (s2 ∨ s3 ∨ s1) s1 = false; s2 = false; s3 = true
A. Darwiche
A B B A C D D C
and and and and and and and and
or or or or
and and
or
rooted DAG (Circuit)
Negation Normal Form
CNF Form and Logical Solutions
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
NNF Form of Solutions
QuickTime™ and a decompressor
are needed to see this picture.
Local Single SNP r2 Tagging
Generate a clause for each SNP Clause for SNP si contains all covers
Input CNF as conjuction of all clauses Compile with minSAT solver Find solutions by traversal of NNF
Optimal Tagging
QuickTime™ and a decompressor
are needed to see this picture.
Whole Genome Tagging
QuickTime™ and a decompressor
are needed to see this picture.
MultiMarker Example
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
MultiMarker Tagging
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
UCLA:Adnan DarwicheArthur ChoiKnot Pipatswisawat
ICSI:Eran HalperinRichard Karp
Perlegen Sciences:David HindsDavid Cox
Ph.D. Students:Buhm HanNils Homer Hyun Min KangSean O’RourkeJimmie YeNoah Zaitlen
Webserver Hosted By: