Statistical Genetics
Matt McQueenAssistant Professor
Institute for Behavioral GeneticsUniversity of Colorado at Boulder
OverviewBackground and Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
OverviewBackground and Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
Statistical GeneticsOtherwise known as:
- Genetic Epidemiology- Genetic Statistics
By definition, “integrative”- Combines epidemiological, statistical, clinical,
genetic and molecular approaches
Genetic Discovery
Evidence for genetic effects? Familial aggregation
Mode of inheritance? Segregation Analysis
Where in the region? Fine Mapping
What chromosome / region? Linkage Analysis
What gene? Association Analysis
What is the effect of the gene? Characterization
Why Hunt for Genes?Disease etiology
Refined diagnosis and/or prognosis
Drug development
Disease prediction
ChallengesField is young and changes rapidly
- Technology drives the science- We test because we can
ChallengesLiterature can be difficult
- Statisticians writing genetic papers- Geneticists writing statistical papers
ChallengesSoftware typically not well-tested or supported
- The cost of being “free”- Use at your own risk!
ChallengesMethods are often oversold
- Consequence of high-pressure field - Rapid development creates sense of urgency
Some TerminologyLocus
- A location in the genome
Gene- A DNA segment characterized by sequence, transcription or
homology
Allele- Different forms of a gene: A, a; B, b
Polymorphism- Allele present in the population with > 5% freq
Mutation- Allele present in the population with < 5% freq
Some TerminologyPhenotype
- Any measurable outcomeQuantitative Trait Locus (QTL)
- A region (gene) that contributes to a phenotypePenetrance (binary, disease phenotypes)
- Prob(Phenotype | Genotype)Heritability (quantitative traits)
- Variance explained by genetic factorsMendelian Disorder
- Diseases influenced by a single geneComplex Trait
- Disease influenced by multiple genes and environment
OverviewA Brief Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
LinkageGeneral Idea:
- Describes the relationship between two loci- If two loci are close in proximity
- “linked”
- If two loci are far apart (different chromosomes):- “not linked”
RecombinationA1
B1
A2
B2
gametes A1
B1
A2
B2
A2
B1
A1
B2
θ = Recombination Rate
1−θ2
1−θ2
θ2
θ2probability
Genetic DistanceDefinition:
- The expected number of crossover events between two loci
Units:- Morgans- 1 Morgan = 1 crossover event expected
Genetic Map- A linearly arranged set of loci with genetic distances between
them- Human Autosomes ~ 3900 cM
Linkage DisequilibriumGeneral Idea:
- Describes the relationship between alleles at two loci
- If the alleles at each loci are close in proximity:- “in linkage disequilibrium”
Linkage Disequilibrium
x4x3x2x1Frequency
A2B2A2B1A1B2A1B1Gametes
pB2=x2+x4pB1=x1+x3pA2=x3+x4pA1=x1+x2Frequency
B2B1A2A1Allele
Linkage Disequilibrium
x4x3x2x1Frequency
A2B2A2B1A1B2A1B1Gametes
pB2=x2+x4pB1=x1+x3pA2=x3+x4pA1=x1+x2Frequency
B2B1A2A1Allele
D = Observed - Expected
D = x1 − pA1pB1
D = x1 − (x1 + x2)(x1 + x3)D = x1x4 − x2x3
Reasons for LDMutation
Population Subdivision
Genetic Drift
Lack of Recombination
Selection
Non-Random Mating
Linkage and LD
Dt = (1−θ)t D0
After t generations of random mating…
LD is a function of recombination and time (generations)
Linkage and LD
Key Concepts…- Linkage : Location- LD : Alleles- There can be Linkage without LD- There can be LD without Linkage
OverviewBackground and Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
DNA VariationDNA
- Adenine (A)- Guanine (G)- Cytosine (C)- Thymine (T)
DNA double helix- A pairs with T and G pairs with C
Codons- Triplets of bases- 64 possible codons
- 20 amino acids
MutationsPoint
- Substitute one base for another
Deletions- Base removed entirely
Insertions- Base inserted
Duplications- Base and/or sequence duplicated
MutationsPoint
- Substitute one base for another
Deletions- Base removed entirely
Insertions- Base inserted
Duplications- Base and/or sequence duplicated
More on Point MutationsPoint Mutations
- Synonymous- No change in amino acid
- Nonsynonymous- Amino acid change
- Creates a new polymorphic site- “Single Nucleotide Polymorphism” (SNP)
Mutation Becomes PolymorphismInfinite Sites Model
- Each mutation creates a unique polymorphic site- Mutation rate ~ 10-6
Life After MutationMutation is neutral
- Random Genetic Drift- Eventually, the allele will “drift” out
Mutation is harmful- Selective Pressure
- Allele may quickly disappear
Mutation is beneficial- Selective Pressure
- Allele frequency may increase rapidly
Who Are We?All DNA sequences are derived from others
- Every sample has a genealogy
Eventually, all lineages coalesce- Most Recent Common Ancestor (MRCA)
The “older” the genetic history…- The less observed LD (Africans vs European)
The more isolated genetic history…- The more observed LD (Mayan)
OverviewBackground and Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
Linkage AnalysisGene-Mapping
- Manipulate the Properties of Linkage- Using an observed locus (marker) to draw inferences about
an unobserved locus (disease gene)
Family-Based Design- Extended (grandparents, parents and kids)- Nuclear (parents and kids)
- Sibling Pair (no parents and kids)
Goal: Find genomic region “linked” to disease
Linkage Analysis
0 2010 30 40 50 60 70
cM
M1 M2 M3 M4 M5 M6 M7 M8
Disease Gene (unobserved)
Genetic Markers
Genetic Distance
Linkage AnalysisParametric
- Affected / Unaffected- Observed recombination events
Non-Parametric- Affected / Unaffected- Identity-by-Descent (IBD)
“Semi-Parametric”- Quantitative- IBD
MCMC- Any phenotype- IBD
Linkage AnalysisParametric
- Affected / Unaffected- Observed recombination events
Non-Parametric- Affected / Unaffected- Identity-by-Descent (IBD)
“Semi-Parametric”- Quantitative- IBD
MCMC- Any phenotype- IBD
IBD Probabilities
00.500.50Avuncular
00.500.50Half-Sibs
00.500.50Grandparent-Grandchild
00.250.75First Cousin
010Parent-Offspring
0.250.500.25Full Sibs
100MZ Twins
π2π1π0Relative PairProbability of Sharing IBD Alleles
IBD and Sibling Pairs
00.500.50Avuncular
00.500.50Half-Sibs
00.500.50Grandparent-Grandchild
00.250.75First Cousin
010Parent-Offspring
0.250.500.25Full Sibs
100MZ Twins
π2π1π0Relative PairProbability of Sharing IBD Alleles
IBD and Sibling PairsUse of Sibling Pairs in linkage analysis
- Affected Sibling Pair (ASP) Design- Binary Trait
- Unascertained Sibling Pair Design- Quantitative Traits
- Ascertained Sibling Pair Design- Quantitative Traits
We look for regions that show deviation of IBD from what is expected under the null
Linkage Analysis of Sibling PairsBasic Idea
- Sibling pairs sharing more alleles IBD than expected at a trait-influencing locus should have more similar phenotypes
Affected Sibling Pairs
ASP DSP USP
If there is a shared genetic component…
P(IBD=0, IBD=1, IBD=2) = 0.25, 0.50, 0.25
Affected Sibling Pairs
100255025Expected
100354520Observed
Total210
Number of Alleles Shared IBD
H0: No LinkageH1: Linkage
Sibling Pairs (Quantitative Traits)
If there is a shared genetic component…
P(IBD=0, IBD=1, IBD=2) = 0.25, 0.50, 0.25
Quantitative TraitsHaseman-Elston Algorithm
- Calculate number of alleles shared IBD and the squared phenotype difference for each sibpair
- Regress squared differences against IBD sharing
E(∆2) =α + βπ
∆ = trait difference between sibsα = regression interceptβ = slopeπ = IBD sharing
The LOD ScoreMorton (1955)Log10 of the ODds for linkageEssentially a Likelihood Ratio
- Likelihood of observed- Likelihood of expected (no linkage, theta=0.5)
Developed in the context of parametric linkage
Common Nonparametric StatisticsMaximum LOD Score
- “MLS” (or MLOD)- ASP design only- GENEHUNTER, ASPEX
Nonparametric Linkage Score- “NPL Score”- Any family design- GENEHUNTER
Kong and Cox LOD Score- “K&C LOD Score”- Derived from the NPL- MERLIN, ALLEGRO
Interpreting Linkage StatisticsTraditional View…
- LOD > 3.0 for genome-wide significance
More Contemporary View…- Simulate for empirically derived significance
OverviewBackground and Introduction
Linkage and Linkage Disequilibrium
Population Genetics
Linkage Analysis
Association Analysis
Association AnalysisGene-Mapping
- Manipulate the Properties of Linkage Disequilibrium- Using an observed locus (marker) to draw inferences about
an unobserved locus (disease gene)
Fine-Mapping- Refine a linkage region
Candidate-Gene- Evaluate the genetic variation as it relates to an outcome
Goal: Find genomic region and/or genes “associated”with disease
Association AnalysisFamily-Based
- Parent/Offspring Trios- Sibling Pairs- Nuclear Families- Extended Pedigrees
Population-Based- Case-Control- Cohort
Association AnalysisKey Concepts
- Genotype Coding- Population Stratification- Transmission Disequilibrium Test (TDT)- Whole Genome Association
Coding Genotypes
Genotype
100Recessive
(A)
110Dominant
(A)
1,0,00,1,00,0,1Genotype
(A)
210Additive
(A)
AAaAaa
Genotype Coding
Marker Score = XAdditive : X = (0, 1 or 2)Dominant : X = (0 or 1)Recessive : X = (0 or 1)
Genetic AssociationsTruth
- Causal locus (direct)- In LD with causal locus (indirect)
Chance- If you test 100 times, you’ll see ~ 5 tests < 0.05- No causal underpinning
Bias- Association is not causal- e.g. Population stratification
Common Cause
G P
A
Ancestry (A) predicts Genotype (G)
Ancestry (A) predicts Phenotype (P)
a.k.a.… Population Stratification
Poor Epidemiologic DesignSource Population?
Two Necessary Components:- Different prevalence (mean) of disease- Different allele frequency
Stratification HappensStrategies to deal with it
- Self-Reported Ancestry- Match (design) or Adjust (analysis)
- Use other genetic markers (ancestry informative)- Genomic Control (Devlin – U of Pittsburgh)- STRUCTURE (Pritchard – U of Chicago)- Eigenstrat (Reich – Broad Institute/Harvard)
- Use a family-based design
Transmission Disequilibrium Test (TDT)
AB AB
AB
Father - “A” was transmitted and “B” wasn’tMother - “B” was transmitted and “A” wasn’t
Transmission Disequilibrium Test (TDT)
AB AB
AB
Offspring
BBxBB
ABxBB
010ABxAB
AAxBB
AAxAB
AAxAA
BBABAAParent
Transmission Disequilibrium Test (TDT)
AB AB
AB
Offspring
BBxBB
ABxBB
010ABxAB
AAxBB
AAxAB
AAxAA
BBABAAParent
nBA
nAA
A
nBBB
nABA
B
Not Transmitted
�Tra
nsm
itted
Transmission Disequilibrium Test (TDT)
AB AB
AB
Offspring
BBxBB
ABxBB
010ABxAB
AAxBB
AAxAB
AAxAA
BBABAAParent
1
0
A
0B
1A
B
Not Transmitted
�Tra
nsm
itted
TDT
nBA
nAA
A
nBBBnABAB
Not Transmitted
�Tra
nsm
itted
TDT =(nBA − nAB )2
nBA + nAB
~ χ12
McNemar Test for Matched-Pair Data
Generalized ExtensionsMultiple OffspringMissing ParentsNon-Binary Phenotypes
- Quantitative, time-to-onset, ordinal…
Generalized ExtensionsFBAT/PBAT (Laird/Lange - Harvard)
QTDT (Abecasis/Cardon - Michigan)
PDT (Monks/Kaplan - Duke)
Gene-MappingMonogenic ‘Mendelian’ Diseases
- Rare disease- Rare variants
- Highly penetrant
Complex Disease- Rare/Common disease- Rare/Common variants
- Variable penetrance
Gene-MappingMonogenic ‘Mendelian’ Diseases
- Rare disease- Rare variants
- Highly penetrant
Complex Disease- Rare/Common disease- Rare/Common variants
- Variable penetrance
Linkage!
Gene-MappingMonogenic ‘Mendelian’ Diseases
- Rare disease- Rare variants
- Highly penetrant
Complex Disease- Rare/Common disease- Rare/Common variants
- Variable penetrance Association
Genetic Discovery
Evidence for genetic effects? Familial aggregation
Mode of inheritance? Segregation Analysis
Where in the region? Fine Mapping
What chromosome / region? Linkage Analysis
What gene? Association Analysis
What is the effect of the gene? Characterization
Genetic Discovery
Evidence for genetic effects? Familial aggregation
Mode of inheritance? Segregation Analysis
Where in the region? Fine Mapping
What chromosome / region? Linkage Analysis
What gene? Association Analysis
What is the effect of the gene? Characterization
Gene-MappingWhere in the genome (1980s - 2005)?
- Linkage
Where in the genome (2006 - )?- Association
Relative Power*
70022,3850.200.20
6598,0670.010.20
2,448207,6350.200.05
2,27867,2190.010.05
ASSOCIATION(NA)
LINKAGE(NL)PrevalenceMAF
MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5
*Adapted from Roeder et al, Am J Hum Genet (2006)
Rare Disease - Rare Variant
70022,3850.200.20
6598,0670.010.20
2,448207,6350.200.05
2,27867,2190.010.05
ASSOCIATION(NA)
LINKAGE(NL)PrevalenceMAF
MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5
*Adapted from Roeder et al, Am J Hum Genet (2006)
Common Disease - Rare Variant
70022,3850.200.20
6598,0670.010.20
2,448207,6350.200.05
2,27867,2190.010.05
ASSOCIATION(NA)
LINKAGE(NL)PrevalenceMAF
MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5
*Adapted from Roeder et al, Am J Hum Genet (2006)
Common Variant - Rare Disease
70022,3850.200.20
6598,0670.010.20
2,448207,6350.200.05
2,27867,2190.010.05
ASSOCIATION(NA)
LINKAGE(NL)PrevalenceMAF
MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5
*Adapted from Roeder et al, Am J Hum Genet (2006)
Common Disease - Common Variant
70022,3850.200.20
6598,0670.010.20
2,448207,6350.200.05
2,27867,2190.010.05
ASSOCIATION(NA)
LINKAGE(NL)PrevalenceMAF
MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5
*Adapted from Roeder et al, Am J Hum Genet (2006)
The “-omics” Agec. 1996
-Pre-genomic era-100’s of Markers
- STRs
c. 2007-Post-genomic era-100,000’s of markers
- SNPs
Available TechnologyPlatforms available (or coming soon)
- 1 SNP- Hundreds of SNPs- Thousands of SNPs- Hundreds of thousands of SNPs- Millions of SNPs
Flexibility for Association- Single Marker- Candidate Gene- Whole-Genome
What if we discover that genes have nothing to do with complex phenotypes?
Good News: We may not have to cross that bridge
Replicated AssociationsType II DiabetesBMI / ObesityCrohn’s DiseaseAge-Related Macular Degeneration (AMD)Prostate CancerBreast CancerHeart Disease
Framingham Heart Study and BMI
The SNP is close (in LD) with INSIG2- A plausible candidate for obesity- Responds to insulin- Involved in trigylceride synthesis
Framingham Heart Study and BMIReplicated in 4 out 5 studies
- Childhood sample- African American Sample- Europe and North America
Wealth of InformationWhole Genome Association using SNPs
- Potentially use all of the data- Covariates, interactions, effect size, etc.- Statistical issues abound…
Multiple ComparisonsWhich SNPs are “real”?
- 500K Chip- 25,000 SNPs with p < 0.05
Multiple Phenotypes- 10 Phenotypes, 500K chip
- 5,000,000 comparisons!!!!
“My name is Matt McQueen and I have a P-value problem”
The smallest p-values- Most addictive- We’ve been trained to focus on them- What do they mean?
- Truth- Chance- Bias
What is a phenotype?If we asked a gene…
GENE
Trait 1
Trait 2
Trait 3
Trait 4
Trait 5
Trait 6
5%
55%
4%
20%
1%
15%
What is a phenotype?If we asked an environmental factor…
Trait 1
Trait 2
Trait 3
Trait 4
Trait 5
Trait 6
10%
10%
30%
5%
5%
40%
ENV
What is a phenotype?
GENE
Trait 1
Trait 2
Trait 3
Trait 4
Trait 5
Trait 6
5%
55%
4%
20%
1%
15%
10%
10%
30%
5%
5%
40%
ENV
What is a Genotype?
We test SNPs for association because we can
What about epigenetic factors?- Methylation- Copy Number Variation
The $1000 GenomeNHGRIRFA Number
- RFA-HG-06-020Title
- “The $1000 Genome”Goal
- Develop technology to enable investigators to sequence an entire human genome for $1000 within 10 years
Top Related