Statistical Genetics - CU Population CenterStatistical Genetics Matt McQueen Assistant Professor...

196
Statistical Genetics Matt McQueen Assistant Professor Institute for Behavioral Genetics University of Colorado at Boulder

Transcript of Statistical Genetics - CU Population CenterStatistical Genetics Matt McQueen Assistant Professor...

Statistical Genetics

Matt McQueenAssistant Professor

Institute for Behavioral GeneticsUniversity of Colorado at Boulder

Why am I here?

Statistical Genetics - Biodemography

Perspectives…

Perspectives…

Epidemiology

Perspectives…

Biostatistics

Perspectives…

Health Policy

Perspectives…

Environmental Health

Perspectives…

Society, Human Development and Health

The View from Here…

GENES Outcome

Environment

OverviewBackground and Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

OverviewBackground and Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

Statistical GeneticsOtherwise known as:

- Genetic Epidemiology- Genetic Statistics

By definition, “integrative”- Combines epidemiological, statistical, clinical,

genetic and molecular approaches

Genetic Discovery

Evidence for genetic effects? Familial aggregation

Mode of inheritance? Segregation Analysis

Where in the region? Fine Mapping

What chromosome / region? Linkage Analysis

What gene? Association Analysis

What is the effect of the gene? Characterization

Why Hunt for Genes?

Why Hunt for Genes?Disease etiology

Why Hunt for Genes?Disease etiology

Refined diagnosis and/or prognosis

Why Hunt for Genes?Disease etiology

Refined diagnosis and/or prognosis

Drug development

Why Hunt for Genes?Disease etiology

Refined diagnosis and/or prognosis

Drug development

Disease prediction

Challenges

ChallengesField is young and changes rapidly

- Technology drives the science- We test because we can

ChallengesLiterature can be difficult

- Statisticians writing genetic papers- Geneticists writing statistical papers

ChallengesSoftware typically not well-tested or supported

- The cost of being “free”- Use at your own risk!

ChallengesMethods are often oversold

- Consequence of high-pressure field - Rapid development creates sense of urgency

Some TerminologyLocus

- A location in the genome

Gene- A DNA segment characterized by sequence, transcription or

homology

Allele- Different forms of a gene: A, a; B, b

Polymorphism- Allele present in the population with > 5% freq

Mutation- Allele present in the population with < 5% freq

Some TerminologyPhenotype

- Any measurable outcomeQuantitative Trait Locus (QTL)

- A region (gene) that contributes to a phenotypePenetrance (binary, disease phenotypes)

- Prob(Phenotype | Genotype)Heritability (quantitative traits)

- Variance explained by genetic factorsMendelian Disorder

- Diseases influenced by a single geneComplex Trait

- Disease influenced by multiple genes and environment

Pedigree Notation

Male

Female

Founders

Pedigree Notation

Male

Female

Affected

Pedigree Notation

Male

Female

Affected

Deceased

Mendel’s Laws

Mendel’s LawsMendel’s First Law

- Independent Segregation

ggGgg

GgGGGDad

gG

Mom

Mendel’s LawsMendel’s Second Law

- Independent Assortment

Mendel’s LawsWhat do peas have to do with people?

- Underlying principles of statistical genetics!

OverviewA Brief Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

Linkage

LinkageGeneral Idea:

- Describes the relationship between two loci- If two loci are close in proximity

- “linked”

- If two loci are far apart (different chromosomes):- “not linked”

Recombinant Events

RecombinationA1

B1

A2

B2

RecombinationA1

B1

A2

B2

gametes A1

B1

A2

B2

A2

B1

A1

B2

RecombinationA1

B1

A2

B2

gametes A1

B1

A2

B2

A2

B1

A1

B2

θ = Recombination Rate

1−θ2

1−θ2

θ2

θ2probability

No LinkageA1

B1

A2

B2

gametes A1

B1

A2

B2

A2

B1

A1

B2

14

14

14

14probability

θ ~ 0.5

Where have we seen this before?Mendel’s Second Law

- Independent Assortment

LinkageA1

B1

A2

B2

gametes A1

B1

A2

B2

θ ~ 0

12

12

Recombination EventsWhat predicts a recombination event?

What drives the recombination fraction?

Genetic DistanceDefinition:

- The expected number of crossover events between two loci

Units:- Morgans- 1 Morgan = 1 crossover event expected

Genetic Map- A linearly arranged set of loci with genetic distances between

them- Human Autosomes ~ 3900 cM

Linkage Disequilibrium

Linkage DisequilibriumGeneral Idea:

- Describes the relationship between alleles at two loci

- If the alleles at each loci are close in proximity:- “in linkage disequilibrium”

Linkage Disequilibrium

x4x3x2x1Frequency

A2B2A2B1A1B2A1B1Gametes

Linkage Disequilibrium

x4x3x2x1Frequency

A2B2A2B1A1B2A1B1Gametes

pB2=x2+x4pB1=x1+x3pA2=x3+x4pA1=x1+x2Frequency

B2B1A2A1Allele

Linkage Disequilibrium

x4x3x2x1Frequency

A2B2A2B1A1B2A1B1Gametes

pB2=x2+x4pB1=x1+x3pA2=x3+x4pA1=x1+x2Frequency

B2B1A2A1Allele

D = Observed - Expected

D = x1 − pA1pB1

D = x1 − (x1 + x2)(x1 + x3)D = x1x4 − x2x3

Another Common LD Metric

r2 =D2

pA1pA 2 pB1pB 2

Reasons for LDMutation

Population Subdivision

Genetic Drift

Lack of Recombination

Selection

Non-Random Mating

How does linkage relate to linkage disequilibrium?

Linkage and LD

Dt = (1−θ)t D0

After t generations of random mating…

LD is a function of recombination and time (generations)

Linkage and LD

Key Concepts…- Linkage : Location- LD : Alleles- There can be Linkage without LD- There can be LD without Linkage

OverviewBackground and Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

DNA VariationDNA

- Adenine (A)- Guanine (G)- Cytosine (C)- Thymine (T)

DNA double helix- A pairs with T and G pairs with C

Codons- Triplets of bases- 64 possible codons

- 20 amino acids

MutationsPoint

- Substitute one base for another

Deletions- Base removed entirely

Insertions- Base inserted

Duplications- Base and/or sequence duplicated

MutationsPoint

- Substitute one base for another

Deletions- Base removed entirely

Insertions- Base inserted

Duplications- Base and/or sequence duplicated

More on Point MutationsPoint Mutations

- Synonymous- No change in amino acid

- Nonsynonymous- Amino acid change

- Creates a new polymorphic site- “Single Nucleotide Polymorphism” (SNP)

Mutation Becomes PolymorphismInfinite Sites Model

- Each mutation creates a unique polymorphic site- Mutation rate ~ 10-6

Life After MutationMutation is neutral

- Random Genetic Drift- Eventually, the allele will “drift” out

Mutation is harmful- Selective Pressure

- Allele may quickly disappear

Mutation is beneficial- Selective Pressure

- Allele frequency may increase rapidly

Human Genetic History

Human Genetic History

National Geographic: The Genographic Project

Human Genetic History

National Geographic: The Genographic Project

Human Genetic History

National Geographic: The Genographic Project

Human Genetic History

National Geographic: The Genographic Project

Who Are We?

Tim

e

Sequences

Who Are We?

Tim

e

Sequences

Who Are We?

Tim

e

MRCA

Who Are We?All DNA sequences are derived from others

- Every sample has a genealogy

Eventually, all lineages coalesce- Most Recent Common Ancestor (MRCA)

The “older” the genetic history…- The less observed LD (Africans vs European)

The more isolated genetic history…- The more observed LD (Mayan)

Who Are We?What does this have to do with gene-mapping?

Balding (2006) Nat Rev Genet.

OverviewBackground and Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

Linkage AnalysisGene-Mapping

- Manipulate the Properties of Linkage- Using an observed locus (marker) to draw inferences about

an unobserved locus (disease gene)

Family-Based Design- Extended (grandparents, parents and kids)- Nuclear (parents and kids)

- Sibling Pair (no parents and kids)

Goal: Find genomic region “linked” to disease

Linkage Analysis

0 2010 30 40 50 60 70

cM

M1 M2 M3 M4 M5 M6 M7 M8

Disease Gene (unobserved)

Genetic Markers

Genetic Distance

Linkage Analysis

0 2010 30 40 50 60 70

cM

M1 M2 M3 M4 M5 M6 M7 M8

Disease Gene

Linkage Analysis

0 2010 30 40 50 60 70

cM

M1 M2 M3 M4 M5 M6 M7 M8

Disease Gene

Linkage Region

Linkage AnalysisParametric

- Affected / Unaffected- Observed recombination events

Non-Parametric- Affected / Unaffected- Identity-by-Descent (IBD)

“Semi-Parametric”- Quantitative- IBD

MCMC- Any phenotype- IBD

Linkage AnalysisParametric

- Affected / Unaffected- Observed recombination events

Non-Parametric- Affected / Unaffected- Identity-by-Descent (IBD)

“Semi-Parametric”- Quantitative- IBD

MCMC- Any phenotype- IBD

Linkage AnalysisKey Concepts

- Allele Sharing (IBS and IBD)- Linkage Statistics (LOD Score, etc.)

Allele Sharing

Identity by State (IBS)

ac bd

IBS = 0

How many alleles are in common?

Identity by State (IBS)

ac ad

IBS = 1

How many alleles are in common?

Identity by State (IBS)

ac ac

IBS = 2

How many alleles are in common?

Identity by Descent (IBD)

ab cd

ac bd

IBD = 0

How many alleles are common by descent?

Identity by Descent (IBD)

ab cd

ac ad

IBD = 1

How many alleles are common by descent?

Identity by Descent (IBD)

ab cd

ac ac

IBD = 2

How many alleles are common by descent?

IBS and IBD

ab cd

ac bd

IBS = 0IBD = 0

IBS and IBD

ab cd

ac ad

IBS = 1IBD = 1

IBS and IBD

ab cd

ac ac

IBS = 2IBD = 2

Ambiguous IBD

ab cb

bc ab

IBS = 1IBD = 0

IBD Probabilities

00.500.50Avuncular

00.500.50Half-Sibs

00.500.50Grandparent-Grandchild

00.250.75First Cousin

010Parent-Offspring

0.250.500.25Full Sibs

100MZ Twins

π2π1π0Relative PairProbability of Sharing IBD Alleles

IBD and Sibling Pairs

00.500.50Avuncular

00.500.50Half-Sibs

00.500.50Grandparent-Grandchild

00.250.75First Cousin

010Parent-Offspring

0.250.500.25Full Sibs

100MZ Twins

π2π1π0Relative PairProbability of Sharing IBD Alleles

IBD and Sibling PairsUse of Sibling Pairs in linkage analysis

- Affected Sibling Pair (ASP) Design- Binary Trait

- Unascertained Sibling Pair Design- Quantitative Traits

- Ascertained Sibling Pair Design- Quantitative Traits

We look for regions that show deviation of IBD from what is expected under the null

Linkage Analysis of Sibling PairsBasic Idea

- Sibling pairs sharing more alleles IBD than expected at a trait-influencing locus should have more similar phenotypes

Affected Sibling Pairs

ASP DSP USP

If there is a shared genetic component…

P(IBD=0, IBD=1, IBD=2) = 0.25, 0.50, 0.25

Affected Sibling Pairs

100255025Expected

100354520Observed

Total210

Number of Alleles Shared IBD

H0: No LinkageH1: Linkage

Sibling Pairs (Quantitative Traits)

If there is a shared genetic component…

P(IBD=0, IBD=1, IBD=2) = 0.25, 0.50, 0.25

Quantitative TraitsHaseman-Elston Algorithm

- Calculate number of alleles shared IBD and the squared phenotype difference for each sibpair

- Regress squared differences against IBD sharing

E(∆2) =α + βπ

∆ = trait difference between sibsα = regression interceptβ = slopeπ = IBD sharing

0

1

2

3

4

5

6

7

8

9

-0.1 0.4 0.9 1.4 1.9

IBD

Quantitative Traits

β=0

Quantitative Traits

0

1

2

3

4

5

6

7

8

9

-0.1 0.4 0.9 1.4 1.9

IBD

β<0

Linkage Analysis Statistics

The LOD ScoreMorton (1955)Log10 of the ODds for linkageEssentially a Likelihood Ratio

- Likelihood of observed- Likelihood of expected (no linkage, theta=0.5)

Developed in the context of parametric linkage

Common Nonparametric StatisticsMaximum LOD Score

- “MLS” (or MLOD)- ASP design only- GENEHUNTER, ASPEX

Nonparametric Linkage Score- “NPL Score”- Any family design- GENEHUNTER

Kong and Cox LOD Score- “K&C LOD Score”- Derived from the NPL- MERLIN, ALLEGRO

Interpreting Linkage StatisticsTraditional View…

- LOD > 3.0 for genome-wide significance

More Contemporary View…- Simulate for empirically derived significance

Examples from the LiteratureLinkage Analysis

Alcoholism

Reich et al (1998) Am J Med Genet (Neuropsychiatric Genetics)

Antisocial Drug Dependence

Stallings et al (2005) Archives of Gen Psychiatry

Bipolar Disorder

McQueen et al (2005) Am J Hum Genet

OverviewBackground and Introduction

Linkage and Linkage Disequilibrium

Population Genetics

Linkage Analysis

Association Analysis

Association Analysis

Association AnalysisGene-Mapping

- Manipulate the Properties of Linkage Disequilibrium- Using an observed locus (marker) to draw inferences about

an unobserved locus (disease gene)

Fine-Mapping- Refine a linkage region

Candidate-Gene- Evaluate the genetic variation as it relates to an outcome

Goal: Find genomic region and/or genes “associated”with disease

Association AnalysisFamily-Based

- Parent/Offspring Trios- Sibling Pairs- Nuclear Families- Extended Pedigrees

Population-Based- Case-Control- Cohort

Association AnalysisKey Concepts

- Genotype Coding- Population Stratification- Transmission Disequilibrium Test (TDT)- Whole Genome Association

Coding GenotypesAssume a biallelic marker (SNP)There are three possible genotypes

- AA- Aa- aa

Coding Genotypes

Genotype

100Recessive

(A)

110Dominant

(A)

1,0,00,1,00,0,1Genotype

(A)

210Additive

(A)

AAaAaa

Genotype Coding

Marker Score = XAdditive : X = (0, 1 or 2)Dominant : X = (0 or 1)Recessive : X = (0 or 1)

Additive Model

X0 1 2

Y

Dominant Model

X0 1 2

Y

Recessive Model

X0 1 2

Y

Population Stratification

Genetic AssociationsTruth

- Causal locus (direct)- In LD with causal locus (indirect)

Chance- If you test 100 times, you’ll see ~ 5 tests < 0.05- No causal underpinning

Bias- Association is not causal- e.g. Population stratification

StratificationEssentially a confounder!

How does it happen?

Common Cause

G P

A

Ancestry (A) predicts Genotype (G)

Ancestry (A) predicts Phenotype (P)

a.k.a.… Population Stratification

Poor Epidemiologic DesignSource Population?

Two Necessary Components:- Different prevalence (mean) of disease- Different allele frequency

Famous Example

Knowler et al (1988) Am J Hum Genet.

DRD2

Stratification HappensStrategies to deal with it

- Self-Reported Ancestry- Match (design) or Adjust (analysis)

- Use other genetic markers (ancestry informative)- Genomic Control (Devlin – U of Pittsburgh)- STRUCTURE (Pritchard – U of Chicago)- Eigenstrat (Reich – Broad Institute/Harvard)

- Use a family-based design

The TDT

Transmission Disequilibrium Test (TDT)

Transmission Disequilibrium Test (TDT)

AB AB

ABBBAABA

Transmission Disequilibrium Test (TDT)

AB AB

ABBBAABA

Under the null:Equally probable!

Transmission Disequilibrium Test (TDT)

AB AB

AB

Father - “A” was transmitted and “B” wasn’tMother - “B” was transmitted and “A” wasn’t

Transmission Disequilibrium Test (TDT)

AB AB

AB

Offspring

BBxBB

ABxBB

010ABxAB

AAxBB

AAxAB

AAxAA

BBABAAParent

Transmission Disequilibrium Test (TDT)

AB AB

AB

Offspring

BBxBB

ABxBB

010ABxAB

AAxBB

AAxAB

AAxAA

BBABAAParent

nBA

nAA

A

nBBB

nABA

B

Not Transmitted

�Tra

nsm

itted

Transmission Disequilibrium Test (TDT)

AB AB

AB

Offspring

BBxBB

ABxBB

010ABxAB

AAxBB

AAxAB

AAxAA

BBABAAParent

1

0

A

0B

1A

B

Not Transmitted

�Tra

nsm

itted

TDT

nBA

nAA

A

nBBBnABAB

Not Transmitted

�Tra

nsm

itted

TDT =(nBA − nAB )2

nBA + nAB

~ χ12

McNemar Test for Matched-Pair Data

Generalized ExtensionsMultiple OffspringMissing ParentsNon-Binary Phenotypes

- Quantitative, time-to-onset, ordinal…

Generalized ExtensionsFBAT/PBAT (Laird/Lange - Harvard)

QTDT (Abecasis/Cardon - Michigan)

PDT (Monks/Kaplan - Duke)

Population StratificationWhy are Family-Based Designs etc. robust to

population stratification?

Family-Based Data

G P

A

GP1 GP2

Family-Based Data

G P

A

GP1 GP2

Condition on parental genotypes

Family-Based Data

G P

A

GP1 GP2

Condition on parental genotypes

P(G|GP1,GP2,A) = P(G| GP1,GP2)

Paradigm ShiftFrom Linkage to Association

Gene-MappingMonogenic ‘Mendelian’ Diseases

- Rare disease- Rare variants

- Highly penetrant

Complex Disease- Rare/Common disease- Rare/Common variants

- Variable penetrance

Gene-MappingMonogenic ‘Mendelian’ Diseases

- Rare disease- Rare variants

- Highly penetrant

Complex Disease- Rare/Common disease- Rare/Common variants

- Variable penetrance

Linkage!

Gene-MappingMonogenic ‘Mendelian’ Diseases

- Rare disease- Rare variants

- Highly penetrant

Complex Disease- Rare/Common disease- Rare/Common variants

- Variable penetrance Association

Genetic Discovery

Evidence for genetic effects? Familial aggregation

Mode of inheritance? Segregation Analysis

Where in the region? Fine Mapping

What chromosome / region? Linkage Analysis

What gene? Association Analysis

What is the effect of the gene? Characterization

Genetic Discovery

Evidence for genetic effects? Familial aggregation

Mode of inheritance? Segregation Analysis

Where in the region? Fine Mapping

What chromosome / region? Linkage Analysis

What gene? Association Analysis

What is the effect of the gene? Characterization

Gene-MappingWhere in the genome (1980s - 2005)?

- Linkage

Where in the genome (2006 - )?- Association

Foreshadowing the Paradigm Shiftc. 1996

Linkage and Complex Disease

Linkage of Complex TraitsDismal and controversial picture

The Power of Linkage vs Association

Relative Power*

70022,3850.200.20

6598,0670.010.20

2,448207,6350.200.05

2,27867,2190.010.05

ASSOCIATION(NA)

LINKAGE(NL)PrevalenceMAF

MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5

*Adapted from Roeder et al, Am J Hum Genet (2006)

Rare Disease - Rare Variant

70022,3850.200.20

6598,0670.010.20

2,448207,6350.200.05

2,27867,2190.010.05

ASSOCIATION(NA)

LINKAGE(NL)PrevalenceMAF

MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5

*Adapted from Roeder et al, Am J Hum Genet (2006)

Common Disease - Rare Variant

70022,3850.200.20

6598,0670.010.20

2,448207,6350.200.05

2,27867,2190.010.05

ASSOCIATION(NA)

LINKAGE(NL)PrevalenceMAF

MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5

*Adapted from Roeder et al, Am J Hum Genet (2006)

Common Variant - Rare Disease

70022,3850.200.20

6598,0670.010.20

2,448207,6350.200.05

2,27867,2190.010.05

ASSOCIATION(NA)

LINKAGE(NL)PrevalenceMAF

MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5

*Adapted from Roeder et al, Am J Hum Genet (2006)

Common Disease - Common Variant

70022,3850.200.20

6598,0670.010.20

2,448207,6350.200.05

2,27867,2190.010.05

ASSOCIATION(NA)

LINKAGE(NL)PrevalenceMAF

MAF = Minor allele frequencyNL = Number of affected sibling pairsNA = Number of case-control pairsOdds Ratio = 1.5

*Adapted from Roeder et al, Am J Hum Genet (2006)

Why Now?

The “-omics” Agec. 1996

-Pre-genomic era-100’s of Markers

- STRs

c. 2007-Post-genomic era-100,000’s of markers

- SNPs

c. 2007

Available TechnologyPlatforms available (or coming soon)

- 1 SNP- Hundreds of SNPs- Thousands of SNPs- Hundreds of thousands of SNPs- Millions of SNPs

Flexibility for Association- Single Marker- Candidate Gene- Whole-Genome

Examples from the LiteratureWhole Genome Association

What if we discover that genes have nothing to do with complex phenotypes?

What if we discover that genes have nothing to do with complex phenotypes?

Good News: We may not have to cross that bridge

Replicated AssociationsType II DiabetesBMI / ObesityCrohn’s DiseaseAge-Related Macular Degeneration (AMD)Prostate CancerBreast CancerHeart Disease

Framingham Heart Study and BMI

Framingham Heart Study and BMI

The SNP is close (in LD) with INSIG2- A plausible candidate for obesity- Responds to insulin- Involved in trigylceride synthesis

Framingham Heart Study and BMI

Framingham Heart Study and BMIReplicated in 4 out 5 studies

- Childhood sample- African American Sample- Europe and North America

Hot off the press…

Hot off the press…

In Summary…WGA is starting off successful

- More replicated associations in one year…

Statistical GeneticsThe Challenges We Face

Analytic Challenges

Wealth of InformationWhole Genome Association using SNPs

- Potentially use all of the data- Covariates, interactions, effect size, etc.- Statistical issues abound…

Multiple Comparisons

Multiple Comparisons

The 500K People Chip

Multiple ComparisonsWhich SNPs are “real”?

- 500K Chip- 25,000 SNPs with p < 0.05

Multiple Phenotypes- 10 Phenotypes, 500K chip

- 5,000,000 comparisons!!!!

The P-Value Epidemic

“My name is Matt McQueen and I have a P-value problem”

The smallest p-values- Most addictive- We’ve been trained to focus on them- What do they mean?

- Truth- Chance- Bias

Replicated Associations…

Scott et al (2007) Science

The Phenotype Question

What is a phenotype?Depends on who you ask…

What is a phenotype?If we asked a gene…

GENE

Trait 1

Trait 2

Trait 3

Trait 4

Trait 5

Trait 6

5%

55%

4%

20%

1%

15%

What is a phenotype?If we asked an environmental factor…

Trait 1

Trait 2

Trait 3

Trait 4

Trait 5

Trait 6

10%

10%

30%

5%

5%

40%

ENV

What is a phenotype?

GENE

Trait 1

Trait 2

Trait 3

Trait 4

Trait 5

Trait 6

5%

55%

4%

20%

1%

15%

10%

10%

30%

5%

5%

40%

ENV

The Genotype Question

What is a Genotype?

We test SNPs for association because we can

What about epigenetic factors?- Methylation- Copy Number Variation

The Not-So Distant Future

The $1000 GenomeNHGRIRFA Number

- RFA-HG-06-020Title

- “The $1000 Genome”Goal

- Develop technology to enable investigators to sequence an entire human genome for $1000 within 10 years

June 2017Biodemography Short Course

Complete Genome Sequence Analysis