Genome-Wide Association Studies

34
Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014

description

Genome-Wide Association Studies. Caitlin Collins , Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014. Outline. Introduction to GWAS Study design GWAS design Issues and considerations in GWAS - PowerPoint PPT Presentation

Transcript of Genome-Wide Association Studies

Page 1: Genome-Wide Association Studies

Genome-Wide Association Studies

Caitlin Collins, Thibaut Jombart

MRC Centre for Outbreak Analysis and ModellingImperial College London

Genetic data analysis using 30-10-2014

Page 2: Genome-Wide Association Studies

2

Outline• Introduction to GWAS

• Study designo GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

Page 3: Genome-Wide Association Studies

3

Genomics & GWAS

Page 4: Genome-Wide Association Studies

Genomics & GWAS 4

The genomics revolution• Sequencing technology

o 1977 – Sangero 1995 – 1st bacterial genomes

• < 10,000 bases per day per machine

o 2003 – 1st human genome• > 10,000,000,000,000

bases per day per machine

• GWAS publicationso 2005 – 1st GWAS

o Age-related macular degeneration

o 2014 – 1,991 publicationso 14,342 associations

Page 5: Genome-Wide Association Studies

Genomics & GWAS 5

A few GWAS discoveries…

Page 6: Genome-Wide Association Studies

Genomics & GWAS 6

• Genome Wide Association Studyo Looking for SNPs… associated with a phenotype.

• Purpose:o Explain

• Understanding• Mechanisms• Therapeutics

o Predict• Intervention• Prevention• Understanding not required

So what is GWAS?

Page 7: Genome-Wide Association Studies

Genomics & GWAS 7

• Definitiono Any relationship

between two measured quantities that renders them statistically dependent.

• Heritabilityo The proportion of

variance explained by genetics

o P = G + E + G*E• Heritability > 0

Associationp

SNPs

n individuals

Cases

Controls

Page 8: Genome-Wide Association Studies

Genomics & GWAS 8

Page 9: Genome-Wide Association Studies

Genomics & GWAS 9

Why?• Environment, Gene-Environment interactions• Complex traits, small effects, rare variants• Gene expression levels• GWAS methodology?

Page 10: Genome-Wide Association Studies

10

Study Design

Page 11: Genome-Wide Association Studies

Study Design 11

• Case-Controlo Well-defined “case”o Known heritability

• Variationso Quantitative phenotypic data

• Eg. Height, biomarker concentrationso Explicit models

• Eg. Dominant or recessive

GWAS design

Page 12: Genome-Wide Association Studies

Study Design 12

• Data qualityo 1% rule

• Controlling for confoundingo Sex, age, health profileo Correlation with other variables

• Population stratification*

• Linkage disequilibrium*

Issues & Considerations

*

*

Page 13: Genome-Wide Association Studies

Study Design 13

• Problemo Violates assumed population homogeneity, independent

observations• Confounding, spurious associations

o Case population more likely to be related than Control population• Over-estimation of significance of associations

Population stratification• Definition

o “Population stratification” = population structure

o Systematic difference in allele frequencies btw. sub-populations… • … possibly due to different

ancestry

Page 14: Genome-Wide Association Studies

Study Design 14

• Solutionso Visualise

• Phylogenetics• PCA

o Correct• Genomic Control• Regression on

Principal Components of PCA

Population stratification II

Page 15: Genome-Wide Association Studies

Study Design 15

• Definitiono Alleles at separate loci are

NOT independent of each other

• Problem?o Too much LD is a problem

• noise >> signal

o Some (predictable) LD can be beneficial• enables use of “marker”

SNPs

Linkage disequilibrium (LD)

Page 16: Genome-Wide Association Studies

16

Testing for Association

Page 17: Genome-Wide Association Studies

Testing for Association 17

• Standard GWASo Univariate methods

• Incorporating interactionso Multivariate methods

• Penalized regression methods (LASSO)• Factorial methods (DAPC-based FS)

Methods for association testing

Page 18: Genome-Wide Association Studies

Testing for Association 18

• Variationso Testing

• Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA

• Gold Standard—Fischer’s exact testo Correcting

• Bonferroni• Gold Standard—FDR

Univariate methodsp

SNPs

n individuals

Cases

Controls

• Approacho Individual test statisticso Correction for multiple

testing

Page 19: Genome-Wide Association Studies

Testing for Association 19

Strengths Weaknesses• Straightforward

• Computationally fast

• Conservative

• Easy to interpret

• Multivariate system, univariate framework

• Effect size of individual SNPs may be too small

• Marginal effects of individual SNPs ≠ combined effects

Univariate – Strengths &

weaknesses

Page 20: Genome-Wide Association Studies

Testing for Association 20

What about interactions?

Page 21: Genome-Wide Association Studies

Testing for Association 21

• Epistasiso “Deviation from linearity under a general linear

model”

vs.

• With p predictors, there are:• k-way interactions

• p = 10,000,000 5 x 1011 o That’s 500 BILLION possible pair-wise

interactions!

• Need some way to limit the number of pairwise interactions considered…

Interactions

Page 22: Genome-Wide Association Studies

Testing for Association 22

Multivariate methodsLASSO penalized regression

Ridge regression

Neural NetworksPenalized Regression

Bayesian Approaches

Factorial Methods

Bayesian Epistasis Association Mapping

Logic Trees

Modified Logic Regression-Gene

Expression ProgrammingGenetic Programming

for Association Studies

Logic feature selectionMonte Carlo

Logic RegressionLogic regression

Supervised-PCA

Sparse-PCA

DAPC-based FS (snpzip)

Bayesian partitioning

The elastic net

Bayesian Logistic Regression with

Stochastic Search Variable Selection

Odds-ratio-based MDR

Multi-factor dimensionality reduction

method

Genetic programming optimized neural

networks

Parametric decreasing method

Restricted partitioning method

Combinatorial partitioning method

Random forests

Set association approach

Non-parametric Methods

Page 23: Genome-Wide Association Studies

Testing for Association 23

• Penalized regression methodso LASSO penalized regression

• Factorial methodso DAPC-based

feature selection

Multivariate methods

Page 24: Genome-Wide Association Studies

Testing for Association 24

• Approacho Regression models multivariate associationo Shrinkage estimation feature selection

• Variationso LASSO, Ridge, Elastic net, Logic regression

• Gold Standard—LASSO penalized regression

Penalized regression methods

Page 25: Genome-Wide Association Studies

Testing for Association 25

• Regressiono Generalized linear model (“glm”)

• Penalizationo L1 norm

o Coefficients 0

o Feature selection!

LASSO penalized regression

Page 26: Genome-Wide Association Studies

Testing for Association 26

Strengths Weaknesses• Stability

• Interpretability

• Likely to accurately select the most influential predictors

• Sparsity

• Multicollinearity

• Not designed for high-p

• Computationally intensive

• Calibration of penalty parameterso User-defined variability

• Sparsity

LASSO – Strengths & weaknesses

Page 27: Genome-Wide Association Studies

Testing for Association 27

• Approacho Place all variables (SNPs) in a multivariate spaceo Identify discriminant axis best separationo Select variables with the highest contributions to that

axis

• Variationso Supervised-PCA, Sparse-PCA, DA, DAPC-based FSo Our focus—DAPC with feature selection (snpzip)

Factorial methods

Page 28: Genome-Wide Association Studies

Testing for Association 28Discriminant axis

Den

sit

y o

f in

div

idu

als

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection

a

b

cd

e

AllelesIndividuals

a b c d e0

0.1

0.2

0.3

0.4

0.5

Contribution to Discriminant Axis

Healthy (“controls”)

Diseased (“cases”)

Discriminant Axis

Discriminant Axis

Page 29: Genome-Wide Association Studies

a b c d e0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Contribution to Discriminant Axis

?

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection• Where should we draw the line?

o Hierarchical clustering

Page 30: Genome-Wide Association Studies

Testing for Association 30

Hierarchical clustering (FS)

a b c d e0

0.050.1

0.150.2

0.250.3

0.350.4

0.45

Contribution to Discriminant Axis

Hooray!

Page 31: Genome-Wide Association Studies

Testing for Association 31

Strengths Weaknesses

• More likely to catch all relevant SNPs (signal)

• Computationally quick

• Good exploratory tool

• Redundancy > sparsity

• Sensitive to n.pca

• N.snps.selected varies

• No “p-value”

• Redundancy > sparsity

DAPC – Strengths & weaknesses

Page 32: Genome-Wide Association Studies

32

Conclusions• Study design

o GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

Page 33: Genome-Wide Association Studies

33

Thanks for listening!

Page 34: Genome-Wide Association Studies

34

Questions?