Imputation for GWAS 6 December 2012. Introduction Imputation describes the process of predicting...

Imputation for GWAS

6 December 2012

Introduction

• Imputation describes the process of predicting genotypes that have not been directly typed in a sample of individuals:• missing genotypes at typed variants;• genotypes at un-typed variants that are present in

an external high-density “reference panel” of phased haplotypes.

• In silico genotypes can be tested for association within standard generalised linear regression framework.

How does imputation work?

What is the purpose of imputation?

• Increased power. The reference panel is more likely to contain the causal variant (or a better tag) than a GWAS array.

• Fine-mapping. Imputation provides a high-resolution overview of an association signal across a locus.

• Meta-analysis. Imputation allows GWAS typed with different arrays to be combined up to variants in the reference panel.

Increased power and improved fine-mapping resolution

IMPUTEv2 and minimac

• Pre-phasing. Estimate haplotypes at variants typed in the study sample (scaffold).

• Haploid imputation. Study sample haplotypes are considered an unknown path through haplotypes from the reference panel.• Hidden Markov model (HMM).• Switch probability between reference

haplotypes depends on recombination rate.

• Allelic mismatch between reference and observed haplotypes can be incorporated by allowing for low rate of mutation.

• Less computationally demanding than diploid imputation that attempts to jointly phase and impute simultaneously (IMPUTEv1 and MaCH).

Reference panels

• Large-scale genotyping and re-sequencing reference panels made available through HapMap Consortium and 1000 Genomes Project.• HapMap2. 60 CEU, 60 YRI and 90 CHB/JPT

individuals typed for ~3M variants.• HapMap3. 1011 individuals from multiple ethnic

groups typed for ~1.6M variants.• 1000 Genomes. Most recent release includes 1094

individuals from multiple ethnic groups typed for ~30M variants (including indels).

Choice of reference panel

• Imputation software designed for use with 1000 Genomes reference panels, but remain computationally demanding.

• Making use of the “all ancestries” reference panel (rather than ethnic-specific reference panel) improves imputation accuracy for rare variants.

• Formatted reference panels for IMPUTEv2 and minimac can be downloaded from the software websites.

Factors affecting imputation accuracy

• Scaffold. Number of individuals and GWAS array used for genotyping (coverage of variation).

• Reference panel. Number of individuals and density of typing. Similarity of ancestry with study sample.

• Minor allele frequency.• Pre-phasing or diploid imputation (minimal).

Imputation accuracy

Imputation quality control

• Pre-imputation. Essential that GWAS scaffold excludes poor quality variants. Common to exclude MAF<1% variants.

• Post imputation. Imputation quality assessed by “information measures” in range 0-1.• Information measure α in a scaffold of N individuals has

equivalent power to αN perfectly genotyped individuals.• Typical to filter SNPs by α (exclude <0.8, <0.4).• IMPUTEv2 “info score” and minimac ȓ2.

• In loci identified through imputation, important to check quality of typed SNPs in the scaffold in the region by visual inspection of cluster plots.

Analysis of imputed genotypes

• For each individual, imputation provides probability distribution of possible genotypes at each un-typed variant from the reference panel.

• Using best guess genotype, or filtering on probability of best guess genotype can increase false positives and reduce power.

• Convert probabilities to “expected allele count”, i.e. p1+2p2.• Fully take account of the uncertainty in the imputation in a

“missing data likelihood”.• Software: SNPTEST2 (for IMPUTEv2) and Mach2Dat (for

minimac).

Rare variants and complex disease

• Rare variants are likely to have arisen from founder effects in the last few generations.

• Rare variants are expected to have larger effects on complex traits that common variants.

• Statistical methods focus on the accumulation of minor alleles at rare variants (mutational load) within the same functional unit.

• Test of association of phenotype with proportion of rare variants at which individuals carry minor alleles.

• Model disease phenotype via regression on pi and any other covariates in GLM framework.

GRANVIL

1 0 0 0 0 1 0 0 0 1 pi = 3/10

Reedik Magihttp://www.well.ox.ac.uk/GRANVIL/

Assaying rare genetic variation

• Gold-standard approach to assaying rare genetic variation is through re-sequencing, which is expensive on the scale of the whole genome.

• GWAS genotyping arrays are inexpensive, but are not designed to capture rare genetic variation.

• Increasing availability of large-scale reference panels of whole-genome re-sequencing data: 1000 Genomes Project and the UK10K Project.

• Impute into GWAS scaffolds up to these reference panels to recover genotypes at rare variants at no additional cost, other than computing.

• Test of association of phenotype with proportion of rare variants at which individuals carry minor alleles.

• Replace direct genotypes with posterior probability of heterozygous or rare homozygous call from imputation.

• Model disease phenotype via regression on pi and any other covariates in GLM framework.

GRANVIL: imputed variants

0.9 0.1 0.2 0.1 0.1 0.8 0.1 0.1 0.1 0.6 pi = 3.0/10

Application to WTCCC

• GWAS of seven complex human diseases from the UK (2000 cases each and 3000 shared controls from 1958 British Birth Cohort and National Blood Service):• bipolar disease (BD), coronary artery disease (CAD), Crohn’s

disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D).

• Individuals genotyped using the Affymetrix GeneChip 500K Mapping Array Set.

• After quality control, 16,179 samples and 391,060 autosomal SNPs (MAF>1%) carried forward for analysis.

Fine-scale UK population structure

• Fine-scale population structure may have greater impact on rare variants than on common SNPs because of recent founder effects.

• Utilised EIGENSTRAT to construct principal components to represent axes of genetic variation across the UK: 27,770 high-quality LD pruned (r2<0.2) common autosomal SNPs (MAF>5%).

Fine-scale UK population structure

Imputation

• SNPs mapped to NCBI build 37 of human genome.• Samples imputed up to 1000 Genomes Phase 1

cosmopolitan reference panel (June 2011 interim release).

• 8.23M imputed autosomal rare variants (MAF<1%) polymorphic in WTCCC.

• 5.38M (65.3%) were “well-imputed” (i.e. Info score > 0.4) and carried forward for analysis.

• Mean info score was 0.618, and 17.3% had info score > 0.8.

Rare variant analysis

• Test for association of each disease with accumulation of rare variants (MAF<1%) within genes using GRANVIL.

• Gene boundaries defined from UCSC human genome database (build 37).

• Analyses adjusted for three principal components to adjust for fine-scale UK population structure.

• Genome-wide significance threshold p<1.7x10-6: Bonferroni adjustment for 30,000 genes.

No evidence of residual population structure

Rare variant association with T1D

• Genome-wide significant evidence of association of T1D with rare variants in multiple genes from the MHC.

• Strongest signal of association observed for HLA-DRA (p=2.0x10-

13).• Gene contains 23 well imputed rare variants with mean MAF of

0.32%. • Accumulations of minor alleles across these variants were

associated with decreased risk of disease: odds ratio 0.556 (0.476-0.650) per minor allele.

T1D association across the MHC

PBMUCL2

NCR3

EHMT2

SLC44A4TNXA PBX2

AGPAT1C6orf10

HLA-DRB5

HLA-DRA

• Ten genes achieve genome-wide significant evidence of rare variant association with T1D.


PBMUCL2SKIVL2

EHMT2

SLC44A4

TNXB

PBX2

AGPAT1HLA-DMA

HLA-DRB5HLA-DRA

• After additional adjustment for additive effect of lead GWAS common variant from the MHC (rs9268645).

Comments

• GRANVIL assumes the same direction of effect on the trait of all rare variants within the functional unit.

• Methodology allowing for different directions of effect of rare variants are well established for re-sequencing data, and are being generalised to allow for imputation.

• The most powerful rare variant test will depend on the underlying genetic architecture of the trait.

Imputation for GWAS 6 December 2012. Introduction Imputation describes the process of predicting...

Documents

Transcript of Imputation for GWAS 6 December 2012. Introduction Imputation describes the process of predicting...