Heritability analysis in the genome wide data - David Balding

26
Heritability analysis in the genome-wide data David Balding Schools of BioSciences and of Maths & Stats, University of Melbourne and UCL Genetics Institute, London CSIRO Genome to Phenome meeting University of Queensland, 27 March 2015 See also: Speed & Balding “Relatedness in the post-genomic era: is it still useful?” Nat Rev Genet Jan 2015

Transcript of Heritability analysis in the genome wide data - David Balding

Page 1: Heritability analysis in the genome wide data - David Balding

Heritability analysis in the genome-wide data

David Balding

Schools of BioSciences and of Maths & Stats, University of Melbourneand UCL Genetics Institute, London

CSIRO Genome to Phenome meetingUniversity of Queensland, 27 March 2015

See also: Speed & Balding “Relatedness in the post-genomicera: is it still useful?” Nat Rev Genet Jan 2015

Page 2: Heritability analysis in the genome wide data - David Balding

Relatedness: what is it? how do we measure it?

I Basic unit is simple: all relationships are made up ofparent-child links.

I Informally we describe our relationships in terms of shortestpath(s) of parent-child steps:

I siblings are linked by 2 paths of length 2;I half-second cousins are linked by one path of length 6.

I Reality is more complex: many lineage paths.I different pairs of sibs have different levels of relatedness.

Relatedness (summarised as a one-number “kinship coefficient”)has been fundamental to quantitative genetics:

I Heritability is how much of observed phenotypic variationcan be “explained” by kinships;

I Similar mathematics underlies phenotype prediction.

The notion of an exact measure of relatedness underlies muchinformal discussion and population genetics research.

Page 3: Heritability analysis in the genome wide data - David Balding

Pedigree-based kinship coefficients

I The classical “trick” used toobtain kinship coefficients is tolimit attention to a smallnumber of known relationshipsin a specified pedigree.

I Most important is coancestryθ(A,B), the probability that arandom allele from A is Identicalby Descent (IBD) with one fromB assuming Mendelianprobabilities.

A

CB

θ(A,B) =∑

X (1+fX )2−gX . Sumis over common ancestors X of Aand B within the pedigree,fX = θ(M(X ),F (X ))

Page 4: Heritability analysis in the genome wide data - David Balding

Problem 1:θ depends on the pedigree you happen to have available

I For diploids, there is no such thing as a complete pedigree.I As more ancestors are added, θ among original pedigree

members can only increase and eventually converges to one;I so if a complete pedigree were possible, it would be useless.

I There is also no “ideal” pedigree in any other sense.

I Similarly for inbreeding (θ between parents): an inbreedingcoefficient depends on the available pedigree, and alwaysincreases with increasing pedigree information.

Didn’t matter much in the past because we could only make use ofclose relatedness, but with genome-wide date now we can “see”relatives separated by 10 or more meioses.

Page 5: Heritability analysis in the genome wide data - David Balding

Problem 2:θ only captures expected, and not realised, genome-sharing

I θ for half-sibs is 0.125, but 95% CI is (0.092,0.158).I Just 6 parent-child transmissions can result in no DNA

remaining from the first parent.I Two children may share no DNA from their common

great-grandparent:I pedigree-related but not DNA-related.

I Conversely, θ = 0 for many pairs of individuals, yet the levelsof genome-sharing among “unrelateds” can vary substantially;this has been exploited e.g. for prediction or to estimate SNPheritability.

Page 6: Heritability analysis in the genome wide data - David Balding

Genome sharing between pairs of “regular” relatives

Fraction Identical by Descent

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Siblings

Half−Siblings

First−Cousins

Half−Cousins

Second−Cousin

Half−Second−Cousin

Third−Cousin

Half−Third−Cousins

...

Page 7: Heritability analysis in the genome wide data - David Balding

Statistics of IBD sharing (update of Donnelly 1983)

# # θ(A,B) P[IBD E[# E[rl]Relationship G A E[IBD]/4 95% CI >0] sr] (Mb)

Sibling 1 2 0.250 (0.204,0.296) 1.000 85.3 31.31/2-sib 1 1 0.125 (0.092,0.158) 1.000 42.6 ”Cousin 2 2 0.063 (0.039,0.089) 1.000 37.1 18.0

1/2-cuz 2 1 0.031 (0.012,0.055) 1.000 18.5 ”2nd-cuz 3 2 0.016 (0.004,0.031) 1.000 13.2 12.6

1/2-2nd-cuz 3 1 0.008 (0.001,0.020) 0.995 6.6 ”3rd-cuz 4 2 0.004 (0.000,0.012) 0.970 4.3 9.7

5 2 0.001 (0.000,0.005) 0.675 0.7 7.97 2 (1/2)14 (0.000,0.001) 0.098 0.1 5.59 2 (1/2)18 0.009 0.0 4.4

1

G: # generations: we consider a single lineage path of 2G steps;A: ancestors; sr = shared regions; rl = region length

Page 8: Heritability analysis in the genome wide data - David Balding

Kinships based on unobserved pedigrees

C

Gene PoolAllele fractionsp and 1!pA

C

A

A

C

A A C A A A A A A

I Many pop gen models definekinship in terms of excessallele sharing, withoutreference to a pedigree.

I The excess allele sharing isattributed to an IBDparameter that equals thepedigree θ if individualscome from a finite pedigreewith unrelated founders, andif allele probabilities infounders are known.

Pop gen textbooks and practice put much weight on this theory

I but the underpinning assumptions don’t hold;

I negative estimates are frequent yet θ is positive by definition.

Page 9: Heritability analysis in the genome wide data - David Balding

IBD genome segments

Homologous segments from two haploid genomes are(recombination-sense) IBD if there has been no recombinationwithin the segment since their MRCA (mutation is ignored).Advantages:

I No need for an explicit pedigree and no founder population.

Problems:

I Recombinations cannot always be inferred.

I IBD segments are typically not identical.I Easy to identify if shared segment is large, almost impossible if

short; most shared segments are short, even for close relativesI some speak of distinguishing IBD from LD, but LD is IBD.

I Limited use as measure relatedness:I two haploid genomes are entirely IBD, relatedness is reflected

in distribution of IBD fragment lengths, which is hard to infer.

Page 10: Heritability analysis in the genome wide data - David Balding

Fragment lengths IBD from 1, 5, 9 and 11 generations ago

Generation 1 Mean Length 30.3

Chunk Length (Mb)

Fre

quen

cy

0 10 20 30 40 50

040

0000

1200

000

Generation 5 Mean Length 7.6

Chunk Length (Mb)

Fre

quen

cy

0 10 20 30 40 50

020

000

5000

0

Generation 9 Mean Length 4.4

Chunk Length (Mb)

Fre

quen

cy

0 10 20 30 40 50

020

040

0

Generation 11 Mean Length 3.6

Chunk Length (Mb)

Fre

quen

cy

0 10 20 30 40 50

010

3050

Parameter estimates for fitted gamma distribution

Page 11: Heritability analysis in the genome wide data - David Balding

Distribution of TMRCA given IBD fragment length

Region Length (Mb)

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

0.1 0.2 0.5 1 2 5 10 20 30 40 50 60 80 90 100 150

G=1

G=2

G=3

G=4

G=5G=6

G>20

Page 12: Heritability analysis in the genome wide data - David Balding

Consumer genetics and IBD

I Large consumer genetics companies have ∼ 106 customersgenotyped at ∼ 106 SNPs.

I They are interested to identify IBD segments in order to infer(remote) pedigree relationships.

I The relationship is usually expressed in terms of one shortlineage path (e.g. 3rd cousin, path length = 8) but thesecannot be distinguished from many other relationshipsinvolving multiple lineage paths.

I Why should a customer prefer a poorly-inferred pedigreerelationship to a direct measure of genome similarity?

Similarly, Inferred IBD is used for inferences of demographicparameters: but is it needed? or an optimal approach?

Page 13: Heritability analysis in the genome wide data - David Balding

Where are we?

I Many textbook notions about relatedness and kinshipcoefficients are no long useful;

I Pedigree-kinships are still regarded as “gold standard”I but they aren’t even very good for many purposes;I they were a useful stand-in when we didn’t have genome data.

I Kinship parameter estimated from excess allele sharing suffersfrom interpretation problems.

I IBD concept can be useful butI the binary nature of IBD doesn’t adequately match reality;I practical problems with inferring short IBD segments;I doesn’t easily lead to a summary-measure of kinship.

Only actual genome similarity matters for most purposes, not anypedigree-based concept.

I So how do we measure genome similarity?

Page 14: Heritability analysis in the genome wide data - David Balding

SNP-based measures of genomic similarity

There is a ton of ways to measure genetic similarity of twoindividuals from genome-wide genetic markers (SNPs),

I which one is the best?

I is there a natural SNP-based alternative to θ?

One difficulty in humans is that we are all closely related:

I Any two haploid human genomes share over 99.9% sequenceidentity due to shared ancestry.

I This isn’t evident for SNPs because they are highlypolymorphic, but

I measures of similarity can depend sensitively on the MinorAllele Fraction (MAF) spectrum.

I more low-MAF sites ⇒ more similarity.I depends on SNP chip and QC.

Page 15: Heritability analysis in the genome wide data - David Balding

SNP-based kinships

Two approaches:

I Average haplotype sharing. Useful in some settings, butsmall (e.g. < 1Mb) shared fragments are informative yet hardto exploit.

I Genome-wide average of a single-SNP measure.

Single-SNP approach 1: Average allele-sharing

I Code SNP genotypes as 0,1 and 2. Then

(0, 0) or (2, 2) → 1(0, 1), (1, 1) and (1, 2) → 1/2

(0, 2) → 0

I Disagreement about how to code heterozygotes: thehighly-influential software PLINK codes (1,1) as 1 not 1/2.

Page 16: Heritability analysis in the genome wide data - David Balding

single-SNP approach 2: Average allelic correlation

I Write GAi for genotype of A at the ith SNP, then

1

m

m∑i=1

(GAi−2pi )(GBi−2pi )× [2pi (1−pi )]α

when α = −1 this is a genome-wide average of single-SNPsample-size-1 correlation estimates.

I Animal/plant genetics, usually α = 0, human genetics α = −1

I Different values of α correspond to different assumptionsabout the genome-wide MAF–effect size relationship.

I So choose α to best fit the genetic architecture of the trait(s)?I Or invent new ways to measure genome similarity that:

I explain the most variance ?I provide the best predictive performance ?

Page 17: Heritability analysis in the genome wide data - David Balding

Heritability and prediction for α = −2,−1, 0, 1

●●

● ●

Simulation Power −2

How Kinship Generated

Cor

rela

tion

Squ

ared

Ideal −2 −1 0 1

0.0

0.4

0.8 Heritability

Prediction

Simulation Power −1

How Kinship Generated

Cor

rela

tion

Squ

ared

Ideal −2 −1 0 1

0.0

0.4

0.8 Heritability

Prediction

●●●

●●

Simulation Power 0

How Kinship Generated

Cor

rela

tion

Squ

ared

Ideal −2 −1 0 1

0.0

0.4

0.8 Heritability

Prediction

Simulation Power 1

How Kinship Generated

Cor

rela

tion

Squ

ared

Ideal −2 −1 0 1

0.0

0.4

0.8 Heritability

Prediction

Page 18: Heritability analysis in the genome wide data - David Balding

Heritability of 139 mouse traits, various kinship matrices

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Phenotype

Her

itabi

lity

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

pow −2 (0.29)pow −1 (0.29)pow 0 (0.29)pow 1 (0.29)PLINK (0.32)IBD (0.31)

Page 19: Heritability analysis in the genome wide data - David Balding

The mixed model and random-regression formulations

If kinships are of form XBXTC , with X a genotype matrix, then the

random-effect model underlying h2 and BLUP is equivalent to alinear regression model with random coefficients

yj =∑i

βiXij + εj

where the summation is over all SNPs, Xij is the genotype ofindividual j at the ith SNP, and the regression coefficients β areassumed to be iid Gaussian.

I Often just called “random regression”.

I Equivalent to ridge regression: parameter estimation viamaximum penalised likelihood, with Gaussian penalty function.

I h2 is the variance explained by the regressionI many possible regression models and so no “correct” value;I no role for a kinship coefficient in this formulation.

Page 20: Heritability analysis in the genome wide data - David Balding

SNP-based heritability: is it useful?

I Conclusion so far: h2 has no special status, it doesn’t requireany notion of kinship and is just variance explained by themodel and it depends on how the model is defined:

I no longer makes sense to speak of THE heritability of a trait.

I SNP-based h2 is hard to interpretI it depends on tagging;I simplistic interpretations in terms of rare/common causals may

be wrong.

But relative values of h2 can be useful to assess contributions ofdifferent genetic models and different genomic regions

Page 21: Heritability analysis in the genome wide data - David Balding

Speed et al. Brain 2014: “heritability” analysis of epilepsy

I Estimated 26% of variance of the liability to “all epilepsy” isattributable to 4 million genotyped and imputed SNPs (aftercorrection for population structure and genotyping errors).

I SNPs near previously-reported epilepsy loci explain only about4% of variance.

I Can similarly attribute heritability to various functionalclassifications (up to a margin of error).

I Contribution from different large-scale genomic regionsapproximately uniform.

I From lack of genome-wide significant SNPs, inferred 100s andprobably 1,000s of causal variants.

I Common genetic basis of focal and non-focal epilepsyestimated around 50% of total

I imprecise estimate, but can exclude both 0 and 100%.

I Showed potential for useful prediction of disease progressionin single-seizure cases.

Page 22: Heritability analysis in the genome wide data - David Balding

Prediction and kinship

Historically prediction of phenotype was understood in terms ofexploiting relatedness summarised by kinship coefficients:

I mathematically the standard formulation involved a matrix ofkinship coefficients, usually understood to be uniquely defined.

Now we have many different kinship coefficients:

I we are free to tailor the kinship coefficient to match thegenetic architecture of the trait;

I we don’t need any kinship coefficient, but it can still be usefulconceptually and computationally to work with matrices thatcorrespond to correlations;

I we can use multiple different kinship coefficientsI for example corresponding to different genome regions.

Page 23: Heritability analysis in the genome wide data - David Balding

Prediction of 139 mouse traits, various kinship matrices

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Phenotype

Pre

dict

ion

(r^2

)

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●●

●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

pow −2 (0.16)pow −1 (0.17)pow 0 (0.17)pow 1 (0.17)PLINK (0.17)IBD (0.14)

Page 24: Heritability analysis in the genome wide data - David Balding

MultiBLUP: (Speed & Balding, Genome Res, Dec 2014).

MultiBLUP generalises BLUP by using several kinship coefficients,each corresponding to different genomic regions.

Effect Size Scenario

Pre

dict

ion

Per

form

ance

(C

orre

latio

n)

0.2

0.4

0.6

0.8

h2= 0.5 h2= 0.8

1 2 3 1 2 3

BLUP

MultiBLUP

Page 25: Heritability analysis in the genome wide data - David Balding

Current methods MultiBLUPRisk Score Stepwise

Trait BLUP (− log10(P)) Regression BSLMM

BD 0.27 0.25 (1) 0.02 0.27 0.27CAD 0.13 0.12 (1) 0.08 0.15 0.16CD 0.32 0.28 (1) 0.18 0.34 0.36Ht 0.15 0.14 (1) 0.00 0.14 0.17RA 0.21 0.28 (3) 0.32 0.33 0.37T1D 0.25 0.34 (5) 0.54 0.57 0.59T2D 0.16 0.14 (1) 0.10 0.17 0.18Av. 0.21 0.22 0.18 0.28 0.30

Prediction of disease traits from WTCCC 1 data: bold indicateshighest predictive accuracy (correlation in cross-validation).

Page 26: Heritability analysis in the genome wide data - David Balding

Conclusions

I There is no “true” measure of kinship between two individualsand there seems no reason in principle e.g. to preferallele-sharing kinships to allelic-correlation kinships orhaplotype-sharing kinships.

I Genome similarity is the key concept.I Is there a useful canonical definition of relatedness?

I likely candidates are based on coalescent concepts e.g. genome-wide distribution of times since most recent common ancestor;

I Rousset (2002): excess of TMRCA density at short times,where “excess” is based on an asymptotic fit; no marker-basedestimator so no use in practice.

I We can choose whatever measure of genome similarity bestsuits our purpose;

I e.g. choose to optimise model likelihood or predictive accuracy,but overfitting is potentially a serious issue.

I It’s all just a big regression model, and h2 is varianceexplained by the regression.