Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased...

15
Student: Marta Pérez Alcántara Supervisor: Mairead Bermingham Matriculation number: s1361097 Title: A comparison of body mass index and waist-hip ratio in the genomic prediction of obesity associated health risk. Course: PGBI11007 Programme title: MSc/Dip Quantitative Genetics and Genome Analysis 2013/14 Word count of main text: 2497

Transcript of Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased...

Page 1: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Student: Marta Pérez Alcántara

Supervisor: Mairead Bermingham

Matriculation number: s1361097

Title: A comparison of body mass index and waist-hip ratio in the genomic prediction of obesity as-sociated health risk.

Course: PGBI11007

Programme title: MSc/Dip Quantitative Genetics and Genome Analysis 2013/14

Word count of main text: 2497

Page 2: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Abstract

Obesity, the abnormal accumulation of fat that that causes a health risk, has become an important concern. There are two main indices of obesity: Body Mass Index (BMI, the most common) and Waist-Hip Ratio (WHR, that reflects better than BMI the health risks). In this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) to predict BMI and WHR within a Croatian population (2159 individuals) and into an independent replication UK population (805 individuals). A GWAS was also conducted using all data to unravel the architecture of the two obesity indices. Results for heritability estimates (with 95% confidence intervals [CI]) were 0.28 (0.14-0.42) and 0.58 (0.36-0.71) for BMI and 0.35 (0.21-0.49) and 0.48 (0.30-0.66) for WHR in the Croatian and UK data respectively. The prediction accuracies of BMI and WHR within the Croatian data were almost identical: 0.10 (0.07-0.14) for BMI and 0.10 (0.07-0.14) for WHR. However, prediction accuracy of WHR dropped when predicting into the UK replication data (0.02 (0.01-0.03)) compared to BMI (0.08 (0.07-0.09)). Our results have demonstrated that assuming the infinitesimal model WHR is as easy to predict as BMI within populations, but more difficult in independent replication studies. This dissimilarity may reflect differences in trait architecture of the two indices.

Introduction

Obesity has become an important health concern in the last three decades. Data from the World Health Organization from 2008 states that over 50% of European adults were overweight and approximately 20% were obese. Obesity is an independent risk factor for a variety of diseases, including: type 2 diabetes, cardiovascular disease and cancer (Van Baal et al., 2008).

Body mass index (BMI), which is the body weight in kilograms divided by the square of the height in centimeters, is the most commonly used index for obesity. BMI measures overall adiposity. However, it does not take into account muscle mass or bone density and higher values of both may cause that a person is misclassified as overweight or obese (Rothman, 2008). The second most common index of obesity is called waist-hip ratio (WHR), and is measured as the value of the circumference of the waist divided by that of the hip. WHR is a measure of abdominal adiposity, and has been independently associated with a higher risk of developing age-related diseases (Heid et al., 2010). Therefore, it could prove a more accurate index of obesity at the population level.

BMI and WHR are highly polygenic traits; influenced by many loci of small effect. Genome wide association studies (GWAS) have identified 14 genes for WHR and 32 for BMI (Speliotes et al, 2010; Heid et al, 2010). However, when GWAS hits are hits are fit in prediction models they accuracies obtained are far below theoretical expectation (square root of the heritability; de los Campos et al., 2010). In the case of BMI, identified GWAS hits account or only for 6-11% of all the genetic variation in that trait (Speliotes et al, 2010; Heid et al, 2010). It has therefore been suggested that all available genomic information should be fitted in prediction models (de los Campos et al., 2010).

Genomic best linear unbiased predictor (G-BLUP), a method that estimates genomic values using a relationship matrix calculated from all available genomic information of individuals (de los Campos et

Page 3: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

al., 2013). G-BLUP assumes an infinitesimal model (i.e. trait genetic variance is explained by a large number of loci of small effects) and that variance is distributed equally among the SNPs (Daetwyler et al., 2010). G-BLUP is therefore particularly suited for genomic prediction of highly polygenic traits such as BMI and WHR. However, prediction accuracies achieved for BMI to date are low (Bermingham et al, 2014). Highly polygenic and moderately heritable, WHR may be easier to predict from genome wide genomic data, and thus may provide a better index of obesity for stratified intervention programs at the individual or population level.

The aim of this project was therefore to determine whether the obesity index WHR is a better determinant for obesity than BMI in the prediction of health risk.

Methodology

Population data. Two datasets were used in this project: a Croatian and a UK replication population samples. The Croatian sample was composed of 2186 individuals from 4 isolated Croatian populations: two samples from the island of Vis, Komiža (809 individuals) and Vis (388 individuals); Korčula (816 individuals) and Split (473 individuals). The UK replication data had 810 individuals sampled from the Orkney Islands, Scotland.

Genotype data. Individuals were genotyped using a dense Illumina SNP array, HumanHap 300v1 for Vis, a mix of HumanHap 300v2 and 370CNV-Quad for ORCADES (Orkney Complex Disease Study), 370CNV-Quad for Korčula and 370CNV-Quadv3 for Split. All SNPs with marker call rate less than 98% and individual call rate lower than 97% were removed. Furthermore, SNPs that deviated from Hardy-Weinberg Equilibrium (p>1 x 10-6) and/or with a minor allele frequency of less than 1% were also excluded. Following quality control 263357 SNPs were available for inclusion in the analysis.Phenotype data. Phenotypic records were adjusted for sex, age and age2 at the population level. Individuals with outlying records (i.e. adjusted phenotypic records three standard deviations away from the mean) were excluded from subsequent analyses. Phenotypes were then standardised to have a mean of zero and variance of one. The normality of the standardised phenotypes was assessed using the Shapiro-Wilk test. The datasets were then combined to generate the final data sets of 2159 Croatian individuals and 805 ORCADES individuals.

Computation of the genomic relationship matrix. The kinship matrix was generated from the genotype

data of all individuals using the “ibs” (identity by state, weighted based on minor allele frequencies)

function in GenABEL (Aulchenko et al., 2007). The formula used to calculate pairwise kinship

coefficients between individuals i and j was:

n

k kk

kjkkikij pp

pgpgn

f1 )1(

))((1

,

where gik is the genotype of the i-th individual at the k-th SNP (coded as 0, 1/2 and 1, for minor allele

Page 4: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

homozygote, heterozygote and common homozygote, respectively), pk is the frequency of the major allele, and n is the number of SNPs (Aulchenko et al., 2007). The genomic relationship matrix (G) was calculated as two times the kinship matrix.

Accounting for population substructure in the genotype data. The data used in this study were from five independent population samples. To account for population substructure that may have arisen from merging the datasets, the top 24 principal components (PC) were extracted from the G using the “mds” (multiple dimensional scaling) function in GenABEL (Aulchenko et al, 2007).

Genome wide association study. Genome-wide rapid association using mixed model and regression (GRAMMAR) was used to test the association of markers to the BMI and WHR indices. This is a two-step approach. The first-step corrects for the population substructure and relatedness among individuals in the data by fitting the top 24 PC and genomic relationship matrix in a linear mixed model. In the second step residuals from the polygenic command were regressed onto SNP minor allele count for both traits using the “mmscore” (Chen and Abecasis, 2007) function in the package GenABEL (Aulchenko et al. 2007). The threshold for genome-wide significance was set to 5 × 10−8. A paired t-test was later performed to assess the difference in phenotypic variance (2pqα2) by the top 40 most highly associated SNPs.

Genomic best linear unbiased prediction (G-BLUP). G-BLUP was performed by fitting the mixed linear model:

y=μ+ Xb+Zu+e,where y is the vector of corrected phenotypes, μ is the mean, X (Z) are the incidence matrices of the PCs (random genetic effects), b is a vector containing PC regression coefficients, u is a vector containing additive genetic effects (u ~ N(0, Gσ2

u)), e is a vector containing the residuals (e ~ N(0, Iσ2e)), G is a

genomic relationship matrix, I is an identity matrix and σ2u (σ2

e) is the additive genetic (residual) variance. The whole genomic heritability (h2) was estimated as σ u

2/(σ a2+σe

2 ). Implementation of G-BLUP and heritability estimation was conducted in the statistical package ASReml (Gilmour et al. 2009).

Cross-validation. Cross-validation was used to evaluate the ability of G-BLUP to predict unobserved phenotypes of BMI and WHR within the Croatian population samples. In this project, I divided the discovery data into ten subsamples, which is a division large enough to ensure precision without but not too computationally demanding. In total, 10% of the Croatian individuals were sampled without replacement the population level, to create ten independent test data sets. The G-BLUP model was then trained on the remaining 90% of the Croatian data. To test the ability of G-BLUP to predict into the independent ORCADES replication data, I used the ten Croatian training datasets to predict the unobserved phenotypes of the ORCADES data. Performance was assessed based on the accuracy of prediction; calculated as the correlation between the observed and predicted phenotypes.

Results

Page 5: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Data. The distribution of standardised phenotypes from the Croatian and ORCADES data (Figure 1) of BMI was slightly skewed to the right, whereas WHR had a relatively symmetrical distribution around zero. There was no evidence that the WHR distribution departed from normality (P>0.05), but there was for BMI (P<0.05). ORCADES data was more normal than that of Croatia for both phenotypes.

Figure 1. Histograms of standardised body mass index and waist-hip ratio records in Croatian dataset (left) and ORCADES (right).

Figure 2. Manhattan plot of –log10 P-values of association with body mass index (above) and waist-hip ratio (below) across all autosomal chromosomes.

Genome wide association study. There were no genome-wide significant associations with the two obesity indices (the threshold is 5 x 10-8; Figure 2). The quantile-quantile plots of GWAS results for BMI and WHR showed no inflation of P-values (Figure 3). The P-values of association with BMI fall on the line or below the line of expectation under the null hypothesis of no association. In contrast, some P-values of association with WHR exceeded expectation (Figure 3). The top 40 SNPs of WHR explain more phenotypic variance than those of BMI (Figure 4, P=1.46 x 10-9).

Page 6: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Figure 3. Quantile-quantile plots for body mass index (left) and waist-hip ratio (right) that show the departure of observed vs. expected –log10 P-values under the null hypothesis of no association.

Figure 4. Phenotypic variance (2pqα2) explained by the top 100 SNPs from body mass index (red) and waist-hip ratio (blue).

Heritability estimates. The heritability estimates of the two obesity indices were similar in the Croatian and ORCADES data (95% CI were broadly overlapping); 0.28(0.14-0.42) and 0.58 (0.36-0.71) for BMI and 0.35(0.21-0.49) and 0.48(0.30-0.66) for WHR in the Croatian and UK data respectively.

Prediction performance. The cross validation dataset provided a good representation of the full data, producing relatively stable estimates of heritability across the folds (Figure 5). The prediction accuracy for BMI and WHR were similar (0.10 (0.07-0.14) for both indices) when predicting within in the Croatian data. Interestingly, when predicting into the ORCADES data, the prediction accuracy for BMI was not significantly different (0.08 (0.07-0.09)) from the estimate obtained when predicting within the Croatian data. However, the prediction accuracy of WHR dropped toward zero when predicting into the ORCADES data (0.02 (0.01-0.03)). There was more variance in the prediction across the folds for WHR than for BMI in the Croatian data (Figure 6). In ORCADES the prediction variance seemed more stable for the two indices. This difference may be due to varying degrees of relatedness between individuals in the test and the training data. For instance, estimates of fold 7 of WHR were almost identical (with a prediction accuracy estimate near zero) in Croatian and ORCADES data. This might reflect that

Page 7: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

individuals in fold 7 are less related than those in the training data, just by chance. There seemed to be little difference in the estimates of the prediction accuracy within Croatian and into ORCADES data for BMI, while for WHR there was a larger difference.

Figure 5. Heritability estimates for body mass index (left) and waist-hip ratio (right) from all Croatian data and the cross-validation ten folds.

Figure 6. Prediction accuracies for body mass index (left) and waist-hip ratio (right) in the ten folds when predicting within the Croatian (above) and into the ORCADES (below) data.

Page 8: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Discussion

In this project, genomic prediction of the obesity indices BMI and WHR within Croatian and into ORCADES replication data was performed using G-BLUP. In this project I wanted to assess whether it is easier to predict WHR than BMI from available genome wide data. A number of genome-wide significant associations have been discovered for WHR and BMI (Speliotes et al., 2010; Heid et al., 2010). In this project however no association reached genome-wide significance. Furthermore, no association with BMI exceeded expectation under the null hypothesis of no association, whereas a number of associations with WHR did. Furthermore, the top 40 markers of WHR explained more phenotypic variance than those of BMI. These results seem to suggest the genetic architecture of the two obesity indices may be different.

The heritability estimates reported in this project are similar to previous estimates from the literature (Heid et al., 2010; Hemani et al., 2013). The heritability estimates were similar in the Croatian and replication ORCADES data sets. The heritabilities for the two indices were higher in the ORCADES data, which could be due to the reduced sample size and the higher degree of relatedness among individuals in the data (Visscher et al., 2008).

The prediction accuracies of the two obesity indices were low. This may be because the training data sample size that is not large enough to estimate individual effects with high accuracy (Wray et al., 2013). In accordance with theoretical expectation, WHR did not achieve a higher accuracy of prediction than BMI when predicting into the Croatian dataset. However, when predicting into the ORCADES data, the accuracy for BMI was relatively unchanged. Conversely, the accuracy of prediction of WHR plummeted when predicting into the replication ORCADES data. A reduction in accuracy is expected when predicting into the ORCADES data due to the lack of relatedness between the individuals from the training and the replication datasets (de los Campos et al., 2013). This does not explain however why the prediction accuracy of BMI was more or less sustained across the two population datasets. In accordance with the results from the GWAS, these results seem to support an argument for a difference in trait architecture between the two obesity indices.

The obesity indices BMI and WHR are highly polygenic traits (Speliotes et al., 2010; Hemani et al., 2013; Heid et al., 2010). Therefore, in this project G-BLUP was chosen, as it assumes the infinitesimal and should show good predictive performance when applied in the prediction of highly polygenic traits (Daetwyler et al., 2010). However, G-BLUP also assumes that variance distributed equally across all SNPs. Reality may contradict this assumption: as some SNPs have a large effect and the others none. Indeed, as in the case of BMI and WHR in this project. If the model we are assuming for our data is correct, G-BLUP will perform well when predicting within populations, however as in the case for WHR in this study the accuracies should drop dramatically when we try to predict into a replication population of unrelated individuals (de los Campos et al., 2010; Bermingham et al., 2014). In the future it may therefore be advisable to use methods that allow for departure from the infinitesimal model, i.e. models that give more variance to certain loci over others (such as methods like Bayes B or C; Clark et al., 2011). Indeed, Daetwyler and colleagues (2010) have demonstrated that Bayes B obtains higher accuracies than G-BLUP for less polygenic traits that are influenced by fewer variant of large effect. Another possibility is that WHR is controlled by population specific rare variants with large effects in the Croatian dataset

Page 9: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

that are in linkage phase with common markers used in this project. These markers therefore cannot contribute to the prediction accuracy in the ORCADES replication data. Future studies could include SNP with MAF below the 1% common SNP threshold applied in this project.

We can conclude that there is no evidence that WHR is easier to predict than BMI from genome-wide genotype data in the G-BLUP framework. However, the results from this study may have uncovered a difference in the trait architecture between the two obesity indices that has not been reported previously.

Page 10: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

References

Aulchenko, Y. S., Ripke, S., Isaacs, A., & Van Duijn, C. M. (2007). GenABEL: an R library for genome-wide association analysis. Bioinformatics, 23(10), 1294-1296.

Bermingham M.L, Pong-Wong R., Spiliopoulou S., Hayward C., Rudan I., Campbell H., Wright A.F., Wilson J.F., Agakov F.V., Navarro P. & Haley C.S. (2014). Application of high-dimensional feature selection: evaluation for genomic prediction in man. American Journal of Human Genetics (in preparation).

Chen, W. M., & Abecasis, G. R. (2007). Family-based association tests for genomewide association scans. The American Journal of Human Genetics,81(5), 913-926.

Clark, S. A., Hickey, J. M., & Van der Werf, J. H. (2011). Different models of genetic variation and their effect on genomic evaluation. Genet Sel Evol, 43, 18.

Daetwyler, H. D., Pong-Wong, R., Villanueva, B., & Woolliams, J. A. (2010). The impact of genetic ar-chitecture on genome-wide evaluation methods. Genetics, 185(3), 1021-1031.

de los Campos, G., Gianola, D., & Allison, D. B. (2010). Predicting genetic predisposition in humans: the promise of whole-genome markers. Nature Reviews Genetics, 11(12), 880-886.

de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C., & Sorensen, D. (2013). Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS genetics, 9(7), e1003608.

Gilmour, A. R., Gogel, B. J., Cullis, B. R., & Thompson, R. (2009). ASReml user guide release 3.0. VSN International Ltd, Hemel Hempstead, UK.

Heid, I. M., Jackson, A. U., Randall, J. C., Winkler, T. W., Qi, L., Steinthorsdottir, V., ... & Yang, J. (2010). Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nature genetics, 42(11), 949-960.

Hemani, G., Yang, J., Vinkhuyzen, A., Powell, J. E., Willemsen, G., Hottenga, J. J., ... & Visscher, P. M. (2013). Inference of the Genetic Architecture Underlying BMI and Height with the Use of 20,240 Sibling Pairs. The American Journal of Human Genetics, 93(5), 865-875.

Rothman, K. J. (2008). BMI-related errors in the measurement of obesity.International journal of obesity, 32, S56-S59.

Speliotes, E. K., Willer, C. J., Berndt, S. I., Monda, K. L., Thorleifsson, G., Jackson, A. U., ... & Hoesel, V. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature genetics, 42(11), 937-948.

Van Baal, P. H., Polder, J. J., de Wit, G. A., Hoogenveen, R. T., Feenstra, T. L., Boshuizen, H. C., ... & Brouwer, W. B. (2008). Lifetime medical costs of obesity: prevention no cure for increasing health expenditure. PLoS medicine,5(2), e29.

Page 11: Outline - University of Edinburgh Web viewIn this project I used Genomic Best Linear Unbiased Predictor (G-BLUP) ... that are in linkage phase with common markers used in this ...

Visscher, P. M., Hill, W. G., & Wray, N. R. (2008). Heritability in the genomics era—concepts and misconceptions. Nature Reviews Genetics, 9(4), 255-266.

World Health Organization. (2013) Overweight/Obesity: Obesity (body mass index > 30). Data by WHO region. Retrieved March, 2014 from http://apps.who.int/gho/data/view.main.2480?lang=en

World Health Organization. (2013) Overweight/Obesity: Overweight (body mass index > 25). Data by WHO region. Retrieved March, 2014 from http://apps.who.int/gho/data/view.main.2461?lang=en

Wray, N. R., Yang, J., Hayes, B. J., Price, A. L., Goddard, M. E., & Visscher, P. M. (2013). Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics, 14(7), 507-515.