Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17]...

31
Training a model for estimating leukocyte composition using whole blood DNA methylation and cell counts as reference Jonathan A. Heiss, 1 Lutz P. Breitling, 1,2 Benjamin C. Lehne, 3 Jaspal S. Kooner, 4,5,6 John C. Chambers, 3,4,5 Hermann Brenner 1,7,8 1 Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany 2 Pneumology and Respiratory Critical Care Medicine, Thorax Clinic, University of Heidelberg, Heidelberg, Germany 3 Department of Epidemiology and Biostatistics, Imperial College London, London, UK 4 Ealing Hospital NHS Trust, Middlesex, UK 5 Imperial College Healthcare NHS Trust, London, UK 6 National Heart and Lung Institute, Imperial College London, Hammersmith Hospital, London, UK 7 Division of Preventive Oncology, National Center for Tumor Diseases (NCT) and German Cancer Research Center (DKFZ), Heidelberg, Germany 8 German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany 1

Transcript of Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17]...

Page 1: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Training a model for estimating leukocyte composition using whole blood DNA methylation and cell counts as reference

Jonathan A. Heiss,1 Lutz P. Breitling,1,2 Benjamin C. Lehne,3 Jaspal S. Kooner,4,5,6 John C. Chambers,3,4,5 Hermann Brenner1,7,8

1Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany

2Pneumology and Respiratory Critical Care Medicine, Thorax Clinic, University of Heidelberg, Heidelberg, Germany

3Department of Epidemiology and Biostatistics, Imperial College London, London, UK

4Ealing Hospital NHS Trust, Middlesex, UK

5Imperial College Healthcare NHS Trust, London, UK

6National Heart and Lung Institute, Imperial College London, Hammersmith Hospital, London, UK

7Division of Preventive Oncology, National Center for Tumor Diseases (NCT) and German Cancer Research Center (DKFZ), Heidelberg, Germany

8German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany

Address correspondence to Jonathan Alexander Heiss, Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 581, 69120 Heidelberg, Germany. Telephone 49-6221-421304. Email: [email protected]

1

Page 2: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Abstract

Aims: Whole blood DNA methylation depends on the underlying leukocyte composition and

confounding hereby is a major concern in epigenome-wide association studies. Cell counts are

often missing or may not be feasible in large-scale studies. Computational approaches can

estimate leukocyte composition from DNA methylation based on reference datasets of purified

leukocytes. We explored the possibility to train such a model on whole blood DNA methylation

and cell counts only without the need for purification.

Materials & methods: Using whole blood DNA methylation measurements and corresponding

5-part cell counts from 2,445 participants from the LOLIPOP study, a model was trained on a

subset of 175 subjects and evaluated on the remaining ones.

Results: Correlations between cell counts and estimated cell proportions in LOLIPOP were high

(neutrophils 0.85, eosinophils 0.88, basophils 0.02, lymphocytes 0.84, monocytes 0.55) and

estimated cell proportions explained more variance in whole blood DNA methylation levels than

cell counts.

Keywords: Leukocyte composition, White blood cell distribution, Estimation of cell proportions, DNA methylation, Infinium 450K, LOLIPOP, KAROLA

Introduction

Blood samples are the most commonly available source of DNA in epidemiological studies, and

DNA extracted from whole blood samples is often used in epigenome-wide association studies

(EWAS). DNA methylation (DNAm) has been linked to lifestyle factors such as smoking [1],

chronic diseases such as diabetes [2], and was shown to predict all-cause and cardiovascular

2

Page 3: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

mortality [3]. Leukocytes show subtype specific DNAm profiles and whole blood DNAm depends

on the underlying leukocyte composition (LC) [4]. A major concern in EWAS based on whole

blood DNAm is that discovered associations may not arise from genuine DNAm changes but may

rather reflect shifts in LC. E.g., cases of ovarian and head-and-neck cancer differed in their LC

compared to cancer-free controls and adjusting for LC in regression models changed the

association between case/control status and whole blood DNAm [5]. Automated cell counting

requires fresh blood samples and is therefore not an option for large-scale studies with banked

samples collected over many years, especially cohort studies with thousands of participants and

blood samples possibly decades old.

Houseman et al. developed an algorithm to infer cell proportions of whole blood DNAm profiles

measured on the Illumina Infinium 27K platform based on a reference set of purified leukocyte

types [6]. The algorithm was adapted by Jaffe and Irizarry [7] to the newer 450K platform based

again on a dataset of purified cell types provided by Reinius et al. [4] and is implemented in the

minfi R package [8]. Recently, accuracy of the minfi model was improved by Koestler et al. by

using a new algorithm for the selection of markers included in the model [9]. For both 27K and

450K platforms it is common to use estimated cell proportions from these models to adjust for

confounding by LC in statistical analysis [2, 5, 10, 11].

However, reports about the accuracy of these estimates vary widely [9, 12-15]. Whole blood

DNAm levels can be thought of as a weighted average of leukocyte-specific methylation levels,

which is the foundation for using purified cells as reference. While this linear relation will hold

3

Page 4: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

approximately for measurements of methylation levels from the mentioned platforms, which

are commonly referred to as β-values, it might be distorted by background noise, an issue

common to microarray technology: if we were to compare the β-values of a whole blood

sample with the weighted average of β-values of its cell fractions, they would differ due to

background noise. Training a model on whole blood β-values might better account for

background noise. Our aim was to explore the possibility to train a model for LC estimation on a

reference dataset of whole blood samples.

Methods

Study populations

DNA methylation profiles of whole blood samples and corresponding 5-part blood cell counts

from 2,445 participants from a nested case-control study within the London Life Sciences

Prospective Population Study (LOLIPOP) cohort were available. Details about this nested case-

control study have been reported elsewhere [2]. In brief, all participants in this study from

London, UK, were of Indian Asian descent, most were male (n=1,646, 67%) and mean age and

standard deviation at enrollment were 50.9±10.1 years. Blood samples were taken at baseline.

Cases with incident type 2 diabetes were identified at the 8-year follow-up and controls were

matched for sex and age. Written informed consent was obtained from all participants. A

second dataset of 37 subjects, recruited in context of the KAROLA study, was used as external

validation. Details on the KAROLA study have been reported elsewhere [16]. In brief,

participants with a stable coronary heart disease were recruited several weeks up to three

months after myocardial infarction or coronary artery revascularization at two cooperating 4

Page 5: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

rehabilitation clinics in Southern Germany in 2009-2011. Participants were Caucasian, most

were male (n=31) and mean age and standard deviation at enrollment were 63.2±6.7 years.

Blood samples were taken at baseline. Written informed consent was obtained from all

participants.

Laboratory measurements

DNA methylation of whole blood samples was measured on the Illumina Infinium 450K platform

that queries the methylation levels of 485,512 CpG sites. Data were normalized using the R

package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts

included proportions of neutrophils (NE), eosinophils (EO), basophils (BA), lymphocytes (LY) and

monocytes (MO). Cell counts in LOLIPOP were performed using a Sysmex XE-2100 hematology

analyzer, cell counts in KAROLA were performed using a Beckman Coulter LH 750 or an Abbott

CELL-DYN Sapphire hematology analyzer. Average cell proportions and standard deviations

stratified by cell counting device are given in Table 1 and were rather similar in both study

populations, with neutrophils (close to 55%) and lymphocytes (30-35%) being the by far most

common cell types.

Statistical analyses

First, we wanted to estimate how many of the 485,512 probes on the 450K chip were associated

with LC. As cell proportions are compositional data (they represent relative quantities) [18], cell

types were not tested individually for their association with methylation levels, instead two

linear models were trained for each probe, one including only an intercept, sex and batch

5

Page 6: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

(indicating the 96-well plate on which the sample was run) as independent variables (model A),

the other including also cell counts of EO, BA, LY and MO (model B). NE were not included as

this proportion can be calculated from the other cell types (proportions sum up to 1) and

including all cell types would falsely increase the degree of freedom and hence underestimate

the number of probes associated with LC. As we were not interested in the regression

coefficients in this step it does not matter which cell type is left out, but there might be slight

differences due to rounding errors. NE were left out, as this resulted in a model specification

with the lowest variance inflation factors among the remaining variables, a useful property for

later analysis steps. ANOVA was used to test the significance of the gain in R ² by including cell

counts. Based on the distribution of p-values from all 485,512 probes the fraction of probes that

were associated with LC was estimated using the function pi0.est from the siggenes package

[19, 20]. Of course, this estimate will reflect only associations with main cell types. Although the

included cell types can be divided further into subtypes that also show distinctive epigenetic

profiles [4], our estimate will still give an impression of the possible magnitude of confounding.

All 2,445 LOLIPOP samples were used in this step.

Next we tested if we could estimate the LC using whole blood DNAmand cell counts as

reference. Two 96-well plates with a total of 175 samples from LOLIPOP were used for model

training and the remaining samples as test set. Samples from the KAROLA study served as

external validation without refitting of the model. We chose a small size for the training set in

order to see if building a new reference set from scratch is a viable option. However, we also

trained a model on half the LOLIPOP samples to see if accuracy would differ. For all probes on

6

Page 7: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

the 450K chip partial correlation coefficients of methylation levels with cell proportions were

computed for EO, BA, LY and MO, each coefficient adjusted for the other cell types and sex and

batch. NE proportions were not included as covariate. For each of the four cell types the 10

probes with the highest absolute partial correlation coefficients were selected (a list of these

markers is provided in the supplement). Methylation levels of these 40 markers were regressed

on the same set of variables as before (EO, BA, LY, MO, sex and batch; deviation coding was

used for the categorical variables) and intercepts α i and regression coefficients

β iEO , β i

BA , βiLY, β i

MO for i=1 , ..., 40 were recorded to construct a matrix as follows.

M=[ α 1 β1EO−α1 β1

BA−α 1 β1LY−α1 β1

MO−α1⋮ ⋮ ⋮ ⋮ ⋮α 40 β40

EO−α 40 β40BA−α 40 β40

LY−α 40 β40MO−α 40]

Using M and the methylation levels of the 40 markers, quadratic programming was applied as

described by Houseman et al. [6], with the additional constraint that estimated proportions

must sum to 1, to estimate the LC in the test and validation set. We report Pearson correlation

coefficients of measured and estimated cell proportions as this is the most relevant metric

supposed that these estimates are included as covariates in linear models. 95% confidence

intervals for correlations were obtained by bootstrapping.

Furthermore we estimated the proportion of variance of whole blood DNAm levels that could

be explained by cell counts or cell proportion estimates. For each of the 485,512 probes we

7

Page 8: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

fitted three linear models using the data from the LOLIPOP test set. The first two models had

the same specification as models A and B and in a third model cell counts in model B were

replaced by their corresponding estimates (model C). We computed the increase of R2 from

model A to B and A to C, representing the proportion of variance that could be explained by cell

counts and cell proportion estimates, respectively.

Recently, Koestler et al. reported an improved accuracy of cell proportion estimates by using a

new algorithm for the selection of markers for their model [9]. We built a model according to

the description provided by Koestler et al. (using the list of 300 markers provided by Koestler et

al. and the data provided by Reinius et al. [4]) for comparison with our custom model. The

output from the Koestler model provides proportions for granulocytes (GR), CD8+ T-cells, CD4+

T-cells, natural killer cells, CD19+ B lymphocytes and monocytes. To compare measured and

estimated proportions, the estimated proportions for CD8+ T-cells, CD4+ T-cells, natural killer

cells and CD19+ B lymphocytes were collapsed into a lymphocyte type, and the measured

proportions for neutrophils, eosinophils and basophils were collapsed into a granulocyte type.

We also created a granulocyte type from estimated cell proportions from our custom model.

Again we computed Pearson correlation coefficients between measured and estimated cell

proportions.

An implementation of our model can be found in the R package normalize450K. Exemplary R

code to perform parts of this analysis on a publicly available dataset (GEO Accession GSE53840)

is provided in the supplement and can be easily adapted to other datasets.

8

Page 9: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Results

Using the distribution of p-values from ANOVA comparisons of the two linear models with and

without cell counts as independent variables (see Figure 1), we estimated that ∼69% of the

probes on the 450K chip were associated with leukocyte composition.

Pearson correlation coefficients between measured and estimated cell proportions from our

custom model using 40 markers are listed in Table 2, stratified by the device used for cell

counting (additional scatterplots are provided in Figure S1). Correlations for neutrophils and

lymphocytes were high for all devices, but somewhat higher in KAROLA than in LOLIPOP.

Correlations for basophils were close to or even less than zero for all devices. Correlations for

eosinophils were high for the XE-2100 and CELL-DYN hematology analyzers, but only moderate

for LH 750, whereas in case of monocytes a high correlation was found only for LH 750. Using

half the LOLIPOP samples to train the model did not improve results. Likewise, in another

sensitivity analysis, results for a model utilizing 100 markers for each cell type were overall very

similar, only correlations for monocytes were lower (Table S1). Table 3 lists corresponding

numbers for the Koestler model. Overall results were very similar to our model with the

exception of monocytes for CELL-DYN.

Figure 2 shows the proportion of variance of whole blood DNAm levels explained by cell counts

and cell proportions estimates, respectively, for the top 10,000 probes when ranked by the

9

Page 10: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

former metric. Despite this unfair selection of probes, cell proportion estimates explained more

variance than cell counts in all instances (on average 16 percentage points more).

Discussion

We explored the possibility to build a model for estimating LC based on a reference dataset of

DNAm profiles of whole blood samples and cell counts. We found that such a model can provide

accurate estimates with performance on a par with a model trained on a dataset of purified cell

lines. Our model also provides proportions for eosinophils, a cell type relevant for inflammation

that is not included in the minfi/Koestler model.

Estimating the LC based on whole blood DNAm has several advantages. It has virtually no costs

(assuming that DNA methylation data is already available) and provides estimates even for

blood samples stored under conditions that no longer allow cell counts by other means [21].

However, reports about the accuracy of these estimates vary widely. The Koestler model

provided accurate predictions for granulocytes, lymphocytes and monocytes in our datasets

(Table 3). Koestler et al. validated the original Houseman model [6] in a set of 94 peripheral

blood mononuclear cell samples and found a correlation of 0.61 and 0.60 between measured

and estimated proportions of lymphocytes and monocytes, respectively [12]. Yousefi et al.

tested the minfi model and found low and high Spearman correlations for 45 whole blood

samples from 12-year old children (GR 0.77, LY 0.75, MO 0.26), but there was no correlation at

all for 111 samples from newborns (GR -0.05, LY -0.03, MO -0.01) [14]. This is likely due to the

age distribution of the study populations: the minfi model is trained on a sample of six males

10

Page 11: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

with mean age 38±13.6 years [4], which is more similar to the sample of adults from LOLIPOP

and KAROLA than to the sample of children and newborns. This is of great importance, as such

LC estimates are already used in study populations of newborns [10, 11] (the most recent

release of the minfi package now supports a dedicated reference dataset for cord blood [22]). In

another dataset of whole blood samples from 22 pairs of monozygotic twins including counts of

minor cell types, predictions of the minfi model were again excellent (GR 0.75, CD4+ T cells 0.75,

CD8+ T cells 0.66, B lymphocytes 0.93, natural killer cells 0.82, MO 0.59), but little information

was provided about the study population [15].

We hypothesized that a model trained on whole blood samples might perform better. While

whole blood DNAm can be thought of as a weighted average of subtype-specific methylation

levels, this is not true for the background noise that is always present for microarray technology,

and a model trained on whole blood DNAm might account better for this issue. Another possible

issue with purification of cell types is that the process could distort DNAm. Our model provided

exact predictions in LOLIPOP even for eosinophils, only in the case of basophils, which account

for only approximately 1% of leukocytes, it did not provide reasonable predictions. Correlation

between measured and estimated cell proportions in LOLIPOP for the two most frequent cell

types neutrophils and lymphocytes (standard deviations of ±8.5% and ±7.9%) were lower than

for eosinophils (standard deviation of ±2.6%). This could point to imprecise cell counts, which

was even more evident in KAROLA, where correlations for eosinophils were close to 1 for one

cell counting device, but only moderate for the other. An explanation for this pattern might be

that correlations of measured with estimated cell proportions were limited by the correlations

11

Page 12: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

of measured with true cell proportions, meaning that for some device/leukocyte combinations

the cell counts might be less precise than the estimates. This interpretation is further supported

by the observations that correlations for neutrophils and lymphocytes were lower in the

population from which the training set was sampled, than in the external validation (where

measurements for these two cell types would be more precise) and that in LOLIPOP more

variance in the 450K data could be explained by the estimated than by the measured cell

proportions. Therefore, both our custom and the Koestler model might provide for some cell

types more precise proportions than the 5-part blood cell counts from the hematology analyzers

used in this study. In contrast, much better precision has been reported for the Sysmex XE-2100

hematology analyzer [23] and other devices or methods also outperform the estimates based

on DNA methylation (see Fig. S8 and S9 in Accomando et al. [21]). In accordance with two

evaluations of the XE-2100 hematology analyzer, which found that eosinophil proportions were

more reproducible than monocyte proportions [24, 25], the correlation between estimates and

cell counts in LOLIPOP was higher for eosinophils than for monocytes.

LOLIPOP and KAROLA study populations included participants that developed type 2 diabetes or

had a coronary heart disease. It is unlikely that these conditions had direct impact on prediction

accuracy. For example, there were only 5 markers associated with incident type 2 diabetes in

the LOLIPOP dataset [2], and neither one was included in our or the Koestler model. Likewise,

normalization can be ruled out as nuisance factor: all analyses used raw 450K data as a starting

point and test and reference datasets were normalized together. However, we cannot exclude

12

Page 13: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

the possibility that there might be other clinical characteristics, or factors such as sample

preparation and storage [26], that could have influenced results.

To train models for the estimation of leukocyte composition from DNAm, a dedicated reference

dataset is required for each microarray platform. The 450K platform is now replaced by the

Illumina Infinium MethylationEPIC chip and eventually by whole-genome bisulfite sequencing.

Building a reference dataset of purified leukocyte types for every new platform or technology is

elaborate and expensive. Our approach achieves similar results as the Koestler model, but does

not require purification of cell types. Instead we trained a model on DNAm data and cell counts

from whole blood samples. While our model likely suffers from the same drawback as the

minfi/Koestler model, namely that it has not been trained on a diverse study population, and

that performance in other groups might be far worse, it can easily be refitted to other study

populations, for example in the case that cell counts are available only for a subset of samples.

And with cell counts including various lymphocyte subtypes such as CD8+ T cells or CD4+ T cells

the model could be extended accordingly. A reference dataset heterogeneous in regard to sex,

age, race and prevalent diseases of subjects might improve generalizability. Finally, estimates

from both models might be combined, for example to additionally adjust for the proportion of

eosinophils in EWAS. Compared to reference-free methods, that do not require any kind of

reference data, our approach has the advantage to provide cell proportion estimates that have

a direct biological interpretation [27].

13

Page 14: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Despite their limitations it seems reasonable to assume that adjustment for LC estimates from

current models will reduce confounding in many situations. However, in case-control studies

investigating outcomes that are known to be associated with shifts in LC or studies including

very young subjects, accuracy of current models might still be insufficient. Highly accurate

models would be of great importance for the growing number of epigenome-wide association

studies and would have a large impact on the validity of findings.

Summary points

Leukocyte composition is an important confounder in epigenome-wide association

studies investigating DNA methylation in whole blood samples, but cell counts are often

missing.

Computational methods can estimate cell proportions based on whole blood DNA

methylation profiles, but existing models require reference datasets of purified

leukocyte subtypes.

We trained a model for estimating cell proportions based on a reference dataset of

whole blood DNA methylation profiles and cell counts only, without the need for

purification of leukocyte subtypes.

We show that the estimated cell proportions from our model explain more variance in

whole blood DNA methylation levels than cell counts from common hematology

analyzers, indicating that estimated cell proportions are more precise than those cell

counts.

14

Page 15: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Our approach is flexible and our model can easily be trained on other datasets, where

existing models do not provide reasonable estimates, e.g. study populations of different

age or ethnic composition.

Acknowledgements

The LOLIPOP study is supported by the National Institute for Health Research (NIHR)

Comprehensive Biomedical Research Centre Imperial College Healthcare NHS Trust, the British

Heart Foundation (SP/04/002), the Medical Research Council (G0601966,G0700931), the

Wellcome Trust (084723/Z/08/Z) the NIHR (RP-PG-0407-10371), European Union FP7

(EpiMigrant, 279143) and Action on Hearing Loss (G51). We thank the participants and research

staff who made the study possible.

15

Page 16: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

References

1. Zhang Y, Yang R, Burwinkel B, Breitling LP, Brenner H. F2RL3 methylation as a biomarker of current and lifetime smoking exposures. Environ. Health Perspect. 122(2), 131-137 (2014).

2. Chambers JC, Loh M, Lehne B et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3(7), 526-534 (2015).

3. Zhang Y, Schöttker B, Florath I et al. Smoking-Associated DNA Methylation Biomarkers and Their Predictive Value for All-Cause and Cardiovascular Mortality. Environ. Health Perspect. 124(1), (2015).

4. Reinius LE, Acevedo N, Joerink M et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7(7), e41361 (2012).

5. Langevin SM, Houseman EA, Accomando WP et al. Leukocyte-adjusted epigenome-wide association studies of blood from solid tumor patients. Epigenetics 9(6), 884-895 (2014).

6. Houseman EA, Accomando WP, Koestler DC et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13 86 (2012).

7. Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15(2), R31 (2014).

8. Aryee MJ, Jaffe AE, Corrada-Bravo H et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30(10), 1363-1369 (2014).

9. Koestler DC, Jones MJ, Usset J et al. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinformatics 17 120 (2016).

10. Küpers LK, Xu X, Jankipersadsing SA et al. DNA methylation mediates the effect of maternal smoking during pregnancy on birthweight of the offspring. Int. J. Epidemiol. 44(4), 1224-1237 (2015).

11. Sharp GC, Lawlor DA, Richmond RC et al. Maternal pre-pregnancy BMI and gestational weight gain, offspring DNA methylation and later offspring adiposity: findings from the Avon Longitudinal Study of Parents and Children. Int. J. Epidemiol. 44(4), 1288-1304 (2015).

12. Koestler DC, Christensen B, Karagas MR et al. Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics 8(8), 816-826 (2013).

13. Lehne B, Drong AW, Loh M et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 16 37 (2015).

14. Yousefi P, Huen K, Quach H et al. Estimation of blood cellular heterogeneity in newborns and children for epigenome-wide association studies. Environ. Mol. Mutagen. 56(9), 751-758 (2015).

15. Waite LL, Weaver B, Day K et al. Estimation of Cell-Type Composition Including T and B Cell Subtypes for Whole Blood Methylation Microarray Data. Front. Genet. 7 (2016).

16. Zhang Q-L, Brenner H, Koenig W, Rothenbacher D. Prognostic value of chronic kidney disease in patients with coronary heart disease: Role of estimating equations. Atherosclerosis 211(1), 342-347 (2010).

17. Heiss JA, Brenner H. Between-array normalization for 450K data. Front. Genet. 6 (2015).18. Aitchison J. The statistical analysis of compositional data. Blackburn Press, Caldwell, N.J. (2003).19. Schwender H. siggenes: Multiple testing using SAM and Efron's empirical Bayes approaches. R

package version 1.46.40 (2012).20. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S.

A. 100(16), 9440-9445 (2003).

16

Page 17: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

21. Accomando WP, Wiencke JK, Houseman E, Nelson HH, Kelsey KT. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 15(3), R50 (2014).

22. Bakulski KM, Feinberg JI, Andrews SV et al. DNA methylation of cord blood cell types: Applications for mixed cell birth studies. Epigenetics 11(5), 354-362 (2016).

23. Park BG, Park C-J, Kim S et al. Comparison of the Cytodiff flow cytometric leucocyte differential count system with the Sysmex XE-2100 and Beckman Coulter UniCel DxH 800. Int. J. Lab. Hematol. 34(6), 584-593 (2012).

24. Tsuda I, Hino M, Takubo T et al. First basic performance evaluation of the XE-2100 haematology analyser. J. Auto. Meth. Manage. Chem. 21(4), 127-133 (1999).

25. Maciel TES, Comar SR, Beltrame MP. Performance evaluation of the Sysmex® XE-2100D automated hematology analyzer. J. Bras. Patol. Med. Lab. 50(1), 26-35 (2014).

26. Imeri F, Herklotz R, Risch L et al. Stability of hematological analytes depends on the hematology analyser used: A stability study with Bayer Advia 120, Beckman Coulter LH 750 and Sysmex XE 2100. Clin. Chim. Acta 397(1-2), 68-71 (2008).

27. Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinformatics 16 95 (2015).

17

Page 18: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Figure legendsFigure 1 Histogram of p-values from all 485,512 probes on the 450K chip testing the association of whole blood DNA methylation levels with leukocyte composition.

Figure 2 Proportion of variance of whole blood DNA methylation levels explained by cell counts (black) and cell proportion estimates (red), respectively, for the top 10,000 probes when ranked by the former metric. Cell proportions estimates explain more variance than cell counts in all 10,000 instances.

18

Page 19: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Table 1 Average cell proportions and standard deviations stratified by cell counting device.

Cell proportions (%)Study population Device n NE EO BA LY MO

LOLIPOP XE-2100 2,270 54.3±8.5 3.8±2.6 1.2±0.7

34.4±7.9 6.2±2.0

KAROLA LH 750 22 53.6±8.6

4.1±2.7 0.8±0.9

32.1±7.2 9.6±2.6

KAROLA CELL-DYN

15 55.5±7.4

4.9±2.5 0.6±0.3

30.1±5.7 8.9±1.2

Abbreviated cell types: NE neutrophils, EO eosinophils, BA basophils, LY lymphocytes, MO monocytes.

19

Page 20: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Table 2 Pearson correlation coefficients between measured and estimated cell proportions.

Study population Device NE EO BA GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.86) 0.88 (0.87,0.89) 0.02 (-0.02,0.06) 0.83 (0.81,0.85) 0.84 (0.82,0.86) 0.55 (0.51,0.58)KAROLA LH 750 0.90 (0.74,0.98) 0.47 (-0.03,0.91) -0.14 (-0.47,0.35) 0.95 (0.90,0.98) 0.96 (0.91,0.99) 0.91 (0.86,0.97)KAROLA CELL-

DYN0.94 (0.87,0.98) 0.95 (0.82,0.99) -0.08 (-0.50,0.42) 0.91 (0.80,0.97) 0.96 (0.89,0.98) 0.42 (0.06,0.74)

Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from our custom model stratified by cell counting device. Correlations for GR are obtained by collapsing proportions of NE, EO and BA into one composite cell type. Abbreviated cell types: NE neutrophils, EO eosinophils, BA basophils, GR granulocytes, LY lymphocytes, MO monocytes.

20

Page 21: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Table 3 Pearson correlation coefficients between measured and estimated cell proportions from the Koestler model.

Study population Device GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.87) 0.85 (0.83,0.87) 0.50 (0.46,0.54)KAROLA LH 750 0.96 (0.88,0.99) 0.96 (0.89,0.99) 0.91 (0.64,0.97)KAROLA CELL-DYN 0.97 (0.93,0.99) 0.97 (0.94,0.99) 0.69 (0.26,0.87)

Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from the Koestler model stratified by cell counting device. Abbreviated cell types: GR granulocytes, LY lymphocytes, MO monocytes.

21

Page 22: Abstract - Spiral: Home · Web viewData were normalized using the R package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts included proportions

Table S1 Pearson correlation coefficients between measured and estimated cell proportions based on a model using 100 markers for each cell type.

Study population Device NE EO BA GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.86) 0.88 (0.87,0.90) 0.01 (-0.03,0.05) 0.84 (0.82,0.86) 0.85 (0.83,0.87) 0.39 (0.35,0.43)KAROLA LH 750 0.90 (0.76,0.97) 0.52 (-0.02,0.94) -0.25 (-0.66,0.02) 0.96 (0.89,0.99) 0.96 (0.88,0.99) 0.83 (0.65,0.94)KAROLA CELL-

DYN0.98 (0.94,0.99) 0.96 (0.87,0.99) -0.05 (-0.56,0.54) 0.96 (0.88,0.99) 0.97 (0.92,0.99) 0.81 (0.38,0.95)

Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from our custom modelwith 100 markers for each cell type (instead of 10 as in Table 2) stratified by cell counting device. Correlations for GR are obtainedby collapsing proportions of NE, EO and BA into one composite cell type. Abbreviated cell types: NE neutrophils, EO eosinophils, BAbasophils, GR granulocytes, LY lymphocytes, MO monocytes.

22