3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh...

16
Estimation of lele Frequencies from uantitative Trait Data 3 rd Place Winning Project , 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata, India

Transcript of 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh...

Page 1: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Estimation of Allele Frequencies from Quantitative Trait Data

3rd Place Winning Project , 2009

USPROC

Author: Kinjal BasuSujayam Saha Sponsor Professor:S. GhoshA. K. GhoshIndian Statistical Institute, Kolkata, India

Page 2: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

• The problem is of localization of a bi-allelic gene controlling a quantitative trait.

• The (unknown) distribution of trait data depends on genotype, i.e. we have a mixture of 3 distributions each corresponding to a genotype.

Introduction

Statistics- An Integral Part of Genetic Research

Our quest is to estimate p, the frequency of allele A, from a mixture distribution with mixing proportions p2, 2pq and q2, due to genotypes AA, Aa and aa.

Page 3: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

• Cluster analysis gives us estimates which are used both on their own or as initial guesses for other methods.

• For sake of algebraic simplicity, to begin with we assume the data to follow a mixture Gaussian Model . We test two methods-based on EM and CEM, respectively.

• We next investigate two categories of departure from normality:

a. Asymmetric Distributions

b. Heavy-tailed Distributions

Methodology-The way the pieces

fall into place!

Page 4: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Using 3-Means algorithm we find the three clusters. Now we need to decide which cluster corresponds to which genotype. We connect the bigger of the extreme clusters to AA and the smaller one to aa.

Cluster Analysis

If n1, n2 and n3 be the cluster sizes corresponding to AA, Aa and aa genotypes respectively, then the MLE of p is given by

p = (2n1 + n2) /2(n1 + n2 + n3)

Page 5: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Gaussian Model Analysis

A mixture of N(3,1) N(0,1) andN(-3,1) with p=0.45

To analyze the data assuming an underlying mixture Gaussian distribution, we make use of EM and CEM algorithms using the posterior expectations of indicator variables given the data in E-step and the standard results for Gaussian Model in M-step (here mean and variance is interpreted as weighted mean and variance with the indicator variables as weights).

Page 6: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

ResultsInference : As the separation

between the means increases the MSE decreases.

EM gives better results than 3-means. CEM is unsatisfactory.

As p approaches 1 the performance of all the methods detoriate. This is probably because the cluster corresponding to q2 vanishes at a quadratic rate.

Original p

Method

µ=1 µ=2 µ=3 µ=1 µ=2 µ=3

0.63-Means 0.0066 0.0041 0.0006 0.0009 0.0006 0.0008

EM 0.0067 0.0048 0.0004 0.0162 0.0078 0.0014

CEM 0.0044 0.0034 0.0003 0.0104 0.0210 0.0046

0.73-Means 0.0313 0.0203 0.0027 0.0144 0.0076 0.0028

EM 0.0157 0.0106 0.0002 0.0132 0.0065 0.0033

CEM 0.0203 0.0135 0.0003 0.0124 0.0201 0.0058

0.83-Means 0.0739 0.0489 0.0339 0.0460 0.0326 0.0219

EM 0.0471 0.0349 0.0003 0.0333 0.0296 0.0037

CEM 0.0508 0.0361 0.0135 0.0349 0.0339 0.0138

Page 7: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

In multi-dimensioned data, treating each variable separately

means information on interdependencies

between the variables is not used at all.

Thus, a vector-valued estimation algorithm is called for. We choose multivariate

normal to model the data and use a multivariate analog of

the theory in Slide 5 to

estimate p.

Multivariate Model Assumption

Overall, EM was better than the other two methods. EM and CEM gave comparable MSE mostly, but their superiority over 3-means was not evident in some cases, especially for p=0.6.

Result :

Page 8: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Deviations from Normality

1 2 3 4 5

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

4 2 2 4

0 .1

0 .2

0 .3

0 .4

Box Cox Transformations

• Here we transform the original asymmetric data into a symmetric data by using an appropriate value of λ.

• yoriginal (yλ – 1)/ λ if λ ≠ 0 ln(y) if λ = 0

• Criterion for choice of λ:Maximizing between group to within group variance ratio.

Log Normal Dist. to Normal Dist. by λ = 0

5 10 15 20

0 .05

0 .10

0 .15

4 2 2 4

0 .1

0 .2

0 .3

0 .4

Chi Squares Dist. to Normal Dist. by λ = 0.5

i) Asymmetric Distributions

Page 9: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

ResultsUsing a regular grid of points for λ we see that almost always (more than 95% time) the correct λ or a nearby value is chosen by the algorithm.

• The performance under different values remain similar under the variations, however there is a drop of performance due to the added variation for the choice of λ.

Inference :Log Normal

Original p

Method

µ=1 µ=2 µ=3 µ=1 µ=2 µ=3

0.63-Means 0.0040 0.0023 0.0007 0.0040 0.0034 0.0016

EM 0.0039 0.0059 0.0008 0.0109 0.0024 0.0019

CEM 0.0114 0.0032 0.0004 0.0101 0.0131 0.0045

0.73-Means 0.0188 0.0109 0.0038 0.0172 0.0136 0.0034

EM 0.0170 0.0135 0.0040 0.0122 0.0068 0.0005

CEM 0.0163 0.0060 0.0002 0.0174 0.0122 0.0026

0.83-Means 0.0494 0.0444 0.0258 0.0565 0.0393 0.0256

EM 0.0414 0.0252 0.0041 0.0392 0.0268 0.0060

CEM 0.0333 0.0346 0.0105 0.0422 0.0282 0.0220

Page 10: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

• Many heavy-tailed distributions such as Cauchy and T-2 do not have finite first two moments. In these cases we cannot use the sample mean and variance to estimate the location and scale parameters of the population

• Instead we use sample median and quartile deviation to estimate the location and scale parameters.

• Use of quantiles instead of moments also help increase the robustness of the algorithms towards outliers in the data. So this algorithm can also be used when robustness is required even though the distribution is not suspected to be heavy-tailed.

Deviations from Normality

ii) Heavy Tailed Distributions

Page 11: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

The Outlier and the single element of the cluster

Using p=0.5, the classification should have been as 250, 500 and 250

The 3 clusters have comparable no of elements and actual classification has been done

Comparing 3-Means and 3-Medoids

The three clusters are of size 984, 15 and 1

The three clusters are of size 299, 421 and 280

Thus, 3-Medoids gives much better results in the presence of outliers.

3-Means

3-Medoids

Page 12: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Results

•The robust algorithms protect us from outliers messing with the estimates too much but at a cost of loss of efficiency of the EM algorithm

Inference :

Cauchy DistributionOriginal p Method

µ=1 µ=2 µ=3 µ=1 µ=2 µ=3

0.63-Means 0.0024 0.0035 0.0009 0.0079 0.0054 0.0052

EM 0.0071 0.0030 0.0004 0.0031 0.0027 0.0046

CEM 0.0529 0.0406 0.0017 0.0469 0.0439 0.0526

0.73-Means 0.0229 0.0139 0.0059 0.0095 0.0231 0.0254

EM 0.0133 0.0039 0.0007 0.0087 0.0233 0.0228

CEM 0.0755 0.0781 0.0406 0.0802 0.0649 0.0580

0.83-Means 0.0499 0.0477 0.0603 0.0409 0.0542 0.0536

EM 0.0348 0.0188 0.0236 0.0533 0.0521 0.0439

CEM 0.0387 0.0406 0.0266 0.0431 0.0440 0.0461

Page 13: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Data was collected from an ongoing clinical survey at Madras Diabetes Research Foundation, Chennai, India on Type 2 Diabetes from roughly 500 patients on 9 different fields.

Preliminary analysis revealed some perfect linear dependencies which helped us reduce dimensionality of the multivariate estimates.

We have run the data through both the univariate algorithms, each variable separately, and also the multivariate routine using 6 fields.

Collection and Analysis of Real Data

Page 14: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

i) Results from multivariate Analysis:3-medoids: 0.6857 EM: 0.6477 CEM: 0.6576

The consistency of the results shows that multivariate normal is a good fit for the data.

Results

ii) Result from univariate analysisData 3-medoids EM CEM Robust EM Robust CEMBMI 0.5532 0.5589 0.5958 0.8250 0.5833FBS 0.7028 0.7909 0.8394 0.7402 0.6647FBS-INS 0.6325 0.7801 0.7400 0.5764 0.5733IR 0.5321 0.9011 0.6496 0.5400 0.5562CHO 0.5040 0.8911 0.5552 0.6632 0.5572TRI 0.6938 0.9728 0.9829 0.5549 0.6446LDL 0.5542 0.6659 0.7329 0.7354 0.7912HDL 0.5994 0.9613 0.6566 0.5296 0.5783HBA1C 0.6847 0.7524 0.8233 0.7095 0.8022

Page 15: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Discussions We see that in phenotypes FBS-INS, IR, CHO, TRI and HDL , the estimate of p is almost consistent except for the EM and CEM Algorithms. The reason must be that the distribution does not follow a Gaussian Model or the data contained extreme outliers . In LDL, robust EM and CEM give consistent values, but the initial cluster analysis does not, implying that though 3-medoids was not entirely accurate, that initial estimate yielded a consistent solution. In BMI and FBS, we have consistent solution for EM and CEM algorithm but its sensitivity decreases during robustification. This implies that the underlying model is most likely Gaussian.If some phenotypes return same p, and we have prior biological knowledge that their controlling genes may be same, it is probably true that the same gene controls those specific phenotypes. This work will immensely help in identifying those phenotypes.

Page 16: 3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Conclusion :Using the simulated result we propose the following method as the most optimum method for calculating the allele frequency :

We first execute the 3-medoids algorithm to estimate the location and scale parameters of the 3 clusters and also a crude estimate of p. Using EM algorithm, starting with the crude estimates for a grid of λ values we choose the one with the maximum between to within variance ratio.

We graphically check if the data contains outliers. If yes, we use the robust EM or else we follow the usual EM to get the final Estimate of p, the allele frequency.

Sources :Madras Diabetes Research Foundation, Chennai, India http://www.mvdsc.org/mdrf/about.htm