Practical applications and properties of the … › islandora › object › idea...Practical...

68
Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution A Thesis Submitted to the Faculty of Drexel University by Scott Haney in partial fulfillment of the requirements for the degree of Doctor of Philosophy March 23 rd , 2011

Transcript of Practical applications and properties of the … › islandora › object › idea...Practical...

Page 1: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

Practical Applications and Properties of the Exponentially

Modified Gaussian (EMG) Distribution

A Thesis

Submitted to the Faculty

of

Drexel University

by

Scott Haney

in partial fulfillment of the

requirements for the degree

of

Doctor of Philosophy

March 23rd, 2011

Page 2: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

c© Copyright March 23rd, 2011Scott Haney. All Rights Reserved.

Page 3: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Background on Microarray Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Measuring Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Affymetrix Microarrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Experimental Errors and Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3. Properties of the Exponentially Modified Gaussian (EMG) Distribution . . . . . 13

3.1 Reparameterization of the EMG Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 EMG Quantile Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 EMG Right Tail Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4. Application of the EMG Distribution to Actual Affymetrix Microarray Per-

fect Match (PM) Probe Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Comparing the Right Tail to a Shifted Exponential Distribution . . . . . . . . 25

4.2 Discrepancy in the Sample Quantile of the Sample Mean . . . . . . . . . . . . . . . . 30

5. Fitting the Right Tail of the Perfect Match (PM) Probe Data. . . . . . . . . . . . . . . . . 32

5.1 Derivation of Functions That Decrease by a Common Ratio . . . . . . . . . . . . . 32

5.2 Application of Functions that Decrease by a Common Ratio to the

Right Tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6. Practical Implementation of EMG Parameter Estimation Method and Prop-

erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1 Proof of Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Page 4: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

6.2 Practical Considerations and Alterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 Summary of Final Parameter Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . 42

6.4 Currently Available Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.5 Maximum Likelihood Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.6 The Silver Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.7 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7. Comparison of Methods on Synthetic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A. Derivation of pdf and cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A.1 Derivation of the Probability Density Function and the Cumulative

Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Page 5: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

i

List of Figures

2.1 The steps of gene expression that leads to a protein product (taken from [25]) 6

2.2 Affymetrix Chip Design (taken from [13]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Step by step procedure of a typical Affymetrix microarray experiment(taken from [9]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Several sources of error for a microarray experiment (Taken from [35]) . . . . . 10

3.1 Plots of EMG distributions for different values of k. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Plots of the sample pdf histograms for the PM probe distributions fromfive Affymetrix microarrays along with a plot of an EMG distribution withk = 1.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Plot of the right tail of the sample pdf histogram for the PM probe datafrom T01 tumor.CEL fitted to a shifted version of f(x) = 3log2(x). . . . . . . . . . 35

Page 6: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

ii

List of Tables

7.1 Synthetic data results for the new method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.2 Synthetic data results for the method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.3 Synthetic data results for the Silver method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Page 7: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution
Page 8: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

i

AbstractPractical Applications and Properties of the Exponentially Modified Gaussian

(EMG) Distribution

Scott Haney

Advisor: Moshe Kam, Ph.D.

The exponentially modified Gaussian (EMG) probability distribution is defined

as the convolution of an exponential distribution and a Gaussian distribution which

are independent of each other. Using a reparameterized form of the EMG cumulative

distribution function (cdf) several properties of the EMG distribution are derived.

These properties are used to test whether the distribution of the perfect match (PM)

probes from five Affymetrix microarrays follows an EMG distribution and to create

a new parameter estimation method. A commonly used method for preprocessing

Affymetrix microarray data, known as the robust multi-array average (RMA), as-

sumes that the distribution of the PM probes at least approximately follows an EMG

distribution. Using the results derived in this thesis it is found that the EMG distri-

bution is not a good fit for these sample data based on differences in the right tail of

the sample distribution. A new distribution that is very dissimilar to the right tail of

an EMG distribution is derived that more accurately fits the right tail of the sample

data. From the properties of the EMG distribution derived in this thesis it is further

shown that a new parameter estimation method can be created. This new parameter

estimation method is compared against two other methods from the literature namely

the method of moments and the Silver method (2009). From a theoretical perspec-

tive, the new parameter estimation method has the advantage that it is proven to

be consistent and to always return valid parameter estimates (such as the constraint

that the variance be positive). Neither the Silver method nor the method of moments

has both of these qualities. All three methods were compared on the same synthetic

data generated from EMG distributions and it was found that the performance of

Page 9: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

ii

each method depended on the “shape of the EMG distribution. It was also found

that the Silver method appears to not be consistent for EMG distributions that are

too “close to being a Gaussian distribution.

Page 10: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

1

1. Introduction

The EMG distribution is the convolution of a Gaussian distribution and an ex-

ponential distribution which are independent of each other. This distribution has

found practical applications in a variety of scientific disciplines such as chromatogra-

phy [17,20,23,29], cellular biology [14], radiotherapy [16], and microarray preprocess-

ing [18,30]. Many of these practical applications focus on the problem of curve fitting

of data points to a function which is an EMG pdf multiplied by a scaling parameter.

A large number of algorithms have been introduced in the literature to solve this

problem [3,11,12,36].

The focus of this thesis is to better understand the properties of the EMG distri-

bution so that it can be determined whether or not the perfect match (PM) probe

distributions from five Affymetrix microarrays approximately follows an EMG dis-

tribution. This is an important assumption made by a commonly used microarray

preprocessing method known as the robust multi-array average (RMA) [18]. Several

properties of the EMG distribution were derived and were used to show that the right

tails of the sample probability density function (pdf) were much “heavier” than would

be expected for an EMG distribution. By visual analysis of the sample pdf histograms

it was determined that the right tails of the sample pdfs approximately reduced in

height by one third whenever the value on the x-axis was doubled. This is a property

that the right tail of an EMG pdf does not come close to having. A function with

this property was derived and it was found to be a reasonable approximation for the

right tails of the sample pdfs. These results strongly challenge the assumption used

by the RMA method that the PM probes approximately follow an EMG distribution.

Using the derived properties it is also possible to create a parameter estimation

method that has some very desirable properties such as consistency and always being

Page 11: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 1. INTRODUCTION 2

able to return “valid” parameter estimates where “valid” refers to parameter esti-

mates that satisfy all of the constraints of the original parameters. Several parameter

estimation methods already exist in the literature [22, 30] and the new parameter

estimation method is compared to two of these. The two methods selected were the

method provided in [30] (referred to as the “Silver method”) and the method of mo-

ments. All three methods were compared on synthetic data generated from EMG

distributions. The synthetic data trials distinguished between three scenarios which

were:

1. The EMG distribution is “close” to being a shifted exponential distribution

2. The EMG distribution is “close” to being a Gaussian distribution

3. The EMG distribution is neither “close” to a shifted exponential distribution

nor “close” to a Gaussian distribution

An EMG distribution is considered to be “close” to a shifted exponential distribution

when a large fraction of the variance of the EMG distribution is due to the variance

of the exponential component; an EMG distribution is considered to be “close” to a

Gaussian distribution when a large fraction of the variance of the EMG distribution

is due to the variance of the Gaussian component.

Both the Silver method and the method of moments were found to have distinct

disadvantages compared ot the new parameter estimation method. The method of

moments failed to return valid parameter estimates at least 10 times out of 100 and

at most 61 times out of 100 in the synthetic data trials. For these failed runs the

method of moments returned at least one imaginary parameter estimate. The Silver

method appears to be converging to incorrect parameter estimates under the second

scenario. The average parameter estimates for the Silver method after applying it to

100 random samples of size 10,000 generated from a certain EMG distribution showed

Page 12: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 1. INTRODUCTION 3

that the parameter estimates were off by as much as 29 standard deviations.

With respect to accuracy, the results of the synthetic data trials showed that the

performance of the parameter estimation methods varied across the three scenarios.

In the first scenario it was found that the accuracy of the Silver method was noticeably

better in most cases than the accuracy of the new method and the accuracy of the

method of moments. In the second scenario it was found that the accuracy of the

method of moments and the accuracy of the new method were comparable, while

in most cases the accuracy of the Silver method was noticeably lower. In the third

scenario it was found that the accuracy of the method of moments and the accuracy

of the new method were comparable while in most cases the accuracy of the Silver

method was noticeably lower.

The organization of this thesis is as follows:

1. Background necessary for understanding the application of the EMG distribu-

tion to Affymetrix microarray data is described.

2. Properties of the EMG distribution that will be used in improving the applica-

tion of the EMG distribution in practice are derived

3. The assumption that the PM probe data from Affymetrix microarrays approx-

imately follows an EMG distribution is tested for data from five microarrays

and it is found that this assumption is unlikely to be true.

4. A new distribution is derived to fit the right tails of the PM probe distributions

from the five microarrays. This new distribution is found to visually fit the

sample data well and is not “close” to the right tail of an EMG distribution.

5. A new parmeter estimation procedure is described and is proven to be consis-

tent.

Page 13: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 1. INTRODUCTION 4

6. The new parameter estimation method is compared to two other parameter

estimation methods from the literature and is found to have several important

advantages over these two methods.

Page 14: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

5

2. Background on Microarray Data Analysis

Within a single human being different cell types can have exactly the same DNA

yet be extraordinarily different. For example, skin cells and bone cells have the same

DNA yet they are not very similar in form or function [2]. Although skin and muscle

cells have the same DNA, certain subsequences of the DNA (known as genes) affect

the cellular environment in different ways. Perhaps the most commonly studied way

by which a gene can affect a cell is the process of gene expression.

2.1 Gene Expression

Gene expression is a multi-step process by which a gene product is created from

a gene. In humans the most common gene products are proteins, which are one or

more long chains of amino acids that are folded together. For simplicity it is assumed

that gene expression refers to gene expression where the gene product is a protein

since proteins are thought to be the primary reason for biological changes within the

cell. The steps of gene expression for protein products [2] are

1. DNA is transcribed into a complementary mRNA copy

2. Intron sequences are removed (or spliced) from the complementary mRNA copy

3. The spliced complementary mRNA sequence is translated into a chain of amino

acids

4. Posttranslational modifications are made to the chain of amino acids and the

final protein product is formed

These steps are shown pictorially in (Figure 2.1).

Page 15: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 6

Figure 2.1: The steps of gene expression that leads to a protein product (takenfrom [25])

Protein gene products are typically very complex and can affect the cell in different

ways depending on a variety of factors. Two common factors that impact the effect

of proteins is the concentration of other proteins in the cellular environment and the

folded shape of the protein. Any change starting from gene expression and ending

with the final structure, form, and environment of the protein product can affect the

biology of the cell [2].

2.2 Measuring Gene Expression

Obtaining a meaningful measure of gene expression is not straightforward. A

single change in any step of the process can lead to different biological results. In

practice, the first step of the process of gene expression where the DNA is transcribed

Page 16: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 7

into a complementary mRNA copy is the portion of the process that is measured.

Measuring this step provides an estimate of the total amount of protein product that

can be produced. This measurement, however, does not provide any estimate as to

how much of the protein is actually produced or give any idea as to the final physical

form of the protein in the cell.

There are several important reasons for focusing on this portion of the process

which are as follows:

1. Methods for measuring the presence of mRNA molecules are well established

2. Since the human genome is approximately 99.9% identical across individuals it

is reasonable to assume that the same mRNA molecules are being tested for

3. It is possible to simultaneously measure the presence of a large number of mRNA

molecules within the same sample

A number of testing devices are available for simultaneously measure large numbers

of mRNA molecules in a sample. One class of these testing devices, known as mi-

croarrays, are commonly used for this purpose in practice.

2.3 Affymetrix Microarrays

One of the most well known manufacturers of microarrays is Affymetrix [1].

Affymetrix microarrays are small chips that have their surfaces subdivided into a

rectangular grid. Each rectangle in the grid contains a large number of 25 nucleotide

base pair long DNA probes all having the same sequence. These DNA probe are

“standing straight up” on the surface of the chip with the bottom end of the probe

affixed to the surface of the chip and the top end of the probe being free to move

(Figure 2.2). This design allows any mRNA molecules to chemically bind to the DNA

probes on the surface of the microarray.

Page 17: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 8

Figure 2.2: Affymetrix Chip Design (taken from [13])

Each subgrid contains either perfect match (PM) probes or mismatch (MM)

probes. A PM probe is designed to be complementary to an expected subsequence of

a specific mRNA. An MM probe is designed to match a PM probe sequence with the

exception that the 13th nucleotide base is switched. Every subgrid of PM probes has

a corresponding subgrid of MM probes. For every gene there are typically several PM

and MM probe subgrids. The entire collection of these subgrids is termed a probeset.

Affymetrix microarrays “measure” mRNA levels by using basic principles of chem-

istry. Each DNA probe on the surface will prefer to be bound to other DNA that

is exactly complementary. In general, the “closer” a subsequence of an DNA is to

being complementary to a probe sequence the more likely it will be to bind to the

corresponding probe. By using this principle it is thought that if a targeted sequence

is present in solution it will bind to its corresponding probe with high probability. Of

course, other DNA sequences in solution that have a subsequence which is “close” to

being complementary can also bind. It is thought that the MM probes can be used

to provide an estimate of this erroneous binding known as cross-hybridization.

Page 18: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 9

An Affymetrix microarray experiment is begun by extracted mRNA from a bio-

logical sample. Extracted mRNA then goes through a number of preparation steps

where it is labeled with some molecule that can be identified using a scanner and

the labeled mRNA is then applied to the surface of the microarray. After the chem-

ical reactions have had some time to take place the microarray is washed and only

the mRNA from the sample that is bound to probes should remain. Lastly, the mi-

croarray is put under a scanner and for each rectangular subgrid the intensity of the

labeling molecule is measured. A pictorial example of this process is given in Figure

2.3.

Figure 2.3: Step by step procedure of a typical Affymetrix microarray experiment(taken from [9])

Page 19: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 10

2.4 Experimental Errors and Data Preprocessing

Affymetrix microarrays are subject to technical, chemical, and human errors. An

example of some of these errors can be seen in Figure 2.4. These errors have been

extensively studied in the literature [33,35,37], however, they still remain to be con-

vincingly modeled in practical Affymetrix microarray experiments. An understanding

of how these errors affect Affymetrix microarray data is essential for determining how

reliable the data are as well as for extracting a reasonable estimate for the “level” of

gene expression in the sample.

Figure 2.4: Several sources of error for a microarray experiment (Taken from [35])

Previous work has been completed towards estimating the mRNA concentration

in the presence of error and has met with some success. In one publication [21],

a method was developed that was capable of detecting known mRNA levels in the

presence of experimental error. At least two other authors determined differential

Page 20: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 11

equation models that took into account error which worked well on the data tested

[5,35]. For the type of microarray experiment described in the previous section these

techniques have not had a very significant impact in practice. For a different type

of microarray experiment, a real-time microarray experiment [8], these techniques

are much more effective and practical. Most microarray data sets that are currently

available, however, are not real-time microarray experiments.

In practice it is common to handle microarray errors by using techniques that are

much simpler than the methods discussed in the previous paragraph. The typical first

step to microarray data analysis is to preprocess the data in order to remove error.

Some of the most commonly used microarray preprocessing techniques in practice

are those provided directly by Affymetrix (PLIER and MAS 5.0) [34] and the robust

multichip average (RMA) [18]. As of the time of this writing no single method

preprocessing method has been found to be generally preferable to the rest [19].

After the data are preprocessed it is usually assumed that the resulting data are

“error free.” Data analysis techniques are then applied to the preprocessed data to

find interesting results.

The preprocessing technique of primary interest in this thesis is RMA. At the

present time the original RMA publication has been cited over 3,000 times. This

technique makes the assumption that the distribution of PM probes from a microarray

approximately follows an EMG distribution [18]. RMA uses this assumption to model

observed values as signal (which follows an exponential distribution) plus noise (which

follows a Gaussian distribution). The signal value is then estimated by solving for

the expected value of signal given the value of signal plus noise. If the assumption

that the distribution of the PM probes follow an EMG distribution is incorrect then

estimating the use of the EMG distribution in RMA is questoinable. This assumption

is shown to be unlikely to be true based on the results of comparing the sample data

Page 21: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 12

to certain properties of the EMG distribution.

Page 22: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

13

3. Properties of the Exponentially Modified Gaussian (EMG)

Distribution

Due to the large reach of the EMG distribution in practical applications, [14, 18,

30] a better understanding of the EMG is worth pursuing. The probability density

function (pdf) and cumulative density function (cdf) for an EMG distribution are

given below as EMG(c;µ, σ, λ) and emg(c;µ, σ, λ) respectively (see Appendix A for

derivations):

EMG(c;µ, σ, λ) =1

2− 1

2eλ(

λσ2

2+µ−c)erfc(

σ√2

(λ+µ− cσ2

)) +1

2erf(

1√2σ

(c− µ))

(3.1)

emg(c;µ, σ, λ) =λ

2eλ(

λσ2

2+µ−c)erfc((

σ√2

)(λ+µ− cσ2

)) (3.2)

where

erf(x) =2√π

∫ x

0

e−t2

dt

erfc(x) =2√π

∫ ∞x

e−t2

dt = 1− erf(x)

In this chapter several properties of the EMG distribution are derived. The deriva-

tions predominantly rely on reparameterizing the input to the EMG cdf. These prop-

erties will later be used to challenge a current assumption that the PM probe data

from Affymetrix microarrays approximately follows an EMG distribution [18] as well

as to create a new parameter estimation method.

Page 23: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 14

3.1 Reparameterization of the EMG Distribution

A carefully selected reparameterization of the input to the EMG cdf can be used

to show several useful properties of the EMG distribution. This result was discovered

by analyzing the mode of the EMG pdf, which occurs when the derivative of the EMG

pdf is equal to zero. This equation is given by

λσ =1√π× e−p

2

erfc(p)(3.3)

where

p =λσ

2+µ− c

Solving for c yields the mode of the EMG pdf.

The equation for the mode can be simplified somewhat by replacing c with a

reparameterization c1 which is given by

c1 = µ+ λσ2 − 2Dσ (3.4)

where D ∈ R. Replacing c with c1 in (3.3) causes the equation to become

λσ =e−D

2

√πerfc(D)

This equation shows that the mode of the EMG pdf can be written entirely in terms

of D and λσ. The term λσ will be used often throughout the rest of this thesis, and

from this point on this term will be denoted by k.

This reparameterization can be slightly generalized and can be used to simplify the

EMG cdf for some input values. This slightly updated reparameterization is denoted

Page 24: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 15

by c2 and is given by

c2 = µ+ Cλσ2 +Dσ

where C,D ∈ R. Replacing c with c2 in (3.1), the EMG cdf reduces to

EMGc2(C,D, k) =1

2× [1− e

k2

2−Ck2−Dkerfc(

k − Ck −D√2

) + erf(Ck +D√

2)]. (3.5)

where k = λσ. Using this equation it is possible to calculate any quantile that can

be represented in terms of c2 once k is known.

Several important results that will be used in this writing are now explained in the

following sections. These results heavily rely on the term k and (3.5). From the work

in the following sections it will become evident that the term k provides a significant

amount of information about an EMG distribution.

3.2 EMG Quantile Bounds

Analysis of specific values of c2 revealed that at least some of the quantiles must

lie within certain bounds. This is accomplished by combining the constraint k > 0

(which is true because both σ and λ are greater than zero) with (3.5). Two such

bounds are given in the following paragraphs both as examples and for use later in

this thesis.

Perhaps the simplest example of a quantile bound is when C = D = 0. Under

these conditions it follows that c2 = µ and (3.5) reduces to a function that only

depends on k which is given by

EMGc2(0, 0, k) = EMGµ(k) =1

2× [1− e

k2

2 erfc(k√2

)] (3.6)

Page 25: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 16

Taking a derivative shows that the right hand side of EMGµ(k) is monotonically

decreasing for k ∈ (0,∞). Using this information it can be shown that

0 < EMGµ(k) <1

2(3.7)

for any EMG distribution.

It is also possible to determine a quantile bound on the mean m = µ + λ−1 of

an EMG distribution. The reparameterization c2 is equal to m when C = k−2 and

D = 0. Under these conditions (3.5) reduces to a function that only depends on k

which is given by

EMGm(k) = EMGc2(k−2, 0, k) =

1

2× [1− e

k2

2−1erfc(

k − k−1√2

) + erf(1

k√

2)] (3.8)

Analysis of the derivative shows that the right hand side of EMGm(k) is monotoni-

cally decreasing for k ∈ (0,∞). Using this information it can be shown that

1

2< EMGm(k) < 1− e−1 ≈ .632 (3.9)

for any EMG distribution.

3.3 Parameter Estimation

It is possible to completely define an EMG distribution in terms of k and two

quantiles rather than in terms of the three parameters µ, σ, and λ. Assuming that k

is known, one such procedure for determining the parameters is as follows:

1. Determine µ from the quantile determined by the right hand side of

EMGµ(k) =1

2× [1− e

k2

2 erfc(k√2

)]

Page 26: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 17

2. Determine ml = µ + λ−1 from the quantile determined by the right hand side

of

EMGm(k) =1

2× [1− e

k2

2−1erfc(

k − k−1√2

) + erf(1

k√

2)]

3. Determine ms = µ + σ from the quantile determined by the right hand side of

EMGc2(0, 1, k) =1

2× [1− e

k2

2−kerfc(

k − 1√2

) + erf(1√2

)]

4. Estimate λ by subtracting the estimate of µ from ml and then taking the mul-

tiplicative inverse of the result

5. Estimate σ by subtracting the estimate of µ from ms

The ability to define the EMG distribution in terms of k and two quantiles opens

up the possibility of a new type of parameter estimation method for an EMG distri-

bution. Given a sample from an EMG distribution if k can be estimated then it is

possible to estimate the parameters of the EMG distribution. In practice, a simple

way to estimate k is by estimating the sample quantile of the sample mean. This

estimate can then be substituted for the left hand side of (3.8) and an estimate for k

can be obtained by solving this equation for k. As long as the estimate for the sample

quantile of the sample mean satisfies (3.9) it will be possible to solve for k.

3.4 Shape Estimation

The value of k determines the overall “shape” of the EMG distribution. This can

be seen by analyzing the variance of an EMG distribution in terms of k which yields

Page 27: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 18

the following:

Var(EMG(c;µ, σ, λ)) = σ2 + λ−2

=k2 + 1

λ2(3.10)

= σ2 + σ2k−2 (3.11)

As k in (3.10) approaches zero the impact of the Gaussian component on the variance

becomes negligible. As k in (3.11) approach ∞ the impact of the shifted exponential

component on the variance becomes negligible. As the variance of a component

becomes negligible, the EMG distribution will be “close” to the distribution of the

other component. These observations indicate that for values of k that are “large”

the EMG distribution is “close” to a Gaussian distribution and that for values of k

that are “small” the EMG distribution is “close” to a shifted exponential distribution.

In practice, it is likely that an EMG distribution which is very “close” to being

either a Gaussian distribution or a shifted exponential distribution will be treated

as a Gaussian distribution or a shifted exponential distribution respectively. Due to

this, it seems reasonable to assume that EMG distributions which arise in practice

are likely to have k values that are located within a certain bounded interval. The

variance relations which were discussed in the previous paragraph provide a way to

obtain a rough estimate for this bounded interval. By combining 3.10) and (3.11) it

follows that

k2 + 1

λ2= σ2 + σ2k−2

Setting k = 1 results in σ2 = λ−2, which implies that the variance of both components

is equal. It seems reasonable to assume that a component will become negligible when

its variance is less than a certain percentage of the other. It further seems reasonable

Page 28: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 19

to assume that this percentage can be set to 1%, which results in the following bounds

on k

k ∈ [0.1, 10].

Several plots of EMG distributions for different values of k between 0.1 and 10 are

given in Figure 3.1.

In practice, it seems unlikely that values of k outside of this interval will be encoun-

tered. If this is not the case then it will be very difficult to estimate the parameters

of the EMG distribution. The reason for this is that the closer the EMG distribution

becomes to either a Gaussian distribution or a shifted exponential distribution the

harder it will be to estimate the exact magnitude of the difference. In general, the

slighter the modification to a distribution the harder it will be to detect.

3.5 EMG Right Tail Approximation

It is possible to show that the EMG cdf is approximately the same as a shifted

exponential cdf in the right tail of the distribution. The cdf of a shifted exponential

distribution will be denoted by SED(c;λ, T ) and is defined to be

SED(c;λ, T ) = 1− eλc−T (3.12)

where T is the shift parameter and λ is the same shape parameter that is used in an

exponential distribution. The desired approximation will be derived by considering

the reparameterization

c3 = µ+Dσ (3.13)

Page 29: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 20

(a) k = 0.1 (b) k = 0.5

(c) k = 1.0 (d) k = 2.0

Figure 3.1: Plots of EMG distributions for different values of k.

where D ∈ R. Using this new reparameterization in place of c in the EMG cdf it

follows that

EMGc3(D, k) =1

2× [1− e

k2

2−Dkerfc(

k −D√2

) + erf(D√

2)]

To see what happens in the right tail the limit as c3 approaches infinity is consid-

ered. This limit can not immediately be determined because the right hand side of

EMGc3(D, k) does not directly include c3. The right hand side is written in terms of

D so a relation between the limiting value of c3 and D would allow the limit to be

easily evaluated. From the constraint that EMGµ(k) ∈ (0, 12) (3.7) and the constraint

Page 30: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 21

that σ > 0 it is clear that if c3 is greater than the median then D > 0. Thus as c3

approaches infinity, D approaches infinity. Using this information it can be seen that

limc3→∞

EMGc3(D, k) = limc3→∞

1

2× [1− e

k2

2−Dkerfc(

k −D√2

) + erf(D√

2)]

= limD→∞

1

2× [1− e

k2

2−Dkerfc(

k −D√2

) + erf(D√

2)]

=1

2× [2− 2 lim

D→∞ek2

2−Dk]

= 1− limD→∞

e−k(D−k2)

where the last equality is the cdf of a shifted exponential distribution with shift T = k2

and shape parameter λ = k. Both the erf and the erfc terms approach their limits

at a much faster rate than does a term of the form e−kD, hence the cdf of the EMG

distribution should be approaching the cdf of a shifted exponential distribution.

To show that the right tail approximation is accurate in a more quantitative

manner it is first noted that if D ≥ D0 > 0 then the following bounds hold

1 > erf(D√

2) ≥ 1− α1

2 > erfc(k −D√

2) ≥ 2− α2

where

α1 = erfc(D0√

2)

α2 = erfc(D0 − k√

2)

From the fact that k > 0 it must be that α1 < α2. Using these constraints along with

Page 31: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 22

the inequality between α1 and α2 bounds can be put on EMGc3(D, k). The lower

bound is given by

EMGc3(D, k) ≥ 1

2× [1− e

k2

2−Dkerfc(

k −D√2

) + 1− α1]

=1

2× [(2− α1)− e

k2

2−Dkerfc(

k −D√2

)]

>1

2× [(2− α1)− 2e

k2

2−Dk]

= (1− α1

2)− e

k2

2−Dk

= 1− ek2

2−Dk − α1

2

= 1− ek2

2−Dk −

erfc(D0√2)

2

and the upper bound is given by

EMGc3(D, k) ≤ 1

2× [1− e

k2

2−Dk(2− α2) + 1− α1]

=1

2× [(2− α1)− (2− α2)e

k2

2−Dk]

<1

2× [2− (2− α2)e

k2

2−Dk]

= 1− ek2

2−Dk +

erfc(D0 − k)

2√

2ek2

2−Dk

≤ 1− ek2

2−Dk +

erfc(D0 − k)

2√

2ek2

2−D0k

= 1− ek2

2−Dk +

erfc(w(k))

2√

2ew2(k)

2−D

202

where w(k) = (D0 − k). For these two bounds the error between EMGc3(D, k) and

the bounds are given by

Le =erfc(D0√

2)

2(3.14)

Ue =erfc(w(k))

2√

2ew2(k)

2−D

202 (3.15)

Page 32: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 23

where Le is the error in the lower bound and Ue is the error in the upper bound.

Both error terms approach zero much more rapidly than does the term ek2

2−Dk so

these approximations should be quite accurate as long as D is large enough relative

to k.

The error in approximation can also be characterized in terms of the percentage

error. It is possible to show that the percentage error of the approximation monoton-

ically decreases to zero for D > 0. The percentage error PE of approximating the

value of EMG(c3) at D is given by

PE =EMGc3(D, k)− 1− e k

2

2−Dk

EMGc3(D, k)

= 1− 1− e k2

2−Dk

EMGc3(D, k)(3.16)

If the percentage error is monotonically decreasing to zero for D > 0 then it must be

the case that the second term in PE given by

Pr =1− e k

2

2−Dk

EMGc3(D, k)

is monotonically increasing to one for D > 0. Clearly the limit of Pr is one since both

the numerator and the denominator are valid cdfs. The derivative of Pr with respect

to D can be shown to be positive so it follows that Pr is monotonically increasing

with respect to D. The denominator of the derivative is always positive since it is

squared and the numerator of the derivative given by

−2kek2

2−Dk(erf(

√2

2(D − k))− erf(

√2

2D))

will be positive for all D > 0.

As an example of the accuracy of this approximation assume that k = 1 and

Page 33: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 24

D = 2. Under these circumstances it follows that

EMGc3(D, k)− (1− e k2

2−Dk)

EMGc3(D, k)≈ 0.016

which shows that the percentage error in the approximation is close to 1.6%. Because

the percentage error is monotonically decreasing for D > 0, the percent error in the

approximation will be no more than approximately 1.6% for all D ≥ 2.

Page 34: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

25

4. Application of the EMG Distribution to Actual Affymetrix Microarray

Perfect Match (PM) Probe Distributions

Data from five Affymetrix microarrays described in [31] were downloaded from [7].

The five Affymetrix microarray data files that were selected were T01 tumor.CEL -

T05 tumor.CEL. It is found that the sample distributions of the PM probes for these

five Affymetrix microarrays are unlikely to follow an EMG distribution. First, it is

shown that the right tail of the sample pdf is not what would be expected for an EMG

distribution. Further, it is shown that the sample quantiles of the sample means for

all five distributions are larger than would be expected for an EMG distribution.

4.1 Comparing the Right Tail to a Shifted Exponential Distribution

From the results in 3.5 it is clear that the EMG cdf should be well approximated

by the cdf of a shifted exponential distribution in the right tail. In order to apply

this approximation in practice it will be necessary to know where to begin. It will

be shown that the start of the right tail can be reasonably approximated if an upper

bound kmax on k can be assumed. Once the right tail has been located, a slightly

modified ratio of two sample quantiles will be compared to the ratio that would be

expected if the distribution was a shifted exponential distribution. The results of

this test will show that the right tails of the PM probe distributions from the five

Affymetrix microarrays described at the beginning of this chapter are very different

from what would be expected for an EMG distribution.

Page 35: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 26

4.1.1 Locating the Beginning of the Right Tail

To estimate the beginning of the right tail there must first be an estimate for the

upper bound on k denoted by kmax. From this estimate it is possible to determine

upper bounds on σ denoted by σmax and µ denoted by µmax. Using the result from 3.5

that the percentage error in the right tail approximation is monotonically decreasing

it is possible to select a value for D such that the percentage error is bounded. These

three estimates are then used to calculate c3 (3.13) which is the estimate for the

beginning of the right tail.

To estimate kmax it is not unreasonable to assume a value for kmax by eye-balling

the data given the insights from 3.4. From viewing the sample pdf histograms of the

five PM probe distributions (Figure 4.1) kmax = 1 seems like a safe estimate. Using

kmax, σmax can be obtained by rearranging (3.11) to obtain

σmax ≤s√

1 + k−2max

where s is the sample standard deviation. Substituting k = kmax, C = D = 0 into

(3.5) yields an estimate for µmax. Lastly a suitable value for D must be chosen so

that the percentage error between the actual EMG tail and the shifted exponential

tail is “small” enough. In 3.5 it was shown that for D > 2 the percentage error in

the approximation was no more than roughly 1.6%. Given that this error seems to

be “small” enough it is assumed that the right tail begins at c3 = µmax + 2σmax.

4.1.2 Testing the Right Tail

In order to test that the right tail is approximately a shifted exponential distri-

bution it is necessary to use a test that will not be affected much by the error in

the approximation. One such test is to slightly modify the ratio of two quantiles.

Page 36: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 27

(a) T01 tumor.CEL (b) T02 tumor.CEL

(c) T03 tumor.CEL (d) T04 tumor.CEL

(e) T05 tumor.CEL (f) EMG with k = 1

Figure 4.1: Plots of the sample pdf histograms for the PM probe distributions fromfive Affymetrix microarrays along with a plot of an EMG distribution with k = 1.

Page 37: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 28

Estimates of quantiles tend to be fairly robust so this test should not be greatly af-

fected by the approximation error between EMGc3(D, k) and the cdf of some shifted

exponential distribution.

In order to derive a test for the ratio of quantiles from a shifted exponential

distribution such a test was first created for an exponential distribution. This test is

then extended in a natural way to the shifted exponential distribution. The cdf for

an exponential distribution denoted by E(c;λ) is given by

E(c;λ) = 1− e−λc

For an exponential distribution, the ratio of any two quantiles is constant. To see

this suppose that E(x1;λ) = q and E(x2;λ) = p. Then it follows that

x1x2

=ln(1− q)ln(1− p)

where the ratio of the quantiles is clearly independent of λ. For a shifted exponential

distribution the only change that needs to be made is to shift the input by the value

of the shift parameter T . The shifted ratio of its quantile denoted by Sr is given by

Sr =x1 − Tx2 − T

(4.1)

=ln(1− q)ln(1− p)

(4.2)

The ratio test just derived for a shifted exponential distribution can be directly

applied to the experimental data being studied despite two possible problems. The

first possible problem is that the right tail of the distribution will not be a valid

probability distribution (because the area under the right tail is not equal to one).

Instead the right tail will be some constant multiple of a probability distribution that

Page 38: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 29

is “close” to being a shifted exponential distribution. Fortunately, these constants

will cancel out by taking a ratio so the test is not affected. The second possible

problem is the approximation error. It is important to show that the approximation

error will not cause a “large” error in Sr. Bounds on the error in Sr caused by error

in the approximation will be derived. In the application of this test to the PM probe

distribution of Affymetrix microarrays these bounds will be used to show that the

error in Sr caused by approximation error is not significant.

Showing that the approximation does not “significantly” affect Sr requires some

extra work due to the shift parameter T being present in the ratio test. It was shown

in 3.5 that the error in approximation can be written in terms of percentage error and

that the percentage error can be made as small as desired by moving far enough to

the right. Shifting the actual quantile value along with its approximation changes the

percentage error so it is necessary to know how the percentage error changes in this

case. The percentage error (PE from 3.16) and the shifted percentage error denoted

by PEs can be related as follows

PE = PEsEMGc3(D, k)− TEMGc3(D, k)

From the last equation it follows that if the ratio of the shifted quantile to the actual

quantile is not too “large” then it will follow that if PE is “reasonably” small then PEs

will also be “reasonably” small. Applying this result back to the EMG cdf it follows

that the ratio of any two quantiles x1 = EMG(y1;µ, σ, λ) and x2 = EMG(y2;µ, σ, λ)

that are “far” enough into the right tail is bounded by

(1− PEs1 + PEs

) < (EMG(y1;µ, σ, λ)− TEMG(y2;µ, σ, λ)− T

)(SED(y2)− TSED(y1)− T

) < (1 + PEs1− PEs

)

This shows that the quantile ratio assuming a shifted exponential distribution will be

Page 39: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 30

approximately the same as the quantile ratio assuming an EMG cdf as long as PEs

is “small” enough.

This ratio test is now applied to the sample data from the five Affymetrix PM

probe distributions using the quantiles q = 0.50 and p = 0.80. For all five sample pdfs

it was assumed that kmax = 1 (see Figure 4.1 for a visual comparison) and that the

right tail could be assumed to start at D = 2 (see 4.1 for justification). Using these

assumptions it was found that even with the approximation error being taken into

account, varying the sample quantiles by even as much as five standard deviations

was not enough to match the ratio that would be expected. This result strongly

suggests that the right tail does not follow a shifted exponential distribution which

casts doubt on the assumption that this data follows an EMG distribution.

4.2 Discrepancy in the Sample Quantile of the Sample Mean

For all five data sets the sample quantile of the sample mean was much larger

than the (1− e−1)th quantile. Since the quantile of the mean of an EMG distribution

can not be larger than the (1 − e−1)th quantile it seems likely that the sample data

are not EMG distributed. To investigate this possibility a hypothesis test is created

to determine whether or not the quantile of the mean of each distribution was larger

than the (1− e−1)th quantile.

To create the hypothesis test it is assumed that both the sample mean and the

sample quantiles approximately follow a Gaussian distribution. Due to the fact that

the sample size was greater than 200,000 for all five sample distributions, these two

assumptions seem reasonable in light of the central limit theorem. Given these as-

sumptions, the paired t-test can be used to determine if it is likely that the quantile

of the mean is larger than the (1− e−1)th quantile.

After applying the paired t-test to all of the sample distributions it was found that

Page 40: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 31

the difference between the sample quantile of the sample mean and the (1 − e−1)th

quantile was very high. For all five data sets, the difference between the sample

quantile of the sample mean and the (1− e−1)th sample quantile was no less than 90,

while the standard deviations for both estimates were less than 1. Given these values

the null hypothesis that the quantile of the mean is less than the (1− e−1)th quantile

can easily be rejected at the α = 0.01 level for all five sample distributions.

Since the mean of the sample data occurs at such a large quantile it seems likely

that the best EMG fit for the data would be a distribution that is close to being

a shifted exponential distribution (small value of k). From viewing Figure 4.1 it is

clear that this sample pdf is not very similar to a shifted exponential distribution.

This result shows that it is unlikely that these sample distributions follow an EMG

distribution.

Page 41: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

32

5. Fitting the Right Tail of the Perfect Match (PM) Probe Data

Given that the sample data are unlikely to follow an EMG distribution the next

question that should be asked is what distribution do these data follow? The previous

chapter showed that the right tails of the sample distributions were very different from

the right tail of an EMG distribution. From visual inspection (Figure 4.1) it appears

that the problem is due to the right tails of the sample pdf histograms being much

too “heavy.” In other words the right tails of the sample pdf histograms do not go to

zero as quickly as would be expected.

After further visual examination the right tails of the sample pdfs all seemed to

share the property that doubling the input to the sample pdf reduced the height of

the sample pdf histogram by approximately one third. Taking this observation as

an assumption the problem of determining an appropriate distribution for the right

tail of the sample data becomes the problem of finding a function with this property.

Such a function will be derived in the next section and will be shown to fit the right

tails of the sample pdf histograms closely. The derivation of this function will then

be generalized to functions of a larger class.

5.1 Derivation of Functions That Decrease by a Common Ratio

It is assumed that a function f(x) such that

f(x)

f(2x)= 3 (5.1)

may be an appropriate distribution for modeling the right tails of the sample pdf

histograms. In order to determine the form of f(x), several common functional forms

were assumed for f(x) and the algebra was checked to see if the final result was valid.

Page 42: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 33

After several attempts it was found that by assuming f(x) = g(x)x it was possible to

determine f(x).

By using the substitution f(x) = g(x)x it follows that

f(x)

f(2x)=

g(x)x

g(2x)2x= 3 (5.2)

It is possible to create a recurrence relation over some values of g(x). This is accom-

plished using the following modified form of (5.2)

g(x)x

g(2x)2x= 3

xlog(g(x))− 2xlog(g(2x)) = log(3)

log(g(x)

g(2x)2) =

log(3)

x= log(3x

−1

)

g(x) = 3x−1

g(2x)2

g(2x) =√g(x)3−x−1

If it is assumed that g(1) = 1 then the first six terms of the recurrence are as

follows:

g(1) = 1

g(2) =√g(1)3−1 = 3−

12

g(4) =

√g(2)3−

12 = 3−

12

g(8) =

√g(4)3−

14 = 3−

38

g(16) =

√g(8)3−

18 = 3−

416

g(32) =

√g(16)3−

116 = 3−

532

The last three terms in this list show that the numerator of the exponent is the log base

Page 43: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 34

two of the input and the denominator of the exponent is the input. This suggests that

the function 3−log2(x)

x may work for g(x) which suggests that f(x) = g(x)x = 3−log2(x).

To verify that f(x) = 3−log2(x) has the desired property this function is tested in

(5.1).

3−log2(x)

3−log2(2x)= r

log2(x−1)− log2((2x)−1) = log3(r)

log2(2) = log3(r)

3 = r

The last line of the algebra shows that f(x) has the desired property.

The format of this function suggests that it would be possible to generate functions

such that

f(x)

f(αx)= β

where α, β > 1 by using the function β−logα(x). Working out the same steps that were

performed for f(x) in the previous paragraph it follows that

β−logα(x)

β−logα(αx)= r

logα(x−1)− logα((αx)−1) = logβ(r)

logα(α) = logβ(r)

β = r

This algebra shows that this class of functions has the expected property.

Page 44: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 35

5.2 Application of Functions that Decrease by a Common Ratio to the

Right Tail

Attempting to fit the right tails of the sample pdfs immediately yields encouraging

results. By fitting a shifted version of the function f(x) = 3log2(x) to the right tail

of the sample pdf histogram from T01 tumor.CEL it can be seen that the shifted

version of f(x) and the right tail of the sample pdf histogram are very similar (Figure

5.1). It seems likely that the cdf for the sample data approaches a function that

decreases by a common ratio.

Figure 5.1: Plot of the right tail of the sample pdf histogram for the PM probe datafrom T01 tumor.CEL fitted to a shifted version of f(x) = 3log2(x).

Comparing the right tail of an EMG pdf to f(x) shows that these two functions

are very different. Both functions are concave up and constantly decreasing, however,

the rate of decrease is very different. By definition the ratio f(x)f(2x)

= 3 is constant with

respect to x. For an EMG distribution this ratio is not constant with respect to x

Page 45: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 36

and can be very different from 3 depending on the value of x. As an example when

mu = 0, σ = 1, λ = 1, and x =10 the ratio for the right tail of an EMG distribution

is approximatley 5,000. In general, the right tail of an EMG pdf converges to zero

much more quickly than does f(x).

Page 46: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

37

6. Practical Implementation of EMG Parameter Estimation Method and

Properties

In section 3.3 it was shown that once the value of the variable k is known, it is

possible to estimate the parameters of an EMG distribution using two sample quan-

tiles. Using (3.6) it is possible to estimate k by replacing EMGm(k) with the sample

quantile of the sample mean. Combining these two results constitutes a parameter

estimation method which is given by

1. Estimate k with ke where ke is calculated by replacing the left hand side of the

following equation with the sample quantile of the sample mean and solving for

ke

EMGm(ke) =1

2× [1− e

k2e2−1erfc(

ke − k−1e√2

) + erf(1

ke√

2)]

2. Determine µ from the quantile determined by the right hand side of

EMGµ(ke) =1

2× [1− e

k2e2 erfc(

ke√2

)]

3. Determine ml = µ + λ−1 from the quantile determined by the right hand side

of

EMGm(ke) =1

2× [1− e

k2e2−1erfc(

ke − k−1e√2

) + erf(1

ke√

2)]

4. Determine ms = µ + σ from the quantile determined by the right hand side of

EMGc2(0, 1, ke) =1

2× [1− e

k2e2−keerfc(

ke − 1√2

) + erf(1√2

)]

Page 47: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 38

5. Estimate λ by subtracting the estimate of µ from ml and then taking the mul-

tiplicative inverse of the result

6. Estimate σ by subtracting the estimate of µ from ms

Although this method will work in theory there are several modifications that need

to be made in order to make it practical.

It will be shown that by performing several slight modifications to the parameter

estimation procedure described in the previous paragraph a practical implementation

will result. The final implementation is consistent and always returns “valid” pa-

rameter estimates where “valid” parameter estimates are parameter estimates that

satisfy all constraints on the original parameters (such as σ > 0). This new parameter

estimation method is then compared to other parameter estimation methods for the

EMG distribution from the literature. It is found that the new parameter estimation

method has several advantages over other currently available methods.

6.1 Proof of Consistency

It is proved that the new parameter estimation method as introduced at the

beginning of this chapter is consistent. This proof will also apply to the final imple-

mentation as the modification made will in no way affect consistency.

Theorem 6.1.1. The parameter estimation method introduced is consistent.

Proof. To prove the theorem it will first be proved that the estimate for k is consistent.

Applying the same techniques used to show that k is consistent it can easily be shown

that the consistency of the parameter estimates follows from the consistency of k. To

show that k is consistent it will be shown that the sample quantile of the sample

mean is a consistent estimate for the quantile of the mean. Given the continuity of

the EMG cdf it will then follow that the estimate for k is consistent.

Page 48: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 39

By definition a consistent estimator Rn of the parameter k must have the following

property

limn→∞

Pr(|Rn − k| > ε) = 0

where ε is any real number greater than zero and n is the sample size. It is well

known that the sample mean will be a consistent estimator for the mean of random

variable that has a continuous pdf and cdf because of the central limit theorem. This

shows that the sample mean will be a consistent estimator of the mean for an EMG

distribution.

If the quantiles were completely known then it would immediately follow by the

continuity of the EMG cdf that the quantile of the sample mean is a consistent esti-

mate for the quantile of the mean. In the actual estimation method sample quantiles

are being used instead of actual quantiles, however, this will not affect consistency.

The distribution of any quantile value q (where 0 < q <1) follows a binomial distribu-

tion. Given any closed interval of quantile values (which does not include zero or one)

it is clear that the variance of all the quantiles in the interval will be bounded above

by the quantile that is farthest from 0.5. Using Chebyshev’s inequality it follows that

no matter which quantile in this interval is chosen the probability that the sample

quantile will be more than ε units away from the ends of this interval approaches

zero as the sample size approaches infinity. This shows that the estimation error

introduced by using sample quantiles in place of the actual quantiles can be made

arbitrarily small with probability approaching one. Since the EMG cdf is continuous

the “small” deviations in the sample quantiles will cause “small” deviations in the

sample quantile of the sample mean.

Combining these results it is clear that the estimator for k will be consistent.

Using the same techniques it is also possible to show that the rest of the estimates

Page 49: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 40

used in this method are also consistent.

6.2 Practical Considerations and Alterations

Before describing the final form of the new parameter estimation method several

practical considerations that impact the method will be discussed. The first issue is

that in order to estimate the parameters, (3.8) must be solved for k. If the estimated

value of EMGm(k) is outside of the interval (0.5, 1 − e−1) then it seems reasonable

to modify the estimate to be the nearest value within (0.5, 1 − e−1). The problem

with this situation is that the interval is open and therefore the nearest value does

not exist. To rectify this, the estimated value of EMGm(k) is rounded to the closest

sample quantile within (0.5, 1− e−1). In order to avoid a discontinuous estimate, this

rounding procedure is extended to any estimates of EMGm(k) that are either smaller

than the smallest sample quantile in (0.5, 1− e−1) or larger than that largest sample

quantile in (0.5, 1− e−1).

For smaller sample sizes that only have one sample quantile within the interval

(0.5, 1 − e−1) the discontinuous estimate is more natural. If the continuous version

of the estimate was used, then the estimate of k would always be the same for these

sample sizes. For sample sizes that are greater than or equal to 15, there are always

at least two sample quantiles within (0.5, 1− e−1) so the discontinuous estimate is an

attractive option for sample sizes less than 15.

There are a handful of sample sizes that have no sample quantile within (0.5, 1−

e−1). For these sample sizes the rounding procedure fails because there is no sample

quantile to round to. The sample sizes for which this situation arises are one, two,

three, four, and six. When the sample size is six it is reasonable to round to the

sample quantile from a sample size of five that is within (0.5, 1− e−1). Choosing an

appropriate quantile for sample sizes that are less than or equal to four is not quite

Page 50: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 41

as straightforward. In this case the easiest remedy is to choose a completely arbitrary

quantile for rounding (such as the center of the interval). This will solve the problem

for all but the sample size of one which is ignored.

The rounding procedure just described will not affect consistency and will not

cause the estimates to be invalid. By construction of the rounding procedure it is

clear that estimate for k must be valid. Next the estimate of λ is investigated. To

show that the estimate for λ is always valid it is noted that the sample mean for a

sample size of two or greater will always be greater than the smallest sample value

because the EMG distribution has a continuous pdf that is positive for all real values.

Since the quantiles are estimated using linear interpolation and since the EMG cdf is

monotonically increasing, it is clear that the estimated quantile of the mean (given

by µ + λ−1) and the estimated quantile of µ will not be the same if both quantiles

are estimated using the same value of k. This shows that the estimate of λ will be

valid and therefore the estimate of σ will also be valid.

One slight modification remains to be made. The estimate for k is often not accu-

rate and so if at all possible this estimate should be avoided for estimating anything

other than a quantile. In the new estimation procedure, the estimate of k is used in

such a manner to estimate σ. An alternative approach is to estimate µ + σ and µ

from the sample quantiles. An estimate for σ can then be obtained by subtracting

these two estimates. This alternative estimation procedure may fail if the estimated

quantiles for both values are less than or equal to the smallest sample quantile. In this

case σ will be estimated to be zero and this invalid estimate can simply be discarded

and replaced with the estimate from the previous estimation method for σ. By using

the alternative estimate for σ, a slight gain in estimation accuracy can be achieved.

Several steps of the new estimation procedure require the evaluation of an erfc

term. As was discussed in section 6.4, this term can introduce numerical instability.

Page 51: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 42

Although this was a major problem for the MLE parameter estimation method it is

only a minor issue for the new method. All of the erfc terms in the new method

depend on the value of k which is likely to occur in the interval [0.1, 10] for most

practical applications. If k occurs in this interval then there should be no issue with

the numerical instability of the erfc term. For applications that require values of

k that are far outside of this interval, it may be necessary to use software that can

calculate erfc to a high precision.

6.3 Summary of Final Parameter Estimation Method

A step by step procedure for estimating the parameters of an EMG distribution

from a random sample S = {s1,...,sn} using the final form of the new method is given

below for sample sizes greater than or equal to 15:

1. Calculate mq, where mq is the sample quantile of the mean of S.

2. Calculate minsq, where minsq is the smallest sample quantile which is greater

than 0.5 and less than 1 − e−1.

3. Calculate maxsq where maxsq is the largest sample quantile which is greater

than 0.5 and less than 1 − e−1.

4. If mq /∈ [minsq, maxsq] then set mq to the closest endpoint of this interval.

5. Estimate k with ke where ke is obtained by solving

mq =1

2× [1− e

k2e2−1erfc(

k2e − 1

ke√

2) + erf(

1

ke√

2)]

6. Calculate µqe where

µqe =1

2× [1− e

k2e2 erfc(

ke√2

)]

Page 52: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 43

7. Estimate µ as the sample quantile of µqe using linear interpolation.

8. Calculate mlqe where

mlqe =1

2× [1− e

k2e2−1erfc(

ke − k−1e√2

) + erf(1

ke√

2)]

9. Estimate ml as the sample quantile of mlqe using linear interpolation.

10. Estimate λ with λe where λe is obtained by subtracting the estimate for µ from

ml and then taking the multiplicative inverse of the result.

11. Calculate msqe where

msqe =1

2× [1− e

k2e2−keerfc(

ke − 1√2

) + erf(1√2

)]

12. Estimate ms as the sample quantile of msqe using linear interpolation.

13. Estimate σ by subtracting the estimate for µ from ms.

14. If the estimate for σ is zero then estimate σ by ke divided by the estimate for

λ.

For sample sizes less than 15, a few slight modifications are necessary. First

the rounding procedure must be changed to only round mq if mq occurs outside of

(0.5, 1 − e−1). Second, the sample quantile to round to when the sample size is six

should be set to the sample quantile to round to when the sample size is five. Third,

the sample quantile to round to for sample sizes of two, three, and four should be set

to the center of (0.5, 1− e−1). Lastly an error message should be returned when the

sample size is one because this case is not supported.

Page 53: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 44

6.4 Currently Available Methods

Three parameter estimation methods are considered. The three parameter esti-

mation methods considered are the maximum likelihood estimation (MLE) method,

the method introduced in [30] (henceforth referred to as the Silver method), and the

method of moments. Each method is described in detail in the following sections.

6.5 Maximum Likelihood Estimation Method

Given a random sample S = {s1, ..., sn} the MLE method for an EMG distribution

returns the estimated values µe, σe, λe which maximize the following expression:

LE(S;µ, σ, λ) =n∏i=1

λ

2eλ(

λσ2

2+µ−si)erfc((

σ√2

)(λ+µ− siσ2

)) (6.1)

This method is proven to be consistent for an EMG distribution [27].

A major problem with the MLE method is the subtractive cancellation in the

exponential term and the erfc term in (6.1). The exponential term is subject to

subtractive cancellation when

si ≈λσ2

2+ µ

and the erfc term is subject to subtractive cancellation when the input to the erfc

term becomes large and positive due to the fact that erfc(x) = 1 − erf(x) and the

fact that erf(x) ≈ 1 when x is large and positive. As an example of subtractive

cancellation, erfc(35) < 10−530 which would underflow at many machine precisions.

Due to this issue the MLE is not a viable method for estimating the parameters

of an EMG distribution in general. If even a single underflow occurs the entire MLE

expression will reduce to zero causing the method to fail.

Page 54: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 45

6.6 The Silver Method

The Silver method [30] circumvents the numerical stability issues by replacing the

(3.2) with a saddlepoint approximation [10] and then approximating the solution of

the maximum likelihood expression using the Nelder-Mead simplex algorithm [24].

The saddlepoint approximation is basically a simplified approximation to a distribu-

tion using information from several moments of the distribution. By replacing the

EMG pdf with a simplified saddlepoint approximation numerical stability is more eas-

ily controlled. In the newly formed approximation to the MLE expression subtractive

cancellation can be avoided. The Nelder-Mead simplex algorithm is then used to

estimate the optimal value of this new expression. A software implementation of this

method is the normexp.fit function (using the saddle method option) from the limma

package version 3.6.9 for the R programming language [26].

6.7 Method of Moments

The method of moments [6] estimates the parameters of a distribution by setting

the moment equations of a random variable equal to the sample moments and then

solving the system of equations for the parameters. The number of equations needed

is equal to the number of parameters that are needed to describe the random variable.

For an EMG distribution the first three moment equations can be used to solve for

µ, σ, and λ. These three equations are given by

m1 = µe + λ−1e

m2 = σ2e + µ2

e + 2µeλ−1e + 2λ−2e

m3 = 3σ2eµe + 3σ2

eλ−1e + µ3

e + 3µ2eλ−1e + 6µeλ

−2e + 6λ−3e ,

Page 55: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

46

7. Comparison of Methods on Synthetic Data

To determine its usefulness the method described in the previous section was

applied to actual experimental data as well as synthetic data.

7.0.1 Synthetic Data

The new method was compared to the Silver method and the method of moments.

Synthetic data were generated for three different EMG distributions, each having

unique values of k (k = 0.1, 4, and 10). For each EMG distribution, 100 random

samples were generated for each of seven possible sample sizes. Each of the three

methods was then applied to the generated synthetic data. The average and standard

deviations of the estimates for each parameter along with the number of times that

valid parameters were not returned (denoted as “Fails”) were calculated. In order to

quickly compare results a goodness-of-fit metric is defined to show how closely the

estimates resembled the actual values. This metric is defined to be the largest of the

percent errors for the three parameters (denoted as “Error”). The results are given

in Tables 7.1 - 7.3.

The performance of each method over the synthetic data samples is dependent on

the value of k. For all three methods, the parameter estimates become less accurate

as k approaches the higher end of the range defined in [0.1, 10]. The accuracy of the

method of moments (when it manages to return valid parameter values) appears to

be very similar to the accuracy of the new method for values of k that are in the

higher end of [0.1, 10]. The Silver method appears to be superior to the other two

methods when k is at the lower end of [0.1, 10] and the new method appears to be

superior to the other two methods when k is at the higher end of [0.1, 10].

The results for k = 4 (Table 7.3) suggest that the Silver method is not converging

Page 56: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 47

to the correct parameter estimates. The parameter estimates for a sample size of

1,000 using the Silver method are µe = 0.6, σe = 1.9, and λ = 1.2 and the standard

deviation of each estimate is less than 0.15. This implies that the estimate for λ

(which is no greater than 1.25 when rounding is taken into account) is at least five

standard deviations away from the actual value. Since the standard error in the

estimates is so close to zero, it seems likely that the parameter estimates from the

Silver method are not consistent for this set of parameter values. This observation

casts doubt on the consistency of the Silver method.

A similar analysis on the estimate for µ when k = 10 (Table 7.3) shows that the

Silver method’s estimate is at least four standard deviations away from the actual

value at a sample size of 1,000. This analysis is less convincing, however, because the

standard deviations in the estimate for λ is not close to being zero. To investigate

further, the test was rerun using 10,000 samples. At this sample size all standard

deviations were less than 0.15 and all three parameter estimates were at least seven

standard deviations away from the actual parameter values (the estimate for λ was

at least 29 standard deviations away from the actual parameter value). This result

combined with the result from the previous paragraph seem to imply that the Silver

method is not consistent for large values of k.

Page 57: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 48

New MethodN Avg Sd Fails(Error)

(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.1,0.30,1.6) (0.1,0.21,0.7) 0(205%)25 (1.1,0.29,1.5) (0.1,0.17,0.4) 0(190%)50 (1.1,0.23,1.2) (0.1,0.13,0.2) 0(135%)

100 (1.1,0.15,1.1) (0.1,0.08,0.1) 0(53%)200 (1.0,0.15,1.1) (0.1,0.06,0.1) 0(54%)500 (1.0,0.12,1.0) (0.1,0.06,0.1) 0(21%)

1000 (1.0,0.10,1.0) (0.1,0.05,0.1) 0(4%)(µ =1, σ = 2 , λ = 2), k = 4

15 (-0.6,1.6,0.6) (0.7,0.7,0.2) 0(157%)25 (-0.1,1.7,0.7) (0.7,0.5,0.3) 0(106%)50 (0.2,1.7,0.7) (0.4,0.4,0.2) 0(83%)

100 (0.3,1.8,0.9) (0.4,0.3,0.2) 0(67%)200 (0.6,1.9,1.1) (0.3,0.2,0.3) 0(43%)500 (0.8,1.9,1.5) (0.3,0.2,0.4) 0(27%)

1000 (0.8,1.9,1.7) (0.3,0.1,0.6) 0(16%)(µ =1, σ = 5 , λ = 2), k = 10

15 (-3.1,4.2,0.2) (1.8,1.7,0.1) 0(412%)25 (-2.1,4.0,0.3) (1.7,1.3,0.1) 0(311%)50 (-1.7,4.2,0.3) (1.2,0.9,0.1) 0(272%)

100 (-1.1,4.3,0.4) (0.9,0.7,0.1) 0(211%)200 (-0.6,4.5,0.5) (0.8,0.6,0.1) 0(165%)500 (-0.2,4.6,0.6) (0.7,0.4,0.2) 0(124%)

1000 (0.2,4.8,0.8) (0.6,0.3,0.2) 0(83%)

Table 7.1: Synthetic data results for the new method

Page 58: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 49

Method of MomentsN Avg Sd Fails(Error)

(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.2,0.47,1.7) (0.2,0.19,0.8) 10(366%)25 (1.2,0.46,1.5) (0.2,0.17,0.6) 26(363%)50 (1.2,0.42,1.3) (0.1,0.14,0.3) 20(321%)

100 (1.1,0.36,1.2) (0.1,0.12,0.2) 25(257%)200 (1.1,0.34,1.1) (0.1,0.12,0.1) 24(239%)500 (1.1,0.28,1.1) (0.1,0.10,0.1) 37(181%)

1000 (1.0,0.25,1.1) (0.0,0.08,0.1) 42(147%)(µ =1, σ = 2 , λ = 2), k = 4

15 (0.2,1.5,1.1) (0.7,0.4,0.6) 36(76%)25 (0.6,1.8,1.2) (0.5,0.3,0.8) 45(39%)50 (0.6,1.8,1.3) (0.4,0.3,1.0) 49(41%)

100 (0.6,1.8,1.4) (0.3,0.2,0.7) 33(37%)200 (0.7,1.9,1.4) (0.3,0.1,0.6) 46(30%)500 (0.8,1.9,1.5) (0.2,0.1,0.5) 39(26%)

1000 (0.8,1.9,1.7) (0.2,0.1,0.7) 37(16%)(µ =1, σ = 5 , λ = 2), k = 10

15 (-1.0,3.8,0.4) (1.4,0.9,0.2) 53(203%)25 (-0.7,4.1,0.6) (1.2,0.8,0.5) 45(173%)50 (-0.6,4.3,0.6) (1.1,0.5,0.4) 49(160%)

100 (-0.6,4.3,0.5) (0.9,0.5,0.2) 61(160%)200 (-0.3,4.6,0.7) (0.7,0.3,0.5) 49(127%)500 (-0.1,4.7,0.7) (0.5,0.2,0.5) 47(113%)

1000 (0.1,4.8,0.8) (0.5,0.2,0.4) 54(92%)

Table 7.2: Synthetic data results for the method of moments

Page 59: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 50

Silver MethodN Avg Sd Fails(Error)

(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.1,0.07,1.3) (0.2,0.14,0.7) 0(35%)25 (1.0,0.05,1.1) (0.1,0.08,0.3) 0(55%)50 (1.0,0.07,1.0) (0.1,0.07,0.2) 0(29%)

100 (1.0,0.08,1.0) (0.0,0.05,0.1) 0(17%)200 (1.0,0.09,1.0) (0.0,0.03,0.1) 0(9%)500 (1.0,0.09,1.0) (0.0,0.02,0.1) 0(7%)

1000 (1.0,0.09,1.0) (0.0,0.01,0.0) 0(8%)(µ =1, σ = 2 , λ = 2), k = 4

15 (-0.3,1.0,8.5) (1.3,0.9,21.4) 0(326%)25 (0.5,1.6,12.5) (1.0,0.7,21.5) 0(527%)50 (0.7,1.8,9.9) (0.7,0.4,19.0) 0(393%)

100 (0.7,1.8,3.6) (0.3,0.2,9.3) 0(82%)200 (0.6,1.9,1.2) (0.2,0.1,0.1) 0(41%)500 (0.6,1.9,1.2) (0.1,0.1,0.1) 0(42%)

1000 (0.6,1.9,1.2) (0.1,0.0,0.1) 0(42%)(µ =1, σ = 5 , λ = 2), k = 10

15 (-1.5,3.2,5.5) (2.9,2.1,9.2) 0(253%)25 (-1.2,3.5,4.1) (2.9,1.9,8.1) 0(222%)50 (-0.3,4.4,4.3) (1.7,1.0,7.2) 0(127%)

100 (-0.3,4.6,2.2) (0.8,0.5,5.2) 0(134%)200 (-0.5,4.6,0.8) (0.5,0.2,1.9) 0(148%)500 (-0.5,4.6,0.6) (0.4,0.2,1.1) 0(155%)

1000 (-0.5,4.6,0.6) (0.3,0.2,0.9) 0(151%)

Table 7.3: Synthetic data results for the Silver method

Page 60: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

51

8. Conclusion

By using a linear reparameterization of the input to the exponentially modified

Gaussian (EMG) cumulative distribution function (cdf) several important properties

of the EMG distribution were derived. These properties showed that the multipli-

cation of the exponential distribution parameter λ and the standard deviation pa-

rameter of the Gaussian distribution denoted by k = λσ provides a large amount

of information about an EMG distribution. This term can be used, for instance, to

determine the relative “shape” of the EMG distribution, to calculate bounds on cer-

tain quantiles, and to estimate the parameters of an EMG distribution from sample

values.

These properties were applied to a specific practical application of the EMG dis-

tribution to Affymetrix microarray preprocessing. The robust multiarray average

(RMA) Affymetrix microarray preprocessing techniques assumes that the distribution

of the perfect match (PM) probes from an Affymetrix microarray at least approxi-

mately follows an EMG distribution. Five Affymetrix microarrays were downloaded

from a public data base and the properties derived in this thesis were used to create

two tests for determining whether or not the sample data distributions were likely

to follow an EMG distribution. Both tests agreed that the sample data distributions

were not likely to follow an EMG distribution. The first test found that the sample

quantiles of the sample means were much larger than would be expected for an EMG

distribution while the second test found that the right tails of the sample distribu-

tions were much “heavier” than would be expected for an EMG distribution. Using

these results a new distribution f(x) was derived for fitting the right tail with pdf

f(x) = 3log2(x) that seemed to fit the right tail of the data reasonably well. This fitting

further challenges the assumption that the sample data follow an EMG distribution

Page 61: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 8. CONCLUSION 52

because f(x) has a significantly “heavier” tail than does an EMG distribution.

The derived properties of the EMG distribution also revealed a new way to esti-

mate the parameters of an EMG distribution from sample data. After a few slight

modifications a practical method for estimating the parameters of an EMG distribu-

tion was created that is proven to be consistent. This new method was shown to have

distinct advantages over two other EMG parameter estimation methods which were

the Silver method and the method of moments. Compared to the Silver method [30],

the new method is: 1, simpler to implement; 2, proven to be consistent (the Sil-

ver method does not appear to be consistent when applied to synthetic data); and

3, appears to more accurately estimate the parameters of EMG distributions with

“large” values of k. The Silver method does, however, appear to return more accu-

rate parameter estimates for EMG distributions with “small” values of k. Compared

to the method of moments, the new method does not have the problem of returning

imaginary parameter estimates. Overall the new method appears to be most useful

for EMG distributions that have “large” values of k and the Silver method appears

to be most useful for EMG distributions that have “small” values of k.

By better understanding the EMG distribution it was possible to not only ade-

quately answer the practical problem being considered (the distribution of the PM

probes from the five Affymetrix microarrays) but also to gain insight into a completely

different application area (parameter estimation). Due to the nature of the derived

properties it was further possible to show that the parameter estimation method that

was created was consistent and also showed how to determine the accuracy of param-

eter estimation based on the “shape” of the EMG distribution. This type of process

was not completed for the Silver method, most likely because the involved numerical

approximation techniques provided no actual insight into why their method may or

may not be working. From the synthetic data trials given in the original publication

Page 62: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER 8. CONCLUSION 53

of the Silver method [30] it would seem that the Silver method would likely have

no issues with consistency in practice. With the knowledge gained from the proper-

ties derived in this thesis it was possible to challenge this “reasonable” assumption.

Any application that involves the EMG distribution is likely to benefit from a better

understanding of its properties.

Page 63: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

54

Appendix A. Derivation of pdf and cdf

A.1 Derivation of the Probability Density Function and the Cumulative

Distribution Function

From the definition of the EMG distribution the cumulative distribution function

can be written as:

EMG(c;µ, σ, λ) = P{E + G ≤ c}

=

∫ ∞0

∫ c−x

−∞(λe−λx)(

1√2πσ2

e−(y−µ)2

2σ2 )dydx (A.1)

=

∫ c

−∞

∫ c−y

0

(λe−λx)(1√

2πσ2e−

(y−µ)2

2σ2 )dxdy (A.2)

Integrating (A.2) yields

EMG(c;µ, σ, λ) =

∫ c

−∞(−e−λ(c−y) + 1)(

1√2πσ2

e−(y−µ)2

2σ2 )dy

=

∫ c

−∞(

1√2πσ2

e−(y−µ)2

2σ2 )dy −∫ c

−∞(

1√2πσ2

e−(y−µ)2

2σ2−λ(c−y))dy

The second integral can be simplified using the following integral from [15]:

∫ ∞0

e−x24β−γxdx =

√πβeβγ

2

(1− erf(γ√β)) (A.3)

where

erf(x) =2√π

∫ x

0

e−t2

dt

After some algebra and simplification the cumulative distribution function reduces

Page 64: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

CHAPTER APPENDIX A. DERIVATION OF PDF AND CDF 55

to:

EMG(c;µ, σ, λ) =1

2× [1− eλ(

λσ2

2+µ−c)erfc(

σ√2

(λ+µ− cσ2

)) + erf(1√2σ

(c− µ))]

where

erfc(x) =2√π

∫ ∞x

e−t2

dt = 1− erf(x)

Using (A.1) and the fact that the probability density function is the derivative of

the cumulative distribution function it follows that:

emg(c;µ, σ, λ) =d

dcEMG(c;µ, σ, λ)

=d

dc

∫ ∞0

∫ c−x

−∞(λe−λx)(

1√2πσ2

e−(y−µ)2

2σ2 )dydx

=

∫ ∞0

(λe−λx)(1√

2πσ2e−

(c−x−µ)2

2σ2 )dx

After using (A.3) and doing some algebra the probability density function simplifies

to:

emg(c;µ, σ, λ) =λ

2eλ(

λσ2

2+µ−c)erfc((

σ

2)(λ+

µ− cσ2

))

Page 65: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution
Page 66: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

57

Bibliography

[1] Affymetrix WebsiteGeneChip Overview: Activity #2 - Structure & Function of GeneChip Microar-rays.http://media.affymetrix.com/about affymetrix/outreach/lesson plan/...downloads/student manual activities/activity2/activity2 structure function.pdfAccessed on: March 19th, 2011

[2] Alberts, B. et al. (2002). Molecular Biology of the Cell (4th ed). GarlandScience; New York, New York.

[3] Barber, W.E. and Carr, P.W. (1981). Graphical method for obtaining re-tention time and number of theoretical plates from tailed chromatographic peaksAnal. Chem. 53 1939–1942.

[4] Bioconductoraffy [Software Package] : methods for affymetrix oligonucleotide arrayshttp://www.bioconductor.org/help/bioc-views/2.5/bioc/html/affy.htmlAccessed on: November 29th, 2010

[5] Bishop J. et al. (2008). Kinetics of Multiplex Hybridization: Mechanisms andImplications Biophysical Journal 94 1726–1734.

[6] Breiman, L. (1973). Statistics: With a View Towards Applications. HoughtonMifflin Company; Boston, MA.

[7] Broad Institute WebsiteCancer Program Data Setshttp://www.broadinstitute.org/cgi-bin/cancer/datasets.cgiAccessed on: November 29th, 2010

[8] Chagovetz A. and Blair S. (2009). Real-time DNA microarrays: reality check.Biochemical Society Transactions 37(Pt 2) 471–475.

[9] Columbia UniversityDNA Microarrays in Health Care and Drug Discoveryhttp://www.columbia.edu/ bo8/undergraduate research/projects/...sahil mehta project/work.htmAccessed on: March 16th, 2011

[10] Daniels, H.E. (1954). Saddlepoint approximations in statistics. Annals ofMathematical Statistics 25 631–650.

[11] Felinger, A. (2010). Estimation of chromatographic peak shape parameters infourier domain. Talanta doi:10.1016/j.talanta.2010.10.001

Page 67: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

58

[12] Foley, J.P. (1987). Equations for chromatographic peak modeling and calcu-lation of peak area. Anal. Chem. 59 1984–1987.

[13] GeceMexicogecemexico.comAccessed on: March 15th, 2011

[14] Golubev, A. (2010). Exponentially modified Gaussian (EMG) relevance todistributions related to cell proliferation and differentiation. Journal of TheoreticalBiology 262(2) 257–266.

[15] Gradshteyn, I.S. and Ryzhik, I.M. (1980). Table of Integrals, Series andProducts: Corrected and Enlarged Edition (2nd ed). Academic Press; Orlando,Florida.

[16] Gunzert-Marx, K. et al. (2008). Secondary beam fragments produced by200 MeVu−1 12C ions in water and their dose contributions in carbon ion radio-therapy. New Journal of Physics 10 1–21.

[17] Howerton S.B. and McGuffin V.L. (2003). Thermodynamic and kineticcharacterization of polycyclic aromatic hydrocarbons in reversed-phase liquidchromatography. Anal. Chem. 75 3539–3548.

[18] Irizzary, R.A. et al. (2003). Exploration, normalization, and summaries ofhigh density oligonucleotide array probe level data. Biostatistics 4(2) 249–264.

[19] Irizarry, R.A. et al. (2006). Comparison of Affymetrix Genechip expression.Bioinformatics 22(7) 789–794.

[20] Kong, H., et al. (2005). Deconvolution of overlapped peaks based on the ex-ponentially modified Gaussian model in comprehensive two-dimensional gas chro-matography. Journal of Chromatography A. 1086 160–164.

[21] Li, S. et al. (2008). A competitive hybridization model predicts probe signalintensity on high density DNA microarrays. Nucleic Acids Research 36(20) 6585–6591.

[22] McGee, M. and Chen, Z. (2006). Parameter estimation for the exponential-normal convolution model for background correction of Affymetrix GeneChipdata. Statistical Applications in Genetics and Molecular Biology 5 Article 24.

[23] Naish, P.J. and Hartwell S. (1988). Exponentially modified Gaussian func-tions - a good model for chromatographic peaks in isocratic HPLC? Chro-matographia 26 285–296.

[24] Nelder, J.A. and Mead, R. (1965). A simplex algorithm for function mini-mization. Computer Journal 7 308–313.

Page 68: Practical applications and properties of the … › islandora › object › idea...Practical Applications and Properties of the Exponentially Modi ed Gaussian (EMG) Distribution

59

[25] News Medical (2011)What is Gene Expression?http://www.news-medical.net/health/What-is-Gene-Expression.aspxAccessed on: March 15th, 2011

[26] R Development Core TeamR: A Language and Environment for Statistical Computinghttp://www.R-project.orgAccessed on: March 1st, 2011

[27] Roussas, G. (2003). Introduction to Probability and Statistical Inference. Aca-demic Press; Orlando, Florida.

[28] Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics.John Wiley & Sons; New York, New York.

[29] Shao X. et al. (2004). Extraction of mass spectra and chromatographic profilesfrom overlapping GC/MS signal with background. Anal. Chem. 76 5143–5148.

[30] Silver, J. et al. (2009). Microarray background correction: maximum likeli-hood estimation for the normal-exponential convolution model. Biostatistics 10(2)352–363.

[31] Singh D. et al. (2002). Gene expression correlates of clinical prostate cancerbehavior. Cancer Cell 1 203–209.

[32] Steffen B. et al. (2005). A new mathematical procedure to evaluate peaksin complex chromatograms. Journal of Chromatography A. 1071 239–246.

[33] Suzuki, S. et al. (2007). Experimental optimization of probe length to in-crease the sequence specificity of high-density oligonucleotide microarrays. BMCGenomics 8:373.

[34] Therneau, T.M. and Ballman, K.V. (2008). What Does PLIER Really Do?Cancer Informatics 6 423–431.

[35] Vikalo, H. et al. (2008). Modeling and Estimation for Real-Time Microar-rays. IEEE Journal of Selected Topics in Signal Processing 2(3) 286–296.

[36] Walsh, S. and Diamond D. (1995). Non-linear curve fitting using microsoftexcel solver. Talanta 42(4) 561–572.

[37] Zakharkin, S. et al. (2005). Sources of variation in Affymetrix microarrayexperiments. BMC Bioinformatics 6:214.