DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
-
Upload
pamela-reed -
Category
Documents
-
view
218 -
download
0
Transcript of DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
DATA TRANSFORMATION and NORMALIZATION
Lecture Topic 4
DATA PRE-PROCESSING
• TRANSFORMATION
• NORMALIZATION
• SCALING
DATA TRANSFORMATION
• Difference between raw fluorescence is a meaningless number
• Data is transformed:– Ratio allows immediate visualization of
number – Log
Why Log 2?
• Difference in expression intensity exist on a multiplicative scale, log transformation brings them into the additive scale, where a linear model may apply.
• Ex. 4 fold repression=0.25 (Log2=-2)• Ex. 4 fold induction=4 ( Log2=2)• Ex. 16 fold induction=16 (Log2= 4)• Ex. 16 fold repression=0.0625 (Log2=-4)• Evens out highly skewed distributions• Makes variation of intensities…independent of absolute
magnitude
Log Transformation: Makes the distribution less skewed
F635 Median
Fre
qu
en
cy
63000540004500036000270001800090000
7000
6000
5000
4000
3000
2000
1000
0
Histogram of F635 Median
log2GFr
eq
ue
ncy
15.614.413.212.010.89.68.47.2
1200
1000
800
600
400
200
0
Histogram of log2G
Example 2
F532 Median
Fre
qu
en
cy
56000480004000032000240001600080000
6000
5000
4000
3000
2000
1000
0
Histogram of F532 Median
log2R
Fre
qu
en
cy
15.614.413.212.010.89.68.47.2
1000
800
600
400
200
0
Histogram of log2R
Generalized Log Transform
• Idea (Sapir and Churchill, 2000, Rocke and Durbin 2001)• X=e• At very low expression levels is close to 0, hence X is
equal to .– X is normally distributed with mean a and variance 2
• At very high expression levels X~ e
– X is lognormally distributed with variance s
– See
• Log(X)~log(• At moderate intensities it is a mixture distribution.• Hence, its better to use a transformation Log(X+c)
The Transform
))()()log(2
22
S
XXX
Non-parametric Regression: the Loess Method
• LOWESS= LOESS is an Acronym for LOcally reWEighted ScatterPlot Smoothing (Cleveland).
• For i=1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by
• Yi=g(xi) + I
• where g is the regression function and i is a random error.
• Idea: g(x) can be locally approximated by a parametric function.
• Obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x.
LOESS contd…
• In the LOESS (LOWESS) method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods.
• The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface.
• Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.
Distance metrics used
• Finding distance between the ith and hth points 2 predictors:
• Distance between: (Xi1, Xi2) and (Xh1,Xh2):
• Generally Eucledean Distance is used,
• and weights are defined by a tri-cube function:
• Choice of q is between 0 and 1, often between .4 to .6.
• Large q: smoother but maybe too smooth
• Small q: too rough
])()[( 222
211 hihii XXXXd
otherwise 0
if ])/(1[ 33qiqi dddd
wi
Comments on LOESS
fitting is done at each point at which the regression surface is to be estimated
faster computational procedure is to perform such local fitting at a selected sample of points and then to blend local polynomials to obtain a regression surface
can use the LOESS procedure to perform statistical inference provided the error distribution are i.i.d. normal random variables with mean 0.
using the iterative reweighting, LOESS can also provide statistical inference when the error distribution is symmetric but not necessarily normal.
by doing iterative reweighting, you can use the LOESS procedure to perform robust fitting in the presence of outliers in the data.
Data “Normalization”
• To biologists, data normalization means “eliminating systematic noise” from the data
• Noise is systematic variation – experimental variation, human error, variation of scanner technology, etc
• Variation in which we are NOT interested• We are interested in measuring true biologic variation of
genes across experiments, throughout time, etc.• Plays an important role in earlier stages of microarray data
analysis.• Subsequent analysis are highly dependent on
normalization.• NORMALIZATION: Adjusts from any bias which arises
from microarray technology rather than biological
Normalization:Age old Statistical Idea
• Stands for removing bias as a result of experimental artifacts from the data.
• Stems back to Fisher’s idea (1923) setting up of ANOVA.
• There is a thrust to use ANOVA for normalization, but for the most part it is still a stage-wise approach instead of a model taking out all sources of variation at once.
• We will need to look at:– Spatial correction
– Background correction
– Dye-effect correction
– Within replicate rescaling
– Across replicate rescaling
• Within slide normalization
• Paired slide normalization for dye swap
• Multiple slide normalization
Systematic Experimental Variations• Scanner properties• Spatial effects (variations in printings, slides)• Print-tip effects (spotting variations due to pin geometry)• Labeling fluctuations of the dyes• Diverse protocols• Inconsistent washings• Arrays produced at different times• Microarray production errors• Sample to sample fluctuations of the mRNA preparation• Varying reverse transcription to cDNA• Varying PCR amplification• Filter inhomogeneities• Cross-hybridization within gene families• Background noise• Image analysis saturation, spot shape variations• Artifacts such as fingerprints, scratches, technician's deodorant particles
Spatial Normalization
Need arises from:
1. Washing chip unevenly
2. Inserting chip at an angle
3. Scanner issues
4. Edge effects from evaporation
5. Uneven wear of print-tip after hours of printing
Reference Designs with the use of ratio data has been popular since it helps with spatial normalization.
Model for spatial bias
• Sd(x,y) = gd(x,y) * C(x,y)
• Scy3(x,y)/Scy5(x,y) = gcy3(x,y)/gcy5(x,y)
True signal
Complicated function of bias
Array 1: pre and post norm
Array2
Array 3
Array 4
Comments:
• Print-tip normalization is generally a good proxy for spatial effects
• Instead of LOESS one can use SPLINE to estimate the trend to subtract from the raw data.
BACKGROUND CORRECTION
• Idea:
• Signal = True Signal + Background
• So, an attractive idea seems like we should subtract BACKGROUND from the signal to get to the “TRUE” signal.
• The problem is that, the actual BACKGROUND in a spot cannot be measured and what is measured are really a “estimate” for background of places NEAR the spot.
• Criticism: the assumption in these models is that the background is additive– OFTEN WE SEE HIGN CORRELATION BETWEEN FOREGROUND AND
BACKGROUND.
– GENERAL CONSENSUS THESE DAYS: NOT TO SUBTRACT LOCAL BACKGROUND, BUT POSSIBLY SUBTRACT A GLOBAL BACKGROUND (FROM EMPTY SPOTS OR BUFFERS).
Background Correction: more thoughts
• McClure and Wit (2004) suggest calculating the mean or median of the empty spots and estimate, signal as:
– Signal = max(observed signal – center(empty spots), 0)
This allows never to have the problem of negative “corrected signals”.
Background Correction: Probabilistic Idea
• Irrizary et al(2003)
• Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise)
• E(si | si+bi)
• Here, si assumed to follow Exponential distribution with parameter .
• Bi assumed to follow N(e, 2e)
• Estimate e and e as the mean and standard deviation of empty spots
ey
1ˆ
Irrizary Approach contd…
• This allows the formula to be approximated by the following, where are the CDF and pdf of the standard normal distribution:
1)()(
)()(
ˆ22
22
2
e
ei
e
eei
e
ei
e
eei
eeeiiyy
yy
yy
Normalization Approaches
• GLOBAL Normalization (G): Global (ARRAY) Mean or Median.– NOT USED VERY OFTEN ANYMORE
• Intensity dependent linear Normalization (L): by least square estimation– AGAIN NOT USED AS MUCH
• Intensity dependent non-linear Normalization (N): Lowess curve (Robust scatter plot smoother)
– Under ideal experimental conditions: M=0 for the selected genes used for normalization
– THE MOST COMMONLY USED IDEA THESE DAYS.
Normalization: Historical Approaches
• Gobal normalization
– Sum method: Norm coef.(kj) =
Where Imi = intensity of gene i on array Array m, m=1,2
Bm= background intensity on Array m, m=1,2 n = number of genes on the array
– problem: validity of the assumption; stronger signals dominate the summation.
– Median (robust with respect to outliers)
Normalization coefficient (kj) =
n
iii
n
iii
BI
BI
122
111
)(
)(
22
11
ii
ii
BImedian
BImedian
Normalization continued
• Housekeeping gene normalization– Housekeeping genes are a set of genes whose expression levels are not affected
by the treatment.– The normalization coefficient is the ratio of mC/mT, where mC and mT are the
means of the selected housekeeping genes for control and treatment respectively.
– Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold.
• Trimmed mean normalization(adjusted global method) trim off 5% highest and lowest extreme values, then globally normalize data. The
normalization coefficient is:
where are the trimmed means for the ith treatment and control
respectively.
i
i
T
Ci m
mk
ii TC mandm
Ideal Control Spots that should be on an array
As we saw in the previous slide, there can be many special probes spotted onto an array during its manufacture, collectively called control probes.
These include
• Blanks: places where water or nothing is spotted.
• Buffer: where the buffer solution without DNA is spotted.
• Negative: here there are DNA probes, but they shouldn’t be complementary to any target cDNA.
• Calibration: probes corresponding to DNA put in the hyb mix which should have equal signals in the two channels.
• Ratio: probes corresponding to DNA put in the hyb mix which should have known ratios between the two channels (e.g. 3:1,1:3, 10:1, 1:10).
Normalization Within and Across Conditions
• The Normalization WITHIN conditions is more common
• Idea we want all the arrays that represent the SAME condition to be comparable.
• Take out the array effect, in other words.
• Many models for this:– Factorial model (Kerr et al, Wolfinger et al)
– Location Scale Model (Yang et al)
– Scaling (Affymetrix)
Consider the data to be:
xijk: ith spot, jth color, kth array
Quantile Normalization
Idea
• Ideally “replicate” microarrays should be similar
• In real life they are often NOT identically distributed
• Quantile normalization FORCES the same distribution on all the arrays for the same condition
Mathematical details: Quantile Normalization
• {x} represent the matrix of all p spot intensities and the n replicate arrays.
• Here, xik is the spot intensity of the ith spot (i=1,…p, k=1,…n).
• Let x(k) = vector of the smallest spot intensities across the arrays
• be the mean/median of x(j)
• The vector represents the compromise distribution. {r} be the matrix of row ranks associated with matrix {x}
• Then, the following are the quantile normalized value
)( jx
)( jx
)( ijrnormij xx
Numerical Example• Let us consider a situation where we have 5 spots on an array and two
replicates for an array (numbers in brackets represents the ranks)
• Spot 1 2 3 4 5
• Array 1 16(5) 0(1) 9(3) 11(4) 7(2)
• Array2 13(4) 3(1) 5(2) 14(5) 8(3)
• Order the arrays: 0 7 9 11 16
• Array 2 3 5 8 13 14
• Average these: 1.5 6 8.5 12 15
• Replace the ranks by these:
• Normalized arrays are:
• Array1: 15 1.5 8.5 12 6.0
• Array2: 12 1.5 6.0 15 8.5
Conclusion
• No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like.
• No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0.
• Nowadays the focus IS on using Nonparametric Regression methods to remove trend or spatial artifacts from the data
• Quantile normalization (though not liked by BIOLOGISTS) is catching on as well.