Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant,...

120
Probe analysis and data preprocessing 1. Affymetrix Probe level analysis 1) Normalization Constant, Loess, Rank invariant, Quantile normalization 2) Expression measure MAS 4.0, LI-Wong (dChip), MAS 5.0, RMA 3) Background adjustment PM-MM, PM only, RMA, GC-RMA 2. Statistical analysis of cDNA array 1) Image analysis 2) Normalization 3) Assess expression level (A case study with Bayesian hierarchical model) 4) Experimental design Source of variations; Calibration and replicate; Choice of reference sample; Design of two-color array 3. Preprocessing 1) Data transformation 2) Filtering (in all platforms) 3) Missing value imputation (in all 1

Transcript of Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant,...

Page 1: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

1

Probe analysis and data preprocessing1. Affymetrix Probe level analysis

1) NormalizationConstant, Loess, Rank invariant, Quantile normalization

2) Expression measureMAS 4.0, LI-Wong (dChip), MAS 5.0, RMA

3) Background adjustmentPM-MM, PM only, RMA, GC-RMA

2. Statistical analysis of cDNA array1) Image analysis2) Normalization3) Assess expression level

(A case study with Bayesian hierarchical model)

4) Experimental designSource of variations; Calibration and replicate; Choice of reference sample; Design of two-color array

3. Preprocessing1) Data transformation2) Filtering (in all platforms)3) Missing value imputation (in all platforms)

Page 2: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

2

From experiment to down-stream analysis

Page 3: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

3

Experimental designImage analysis

Preprocessing(Normalization, filtering,

MV imputation)

Data visualization

Identify differentially expressed genes

Regulatory network

Clustering Classification

Statistical Issues in Microarray Analysis

Pathwayanalysis

Integrative analysis & meta-analysis

Page 4: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

4

Data Preprocessing

Preliminary analyses extract and summarize information from the microarray experiments. • These steps are irrelevant to biological discovery • But are for preparation of meaningful down-stream

analyses for targeted biological purposes. (i.e. DE gene detection, classification, pathway analysis…)

From scanned images Þ Image analysis (extract intensity values from the images)Þ Probe analysis (generate data matrix of expression profile) Þ Preprocessing (data transformation, gene filtering and

missing value imputation)

Page 5: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

5

1. Affymetrix probe level analysis

Page 6: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

6

Hybridization

                                                                                                                                       

from Affymetrix Inc.

Overview of the technology

Page 7: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

7

25-mer unique oligo

mismatch in the middle nuclieotide

multiple probes (11~16) for each gene

from Affymetrix Inc.

Array Design

Page 8: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

8

Background adjustment Normalization Summarization

Give an expression measure for each probe set on each array (how to pool information of 16 probes?)

The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly,

=> “Garbage in, garbage out”

Array Probe Level Analysis

We will leave the discussion of “backgound adjustment” to the last because there’re more new exciting & technical advances.

NormalizationBackground adjustment Summarization

Page 9: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

9

1.1 NormalizationThe need for normalization:

array1 array2gene1 3308 4947.5gene2 2334 3155.5gene3 2518 3738gene4 8882.5 18937gene5 5041 12956.5gene6 7314.5 19013.5gene7 3508.5 8164gene8 2183 5121.5gene9 4790 8082gene10 1645.5 1794.5gene11 1772 1963gene12 1802.5 2186.5gene13 14846 35811gene14 9986 25293gene15 11640.5 21508gene16 3860 6530average 5339.5 11200.09

Intensities of array 2 is intrinsically larger than array 1. (about two fold)

Page 10: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

10

1.1. NormalizationReason:1. Different labeling efficiency.2. Different hybridization time or

hybridization condition.3. Different scanning sensitivity.4. …..

Sarray )(

2array )(

1array )(

intensity observed the:

level expression underlying the:

222

111

gSSgS

gg

gg

gs

gs

fx

fx

fx

x

Normalization is needed in any microarray platform. (including Affy & cDNA)

Page 11: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

11

Constant scaling

1.1. Normalization

• Distributions on each array are scaled to have identical mean.

• Applied in MAS 4.0 and MAS 5.0 but they perform the scaling after computing expression measure.

gss

gs xx

xx

1'

array, reference theis 1array Suppose

Page 12: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

12)( same eroughly th are expression overall their if

as estimated becan

but estimablenot is

,,2,

array, reference theis 1array Suppose

Sarray )(

2array )(

1array )( :Assumption

s1

11

1

11'

22222

11111

ss

gsgsS

gs

gSSgSSgS

ggg

ggg

x

x

Ssxx

fx

fx

fx

Constant scaling: Underlying reasoning

1.1. Normalization

Page 13: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

13

1.1 Normalization

M-A plot

2log

log

1

1

ggs

g

gs

xxA

x

xM

A

M

M=0

Page 14: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

14

1.1 Normalization

M-A plot shows the need for non-linear normalization. The normalization factor is a function of the expression level.

constantlogloglog

:genes expressedally differenti-nonFor

1111

1

s

g

gss

g

gs

ggs

x

xM

θθ

Page 15: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

15

1.1 Normalization

)(ˆˆ AfM Fit by ‘Lowess’ function in S-Plus

Normalized Log ratio:

MMM ˆ~

Replicate arraysThe same pool of sample is applied

Page 16: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

16

)(

)(log),(

),(log

)(

)(loglog

Sarray )()(

2array )()(

1array )()( :Assumption

111

11

11

11

222222

111111

gss

ggsg

gsgg

gs

gss

g

g

gs

g

gsgs

gSgSSgSSgS

gggg

gggg

gwhere

gx

x

x

x

θ

θh

fx

fx

fx

Non-linear scaling: Underlying reasoning

1.1 Normalization

log relativeexpression level

Page 17: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

17

Suppose we know the green genes are non-differentially expressed genes,

),(ˆloglogˆ

)(

)(log),(ˆ

)(

)(log

)(

)(log),(

111

11

1

11111

gsgg

gs

g

gsgs

gs

ggsg

ggs

gsgs

gg

gss

ggsg

gx

x

θ

θh

Ax

Axg

where

x

xg

Non-linear scaling: Underlying reasoning (cont’d)

1.1 Normalization

The problem is: we usually don’t know which genes are constantly expressed!!

Page 18: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

18

1.1. Normalization

Loess (Yang et al., 2002)

• Using all genes to fit a non-linear normalization curve at the M-A plot scale. (believe that most genes are constantly expressed)

• Perform normalization between arrays pairwisely.

• Has been extended to perform normalization globally without selecting a baseline array but then is time-consuming.

Page 19: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

19

1.1. NormalizationInvariant set (dChip)

• Select a baseline array (default is the one with median average intensity).

• For each “treatment” array, identify a set of genes that have ranks conserved between the baseline and treatment array. This set of rank-invariant genes are considered non-differentially expressed genes.

• Each array is normalized against the baseline array by fitting a non-linear normalization curve of invariant-gene set.

lGxxRankldxrankxrankgG ggsggs 2/)(&)()(: 11

Tseng et al., 2001

Page 20: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

20

Advantage:

More robust than fitting with all genes as in loess. Especially when expression distribution in the arrays are very different.

Disadvantage:

The selection of baseline array is important.

Invariant set (dChip)

1.1. Normalization

Page 21: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

21

1.1. NormalizationQuantile normalization (RMA) (Irizarry2003)

1. Given n array of length p, form X of dimension p×n where each array is a col21umn.

2. Sort each column of X to give Xsort.

3. Take the means across rows of Xsort and assign this mean to each element in the row to get Xsort.

4. Get Xnormalized by rearranging each column of Xsort to have the same ordering as original X.237 283

341 397

401 198

329 335

237 198

329 283

341 335

401 397

217.5 217.5

306 306

338 338

399 399

X Xsort Xsort217.5 306

338 399

399 217.5

306 338

Xnormalized

Page 22: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

22

1.1. Normalization

Bolstad, B.M., Irizarry RA, Astrand, M, and Speed, TP (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance Bioinformatics. 19(2):185-193

A careful comparison of different normalization methods and concluded that quantile normalization generally performs the best.

Page 23: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

23

1.2. Summarize Expression IndexThere’re multiple probes for one gene (11 PM and 11 MM) in U133.How do we summarize the 24 intensity values to a meaningful expression intensity for the target gene?

Page 24: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

24

MAS 4.0For each probe set, (I: # of arrays, J: # of probes)

PMij-MMij= i + ij, i=1,…,I, j=1,…,J

i estimated by average difference

1. Negative expression

2. Noisy for low expressed genes

3. Not account for probe affinity

1.2. Summarize Expression Index

Page 25: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

25

dChip (DNA chips)For each probe set, (I: # of arrays, J: # of probes)

PMij=j + ij + ij + ij

MMij=j + ij + ij

PMij - MMij= ij + ij, i=1,…,I, j=1,…,J

j = J, ij ~ N(0, 2)

1. Account for probe affinity effect, j.

2. Outlier detection through multi-chip analysis

3. Recommended for more than 10 arrays

Multiplicative model: PMij - MMij= ij + ij (better)

Additive model: PMij - MMij= i + j + ij

1.2. Summarize Expression Index

Li and Wong (PNAS, 2001)

Page 26: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

26

MAS 5.0For each probe set, (I: # of arrays, J: # of probes)

log(PMij-CTij)=log(i)+ij, i=1,…,I, j=1,…,J

CTij=MMij if MMij<PMij

if MMijPMij

i estimated by a robust average (Tukey biweight).

1. No more negative expression

2. Taking log adjusts for dependence of variance on the mean.

less than PMij

1.2. Summarize Expression Index

Page 27: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

27

RMA (Robust Multi-array Analysis)For each probe set, (I: # of arrays, J: # of probes)

log(T(PMij))= i + j + ij, i=1,…,I, j=1,…,J

T is the transformation for background correction and normalization

ij ~ N(0, 2)

1. Log-scale additive model

2. Suggest not to use MM

3. Fit the linear model robustly (median polish)

1.2. Summarize Expression Index

Irizarry et al. (NAR, 2003)

Page 28: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

28

1.25g 1.25g

1.25g

20g 20g

20g

R2=0.85 R2=0.95

R2=0.97

from Irizarry et al. (NAR, 2003)

Affymetrix Latin square data

Page 29: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

29

from Irizarry et al. (NAR, 2003)

Affymetrix Latin square data

Page 30: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

30

Direct subtraction: PM-MMMAS4.0, dChip, MAS5.0

Assume the following deterministic model: PM=O+N+S (O: optical noise, N: non-specifi binding)

MM=O+N

=> PM-MM=S>0

Is it true?

1.3. Background Adjustment

Page 31: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

31

Yeast sample hybridized to human chip

If MM measures non-specific binding of PM well, PMMM.

R2 only 0.5.

MM does not measure background noise of PM

86 HG-U95A human chips, human blood extracts

Two fork phenomenon at high abundance

1/3 of probes have MM>PM

Many MM>PM

Page 32: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

32

Reasons MM should not be used:

1. MM contain non-specific binding information but also include signal information and noise

2. The non-specific binding mechanism not well-studied.

3. MM is costly (take up half space of the array)

Ignore MM

dChip has an option for PM-only model

In general, PM-only is preferred for both dChip or RMA methods.

1.3. Background Adjustment

Page 33: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

33

1. 95% of (MM>PM) have purine (A, G) in the middle base.

2. In the current protocol, only pyrimidines (C, T) have biotin-labeled florescence.

Consider sequence information

Naef & Magnasco, 2003

1.3. Background Adjustment

Page 34: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

34

Fit a simple linear model:

1. C > G T > A2. Boundary effect

1.3. Background Adjustment

affinity probe :

Naef & Magnasco, 2003

Page 35: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

35

PM C G T A

MM G C A T

labeling Yes (+) No (-) Yes (+) No(-)

Labeling impedes binding

Yes (-) No Yes (-) No

Hydrogen bonds

3 (+) 3 (+) 2 2

Sequence specific brightness

High average average Low

Some chemical explanation of the result:

1.3. Background Adjustment

See next page

Page 36: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

36From: Lodish et al. Fig 4-4

Double strand

Remember from the first lecture:

• G-C has three hydrogen bonds. (stronger)

• A-T has two hydrogen bonds. (weaker)

1.3. Background Adjustment

Page 37: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

37

GC-RMA

1

1,

)log(

)log( 2

MM

PM

MM

PM NN

N

MM

PM

MM

PM h

h: a smooth (almost linear) function.: the sequence information weight

computed form the simple linear model.

O: optical noise, log-normal dist.N: non-specific binding

1.3. Background Adjustment

Wu et al., 2004 JASA

Page 38: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

38

• Accuracy– In well-controlled experiment with spike-in

genes (such as Latin Square data), accuracy of estimated log-fold changes compared to the underlying true log-fold changes are concerned.

(only available in data with spike-in genes)• Precision

– In data with replicates, the reproducibility (SD) of the same gene in replicates is concerned.

(available in data with replicates)

Criterion to compare diff. methods

Page 39: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

39

Page 40: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

40

Page 41: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

41

GC-RMA

Page 42: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

42

Fee GUI Flexibility to programming and mining

Audience

MAS 4.0 Commercial Yes No Average Difference

Manufacturer default

dChip Free Yes Some extra tools

Li-Wong model Biologists

MAS 5.0 Commercial Yes No Robust average of log difference

Manufacturer default

RMAExpress Free Yes No RMA Biologists

Bioconductor Free Some Best All of above Statistician, programmer

ArrayAssist Commercial Yes No RMA, GC-RMA Biologists

Page 43: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

43

Background

Methods

Normalization

Methods

PM correctionMethods

Summarization

Methods

nonerma/rma2mas

quantilesloesscontrastsconstantinvariantsetqspline

maspmonlysubtractmm

avgdiffliwongmasmedianpolishplayerout

Probe level analysis in Bioconductor(affy package)

Page 44: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

44

A Simple Case Study

Transcripts1 2 3 4 5 6 7 8 9 10 11 12 13

Expts A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8

J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64

M, N, O, P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128Q, R, S, T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256

http://www.affymetrix.com/analysis/download_center2.affx

Latin Square Data59 HG-U95A arrays14 spike-in genes in 14 experimental groups

M, N, O, P are replicates and Q, R, S, T another replicates

Page 45: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

45

M 1521m99hpp_av06.CEL 1521q99hpp_av06.CEL Q

N 1521n99hpp_av06.CEL 1521r99hpp_av06.CEL R

O 1521o99hpp_av06.CEL 1521s99hpp_av06.CEL S

P 1521p99hpp_av06.CEL 1521t99hpp_av06.CEL T

Take the following two replicate groups.

Use Bioconducotr to perform a simple evaluation of different probe analysis algorithms.

Note: This is only a simple demonstration. The evaluation result in this presentation is not conclusive.

A Simple Case Study

Page 46: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

46

Average log intensities vs SD log intensities. (M, N, O, P)

A Simple Case Study

MAS5.0

dChip(PM only)

dChip(PM/MM)

RMA

GC-RMA(PM/MM)

GC-RMA(PM only)

Page 47: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

47

A Simple Case StudyAverage log intensities vs SD log intensities. (Q, R, S, T)

MAS5.0

dChip(PM only)

dChip(PM/MM)

RMA

GC-RMA(PM/MM)

GC-RMA(PM only)

Page 48: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

48

M, N, O, P Q, R, S, T

MAS5 0.8930 0.9002

dChip (PM/MM) 0.9604 0.9621

dChip (PM-only) 0.9940 0.9966

RMA 0.9978 0.9978

GC-RMA(PM/MM) 0.9988 0.9990

GC-RMA(PM-only) 0.9993 0.9994

Average pair-wise correlationsbetween replicates

A Simple Case Study

Replicate correlation performance: GCRMA(PM-only)>GC-RMA(PM/MM)> RMA> dChip(PM-only)>>dChip(PM/MM)>>MAS5

Page 49: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

49

A Simple Case Study

RMA greatly improves dChip(PM/MM) but dChip(PM-only) model generally seems a little better than RMA.

Average replicate correlations of RMA (0.9978) is a little better than dChip(PM only) (0.9940 & 0.9966)

dChip(PM only) suffers from a number of outlying genes in the model.

Outlying genes that do not fit Li-Wong model

Page 50: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

50

Conclusion:1. Technological advances have been made to have smaller

probe size and better sequence selection algorithms to reduce # of probes in a probe set. This will enable more biologically meaningful genes on a slide and reduce the cost.

2. Recent analysis advances have been focused on understanding and modelling hybridization mechanisms. This will allow a better use of MM probes or eventually suggest to remove MMs from the array.

3. The probe analysis is relatively settled in the field. In the second lab session next Friday, we will introduce dChip and RMAexpress for Affymetrix probe analysis.

Page 51: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

51

2. cDNA probe level analysis

Page 52: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

52

From Y. Chen et al. (1997)

cDNA Microarray Review

Page 53: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

53

1. 48 grids in a 12x4 pattern.

2. Each grid has 12x16 features.

3. Total 9216 features.

4. Each pin prints 3 grids.

Probe (array) printingcDNA Microarray Review

Page 54: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

54

Probe design and printingcDNA Microarray Review

Page 55: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

55From Y. Chen et al. (1997)

cDNA Microarray Review

Page 56: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

56

cDNA GeneChip

Probe preparation

Probes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

colors Two-color(measures relative intensity)

One-color(measures absolute intensity)

Gene representation

One probe per gene 11-16 probe pairs per gene

Probe length Long, varying lengths(hundreds to 1K bp)

25-mers

Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes

Comparison of cDNA array and GeneChip

cDNA Microarray Review

Page 57: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

57

Advantage and disadvantage of cDNA array and GeneChip

cDNA microarray Affymetrix GeneChip

The data can be noisy and with variable quality

Specific and sensitive. Result very reproducible.

Cross(non-specific) hybridization can often happen.

Hybridization more specific.

May need a RNA amplification procedure.

Can use small amount of RNA.

More difficulty in image analysis. Image analysis and intensity extraction is easier.

Need to search the database for gene annotation.

More widely used. Better quality of gene annotation.

Cheap. (both initial cost and per slide cost)

Expensive (~$400 per array+labeling and hybridization)

Can be custom made for special species.

Only several popular species are available

Do not need to know the exact DNA sequence.

Need the DNA sequence for probe selection.

Page 58: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

58

Identify spot area : 1. Each spot contains around 100100 pixels. 2. Spot image may not be uniformly and roundly

distributed. 3. Some software (like ScanAlyze or ImaGene) have

algorithms to “help” placing the grids and identify spot and background area locally.

4. Still semi-automatic: a very time-consuming job.

Extract intensities (data reduction) : 1. Aim to extract the minimum most informative

statistics for further analysis. Usually use the median signal minus the median background.

2. Some spot quality indexes (e.g. Stdev or CV) will be computed.

2.1. Image Analysis

Page 59: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

59

ScanAlyze 2.1. Image Analysis

Page 60: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

60

1. Input the number of rows and columns in each sector; input the approximate location and distances between spots.

2. May need to tilt the grids

3. Some local adjustments may be needed.

4. Once the spot grids are close enough to the real spot physical location, computer image algorithms will help to find the optimal spot area (spherical or irregular shapes) and background area.

May take 10~30 minutes for an array. Usually the biologists will do it.

2.1. Image Analysis

Page 61: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

61http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/

2.1. Image Analysis

Page 62: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

62

Result file from image analysis

Summarized intensities for further analysis: median(spot intensities)-median(background intensities)

2.1. Image Analysis

Page 63: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

63

• Affymetrix– Normalization done across arrays– After normalization, the expression data matrix

shows absolute expression intensities.

• cDNA– Normalization between two colors in an array.– After normalization, the expression data matrix

shows comparative expression intensities (log-ratios).

2.2. Normalization

Page 64: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

64

)35log( CyCyM

2/)3log()5log( CyCyA

Calibration: apply the same samples on both dyes (E. Coli grown in glucose). Purple and orange represent two replicate slides.

2.2. Normalization

• Same sample on both dyes.• Each point is a gene.• Orange is one array and purple is

another array.

Page 65: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

65

Normalization:

General idea: Dye effect : Cy5 is usually more bleached than

Cy3. Slide effect The normalization factor is slide dependent. Usually need to assume that most genes are not

differentially expressed or up- and down-regulated genes roughly cancel out the expression effect.

2.2. Normalization

Page 66: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

66

Normalization:Current popular methods: House-keeping genes : Select a set of non-differentially

expressed genes according to experiences. Then use these genes to normalize.

Constant normalization factor : Use mean or median of each dye to normalize. ANOVA model (Churchill’s group)

Average-intensity-dependent normalization: Robust nonlinear regression(Lowess) applied on whole

genome. (Speed’s group) Select invariant genes computationally (rank-invariant

method). Then apply Lowess. (Wong’s group)

2.2. Normalization

Page 67: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

67

Loess Normalization:Pin-wise normalization using all the genes. It requires the assumption that up- and down-regulated genes with similar average intensities (denoted A) are roughly cancelled out or otherwise most genes remain unchanged.

A

M

From Dudoit et al. 2000

2.2. Normalization

Page 68: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

68

Rank Invariant Normalization:

Rank-invariant method (Schadt et al. 2001, Tseng et al. 2001):

11

0

*)3()5(&:

2/35&*)3()5(:

:selection Iterative

553:

11

igSggSgii

gggg

gg

SpCyRankCyRankSggS

lGCyCyRanklGpCyRankCyRankgS

CyrankCyrankabsgG

ii

Idea: If a particular gene is up- or down- regulated, then its Cy5

rank among whole genome will significantly different from Cy3 rank.

Iterative selection helps to select a more conserved invariant set when number of genes is large.

2.2. Normalization

Page 69: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

69

Rank Invariant Normalization:

Blue points are invariant genes selected by rank-invariant method.

Red curves are estimated by Lowess and extrapolation.

Data: E. Coli. Chip, ~4000 genes, from Liao lab.2.2. Normalization

Page 70: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

70

Data Truncation

Data Truncation• In cDNA microarry, the

intensity value is between 0~216=65536.

• Measurement of low intensity genes are not stable.

• Extremely highly expressed genes can saturate.

• For example, we can truncate genes with intensity smaller than 200 or larger than 65000.

Page 71: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

71

Approaches to assess expression level:

Single slide:1. Normal model (Chen et al. 1997)2. Gamma model with empirical Bayes approach

(Newton et al. 2001)

With replicate slides: Traditional t-test. ANOVA model (Kerr et al. 2000) Permutation t-test (Dudoit et al. 2000)

Hierarchical structure: Linear hierarchical model (Tseng et al. 2001)

2.3. Assess Expression Level

Page 72: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

72

s.intensitie absoluteon lconditiona ratioinfer Suggest to

values.negative avoid toonsdistributi gamma Assume

(1999) al.et Newton 2.

.experiment in the

genes keeping-houseknown at Cy5/Cy3by /Infer

c

),(~Cy3 ,),(~Cy5

al.(1997)et Chen 1.

GR

G

G

R

R

GGRR NN

Single slide analysis:

2.3. Assess Expression Level

Page 73: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

73

Case study: (Tseng et al. 2001)125-gene project: each gene is spotted four timesCalibration: E. Coli grown in acetate v.s. actate C1S1~2 E. Coli grown in glucose v.s. glucose C2S1~4, C3S1~2, C4S1~3Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~2

4129-gene project: each gene is singly spottedCalibration: E. Coli grown in acetate v.s. actate C1S1~2, C2S1~2Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~2

2.3. Assess Expression Level

Page 74: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

74

C 1 S 1 C 1 S 2

C 1

C 2 S 1 C 2 S 2

C 2

O rig in a l m R N A p oo lReversed transcription & labeling

Hybridize onto different slides

Hierarchical structure in experiment design

2.3. Assess Expression Level

Page 75: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

75

2.3. Assess Expression Level

Page 76: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

76

2.3. Assess Expression Level

ofon distributiposterior thecalled is )(

of ge)on(knowleddistributiprior a is )(

)()|,(

)()|,()(

:rule Bayes

data observed :,

interest ofparameter underlyingunknown the:

)|(~,

1

1

11

1

1

n

n

nn

n

n

x,θ|xg

h

dhxxf

hxxfx,θ|xg

xx

xfxx

Basics of Bayesian Analysis

Meaning how much we can say about given the data

Page 77: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

77

),(~ 2gegseg Nx

222 ~~ kgg k

),(~ 2ggeg N

222 ~~ hgg h

.parameters adjustable are and . variationslide :

. variationl)al(culturaexperiment :

level. expression trueunderlying :

gene. :g slide, :s ,experiment :e

).35log( logratios, normalized :x

2

2

kh

CyCy

Baysian Hierarchical Model

1gp

0.51)) (0.45, 0.67), ((0.75, e.g.

observed. are sonly x' Note

C 1 S 1 C 1 S 2

C 1

C 2 S 1 C 2 S 2

C 2

O rig in a l m R N A p oo l2

2

2.3. Assess Expression Level

Page 78: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

78

2.3. Assess Expression Level

2~g

2g

2~g

2g

g

eg

segx

h

k

),(~ 2gegseg Nx

222 ~~ kgg k

),(~ 2ggeg N

222 ~~ hgg h 1gp

Page 79: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

79

How to specify the prior?

Empirical Bayes:

model. alhierarchiclayer - threein thisimplement tohard isIt

)|,,,,~,~()|~,~p(

.likelihood

marginal resulting themaximize and parameters teintermedia

out gintegratinby achievedusually is EB ofsion common verA

).~,~( etershyperparamspecify help todata empirical Use

22

~

2222

~

22

~,~

22

22max

ddddXpX

gg

2.3. Assess Expression Level

Page 80: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

80

) variationexperiment(between

) variationslides(between

)1()(~

)1()(~data. empirical fromprior in parameters Estimate

,

22

,,

22

eg geg

esg eggse

EGyy

ESGyy

Another version of EB:

C 1 S 1 C 1 S 2

C 1

C 2 S 1 C 2 S 2

C 2

O rig in a l m R N A p oo l

2

2

ve.conservati more is ~ of estimation The •

.alleviated is EBin prior confident -over getting and data

reusing of problemcommon thegenes, of thousandsare thereSince •

:Note

2

2.3. Assess Expression Level

Page 81: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

81

esg eggseA

es eggseg

Agg

ESGxx

ESxx

ESES

,,

22

,

22

222

)1()(ˆ

)1()(ˆ

1*)1(ˆˆ**)1(~

((0.75, 0.67), (0.45, 0.51))

eg egA

e egg

Agg

GEx

Ex

EE

,

22

22

222

ˆ

ˆ

1ˆˆ*~

Getting prior distribution: (when we have calibration experiments)

C 1 S 1 C 1 S 2

C 1

C 2 S 1 C 2 S 2

C 2

O rig in a l m R N A p oo l

C 1 S 1 C 1 S 2

C 1

C 2 S 1 C 2 S 2

C 2

O rig in a l m R N A p oo l

2

2

Calibration(normal vs normal)

Comparative(cancer vs normal)

2

2

2.3. Assess Expression Level

Page 82: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

82

1. Compute

2.

3.

4.

5.

egeg x)0()(

21

222

~)(~

hE

e geg

egg

h

E

N gggegg

22 ,~,

2

1

2

1

2

2

1

~)(

~,kss

E

jg

s

segseg

segegg

E

e

kx

x

22

22

22

2222 ,~,,,

gge

gg

gge

gggegegggsegeg

ss

xsNx

MCMC for hierarchical model:

2.3. Assess Expression Level

Page 83: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

83

2.3. Assess Expression Level

95% probability interval of the posterior distribution of the underlying expression level.

Page 84: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

84

2.4. Experimental design

• Biological variation

Technical variations:

• Amplification

• Labeling

• Hybridization

• Pin effect

• Scanning

Page 85: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

85

(i) Calibration: Use the same sample on both dyes for hybridization.

Calibration experiments help to validate experiment quality and gene-specific variability.

2.4.1 Calibration and replicate

Comparative:Tumor vs Ref

Calibration:Ref vs Ref

Page 86: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

86

(ii) Replicates: (replicate spots, slides) Multiple-spotting helps to identify local

contaminated spots but will reduce number of genes in the study.

Multi-stage strategy: Use single-spotting to include as many genes as possible for pilot study. Identify a subset of interesting genes and then use multiple-spotting.

Replicate spots and slides help to verify reproducibility on the spot and slide level.

2.4.2. Calibration and replicate

Page 87: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

87

2.4.2. Calibration and replicateBiological replicate

From Yang, UCSF

Page 88: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

88

Technical replicate

2.4.2. Calibration and replicate

From Yang, UCSF

Page 89: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

89

(iii) Reverse labelling:

Advantage:• Cancel out linear normalization scaling and

simplifies the analysis. However, the linear assumption is often not true.

• Help to cancel out gene-label interactions if it exists.

Sample A Sample B

2.4.2. Calibration and replicate

Page 90: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

90

Different choices of reference sample:a) Normal patient or time 0 sample in time course

study

b) Pool all samples or all normal samples

c) Embryonic cells

d) Commercial kit

2.4.3. Choice of reference sample

Ideally we want all genes expressed at a constant moderate level in reference sample.

Page 91: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

91

2.4.4. Design issue

From Yang, UCSF

Page 92: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

92

Design issues:(a) Reference design(b) Loop design(c) Balance design

2.4.4. Design issue

Reference sample is redundantly measured many times.

Page 93: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

93

(c)v samples with v+2 experiments v samples with 2v experiments

See Kerr et al. 2001

Page 94: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

94

Conclusion of cDNA array

1. Affymetrix GeneChip is more preferred if available.

2. Unlike GeneChip, cDNA array data is usually more noisy and careful quality control (replicates and calibration) is important. But occasionally custom arrays are needed for some specific research.

3. Analysis of cDNA microarray is also applicable to other two-color technology such as array CGH and similar two-color oligo arrays.

4. Conservative “Reference design” is usually more robust although it’s not statistically most efficient.

Page 95: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

95

3. Data preprocessing

Page 96: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

96

3.1. Data Truncation and Transformation

Transformation

1. Logarithmic transformation (most commonly used)-- tend to get an approximately normal distribution-- should avoid negative or 0 intensity before transformation

2. Square root transformation-- a variance-stabilizing transformation under Poisson model.

3. Box-Cox transformation family

4. Affine transformation

5. Generalized-log transformation

Details see chapter 6.1 in Lee’s book; Log10 or Log2 transformation is the most common practice.

Page 97: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

97

3.2. Filtering

Filter is an important step in microarray analysis:1. Without filtering, many genes are irrelevant to the

biological investigation and will add noise to the analysis. (among ~30,000 genes in the human genome, usually only around 6000 genes are expressed and varied in the experiment)

2. But filtering out too many genes will run the risk to eliminate important biomarkers.

3. Three common aspects of filtering:1. Genes of bad experimental quality.2. Genes that are not expressed3. Genes that do no fluctuate across experiments.

Page 98: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

98

3.2. Filtering

Filter out genes with bad quality in cDNA array: Outputs from imaging analysis usually have a quality index or flag to identify genes with bad quality image.

Three common sources of bad quality probes:1. Problematic probes: probes with non-uniform intensities.2. Low-intensity probes: genes with low intensities are

known to have bad reproducibility and hard to verify by RT-PCR. Normally genes with intensities less than 100 or 200 are filtered.

3. Saturated probes: genes with intensities reaching scanner limit (saturation) should also be filtered.

For Affymetrix and other platforms, each probe (set) also has a detection p-value, quality flag or present/absent call.

Page 99: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

99

3.2. Filtering

Filtering by quality index: different array platform and image analysis have different format

low intensity

Page 100: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

100

Filtering by quality index:

3.2. Filtering

Array 1

Array 2

Array S

Array 1 Array 2 Array S

Gene 1 NA

Gene 249.422

5

Gene 358.793

8

Gene 4196.23

6

Gene 5146.34

4

Gene 693.554

9

: :

: :

Gene G-2 768.63

Gene G-1 30.3535

Gene G 15.9003

342.061

267.247

72.2798

54.2583

69.6987

73.8338

163.73 197.419

136.412

140.536

131.405 96.128

: :

: :

763.445

936.445

NA34.747

7

12.5406 13.648

NA: not applicableMissing values due to bad quality, low or saturated intensities

Page 101: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

101

3.2. FilteringFilter genes with low information content:1. Small standard deviation (stdev)2. Small coefficient of variation (CV: stdev/mean)

samples

inte

nsi

ty

1.0 1.5 2.0 2.5 3.0 3.5 4.0

05

01

00

15

0

gene 1

samples

inte

nsi

ty

1.0 1.5 2.0 2.5 3.0 3.5 4.0

05

01

00

15

0

gene 2

stdev=6.45CV=0.29

stdev=6.45CV=0.053

2515

30

20

125115

130

120

Note: CV is more reasonable for original intensities. But for log-transformed intensities, stdev is enough

Why?

Page 102: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

102

A simple gene filtering routine (I usually use) before down-stream analyses:

1. Take log (base 2) transformation.

2. Delete genes with more than 20% missing values among all samples.

3. Delete genes with average expression level less than, say α=7 (27=128). For Affymetrix and most other platforms, intensities less than 100-200 are simply noises.

4. Delete genes with standard deviation smaller than, say β=0.4 (20.4=1.32, i.e. 32% fold change).

5. Might adjust β so that the number of remaining genes are computationally manageable in downstream analysis. (e.g. around ~5000 genes)

3.2. FilteringGene filtering

Page 103: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

103

3.2. FilteringSample filtering (detecting problematic slides)

Compute correlation matrix of the samples:

1. Arrays of replicates should have high correlation. (m,n,o,p are replicates and q,r,s,t are another set of replicates)

2. A problematic array is often found to have low correlation with all the other arrays.

3. Heatmap is usually plotted for better visualization.

Page 104: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

104

m,n,o,p

q,r,s,t

White: high correlation

Dark gray: low correlation

3.2. Filtering

Diagnostic plot by correlation matrix

Page 105: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

105

3.3. Missing Value Imputation

Reasons of missing values in microarray:

spotting problems (cDNA) dust finger prints poor hybridization inadequate resolution fabrication errors (e.g. scratches) image corruption

Many down-stream analysis require a complete data.

“Imputation” is usually helpful.

Page 106: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

106

Array 1 Array 2 Array S

Gene 1 NA

Gene 2 49.4225

Gene 3 58.7938

Gene 4 196.236

Gene 5 146.344

Gene 6 93.5549

: :

: :

Gene G-2 768.63

Gene G-1 30.3535

Gene G 15.9003

342.061 267.247

72.2798 54.2583

69.6987 73.8338

163.73 197.419

136.412 140.536

131.405 96.128

: :

: :

763.445 936.445

NA 34.7477

12.5406 13.648

It is common to have ~5% MVs in a study.5000(genes)50(arrays) 5%=12,500

3.3. Missing Value Imputation

Page 107: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

107

• Naïve approaches– Missing values = 0 or 1 (arbitrary signal)– missing values = row (gene) average

• Smarter approaches have been proposed:– K-nearest neighbors (KNN)– Regression-based methods (OLS)– Singular value decomposition (SVD)– Local SVD (LSVD)– Partial least square (PLS)– More (Bayesian Principal Component Analysis, Least Square

Adaptive, Local Lease Square)

Assumption behind: Genes work cooperatively in groups. Genes with similar pattern will provide information in MV imputation.

3.3. Missing Value Imputation

Existing methods

Page 108: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

108

Arrays

Exp

ress

ion

?

randomly missing datum

• choose k genes that are most “similar” to the gene with the missing value (MV)

• estimate MV as the weighted mean of the neighbors

• considerations:– number of neighbors (k)– distance metric– normalization step

3.3. Missing Value Imputation

KNN.e & KNN.c

Page 109: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

109

• parameter k– 10 usually works (5-15)

• distance metric– euclidean distance (KNN.e)– correlation-based distance

(KNN.c)• normalization?

– not necessary for euclidean neighbors

– required for correlation neighbors

Arrays

Exp

ress

ion

?

3.3. Missing Value Imputation

KNN.e & KNN.c

Page 110: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

110

• regression-based approach• KNN+OLS• algorithm:

– choose k neighbors (euclidean or correlation; normalize or not)

– the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression)

– for each neighbor, MV is predicted from the regression model

– MV is imputed as the weighed average of the k predictions

3.3. Missing Value Imputation

OLS.e & OLS.c

Page 111: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

111

Arrays

Exp

ress

ion

?

randomly missing datumy1 = a1 + b1 x1

y2 = a2 + b2 x2

y = w1 y1 + w2 y2

3.3. Missing Value Imputation

OLS.e & OLS.c

Page 112: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

112

• Algorithm– set MVs to row average (need a starting point)– decompose expression matrix in orthogonal

components, “eigengenes”.– use the proportion, p, of eigengenes corresponding to

largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate)

– use EM approach to iteratively imporove estimates of MVs until convergence

• Assumption:– The complete expression matrix can be well-

decomposed by a smaller number of principle components.

3.3. Missing Value Imputation

SVD

Page 113: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

113

• KNN+SVD– choose k neighbors (euclidean or correlation;

normalize or not)– Perform SVD on the k nearest neighbors and

get a prediction of the missing value.

3.3. Missing Value Imputation

LSVD.e & LSVD.c

Page 114: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

114

• PLS: Select linear combinations of genes (PLS components) exhibiting high covariance with the gene having the MV.– The first linear combination of genes has the highest

correlation with the target gene.– The second linear combination of genes had the greatest

correlation with the target gene in the orthogonal space of the first linear combination.

• MVs are then imputed by regressing the target gene onto the PLS components

3.3. Missing Value Imputation

PLS

Page 115: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

115

3.3. Missing Value Imputation

Types of missing mechanism:

1. Missing completely at random (MCAR)Missingness is independent of the observed values and their own unobserved values.

1. Spot missing due to mis-printing or dust particle.2. Spot missing due to scratches.

2. Missing at random (MAR)Missingness is independent of the unobserved data but depend on the observed data.

• Missing not at random (MNAR)MIssingness is dependent on the unobserved data1. Spots missing due to saturation or low expression.

Currently imputation methods only work for MCAR, not MNAR.

Page 116: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

116

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Guy N. Brock1, John R. Shaffer2, Richard E. Blakesley3, Meredith J. Lotz3, George C. Tseng2,3,4§

BMC Bioinformatics, 2008

Page 117: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

117

Data set Full Dim. Used Dim. Category Organism Expression Profiles

Alizadeh (ALI)

13412 x 40

5635 x 40 multiple exposure H. sapiens diffuse large B-cell lymphoma

Alon (ALO) 2000 x 62 2000 x 62 multiple exposure H. sapiens colon cancer and normal colon tissue

Baldwin (BAL)

16814 x 39

6838 x 39 time series, non-cyclic H. sapiens epithelial cellular response to L. monocytogenes

Causton (CAU)

4682 x 45 4616 x 45 multiple exposure x time series

S. cerevisiae

response to changes in extracellular environment

Gasch (GAS) 6152 x 174

2986 x 155 multiple exposure x time series

S. cerevisiae

cellular response to DNA-damaging adgents

Golub (GOL) 7129 x 72 1994 x 72 multiple exposure H. sapiens acute lymphoblastic leukemia

Ross (ROS) 9706 x 60 2266 x 60 multiple exposure H. sapiens NCI60 cancer cell lines

Spellman, AFA (SP.AFA)

7681 x 18 4480 x 18 time series, cyclic S. cerevisiae

cell-cycle genes

Spellman, ELU (SP.ELU)

7681 x 14 5766 x 14 time series, cyclic S. cerevisiae

cell-cycle genes

9 data sets: multiple exposure, time series or both7 methods were compared: KNN, OLS, LSA, LLS, PLS, SVD, BPCA

3.3. MV imputation comparative study

Page 118: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

118

Global-based methods (PLS, SVD, BPCA): Estimate the global structure of the data to impute MV.

Neighbor-based methods (KNN, OLS, LSA, LLS): Borrow information from correlated genes (neighbors).

Intuitively global-based methods require that dimension reduction of the data can be effectively performed.

We define an entropy measure for a given data D to determine how well the dimension reduction of the data can be done: (i are the eigenvalues)

,)log(

log)( 1

k

ppDe

k

i ii

k

l liip1

.

Entropy low: the first few eigenvalues dominate and the data can be reduced to low-dimension effectively.

3.3. MV imputation comparative study

Page 119: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

119

LRMSE is the performance measure, the lower the better.

KNN, OLS, LSA, LLS are neighbor-based methods and work better in low-entropy data sets.

PLS and SVD are global-based methods and work better in high-entropy data sets.

,)(,ˆ0

)(; ijkjjiijkMj DeDDLRMSE

i

3.3. MV imputation comparative study

Page 120: Probe analysis and data preprocessing 1.Affymetrix Probe level analysis 1)Normalization Constant, Loess, Rank invariant, Quantile normalization 2)Expression.

120

Simulation II Simulation III

Data set

Entropy

Optimal EBS Accuracy

Optimal STS Accuracy

BAL 0.819 LSA (38), LLS (12)

LSA (50) 76% LSA (9), LLS (1)

LSA (10) 90%

CAU 0.838 LLS (45), LSA (5)

LSA (50) 10% LLS (10) LSA (10) 0%

ALO 0.872 LSA (50) LSA (50) 100% LSA (10) LSA (10) 100%

GOL 0.876 LSA (50) LSA (50) 100% LSA (10) LSA (10) 100%

SP.ELU

0.909 LLS (41), BPCA (9)

LSA (50) 0% LLS (10) BPCA (10) 0%

GAS 0.911 LSA (50) LSA (50) 100% LSA (10) LSA (10) 100%

SP.AFA

0.94 LLS (40), BPCA (10)

LSA (50) 0% LLS (9), BPCA (1)

BPCA (10) 10%

ROS 0.944 LSA (50) LSA (50) 100% LSA (10) LSA (10) 100%

ALI 0.947 LSA (50) LSA (50) 100% LSA (10) LSA (10) 100%

Overall 65% Overall 67%

Three methods (LSA, LLS, BPCA) performed best but none dominated.

Performed two selection schemes (entropy-based scheme and self-training scheme) to select the best imputation method.

3.3. MV imputation comparative study