Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x...

66
Wolfgang Huber European Bioinformatics Institute Quality control and normalization

Transcript of Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x...

Page 1: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Wolfgang HuberEuropean Bioinformatics Institute

Quality control and normalization

Page 2: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Acknowledgements

Anja von Heydebreck (Darmstadt)Robert Gentleman (Seattle)Günther Sawitzki (Heidelberg)Martin Vingron (Berlin)Annemarie Poustka, Holger Sültmann, Andreas

Buness, Markus Ruschhaupt (Heidelberg)Rafael Irizarry (Baltimore)Judith Boer (Leiden) Anke Schroth (Heidelberg)Friederike Wilmer (Hilden)

Page 3: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Which genes are differentially transcribed?

same-same tumor-normal

log-ratio

Page 4: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Statistics 101:

←bias accuracy→

←pr

ecis

ion

varia

nce→

Page 5: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Basic dogma of data analysis:

Can always increase sensitivity on the cost of specificity,

or vice versa,

the art is to find the best trade-off.

X

X

X

X

X

X

X

X

X

(It can also be possible to increase both by better choice of method / model)

Page 6: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

10001500

3000x3

x1.5

A B C

0200

3000?

?

A B C

But what if the gene is “off” (below detection limit) in one condition?

Page 7: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

ratios and fold changesThe idea of the log-ratio (base 2)

0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼

Page 8: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

ratios and fold changesThe idea of the log-ratio (base 2)

0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

Page 9: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

ratios and fold changesThe idea of the log-ratio (base 2)

0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

What about a change from 0 to 500?- conceptually- noise, measurement precision

Page 10: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

A complex measurement process lies between mRNA concentrations and intensities

o other array manufacturing-related issues

o hybridization efficiency and specificity

o DNA-support binding

o reverse transcription efficiency

o ‘background’ correction

o spotting efficiency

o amplification efficiency

o signal quantification

o PCR yield, contamination

o RNA degradation

o image segmentation

o clone identification and mapping

o tissue contamination

Page 11: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

A complex measurement process lies between mRNA concentrations and intensities

o other array manufacturing-related issues

o hybridization efficiency and specificity

o DNA-support binding

o reverse transcription efficiency

o ‘background’ correction

o spotting efficiency

o amplification efficiency

o signal quantification

o PCR yield, contamination

o RNA degradation

o image segmentation

o clone identification and mapping

o tissue contamination

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

Page 12: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

♦ How to compare microarrayintensities with each other?

♦ How to address measurement uncertainty (“variance”)?

♦ How to calibrate (“normalize”) for biases between samples?

Questions

Page 13: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Page 14: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Systematico similar effect on many measurementso corrections can be estimated from data

Page 15: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Systematico similar effect on many measurementso corrections can be estimated from data

Calibration

Page 16: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Systematico similar effect on many measurementso corrections can be estimated from data

Calibration

Stochastico too random to be ex-plicitely accounted for o remain as “noise”

Page 17: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Systematico similar effect on many measurementso corrections can be estimated from data

Calibration

Stochastico too random to be ex-plicitely accounted for o remain as “noise”

Error model

Page 18: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Error models

describe the possible outcomes of a set of measurements

Outcomes depend on:-true value of the measured quantity (abundances of specific molecules in biological sample)

-measurement apparatus (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)

Page 19: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Error models

Purpose:

1. Data compression: summary statistic instead of full empirical distribution

2. Quality control

3. Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, and ethical aspects)

Page 20: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

ε= +iik ika aai per-sample offset

εik ~ N(0, bi2s1

2)“additive noise”

bi per-samplenormalization factor

bk sequence-wiseprobe efficiency

ηik ~ N(0,s22)

“multiplicative noise”

exp( )iik k ikb b b η=

ik ik ik ky a b x= +

The two component model

measured intensity = offset + gain × true abundance

Page 21: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

The two-component model

raw scale log scaleB. Durbin, D. Rocke, JCB 2001

Page 22: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

The two-component model

raw scale log scaleB. Durbin, D. Rocke, JCB 2001

“additive” noise

“multiplicative” noise

Page 23: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Parameterization

(1 )y a b xy a b x eη

ε η

ε

= + + ⋅ ⋅ +

= + + ⋅ ⋅

two practically equivalent forms

(η<<1)

iid per arrayiid in whole experiment

η random gain fluctuations

per array x color x print-tip group

per array x colorb systematic gain factor

iid per arrayiid in whole experiment

ε random background

per array x color x print-tip group

same for all probes (per array x color)

a systematic background

Page 24: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

variance stabilizing transformations

Xu a family of random variables with EXu=u, VarXu=v(u). Define

⇒ var f(Xu ) ≈ independent of u

1( )v( )

x

f x duu

= ∫

derivation: linear approximation

Page 25: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

0 20000 40000 60000

8.0

8.5

9.0

9.5

10.0

11.0

raw scale

trans

form

ed s

cale

variance stabilizing transformations

f(x)

x

Page 26: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

variance stabilizing transformations1( )

v ( )

x

f x d uu

= ∫1.) constant variance (‘additive’) 2( ) sv u f u= ⇒ ∝

2.) constant CV (‘multiplicative’) 2( ) logv u u f u∝ ⇒ ∝

4.) additive and multiplicative

2 2 00( ) ( ) arsinh u uv u u u s f

s+

∝ + + ⇒ ∝

3.) offset 20 0( ) ( ) log( )v u u u f u u∝ + ⇒ ∝ +

Page 27: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

the “glog” transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hs(x) = asinh(x/s)

( )( )

2arsinh( ) log 1

arsinh log log 2 0limx

x x x

x x→∞

= + +

− − =

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002

Page 28: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

raw scale log glog

difference

log-ratio

generalized

log-ratio

glog

Page 29: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

raw scale log glog

difference

log-ratio

generalized

log-ratio

glog

constant partvariance:

proportional part

Page 30: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

the transformed model

2

Yarsinh

(0, )

sikik ki

si

ki

ab

N c

µ ε

ε

−= +

i: arrays k: probess: probe strata (e.g. print-tip, region)

Page 31: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

“usual” log-ratio

'glog' (generalized log-ratio)

+ +

+ +

1

2

2 21 1 1

2 22 2 2

log

log

xx

x x cx x c

c1, c2 are experiment specific parameters (~level of background noise)

Page 32: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Variance Bias Trade-Off

Estimat

ed log

-fold-

chan

ge

Signal intensity

logglog

Page 33: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Variance-bias trade-off and shrinkage estimators

Shrinkage estimators:pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio: = a shrinkage estimator for fold change

There are many possible choices, we chose “variance-stabilization”:+ interpretable even in cases where genes are off in some conditions+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale

Page 34: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

evaluation: effects of different data transformationsdiff

eren

ce r

ed-g

reen

rank(average)

Page 35: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Normality: QQ-plot

Page 36: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

“Single color normalization”

n red-green arrays (R1, G1, R2, G2,… Rn, Gn)

within/between slidesfor (i=1:n)

calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)normalize Mi vs Ai

normalize M1…Mn

all at oncenormalize the matrix of (R, G)then calculate log-ratios or any other

contrast you like

Page 37: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

What about non-linear effectso Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)

Two reasons for non-linearity:

o At the high intensity end: saturation/quenching. This can and should be avoided experimentally - loss of data!

o At the low intensity end: background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.

Page 38: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Non-linear or affine linear?

Page 39: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Definitions

linear affine linear genuinely non-linear

Page 40: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

How to compare and assess different ‘normalization’ methods?

Normalization :=1. correction for systematic experimental biases2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…3. provision of a measure of measurement uncertainty

Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.

Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?- just use anyway- ignore- shrink

Page 41: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

How to compare and assess different ‘normalization’ methods?

Aesthetic criteriaLogarithm is more beautiful than arsinh

Practical criteraIt takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.

Silly criteriaThe best method is that that makes all my scatterplotslook like straight, slim cigars

Physical criteriaNormalization calculations should be based on physical/chemical model

Economical/political criteriaLife would be so much easier if everybody were just using the same method, who cares which one

Page 42: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

How to compare and assess different ‘normalization’ methods?

Comparison against a ground truthBut you have millions of numbers – need to choose the metric that measures deviation from truth.FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?qualitative/quantitative: how well do you estimate abundance, fold-change?

Spike-In and Dilution series… great, but how representative are they of other data?

Implicitely, from resampling / cross-validating with the actual experiment of interest… but isn’t that too much like Münchhausen’s bootstrap?

Page 43: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

evaluation: a benchmark for Affymetrixgenechip expression measures

o Data:Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts

o Benchmark:15 quality measures regarding-reproducibility-sensitivity-specificityPut together by Rafael Irizarry (Johns Hopkins)http://affycomp.biostat.jhsph.edu

Page 44: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Quality assessment and control:

an overview over some diagnostic plots and

common artifacts

Page 45: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)

PCR plates

Page 46: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

PCR platesPCR plates

Page 47: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

PCR plates: boxplotsPCR plates: boxplots

Page 48: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

array batchesarray batches

Page 49: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

print-tip effectsprint-tip effects

-0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.2

0.4

0.6

0.8

1.0

41 (a42-u07639vene.txt) by spotting pin

log(fg.green/fg.red)

1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4

q (log-ratio)

F(q)

Page 50: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

spotting pin quality declinespotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

Page 51: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

spatial effectsspatial effects

R Rb R-Rbcolor scale by rank

spotted cDNA arrays, Stanford-type

another array:

print-tip

color scale ~ log(G)

color scale ~ rank(G)

Page 52: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

10 20 30 40 50 60

1020

3040

5060

1:nrhyb

1:nr

hyb

1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Batches: array to array differences dij = madk(hik -hjk)

arrays i=1…63; roughly sorted by time

Page 53: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Density representation of the scatterplot(76,000 clones, RZPD Unigene-II filters)

See: package hexbin; also, smoothscatter in package prada

Page 54: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Oligonucleotide chips

Page 55: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Affymetrix files

Main software from Affymetrix: MAS - MicroArray Suite.

DAT file: Image file, ~108 pixels, ~200 MB.CEL file: probe intensities, ~106 numbersCDF file: Chip Description File. Describes

which probes go in which probe sets (genes, gene fragments, ESTs).

1LQ file: Probe sequences and intended targets in the transcriptome

Page 56: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Image analysis

DAT image files CEL filesEach probe cell: 10x10 pixels.Gridding: estimate location of probe cell centers.Signal:

– Remove outer 36 pixels 8x8 pixels.– The probe cell signal, PM or MM, is the 75th

percentile of the 8x8 pixel values.Background: Average of the lowest 2% probe cells

is taken as the background value and subtracted.

Compute also quality values.

Page 57: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Data and notationPMijg , MMijg = Intensities for perfect match and

mismatch probe j for gene g in chip ii = 1,…, n one to hundreds of chipsj = 1,…, J usually 11 or 16 probe pairsg = 1,…, G 6…30,000 probe sets.

Tasks:calibrate (normalize) the measurements from different

chips (samples)summarize for each probe set the probe level data, i.e.,

11 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

Page 58: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

expression measures: MAS 4.0

expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:

o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard

deviations of the average

1 ( )# j j

j JAvDiff PM MM

J ∈

= −∑

Page 59: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Expression measures MAS 5.0

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

"Signal" =Tukey.Biweight (log(PM-CT))

(… ≈median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Page 60: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Expression measures: Li & Wong

Expression measures: Li & Wong

dChip fits a model for each gene

where– θi: expression index for gene i– φj: probe sensitivity

Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.

Need at least 10 or 20 chips.

Current version works with PMs only.

2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝

Page 61: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Expression measures RMA: Irizarry et al. (2002)

Expression measures RMA: Irizarry et al. (2002)

o Estimate one global background value b=mode(MM). No probe-specific background!

o Assume: PM = strue + bEstimate s≥0 from PM and b as a conditional expectation E[strue|PM, b].

o Use log2(s).o Nonparametric nonlinear calibration

('quantile normalization') across a set of chips.

Page 62: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

AvDiff-like

with A a set of “suitable” pairs.

Li-Wong-like: additive model

Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

21RMA log ( )j j

j APM BG

= −Α ∑

Expression measures RMA: Irizarry et al. (2002)

Expression measures RMA: Irizarry et al. (2002)

2log ( )ij i j ijPM BG a b ε− = + +

Page 63: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Affymetrix: IPM = IMM + Ispecific ?

log(PM/MM)0From: R. Irizarry et al.,

Biostatistics 2002

Page 64: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Sequence-dependent preprocessing

i

25

1log log ( )i i

iY x w s ε

=

= + +∑

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

Page 65: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Software for pre-processing Affymetrix data

• Bioconductor R package affy.• Background estimation.• Probe-level normalization.• Expression measures• Two main functions: ReadAffy, expresso.• See also: gcrma, tilingArray, vsn.

Page 66: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

References

Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to thequantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density OligonucleotideArray Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).