Exploratory Data Analysis of High Density Oligonucleotide Array

46
Exploratory Data Analysis of High Density Oligonucleotide Array Rafael A. Irizarry, Bridget Hobbs, Terry Speed http:// biosun01.biostat.jhsph.edu/ ~ririzarr/Raffy

description

Exploratory Data Analysis of High Density Oligonucleotide Array. Rafael A. Irizarry, Bridget Hobbs, Terry Speed http://biosun01.biostat.jhsph.edu/~ririzarr/Raffy. Outline. Review of technology Form of Data Description of Data Normalization Future/current work: Defining expression. *. *. - PowerPoint PPT Presentation

Transcript of Exploratory Data Analysis of High Density Oligonucleotide Array

Page 1: Exploratory Data Analysis of High Density Oligonucleotide Array

Exploratory Data Analysis of High Density Oligonucleotide Array

Rafael A. Irizarry, Bridget Hobbs, Terry Speed

http://biosun01.biostat.jhsph.edu/~ririzarr/Raffy

Page 2: Exploratory Data Analysis of High Density Oligonucleotide Array

Outline

• Review of technology• Form of Data• Description of Data• Normalization• Future/current work:

Defining expression

Page 3: Exploratory Data Analysis of High Density Oligonucleotide Array

Probe Arrays

24µm24µm

Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe

Image of Hybridized Probe ArrayImage of Hybridized Probe Array

>200,000 different>200,000 differentcomplementary probes complementary probes

Single stranded, Single stranded, labeled RNA targetlabeled RNA target

Oligonucleotide probeOligonucleotide probe

* ****

1.28cm1.28cm

GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell

Compliments of D. Gerhold

Page 4: Exploratory Data Analysis of High Density Oligonucleotide Array

Image analysis

• About 100 pixels per probe cell

• These intensities are combined to form one number representing expression for the probe cell oligo

• What about genes?

Page 5: Exploratory Data Analysis of High Density Oligonucleotide Array

PM MM

Page 6: Exploratory Data Analysis of High Density Oligonucleotide Array

Data and notation

PMijn , MMijn = Intensity for perfect/mis-match

probe cell j, in chip i, in gene n

i = 1,…, I (ranging from 1 to hundreds)j=1,…, J (usually 16 or 20)n = 1,…, N (between 8,000 and 12,000)

Page 7: Exploratory Data Analysis of High Density Oligonucleotide Array

$64K Question

• How do we define expression? or

• What is the one number summary of the 20 PMs and 20 MMs that best quantifies expression?

• How about differential expression?

Page 8: Exploratory Data Analysis of High Density Oligonucleotide Array

Current default

• GeneChip® software uses Avg.diff

with A a set of “suitable” pairs chosen by software.• Log ratio version is also used.• For differential expression Avg.diffs are compared

between chips.

j

jj MMPMdiffAvg )(1.

Page 9: Exploratory Data Analysis of High Density Oligonucleotide Array

What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)

Page 10: Exploratory Data Analysis of High Density Oligonucleotide Array

• Chips used in Lockhart et. al. containedaround 1000 probes per gene

• Current chips contain 20 probes per gene

• These are different situations• We haven’t seen a plot like the previous

one, for current chips

Page 11: Exploratory Data Analysis of High Density Oligonucleotide Array

Possible problems

What if• a small number of the probe pairs hybridize much

better than the rest?• removing the middle base does not make a

difference for some probes?• some MM are PM for some other gene?• there is need for normalization?We explore these possibilities using data from 3

experiments

Page 12: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 1

• 8 Rats, under 4 experimental conditions– Control NV21– Ventilation V21– Oxygen NV100– Oxygen and Ventilation V100

• 2 rats in each condition• RNA is pooled and divided to form 2

technical replicates for each condition

Page 13: Exploratory Data Analysis of High Density Oligonucleotide Array

Notice

• Experimental condition is confounded with couples: we can’t distinguish between biological variability and variability due to experimental condition

• NV21, V21 and NV100,V100 processed in different scanners/fluidic stations: Oxygen effect confounded with scanner/fluidic station effect

Page 14: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 2

• 6 Rats, under 3 experimental conditions– Control– ENOS– NNOS

• 2 rats in each condition• RNA is pooled and divided to form 2

technical replicates

Page 15: Exploratory Data Analysis of High Density Oligonucleotide Array

Notice

• One of the chips for NNOS did not “work”• Biological variability confounded with

variability due to experimental condition• About 1/5 of the probes on chips used

where defective.

Page 16: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 3

• Five mice with different characteristics:– 4 week old female NOD (J4FD, R4FD)– 4 week old female NOD (J4FD)– 4 week old male NOD (J4MD)– 4 week old female homozygous transgenic

mouse which can't get diabetes (R4FN)

Page 17: Exploratory Data Analysis of High Density Oligonucleotide Array

Notice

• Each of the 5 chips were scanned twice• Two separate stains are used• This gives us 10 sets of results

Page 18: Exploratory Data Analysis of High Density Oligonucleotide Array

Properties of Data that make defining expression hard

• There can be saturation• log2(PM / MM) and PM-MM are noisy• MM >> PM for many probes• PMs of the same probe vary about 5 times

less from chip to chip than from probe to probe within the same probe set.

Page 19: Exploratory Data Analysis of High Density Oligonucleotide Array

Saturation problemProbes reaching maximum in experiment 1

Scanner Chip PM MM Value2 NV21a 354 25 46140 2 NV21b 564 57 46144

2 V21a 1004 83 46141 2 V21b 665 51 46139

1 NV100a 1917 328 46154 1 NV100b 1265 168 46160 1 V100a 3399 1085 46155 1 V100b 2267 446 46149

Page 20: Exploratory Data Analysis of High Density Oligonucleotide Array

log2(PM/MM) for defective and normal probe sets in a chip from experiment 2

Page 21: Exploratory Data Analysis of High Density Oligonucleotide Array

The Good News

Section of Data Bad Probes

All Data 20%

Top 1% of PM / MM 11%

Top 1% of PM 14%

Bottom 1% of PM / MM 29%

Top 0.1% of PM / MM 7%

Top 0.1% of PM 10%

Bottom 0.1% of PM / MM 29%

Page 22: Exploratory Data Analysis of High Density Oligonucleotide Array

Histograms of log2(PM/MM) stratifies by log2(PMxMM)/2 for one of the chips in experiment 1

Page 23: Exploratory Data Analysis of High Density Oligonucleotide Array

Histograms of log2(PM/MM) stratifies by log2(PMxMM)/2 for chip in experiment 2 for defective and normal probe

Page 24: Exploratory Data Analysis of High Density Oligonucleotide Array

Histograms of log2(PM/MM) stratifies by log2(PMxMM)/2 for one of the chips in experiment 3

Page 25: Exploratory Data Analysis of High Density Oligonucleotide Array

ANOVA

Page 26: Exploratory Data Analysis of High Density Oligonucleotide Array

Normalization

• There are many sources of experimental variation:– During preparation: e.g. mRNA extraction, introduction of labeling

– During manufacture of array: e.g. amount of oligos on cells

– During hybridization: e.g. amount of sample applied, amount of target hybridized

– After hybridization: e.g. optical measurements, label intensity, scanner

• Proper normalization is need before intensities from different chips are compared

Page 27: Exploratory Data Analysis of High Density Oligonucleotide Array
Page 28: Exploratory Data Analysis of High Density Oligonucleotide Array
Page 29: Exploratory Data Analysis of High Density Oligonucleotide Array
Page 30: Exploratory Data Analysis of High Density Oligonucleotide Array
Page 31: Exploratory Data Analysis of High Density Oligonucleotide Array

Log ratio vs. average log intensity (MVA) plots of PM,MM

Page 32: Exploratory Data Analysis of High Density Oligonucleotide Array

Log ratio vs. avg log intensity (MVA) plots for PM / MM

Page 33: Exploratory Data Analysis of High Density Oligonucleotide Array

Normalization

• Pair-wise normalization? • Which chips do we compare?• The following three plots show the 3

pairwise comparisons of chips Control A, ENOB, and NNOA

Page 34: Exploratory Data Analysis of High Density Oligonucleotide Array

Normalization based on combined PMs and MMs

Page 35: Exploratory Data Analysis of High Density Oligonucleotide Array

Cyclic algorithm (version 0.1)

• For chip j, with entries X1 define the functions

f1,…,fj-1,fj+1,…,fJ

to be the results of smoothing the scatter plot{Xj-Xk , (Xj+Xk)/2}

• Define the normalized chip asXj’= Xj- (f1+…+fj-1+fj+1+…+fJ)/J

• Chips X1,…,XJ are normalized in the same way

• We iterate until Xi’, Xi are very similar for all i.

Page 36: Exploratory Data Analysis of High Density Oligonucleotide Array

Before and after normalization

Page 37: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 1

Page 38: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 2

Page 39: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 2Combined PM and MM

Page 40: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 2PM / MM

Page 41: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 2PM – MM in a hybrid log scale

Page 42: Exploratory Data Analysis of High Density Oligonucleotide Array

Experiment 3Combined PM and MM

Page 43: Exploratory Data Analysis of High Density Oligonucleotide Array

Competing definitions of expression

• Li and Wong fit a model

Consider expression in chip i• Efron et. al. consider log PM – 0.5 log MM• Another is second largest PM

),0(, 2 NMMPM ijijjiijij

i

Page 44: Exploratory Data Analysis of High Density Oligonucleotide Array

How do we compare?

• We want small variance, small bias.• Up to now we don’t know truth in any of

our data sets so hard to assess bias.• One possibility is to assume some gene is

differentially expressed in the experiments we study, find it, and look at its probe profile.

Page 45: Exploratory Data Analysis of High Density Oligonucleotide Array

Conclusion

• Features of data suggest that avg.diff may be improved as a definition of expression

• It seems that normalization is needed to remove experimental variation and make meaningful comparison of data from different chips fair

Page 46: Exploratory Data Analysis of High Density Oligonucleotide Array

Acknowledgements

• JHU: Leslie Cope, Tom Coppola, Shwu-Fan Ma, Skip Garcia

• CNMC: Rehannah Borup, Josephine Chen, Eric Hoffman

• UC Berkeley: Ben Bolstad• WEHI: Runa Daniel, Len Harrison