Pooling data across Micorarray

8/7/2019 Pooling data across Micorarray

http://slidepdf.com/reader/full/pooling-data-across-micorarray 1/50

Pooling Data Across Microarray

Experiments Using Different Versions of AffymetrixOligonucleotide Arrays

Jeffrey S. Morris

Li Zhang, Chunlei Wu, Keith

Baggerly and Kevin CoombesUT MD Anderson Cancer Center

Houston, TX, USA



Combining Information across

Microarray Studies

Many publicly available microarray data sets

Can combine information across studies to:1. Validate results from individual studies Find intersection of differentially expressed genes Build model using one study, validate using another

2. Discover new biological insights by analysespooling data across studies.

Potential for increased statistical power Important since many individual studies are

underpowered.



Pooling Data across Studies Challenge: In general, microarray data

from different studies not comparable Clinical differences

Different study populations

Technical differences Laboratory differences: sample collection and

storage, microarray protocol Different platforms: cDNA/oligo, different

versions of same technology (e.g. Affy chips)



Pooling Data across Studies Approaches in Existing Literature:1. Include study effects in model

Gene-specific study effects SVD, Distance-weighted DiscriminationDrawback: First-order corrections not enough

2. Model unitless summary measures

standardized log fold-change t-statistics probabilities of +/0/- expressionDrawback: Implicit assumptions about comparability

of clinical populations across studies



Pooling Data across Studies Sequence-related reasons for incomparability

of raw expression levels across platforms:

Cross-hybridization RNA degradation (near 5 end) Probe validity map to RefSeq? Alternative splicing

It may be possible that, by taking these into

account, we can obtain more comparable rawexpression levels to use in pooled analyses

Our focus: combining information acrossdifferent versions of Affymetrix genechips



Overview of Affymetrix

GeneChips Probes: 25-base sequences from gene

of interest Probesets: set of probes corresponding

to same gene. Obtained from current sequence information

in GenBank, Unigene, RefSeq

Generations of human chips: HuGeneFL: 5600 genes, 20 probes/gene U95Av2: 10,000 genes, 16 probes/gene U133A: 14,500 genes, 11 probes/gene



Example: CAMDA Lung Cancer

Data CAMDA: Critical Assessment of

Microarray Data Analysis: annualconference at Duke University

CAMDA 2003: Two studies relating geneexpression data to survival in lung cancerpatients

1. Harvard (Bhattacharjee, et al. 2001) 124 lung adenocarcinoma samples

2. Michigan (Beer, et al. 2002) 86 lung adenocarcinoma samples

GOAL: Pool data across studies to identify

prognostic genes for lung cancer.



Pooling Information across Studies

Harvardpatients worseprognosis



Pooling Information across Chip

Types

Michigan: HuGeneFl Chip

6,633 probe sets -- 20 probe pairs each

Harvard: HG_U95Av2 Chip

12,453 probe sets 16 probe pairs each

Problems:

Different genes

Incomparable expression levels



Example: CAMDA 2003 Data Two studies used different chip types:

Michigan: HuGeneFL

~130,000 probes in 6,633 probesets Harvard: U95Av2

~200,000 probes in 12,625 probesets

34,428 matching probes combined into 4,101partial probesets

After preprocessing, 1036 probesets consideredin subsequent analyses

We used PDNN (Zhang et al. 2003 NatureBiotech) method for quantification



Partial Probeset Method

HuGeneFL :HG_U95Av2:

Matching ProbesPartial Probesets

1. Identify matching probes

2. Recombine into new probesets based on UNIGENE

clusters, which we refer to as partial probesets3. Eliminate any probesets containing just one or two

probes

Note: Any quantification method can subsequently beused (MAS, dChip, RMA, PDNN)



Pooling Information across Chip

Types



Quantification of Expression Levels

Gene expressions quantified by applying LisPDNN model to our partial probesets

Uses probe sequence info to predict patterns of specific and nonspecific hybridization intensities

Allows borrowing of strength across probe sets

Model is not overparameterized O(N probesets)

See Zhang, et al. (2003) Nature Biotech forfurther details on method and comparison



Assessing Our Method for Combining

Information Across Chip Types

Partial

Probesetmethodappears to givecomparableexpressionlevels acrosschip types.



Assessing our Method for Combining

Information across Chip Types

Median partial probesetsize is 7, vs. 16 or 20

Loss of precision?

No evidence of significant precision loss



Assessing Partial Probeset Method

Agreement inrelative

quantificationsacross samples



Assessing Partial Probeset Method Agreement in

relative


Less variablegenes worse




relative



Eliminate geneswith sd<0.20 orr<0.90




relative



Eliminate geneswith sd<0.20 orr<0.90

1,036 genes



Identifying Prognostic Genes1. Preprocess raw microarray data

Outlier Detection, Normalization, Quantification,

Remove Some Genes? Left with n-by-p matrix of expression levels for

p genes on n microarrays.

2. Identify which genes are correlated to

outcome of interest Perform standard statistical test for each gene

obtain (permutation) p-values

Find cutpoint on p-values to declaresignificance that accounts for multiplicities.



Identifying Prognostic Genes After preprocessing:

1036 genes, 200 samples Identify genes related to survival

After adjusting for known clinical

predictors Provide prognostic information on survival

above and beyond clinical predictors



Identifying Prognostic Genes:

Cox Regression Modeling Hazard : P(t) ~ Prob(X<t +(t | X>t )

Cox Model: Pi(t) = P0(t) exp(X i F )

X i = Vector of covariates for subject i F = Vector of regression coefficients

Key Assumption: Proportional Hazards Hazard ratio between subjects with different

covariates does not vary over time.

Pi(t )/Pk(t ) = exp{ (X i-X k) F } Exp(F) =Change in hazard per unit change in X




Cox Regression Modeling Best Clinical Model:

Factor FExp(F) Z p

StudyMichigan = 0

Harvard = 1

0.67 1.95 2.73 0.0062

Age 0.03 1.03 2.60 0.0094

StageEarly (1-2) = 0

Late (3-4) = 1

1.53 4.61 6.61 <0.000000001



Identifying Prognostic Genes Series of 1036 multivariable Cox models fit to

identify prognostic genes. Each model contained: Study (Michigan=-1, Harvard=1). Age (continuous factor). Stage (early=0/late=1). Probeset (log intensity value as continuous factor).

Exact p-values for each probeset computed

using permutation approach By using multivariate modeling, we search for genes

offering prognostic information beyond clinicalpredictors




BUM Method No prognostic genes pvals Uniform

Prognostic genes

smaller pvals Fit Beta-Uniform mixture to histogram

of p-values BUM method

(Pounds and Morris, 2003 Bioinformatics)

Method can be used to identifyprognostic genes while controlling FDR



Results Histogram suggests

there are somesignificant probesets

FDR=0.20 correspondspval cutoff of 0.0024(BUM, Pounds andMorris 2003)

26 probesets flaggedas significant



Selected Flagged GenesRank Gene F p Function

1 FCGRT -2.07 <0.00001 Induced by IF-K in treating SCLC

2 ENO2 1.46 0.00001 Marker of NSCLC4 RRM1 1.81 0.00002 Linked to survival in NSCLC

8 CHKL -1.43 0.00010 Marker of NSCLC

11 CPE 0.72 0.00031 Marker of SCLC

12 ADRBK1 -2.20 0.00044 Co-expressed with Cox-2 in lung ADC

16 CLU -0.52 0.00109 Marker of SCLC

20 SEPW1 -1.29 0.00145 H202 cytotox. in NSCLC cell lines

21 FSCN1 0.66 0.00150 Marker of invasiveness in Stg 1 NSCLC

25 BTG2 -0.75 0.00232 Induced by p53 in SCLC cell lines





2 ENO2 1.46 0.00001 Marker of NSCLC

4 RRM1 1.81 0.00002 Linked to survival in NSCLC



12 ADRBK1 -2.20 0.00044Co-expressed with Cox-2 in lung ADC







Selected Flagged GenesRank Gene F p pStage Function

1 FCGRT -2.07 <0.00001 0.154 Induced by IF-K in treating SCLC

2 ENO2 1.46 0.00001 0.282Marker of NSCLC

4 RRM1 1.81 0.00002 0.321 Linked to survival in NSCLC

8 CHKL -1.43 0.00010 0.979 Marker of NSCLC

11 CPE 0.72 0.00031 0.088 Marker of SCLC

12 ADRBK1 -2.20 0.00044 0.484 Co-expressed with Cox-2 in PUC

16 CLU -0.52 0.00109 0.014 Marker of SCLC

20 SEPW1 -1.29 0.00145 0.028 H202 cytotox. in NSCLC cell lines

21 FSCN1 0.66 0.00150 0.082 Marker of invasiveness in Stage 1NSCLC









12 ADRBK1 -2.20 0.00044 0.484 Co-expressed with Cox-2 in PUC












12 ADRBK1 -2.20 0.00044 0.484 Co-expr with Cox-2 in lung adenocarc








2 ENO2 1.46 0.00001 0.282 Marker of NSCLC




12 ADRBK1 -2.20 0.00044 0.484Co-expressed with Cox-2 in PUC






Selected Flagged Genes

Rank Gene F p pStage Function

3 NFRKB -2.81 0.00001 0.058 Amplified in AML

7 ATIC 1.81 0.00009 0.771 Fusion partner of ALK which definessubtype of ALCL

13 BCL9 -1.64 0.00069 0.057 Over-expressed in ALL

15 TPS1 -0.64 0.00107 0.882 Associated with pulmonaryinflammation

25 BTG2 -0.75 0.00232 0.726 Inhibits cell proliferation in primarymouse embryo fibroblasts lackingfunctional p53



Results Our gene list has almost no overlap with other

publications of these data. Reasons:

1. We addressed a different research question Us: ID Genes offering prognostic info beyond clinical

Michigan: Univariate Cox models fit; results used toconstruct dichotomous risk index

Harvard: Cluster analysis done; clusters linked tosurvival; found genes driving the clustering

2. Pooling across studies yielded significantgains in statistical power.

Most genes (17/26) in our study are not flagged if we

analyze 2 data sets separately (i.e. no pooling)



CAMDA 2003 Results

We Won!



Limitations of Partial Probeset Method Worked well for combining across HuGeneFL/

U95Av2

~25% probes from HuGeneFL on U95Av2, with4,101 probesets

Not enough matching probes for use withU95Av2/U133A ~6% of probes from U95Av2 also on U133A, with

only 628 probesets

Requiring matching probes strong criterion,maybe weaker criterion would suffice?



Full-Length Transcript Based Probesets New probeset definition (FLTBP): probes match

the same set of full-length mRNA sequences

Procedure1. Construct comprehensive library of full-length mRNA

transcript sequences from RefSeq and HinvDB

2. For each probe, identify all matching full-length

transcripts using Blast programU95Av2: 15% matched no sequence, 33% matched multiple seq.

U133A: 18% matched no sequence, 38% matched multiple seq.

3. Group probes with same matched target lists (FLTBPs)U95Av2: 23,972 probesets, U133A: 14,148 probesets



Full-Length Transcript Based Probesets Matching across chip types:

9,642 FLTBPs match across U95Av2 and U133A

Affymetrix has their own method for mapping theirprobesets across arrays 9,480 pairs of probesets(only about ½ map the same way as FLTBPs)

Example: Lung cancer cell line data

28

cell lines, each hybridized onto both U95Av2 andU133A arrays.

Paired design suggests any differences between pairedmeasurements due to technical, not biological, sources.

Different quantification methods (PDNN, RMA, MAS, dChip)



Results Densityestimate of chip-to-chipcorrelations for

each gene Positive shift

for FLTBPsuggests bettercorrelations

Improvement greatest forPDNN

Correlation stillnot perfect

-0.5 0.0 0.5 1.0

0.0

0.5

1

.0

1.5

2.0

2.5

Gene-gene c orrelation

D

ensity

A

PDNN(P<0.00001)

-0.5 0.0 0.5 1.0

0.0

0.5

1

.0

1.5

2.0

2.5

Gene-gene c orrelat ion

D

ensity

B

RMA(P<0.00031)

-0 .5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Gene-gene c orrelation

D

ensity

C

MAS5(P<0.00575)

-0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Gene-gene c orrelat ion

D

ensity

D

dChip(P<0.00005)



Example: Sample Gene 2

4

5

6

7

8

9

10

1 6 11 16 21 26

Sample

Ln(probe sign

al)

4

5

6

7

8

9

10

1 6 11 16 21 26

Sample

Ln(probe sign

al)

11

12

13

14

15

7 8 9 10 11

U95Av2

U133A

R2

=0.87

R2

=0.25

Again, significantly higher correlation using FLTBP thanusing Affymetrix definition



Results Boxplot of

chip-to-chipcorrelations

(over genes)for eachsample

PDNN resulted

in highercorrelations

PDNN RMA MAS5 dChip

FLTBP

AffyPS

FLTBP

AffyPS

FLTBP

AffyPS

FLTBP

AffyPS

0.6

0.7

0.8

0.9

1.0

Sam

pl

la

ti



Conclusions New method for pooling info across studies

using different versions of Affymetrix chips. Recombine matched probes into new

probesets using Unigene clusters. Method appears to obtain comparable

expression levels across chips without sacrificingmuch precision or significantly altering therelative ordering of the samples.

Worked well combining information acrossHuGeneFL/U95Av2, but not U95Av2/U133A



Conclusions

Discussed new probeset definition based on

full-length transcript sequences. Removes effect of known alternative splicing Yields stronger between-chip correlations than

Affymetrix standard definitions

Pooling information across studies is difficult there is still more work to be done but worththe effort.



ReferencesMorris JS, Yin G, Baggerly KA, Wu C, and Zhang L (2005).Pooling Information Across Different Studies and OligonucleotideMicroarray Chip Types to Identify Prognostic Genes for LungCancer. Methods of Microarray Data Analysis IV, eds. JSShoemaker and SM Lin, pp. 51-66, New York: Springer-Verlag.

Wu C, Morris JS, Baggerly KA, Coombes KR, Minna JD, andZhang L (2005). A probe-to-transcripts mapping method forcross-platform comparisons of microarray data taking into

account the effects of alternative splicing. Under review.

Morris JS, Wu C, Coombes KR, Baggerly KA, Wang J, and ZhangL (2006). Alternative Probeset Definitions for CombiningMicroarray Data Across Studies Using Different Versions of Affymetrix Oligonucleotide Arrays. To appear in Meta-Analysis in

Genetics edited by Rudy Guerra and David Allison Chapman

Pooling data across Micorarray

Documents

Transcript of Pooling data across Micorarray