Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data...

19
1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen [email protected] 2 0. Contributors Statistics: - Kyung In Kim - Aad van der Vaart - Mark van de Wiel - Wessel van Wieringen Bioinformatics: - Eskeatnaf Achame - Jeroen Belien - Kees Jong - Sjoerd Vosse Biology: - Saskia Wilting - Bauke Ylstra 3 2. 0. Outline Topics discussed Pre-processing 1. 5. Clustering aCGH 4. Hypothesis testing 3. Dimension reduction Integration with expression 6. 4 1. aCGH 5 Chromosomes of a tumor cell Technique: SKY 1. aCGH 6 tumor cell normal cell 1. Classical CGH

Transcript of Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data...

Page 1: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

1

Analysis of aCGH data

Department of MathematicsVrije Universiteit Amsterdam

Wessel van [email protected]

2

0. Contributors

Statistics:- Kyung In Kim- Aad van der Vaart- Mark van de Wiel- Wessel van Wieringen

Bioinformatics:- Eskeatnaf Achame- Jeroen Belien- Kees Jong- Sjoerd Vosse

Biology:- Saskia Wilting- Bauke Ylstra

3

2.

0. Outline

Topics discussedTopics discussed

Pre-processing1.

5. Clustering

aCGH

4. Hypothesis testing3. Dimension reduction

Integration with expression6.

4

1. aCGH

5

Chromosomes of a tumor cell

Technique: SKY

1. aCGH

6

tumor cellnormal cell

1. Classical CGH

Page 2: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

7

normal cellnormal cell

hybridization

after hybridization

1. Classical CGH

8

tumor cellnormal cell

hybridization

after hybridization

1. Classical CGH

9

1. Classical CGH

CGH 5-10 Mb resolution vs array CGH 0.8 Mb resolution

10

BAC’s

chr.1 chr. 2 chr.3 chr. 4

1. arrayCGH

ProbeCloneOligo

Array element

Resolution

• Human genome is 3000Mb (29 Gb),• 30 000 BACs to cover the human genome,• Max resolution is size of the BAC, 100-150kb.

11

1. arrayCGH

Hybridize

test samplereference sample

12

1. arrayCGH

Hybridize

test samplereference sample

Page 3: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

13

1. arrayCGH

Hybridize

test samplereference sample

Scan

14

1. arrayCGH

Scan

Image of the arrayImage of

a spot

log2(G/R)-ratio

Quantification of intensity

15

1. aCGH profiles

1 2 3 4 5 6 7 8 ……. X

Log2-ratio’s plotted against the genomic order of clones.Log2-ratio’s plotted against the genomic order of clones.

Chromosomes 16

1. aCGH profiles

GainGain

LossLoss

AmplificationAmplification

NormalNormal

17

1. aCGH profiles

Males have only one copy of the X-chromosome.

1 2 3 4 5 6 7 8 ……. X

CorroborationCorroboration

18

1. aCGH profiles

Deletion on 5p

CorroborationCorroboration

Page 4: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

19

1. aCGH profilesNew findingsNew findings

20

2. aCGH

• DNA copy number,• genome-wide,• with high resolution.

• DNA copy number,• genome-wide,• with high resolution.

aCGH measuresaCGH measures

- Test for chromosomal aberrations,- Distinguish one tumor class from another (diagnosis),- Screening for new drug targets,- Predict clinical outcome, - Find new subclasses,- etc..

- Test for chromosomal aberrations,- Distinguish one tumor class from another (diagnosis),- Screening for new drug targets,- Predict clinical outcome, - Find new subclasses,- etc..

aCGH data are used toaCGH data are used to

21

1 3 3 10 19 41 64146

610

266*

0100200300400500600700

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Year

# of

Pub

Med

item

s

PubMed: “Array CGH”

1. arrayCGH

22

2. Pre-processing- Normalization- Segmentation- Calling

23

Log2-ratios from different hybridizations are compared.

Normalization aims to make log2-ratios from different hybridizations comparable.

Log2-ratios from different hybridizations are compared.

Normalization aims to make log2-ratios from different hybridizations comparable.

Motivation for normalizationMotivation for normalization

2. Normalization

• Median normalization.• Mode normalization.• Spatial normalization.

Types of normalization

24

2. Normalization

Shift median to zero

Page 5: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

25

2. NormalizationSpatial effects are present!

Subtract loess curve

26

2. Pre-processing- Normalization- Segmentation- Calling

27

Divide the genome into contiguous segments.

Clones that belong to the same segment are assumed to have the same underlying copy number.

Segmentation

Segmentation is also called smoothing.

segment 1 segment i

2. Segmentation

……..28

• Noise reduction.• Detection of aberration (loss, normal, gain).• Breakpoint analysis.

Why segmentation?

Recurrent (over tumors) aberrations may indicate:- an oncogene, or- a tumorsuppressor gene.

2. Segmentation

• Measurements are relative to a reference sample.• Printing, labeling and hybridization may be uneven.• Tumor sample is inhomogeneous.

Difficulties for segmentation

29

2. Segmentation

Copy numbers are integers:

“discrete smoothing”!

30

2. Segmentation

Page 6: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

31

2. SegmentationA segmentation can be described by:- a number of breakpoints, and - the corresponding levels (or states).

Levels

Breakpoints

Variance

Segment

32

Identify possibly damaged genes:- These genes will not be expressed any more.

Identify recurrent breakpoint location:- Indicates fragile pieces of the chromosome.

Accuracy is important:- Important genes may be located in a region with

(recurrent) breakpoints.

Breakpoint detection

2. Segmentation

33

Problem formalization (Jong et al., 2004)

2. Segmentation

aCGH values: x1 , ... , xnBreakpoints: 0 < y1< … < yN < xN

Likelihood:

Error variances: s12, . . ., sN

2

Levels: m1, . . ., mN

- A fitness function scores each segmentation according to its fitness to the data.

- Find segmentation with highest fitness. - Assume data are Gaussian. - Use the ML-criterion with a penalization term for model

complexity.

34

2. Segmentation

- ML-estimators of µ and s 2 can be found explicitly.- Add a penalty to log likelihood to control number N of

breakpoints:

- Maximizing fitness is computationally hard- Use genetic algorithm + local search to find

approximation to the optimum.

Fitness function

35

expert

algorithm

2. SegmentationResults comparable to those produced by hand by local expert

36

2. Pre-processing- Normalization- Segmentation- Calling

Page 7: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

37

2. Calling

Calling is the process of categorizing the different segmentation states as ‘loss’, ‘normal’, ‘gain’, or ‘amplification’.

Calling is the process of categorizing the different segmentation states as ‘loss’, ‘normal’, ‘gain’, or ‘amplification’.

CallingCalling

- Nature of the data.- Clearest interpretation.- Necessary for inferences on copy number.- Advantageous for down-stream analysis.

Why call? (Van Wieringen et al., 2007b)

38

2. CallingCalling with a simple cut-off (normalized data)

gain

normal

loss

39

Calling with a simple cut-off (segmented data)

2. Calling

gain

normal

loss

40

2. Calling

all normal, levels differ

breakpoint locations

same segment, same copy number

Biological assumptions

Six copy numbers : double loss (0), single loss (1), normal (2), single gain (3), double gain (4), amplification (≥5).

CGHcall is a calling method that makes extensive use of biology and the knowledge of the aCGH data.

CGHcall (Van de Wiel et al., 2007a)

41

2. Calling

Formalization of biological assumptions

1. Independence between clones from different segments. 2. Tumors reasonably similar (otherwise cellularity correction).3. Identifiability restrictions on parameters.4. Variation constant within segments.5. Log2-ratio’s are modelled by a hierarchical model:

of sample k, clone i, segment j, and underlying copy number l. 42

2. CallingEM algorithmInitialize unknown parameters; compute membership probabilities; compute expected log-likelihood EL with respect to hidden states; maximize EL with respect to unknown parameters; update membership probabilities; iterate.

The likelihood:

with

Page 8: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

43

2. Calling

The results of a calling method are the call probabilities:

Output of calling methods

Chr. Start BP LossClone Normal Gain Ampl.1 38099 0.0012749 0.052 0.174 0.773

CGHcall yields sixprobabilities.

sums to one44

2. Calling

Final calling

5.0

5.05.0

5.0

5.0

6

6654

3

21

<>++

>+

jk

jkjkjkjk

jk

jkjk

P

PPPP

P

PP

:ionAmplificat and :Gain

:Normal :Loss

While the final calling returns 4 classes only, it is essentialto distinguish single gain from double gains in the model.

Otherwise the double gain data will bias the results for the single gains

Call probabilities are used for generation of calls.

45

2. Calling

Loss

Gain

Loss

Gain

Log2 (G

/R)P

roba

bilit

y(o

f an

aber

ratio

n)

Probability > 0.5 � LossProbability > 0.5 � Gain 46

Wilting et al., (2006)

2. Calling3 modes

6 modes

47 Chromosome 17

Amplifications

2. Calling

48

2. Calling

Alternative modelOriginal model:all chromosomal arms have equal probability of, say, a loss.

For some tumor types certain chromosomal arms are likely to contain more aberrations than others.

Incorporated in alternativemodel.

Page 9: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

49

CGH-call

CGH-classify

2. Calling

50

2. Calling

Comparison: simulation

Method

Overall classification

rate

True positive

rate

True negative

rate

CGHcall

98.6

95.8

99.3

CGHclassify

90.8

96.2

89.2

CLAC

95.1

75.9

99.7

MergeLevels

88.0

46.2

98.3

2sd

87.4

36.3

99.9

Willenbrock & Fridlyand (2005) simulation model

5189 oral carcinomas; Snijders et al. (2005)

2. CallingSummary plot

52

2. Pre-processing

Chr. Start BP Array 1Clone Array 2 … Array n1 38099 01 0 … 01 614489 02 0 … 01 2413398 03 0 … -1

… … …… … … …18 731200 12384 0 … 218 3908046 12385 1 … 218 4825168 12386 1 … 2… … …… … … …23 204448 0p 0 … 0

-1 = loss, 0 = normal, 1 = gain, 2 = amplification

Pre-processed data

53

3. Dimension reduction

54

3. Regions

Summarize clone data into region data, and use regions.

CGHregions (Van de Wiel et al., 2007b)

The ordinal nature of the data causes many genomic seg-ments to have the same values over consecutive clones.

A region is a series of neighboring clones on the chromo-some whose aCGH-signature is shared by all clones.

A region can be a chromosome arm, or a small amplification.

Regions capture the essential features of the data.

Regions

Page 10: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

55

3. Regions

Region 1

Region 2

Region 3

S5 S6

1

1

1

1

0

0

0

0

0

1

1

1

2

0

0

0

0

0

0

0

0

2

0

0

0

0

0

1

1

1

2

0

0

0

0

0

clone 1

clone 2

clone 3

clone 4

clone 5

clone 6

clone 7

clone 8

clone 9

S1 S2

0

0

0

2

0

0

0

0

0

S3

0

0

0

2

0

0

0

0

0

S4

geno

mic

loca

tion

56

Region 1 0 1 0 0 1 1

Region 2 2 2 2 2 1 2

Region 3 0 0 0 0 0 0

Illustration of the principle.

But … too simple!

S5 S6

1

1

1

1

0

0

0

0

0

1

1

1

2

0

0

0

0

0

0

0

0

2

0

0

0

0

0

1

1

1

2

0

0

0

0

0

clone 1

clone 2

clone 3

clone 4

clone 5

clone 6

clone 7

clone 8

clone 9

S1 S2

0

0

0

2

0

0

0

0

0

S3

0

0

0

2

0

0

0

0

0

S4

geno

mic

loca

tion

3. Regions

57

3. Construction of regions

A sequence of clones that satisfies:

where d(•, •) is the L1-distance function.

Definition of region

is a C × N matrix with entries the call for clone i of sample s.

A region is represented by a sub-matrix of :

Notation

58

3. Construction of regions

1. Between chromosomes.

2. At breakpoints with:

3. At the highest gradient of the region if it does not satisfy the definition.The gradient:

where wij = 0 if i = j, otherwise wij = 1 / | i – j |.

Rulesone region another region

clones

adja

cent

clo

ne d

ista

nce

new region

c

clones

adja

cent

clo

ne d

ista

nce

nonetheless, d(•,•) > c

c

59

3. Construction of regions

The mediod is taken as the representative signature.The mediod is the signature with least average distance to the other signatures.

Representative signature

S5 S6

1

1

1

1

0

0

0

0

0

1

1

1

2

0

-1

0

0

0

0

0

0

2

0

0

0

0

0

1

1

1

2

0

0

0

0

0

clone 1

clone 2

clone 3

clone 4

clone 5

clone 6

clone 7

clone 8

clone 9

S1 S2

0

0

0

2

0

0

0

0

0

S3

0

0

0

2

0

0

0

-1

0

S4

geno

mic

loca

tion

signature of region 1

signature of region 2

signature of region 3

60

3. Construction of regions

Threshold c is set equal to cmax, where

and T is maximum sustained proportion of mis-predictions.

a(c) is a measure of prediction error:

where K is the set of clone regions with many aberrations, and Am is the mediod signature.

Choice of c

Page 11: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

61

Colorectal tumor datafrom Douglas et al. (2004):

from 3129 clones to 68 regions.

3. Application to Douglas et al. (2004)

62

T = 0.01

T = 0.025

3. Application to Douglas et al. (2004)

63

4. Hypothesis testing

64

Mean,Std. dev.0,1

Normal Distribution

-5 -3 -1 1 3 5

x

0

0.1

0.2

0.3

0.4

dens

ity

Chr. Pos.Clone #loss #norm. #gain #loss #norm. #gain1 …1 3 7 2 1 11 01 …2 3 7 2 1 11 0… …… … … … … … …18 …2384 2 8 2 1 9 318 …2385 2 8 2 1 9 3… … …… … … … … …

• Normal distribution does not hold• t-test results in WRONG p-values!

Compare condition 1 and condition 2

Condition 1 Condition 2

4. Hypothesis testing

65

- Two-sample Wilcoxon test corrected for ties, deals with discreteness and natural ordering of the levels.

- A fast method to generate correct, exact p-values for multi-array called aCGH data.

Chr. Pos.Clone #loss #norm. #gain #loss #norm. #gain1 …1 3 7 2 1 11 01 …2 3 7 2 1 11 0… …… … … … … … …18 …2384 2 8 2 1 9 318 …2385 2 8 2 1 9 3… … …… … … … … …

Condition 1 Condition 2

CGHMultiArray (Van de Wiel et al., 2005)

4. Hypothesis testing

66

Maths behind CGHMultiArray

Q: “There exist one-line formulas to approximate the p-value, can’t we use these?”

A: “No, for this situation these approximations may be a factor 2 to 3 off!”

Q: “Couldn’t we generate the distribution of the test statistic just once and then read off the p-values for all clones?”

A: “No, the structure of the data (total number of losses, normals, and gains) is different for many clones; the p-value has to be re-computed for each clone with a previously unobserved structure.”

4. Hypothesis testing

Page 12: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

67

4. Hypothesis testingLoads of identical

p-values.

68

Life scientist: “That’s nice, but those pessimistic statisticians always tell me to correct for ‘multiple testing’ which means multiplying the p-value by a huge factor!”

Life scientist: “Moreover, subsequent clones are highly correlated. I would like to make statements about chromosomal regions rather than about individual clones.”

Combine CGHMultiArray with dimension reduction, and apply special FDR-correction

4. Hypothesis testing

69

FDR controlStandard Benjamini-Hochberg far too conservativeName Chr Pos

# loss # norm # gains # loss # norm # gains # loss # norm # gains012 0 0 24

Condition 1 Condition 2 Condition 1 + 2

BAC1432 - BAC1476

8 670946 - 8769451

0 12 0 0

Case above: minimal p-value equals 1!!!

Gilbert (2005) : “For FDR correction for a particular case with p-value = P only include other cases that, based on the aggregated data, can reach a p-value ≤ P.”Name Chr Pos

# loss # norm # gains # loss # norm # gains p-val FDR-BH FDR-GilCondition 1 Condition 2 Significance

BAC738 - BAC755

8 1432258 - 1260024

0 2 10 1 0.03010 1 0.00055 0.072

4. Hypothesis testing

70

4. Re-analysis of Douglas et al. (2004)

Experimental design (Douglas et al., 2004)aCGH profiles of 7 MSI+ colorectal cancers, andaCGH profiles of 30 CIN+ colorectal cancers.

Research questionAre there genomic differences between the two groups? If so, what is their location?

ResultsChromosome 20 and chromosome arms 8p, 17p, 18q are reported.

ProblemNo statistical underpinning of the findings is provided!

Orginal analysis

71

4. Re-analysis of Douglas et al. (2004)

Pre-processed (missing values, normalization, segmenta-tion, calling) clone data are used for testing.

Analysis of the clone data reveals that the smallest FDR-value equals 0.3815.

Clone data

Pre-pocessed data are trans-formed into regions.

Region data is used for testing.

Region data

72

Chr.8

20181888

1818

13

17

Start BP7938099

308144893241339873991368

73120034108046

22516827315721

19104448

….

#clones3250548

142

324

16

….

p-value0.001660.004640.006180.006180.007530.016770.019610.02259

0.02264

0.09346

Start BP32678693635898687288681877615559

6933218350261372570056829970100

29970100

….

FDR0.0130.0360.0360.0360.0360.0440.0450.045

0.045

0.204

Extra

Not found

4. Re-analysis of Douglas et al. (2004)

Page 13: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

73

5. Clustering

74

5. IntroductionResearch question (Fridlyand et al., 2006)Can DNA copy number profiles be used to divide ductalinvasive breast cancers into subgroups?

Research question (Fridlyand et al., 2006)Can DNA copy number profiles be used to divide ductalinvasive breast cancers into subgroups?

Experimental designDNA copy number profiles of 67 tumors.Experimental designDNA copy number profiles of 67 tumors.

ResultsCluster analysis identified 4 subgroups: known (BRCA), others not clinically recognized yet.

ResultsCluster analysis identified 4 subgroups: known (BRCA), others not clinically recognized yet.

ConclusionDNA copy number profiles can be used to identify subtypes of breast cancer.

ConclusionDNA copy number profiles can be used to identify subtypes of breast cancer.

75

5. Introduction

WECCA (Van Wieringen et al., 2007a)A cluster method:- tailor-made for the discrete nature of called aCGH data,- that uses easily interpretable similarity measures,- that allows certain clones to have more influence.- yields compact, well-seperated clusters.

WW E ighted

C lustering ofC alleda CGH data

Currently, aCGH data are clusteredwith techniques for expression data.

These do not take into account the nature of data.

Motivation

76

5. Introduction

Cluster analysis seeks• meaningful data-determined groupings of samples, s.t.

• samples are more “similar” within than across groups,

• this similarity in copy number is assumed to imply someform of regulatory or functional similarity of samples.

Cluster analysis seeks• meaningful data-determined groupings of samples, s.t.

• samples are more “similar” within than across groups,

• this similarity in copy number is assumed to imply someform of regulatory or functional similarity of samples.

Objective of cluster analysisObjective of cluster analysis

WECCA is a hierarchical cluster method that produces a nested sequence of clusters, represented by a dendrogram.

Central to WECCA is the notion of similarity.

77

5. Similarity

Central to WECCA is the notion of similarity (or distance) between objects being clustered. Central to WECCA is the notion of similarity (or distance) between objects being clustered.

SimilaritySimilarity

Properties of a similarity S(xi,xj): • S(xi,xj) takes on values between 0 and 1, • S(xi,xj) = 0 means that xi and xj are not similar at all.• S(xi,xj) = 1 reflects maximum similarity. • In particular, S(xi,xi) = 1.• S(xi,xj) is symmetric: S(xi,xj) = S(xj,xi).

Properties of a similarity S(xi,xj): • S(xi,xj) takes on values between 0 and 1, • S(xi,xj) = 0 means that xi and xj are not similar at all.• S(xi,xj) = 1 reflects maximum similarity. • In particular, S(xi,xi) = 1.• S(xi,xj) is symmetric: S(xi,xj) = S(xj,xi).

78

The probability that the copynumber of an arbitrary clone of samples i1 and i2 agree:

This can be unbiasedly estimated:

Agreement similarity

5. Similarity

The copy number of a clone of two samples is in agreement ifthey are identical.The copy number of a clone of two samples is in agreement ifthey are identical.

AgreementAgreement

-1

-1

1

1

0

-1

0

1

1

0

...

...

clone 1

clone 2

clone 3

clone p-1

clone p

...

S1 S2

Page 14: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

79

5. Similarity

The probability of agreement is a similarity, for:• P(Xi1j = Xi2j) in [0,1].• P(Xi1j = Xi1j) = 1.• P(Xi1j = Xi2j) = P(Xi2j = Xi1j).

In addition, it is:• Location invariant.• Scale invariant.• Consistent, that is, two samples are more similar if they stem

from the same cluster than if they do not:P(Xi1j = Xi2j | Zi1

= Zi2) ≥ P(Xi1j = Xi2j | Zi1

≠ Zi2).

The probability of agreement is a similarity, for:• P(Xi1j = Xi2j) in [0,1].• P(Xi1j = Xi1j) = 1.• P(Xi1j = Xi2j) = P(Xi2j = Xi1j).

In addition, it is:• Location invariant.• Scale invariant.• Consistent, that is, two samples are more similar if they stem

from the same cluster than if they do not:P(Xi1j = Xi2j | Zi1

= Zi2) ≥ P(Xi1j = Xi2j | Zi1

≠ Zi2).

Properties of agreement similarityProperties of agreement similarity

80

The probability that the copy number of two arbitraryclones of samples i1 and i2 are in concordance:

Concordance similarity

5. Similarity

The copy number of two clones of two samples are in concordance if they agree on which clone has the largestcopy number.

The copy number of two clones of two samples are in concordance if they agree on which clone has the largestcopy number.

ConcordanceConcordance

81

5. Similarity

-1

-1

1

1

0

-1

0

1

1

0

...

...

clone 1

clone 2

clone 3

clone p-1

clone p

...

S1 S2

Pairs of clones that are in concordance

Pairs of clones that are in dis-concordance

-1

-1

1

1

0

-1

0

1

1

0

...

...

clone 1

clone 2

clone 3

clone p-1

clone p

...

S1 S2

82

5. Weighting

Three types of weighting:1) Per clone: researcher assigns weights to each clone.2) Per region: data-driven weighting.3) Combination of 1 and 2.

Types of weighting

Allow for the weighting of clones to give them a largerinfluence on the clustering.

Motivation, e.g.:• Some chromosomal regions are more important.• Some chromosomes have a higher gene density.

Allow for the weighting of clones to give them a largerinfluence on the clustering.

Motivation, e.g.:• Some chromosomal regions are more important.• Some chromosomes have a higher gene density.

WeightingWeighting

83

5. Weighting

Specify a weight wj ≥ 0 for each clone.

The estimator of the probability of agreement is modified straightforwardly:

Similarly, for the probability of concordance, e.g.:

The weighted estimators are also unbiased.

Weighting per clone

84

5. Weighting

Region 1 0 1 0 0 1 1

Region 2 2 2 2 2 1 2

Region 3 0 0 0 0 0 0

All regions have equal weight in the

clustering!

S5 S6

1

1

1

1

0

0

0

0

0

1

1

1

2

0

0

0

0

0

0

0

0

2

0

0

0

0

0

1

1

1

2

0

0

0

0

0

clone 1

clone 2

clone 3

clone 4

clone 5

clone 6

clone 7

clone 8

clone 9

S1 S2

0

0

0

2

0

0

0

0

0

S3

0

0

0

2

0

0

0

0

0

S4

geno

mic

loca

tion

Signature of region

Page 15: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

85

5. Linkage

Cluster A Cluster B

Single linkage

Average linkage

Complete linkage

Minimum dis-similarity

Average dis-similarity

Maximum dis-similarity

86

5. Linkage

Total linkage measures the overall (not just pairwise) similarity between the samples in the clusters.

E.g., the agreement similarity becomes the probability of all samples being in agreement, which is estimated by:

Total linkage produces more compact clusterings.

Total linkage

Total linkage

Overall similarity

Cluster A Cluster B

87

5. Linkage

Agreement with

total linkageS5 S6

-1

-1

1

1

0

-1

0

1

0

0

...

...

-1

-1

1

1

0

-1

0

1

1

0

...

...

clone 1

clone 2

clone 3

clone p-1

clone p

...

S1 S2

-1

0

1

1

0

S3

-1

0

1

1

0

S4

...

...

Cluster A Cluster B

Example

88

5. Linkage

Illustration of several types of linkage, on oral squamous cell carcinoma data from Snijders et al. (2005).

Total

Average Complete

Effects of linkage

89

5. WECCA

WECCA consists of:Step 1: Assign weights to each clone, or construct

regions.

Step 2: Form initial clusters, each containing one sample.

Step 3: Calculate the similarity between all cluster pairs.

Step 4: Merge the two clusters with highest similarity.

Step 5: Iterate between step 3 and 4 until one final cluster remains.

Step 6: Determine the cut-off for the obtained dendrogram.

WECCA (WEighted Clustering of Called aCGH data)

90

5. Simulation

• Set-up from Willenbrock & Fridlyand (2005).

• 100 datasets, 20 or 40 samples divided into two clusters.

• 1 chromosome, 500 clones.

• Log2 ratios are median normalized.

• Segmentation with DNAcopy (Olshen et al., 2004).

• Calls are made using CGHcall (Van de Wiel et al., 2007a).

• Region data is constructed from the called data .

• Cont., called, region data are hierarchically clustered.

• Six external validation measures are registered.

Specifics

Page 16: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

91

5. Simulation

Link.

aver.comp.

Adj. R

0.910.90

F

0.970.97

NMI

0.910.90

Data

cont.cont.

Sim.

pear.pear.

aver.comp.

0.950.93

0.980.98

0.940.92

reg.reg.

conc.conc.

total 0.86 0.97 0.86reg. conc.

aver.comp.

0.810.78

0.940.93

0.810.78

call.call.

conc.conc.

total 0.65 0.89 0.68call. conc.

20 samples 92

5. Real-life dataset

Breast cancer data from Fridlyand et al. (2006):• 67 tumor profiles,• 56 with survival data,• 2041 clones per profile,• 236 resulting regions.

Apply WECCA with:• concordance,• full linkage,• unweighted region data.

Breast cancer data

93

5. Real-life dataset

7

Choosing the cut-off: 7 clusters

73

94

5. Real-life dataset

max

. par

iwis

eK

L-di

stan

ce

chr. position

Importance of feature for clustering

95

5. Real-life dataset

Pollack et al. (2002) present a list of potential and known oncogenes.

The majority is on 8q, 15q, 17q, 20q.

Re-apply WECCA with a 10-fold weight for regions on these chromosomal arms.

Breast cancer data

96

5. Real-life dataset

Clustering

Disease Spec. Surv.Overall Surv.

I

0.0090.165

II

0.0060.073

III

0.0020.004

IV

0.0050.007

V

0.0020.000

p-values of log-rank test for difference in survival

I : clustering of Fridlyand et al. (3 clusters)II : unwe. conc., full linkage (3 clusters)III : unwe. conc., full linkage (7 clusters)IV : we. conc., full linkage (3 clusters)V : we. conc., full linkage (6 clusters)

Page 17: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

97

6. Integration withexpression

98

arrayCGH measures chromosomal copy number changes

Expression arraysmeasure differences in RNA levels

Maldi-Tof measures protien compositions

6. Central dogma of biology

99

6. DNA → RNA

Array CGH Expression - integration tool

ACE-it consists of three steps:• Link chromosomal position of the two platforms,• Decide which genes are to be tested, and• Test for difference in expression between normal and

aberrated copy number.

ACE-it (Van Wieringen et al., 2006)

The central dogma of biology suggests that aCGH and expression data can be used to investigate the relation between DNA and RNA.

100

6. Link chromosomal position

Gen

e 1

Gen

e 2

Gen

e i

Gen

e j

…. ….

A copy number at each genomic location

Missing data Overlapping clonesGenomic segments not covered

BioinformaticsCalling

101

6. Gene selectionACE-it tests

loss ↔ normal, or

normal ↔ gainin genes with a non-contaminated and representivecopy number-distrubition.

A gene is either:- Oncogene: only normals and gains.- Tumorsurpressor gene: only losses and normals.

Genes with losses, normals and gains are “biologically”uninterpretable.

Assumption

102

Take the contamination threshold equal to 5.

A contaminated copy number-distribution(# losses = 17, # normals = 16, # gains = 17).

A non-contaminated copy number-distribution(# losses = 2, # normals = 24, # gains = 24).

Contamination is defined as the number of samples in the group not used in the comparison.

The contimation should not exceed a threshold.

6. Gene selection

Contamination

Page 18: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

103

Representiveness is defined as the minimum number of samples in the groups used in the comparison.

The representiveness should exceed a threshold.

Representiveness

6. Gene selection

Take the representiveness threshold equal to 5.

A representive copy number-distribution(# losses = 0, # normals = 25, # gains = 25).

A non-representive copy number-distribution(# losses = 0, # normals = 48, # gains = 2).

104

6. Testing

The one-sided Wilcoxon rank sum test is used to test :

H0 : mediangain ≤ mediannormal,Or,

H0 : mediannormal ≤ medianloss.

The BH-multiplicity correction is used to control the False Discovery Rate.

Testing

105

6. Real-life dataset

Breast cancer data from Pollack et al. (2002):• expression and aCGH profiles of• 41 tumors, consisting of• 5889 genes.

Breast cancer data

ACE-it settings:• Contamination threshold = 9.• Representiveness threshold = 9.(results in 2450 genes taken along for analysis)

• α = 0.20.

106

6. Real-life dataset

Name Chr Start End Comp #L #N #G Raw.p Adj.pIMAGE:245198 1 117621906 117622206 n vs. g 6 25 10 0.000842 0.0368IMAGE:432564 1 169737131 169737431 n vs. g 0 24 17 2.3e-05 0.0106IMAGE:245015 1 175305900 175306200 n vs. g 0 19 22 0.000786 0.0368IMAGE:878406 1 176394346 176394646 n vs. g 0 18 23 0.001146 0.0435IMAGE:47665 1 176432278 176432578 n vs. g 0 22 19 0.000169 0.0238IMAGE:267458 1 176895144 176895444 n vs. g 0 22 19 6e-06 0.0076IMAGE:782718 1 177854451 177854751 n vs. g 0 30 11 0.000439 0.0326IMAGE:244801 1 178324754 178325054 n vs. g 0 25 16 0.000648 0.0345IMAGE:725630 1 181699421 181699721 n vs. g 2 21 18 0.000838 0.0368IMAGE:795185 1 204750990 204751290 n vs. g 1 24 16 0.000237 0.0270

In total 354 genes are found whose expression is significantly affected by copy number.

……

..

……

..

……

..

……

..

……

..

……

..

……

..

……

..

……

..

……

..

107

6. Real-life dataset

Among the genes found by ACE-it is ERBB2.

ERBB2 is a well-known onco-gene, which is amplified in 20 to 30% of the breast cancer cases.

108

6. Real-life dataset

FDR

-val

ueP

ropo

rtion

Page 19: Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data Department of Mathematics Vrije Universiteit Amsterdam Wessel van Wieringen wvanwie@few.vu.nl

109

Acknowledgements

This work was in part supported by the Center for Medical Systems Biology (CMSB) established by the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research (NGI/NWO).

Acknowledgements

110

ReferencesDouglas et al. (2004), “Array Comparative genomic hybridization analysis of colorectal cancer cell lines and primary carcinomas”, Cancer Research, 64, 4817-4825.Fridlyand et al. (2006), “Breast tumor copy number aberration phenotypes and genomic instability”, BMC Cancer.Jong et al. (2004), “Breakpoint identification and smoothing of array Comparative Genomic Hybridization data”,Bioinformatics, 20, 3636-3637.Gilbert et al. (2005), “A modified false discovery rate multiple comparisons procedure for discrete data, applied to human immunodeficiency virus genetics”, Applied Statistics, 54, 132-153.Olshen et al. (2004), “Circular binary segmentation for the analysis of array-based DNA copy number data”, Biostatistics, 5, 557-572.Pollack et al. (2002), “Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors”, PNAS, 99, 12963-12968. Snijders et al. (2005), “Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma”, Oncogene, 24, 4232-4242.van de Wiel (2001). “The split-up algorithm: a fast symbolic method for computing P-values of rank statistics”, Comput. Stat., 16, 519–538.Van de Wiel et al. (2005), “CGHMultiArray: exact P-values for multi-array comparative genomic hybridization data”, Bioinformatics, 21, 3193-3194.Van de Wiel et al. (2007a), “CGHcall: calling aberrations for array CGH tumor profiles”, Bioinformatics, to appear.Van de Wiel et al. (2007b), “CGHregions: dimension reduction for array CGH data with minimal information loss”, Cancer Informatics, to appear. Van Wieringen et al. (2006), “ACE-it: a tool for genome-wide integration of gene dosage and RNA expression data”, Bioinformatics, 22, 1919-1920.Van Wieringen et al. (2007a), “Weighted clustering of called aCGH data”, technical report, Vrije Universiteit Amsterdam. Van Wieringen et al. (2007b), “Normalized, segmented or called aCGH data”, technical report, Vrije UniversiteitAmsterdam.Willenbrock et al. (2005), “A comparison study: applying segmentation to array CGH data for downstream analyses”, Bioinformatics, 21, 4084-4091.Wilting et al. (2006), “Increased gene copy numbers at chromosome 20q are frequent in both squamous cell carcinomas and adenocarcinomas of the cervix”, J. Pathol., 209, 220-230.