Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data...
Transcript of Analysis of array CGH data - VUwvanwie/presentations/WNvanWieringen...1 Analysis of aCGH data...
1
Analysis of aCGH data
Department of MathematicsVrije Universiteit Amsterdam
Wessel van [email protected]
2
0. Contributors
Statistics:- Kyung In Kim- Aad van der Vaart- Mark van de Wiel- Wessel van Wieringen
Bioinformatics:- Eskeatnaf Achame- Jeroen Belien- Kees Jong- Sjoerd Vosse
Biology:- Saskia Wilting- Bauke Ylstra
3
2.
0. Outline
Topics discussedTopics discussed
Pre-processing1.
5. Clustering
aCGH
4. Hypothesis testing3. Dimension reduction
Integration with expression6.
4
1. aCGH
5
Chromosomes of a tumor cell
Technique: SKY
1. aCGH
6
tumor cellnormal cell
1. Classical CGH
7
normal cellnormal cell
hybridization
after hybridization
1. Classical CGH
8
tumor cellnormal cell
hybridization
after hybridization
1. Classical CGH
9
1. Classical CGH
CGH 5-10 Mb resolution vs array CGH 0.8 Mb resolution
10
BAC’s
chr.1 chr. 2 chr.3 chr. 4
1. arrayCGH
ProbeCloneOligo
Array element
Resolution
• Human genome is 3000Mb (29 Gb),• 30 000 BACs to cover the human genome,• Max resolution is size of the BAC, 100-150kb.
11
1. arrayCGH
Hybridize
test samplereference sample
12
1. arrayCGH
Hybridize
test samplereference sample
13
1. arrayCGH
Hybridize
test samplereference sample
Scan
14
1. arrayCGH
Scan
Image of the arrayImage of
a spot
log2(G/R)-ratio
Quantification of intensity
15
1. aCGH profiles
1 2 3 4 5 6 7 8 ……. X
Log2-ratio’s plotted against the genomic order of clones.Log2-ratio’s plotted against the genomic order of clones.
Chromosomes 16
1. aCGH profiles
GainGain
LossLoss
AmplificationAmplification
NormalNormal
17
1. aCGH profiles
Males have only one copy of the X-chromosome.
1 2 3 4 5 6 7 8 ……. X
CorroborationCorroboration
18
1. aCGH profiles
Deletion on 5p
CorroborationCorroboration
19
1. aCGH profilesNew findingsNew findings
20
2. aCGH
• DNA copy number,• genome-wide,• with high resolution.
• DNA copy number,• genome-wide,• with high resolution.
aCGH measuresaCGH measures
- Test for chromosomal aberrations,- Distinguish one tumor class from another (diagnosis),- Screening for new drug targets,- Predict clinical outcome, - Find new subclasses,- etc..
- Test for chromosomal aberrations,- Distinguish one tumor class from another (diagnosis),- Screening for new drug targets,- Predict clinical outcome, - Find new subclasses,- etc..
aCGH data are used toaCGH data are used to
21
1 3 3 10 19 41 64146
610
266*
0100200300400500600700
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
# of
Pub
Med
item
s
PubMed: “Array CGH”
1. arrayCGH
22
2. Pre-processing- Normalization- Segmentation- Calling
23
Log2-ratios from different hybridizations are compared.
Normalization aims to make log2-ratios from different hybridizations comparable.
Log2-ratios from different hybridizations are compared.
Normalization aims to make log2-ratios from different hybridizations comparable.
Motivation for normalizationMotivation for normalization
2. Normalization
• Median normalization.• Mode normalization.• Spatial normalization.
Types of normalization
24
2. Normalization
Shift median to zero
25
2. NormalizationSpatial effects are present!
Subtract loess curve
26
2. Pre-processing- Normalization- Segmentation- Calling
27
Divide the genome into contiguous segments.
Clones that belong to the same segment are assumed to have the same underlying copy number.
Segmentation
Segmentation is also called smoothing.
segment 1 segment i
2. Segmentation
……..28
• Noise reduction.• Detection of aberration (loss, normal, gain).• Breakpoint analysis.
Why segmentation?
Recurrent (over tumors) aberrations may indicate:- an oncogene, or- a tumorsuppressor gene.
2. Segmentation
• Measurements are relative to a reference sample.• Printing, labeling and hybridization may be uneven.• Tumor sample is inhomogeneous.
Difficulties for segmentation
29
2. Segmentation
Copy numbers are integers:
“discrete smoothing”!
30
2. Segmentation
31
2. SegmentationA segmentation can be described by:- a number of breakpoints, and - the corresponding levels (or states).
Levels
Breakpoints
Variance
Segment
32
Identify possibly damaged genes:- These genes will not be expressed any more.
Identify recurrent breakpoint location:- Indicates fragile pieces of the chromosome.
Accuracy is important:- Important genes may be located in a region with
(recurrent) breakpoints.
Breakpoint detection
2. Segmentation
33
Problem formalization (Jong et al., 2004)
2. Segmentation
aCGH values: x1 , ... , xnBreakpoints: 0 < y1< … < yN < xN
Likelihood:
Error variances: s12, . . ., sN
2
Levels: m1, . . ., mN
- A fitness function scores each segmentation according to its fitness to the data.
- Find segmentation with highest fitness. - Assume data are Gaussian. - Use the ML-criterion with a penalization term for model
complexity.
34
2. Segmentation
- ML-estimators of µ and s 2 can be found explicitly.- Add a penalty to log likelihood to control number N of
breakpoints:
- Maximizing fitness is computationally hard- Use genetic algorithm + local search to find
approximation to the optimum.
Fitness function
35
expert
algorithm
2. SegmentationResults comparable to those produced by hand by local expert
36
2. Pre-processing- Normalization- Segmentation- Calling
37
2. Calling
Calling is the process of categorizing the different segmentation states as ‘loss’, ‘normal’, ‘gain’, or ‘amplification’.
Calling is the process of categorizing the different segmentation states as ‘loss’, ‘normal’, ‘gain’, or ‘amplification’.
CallingCalling
- Nature of the data.- Clearest interpretation.- Necessary for inferences on copy number.- Advantageous for down-stream analysis.
Why call? (Van Wieringen et al., 2007b)
38
2. CallingCalling with a simple cut-off (normalized data)
gain
normal
loss
39
Calling with a simple cut-off (segmented data)
2. Calling
gain
normal
loss
40
2. Calling
all normal, levels differ
breakpoint locations
same segment, same copy number
Biological assumptions
Six copy numbers : double loss (0), single loss (1), normal (2), single gain (3), double gain (4), amplification (≥5).
CGHcall is a calling method that makes extensive use of biology and the knowledge of the aCGH data.
CGHcall (Van de Wiel et al., 2007a)
41
2. Calling
Formalization of biological assumptions
1. Independence between clones from different segments. 2. Tumors reasonably similar (otherwise cellularity correction).3. Identifiability restrictions on parameters.4. Variation constant within segments.5. Log2-ratio’s are modelled by a hierarchical model:
of sample k, clone i, segment j, and underlying copy number l. 42
2. CallingEM algorithmInitialize unknown parameters; compute membership probabilities; compute expected log-likelihood EL with respect to hidden states; maximize EL with respect to unknown parameters; update membership probabilities; iterate.
The likelihood:
with
43
2. Calling
The results of a calling method are the call probabilities:
Output of calling methods
Chr. Start BP LossClone Normal Gain Ampl.1 38099 0.0012749 0.052 0.174 0.773
CGHcall yields sixprobabilities.
sums to one44
2. Calling
Final calling
5.0
5.05.0
5.0
5.0
6
6654
3
21
≥
<>++
≥
>+
jk
jkjkjkjk
jk
jkjk
P
PPPP
P
PP
:ionAmplificat and :Gain
:Normal :Loss
While the final calling returns 4 classes only, it is essentialto distinguish single gain from double gains in the model.
Otherwise the double gain data will bias the results for the single gains
Call probabilities are used for generation of calls.
45
2. Calling
Loss
Gain
Loss
Gain
Log2 (G
/R)P
roba
bilit
y(o
f an
aber
ratio
n)
Probability > 0.5 � LossProbability > 0.5 � Gain 46
Wilting et al., (2006)
2. Calling3 modes
6 modes
47 Chromosome 17
Amplifications
2. Calling
48
2. Calling
Alternative modelOriginal model:all chromosomal arms have equal probability of, say, a loss.
For some tumor types certain chromosomal arms are likely to contain more aberrations than others.
Incorporated in alternativemodel.
49
CGH-call
CGH-classify
2. Calling
50
2. Calling
Comparison: simulation
Method
Overall classification
rate
True positive
rate
True negative
rate
CGHcall
98.6
95.8
99.3
CGHclassify
90.8
96.2
89.2
CLAC
95.1
75.9
99.7
MergeLevels
88.0
46.2
98.3
2sd
87.4
36.3
99.9
Willenbrock & Fridlyand (2005) simulation model
5189 oral carcinomas; Snijders et al. (2005)
2. CallingSummary plot
52
2. Pre-processing
Chr. Start BP Array 1Clone Array 2 … Array n1 38099 01 0 … 01 614489 02 0 … 01 2413398 03 0 … -1
… … …… … … …18 731200 12384 0 … 218 3908046 12385 1 … 218 4825168 12386 1 … 2… … …… … … …23 204448 0p 0 … 0
-1 = loss, 0 = normal, 1 = gain, 2 = amplification
Pre-processed data
53
3. Dimension reduction
54
3. Regions
Summarize clone data into region data, and use regions.
CGHregions (Van de Wiel et al., 2007b)
The ordinal nature of the data causes many genomic seg-ments to have the same values over consecutive clones.
A region is a series of neighboring clones on the chromo-some whose aCGH-signature is shared by all clones.
A region can be a chromosome arm, or a small amplification.
Regions capture the essential features of the data.
Regions
55
3. Regions
Region 1
Region 2
Region 3
S5 S6
1
1
1
1
0
0
0
0
0
1
1
1
2
0
0
0
0
0
0
0
0
2
0
0
0
0
0
1
1
1
2
0
0
0
0
0
clone 1
clone 2
clone 3
clone 4
clone 5
clone 6
clone 7
clone 8
clone 9
S1 S2
0
0
0
2
0
0
0
0
0
S3
0
0
0
2
0
0
0
0
0
S4
geno
mic
loca
tion
56
Region 1 0 1 0 0 1 1
Region 2 2 2 2 2 1 2
Region 3 0 0 0 0 0 0
Illustration of the principle.
But … too simple!
S5 S6
1
1
1
1
0
0
0
0
0
1
1
1
2
0
0
0
0
0
0
0
0
2
0
0
0
0
0
1
1
1
2
0
0
0
0
0
clone 1
clone 2
clone 3
clone 4
clone 5
clone 6
clone 7
clone 8
clone 9
S1 S2
0
0
0
2
0
0
0
0
0
S3
0
0
0
2
0
0
0
0
0
S4
geno
mic
loca
tion
3. Regions
57
3. Construction of regions
A sequence of clones that satisfies:
where d(•, •) is the L1-distance function.
Definition of region
is a C × N matrix with entries the call for clone i of sample s.
A region is represented by a sub-matrix of :
Notation
58
3. Construction of regions
1. Between chromosomes.
2. At breakpoints with:
3. At the highest gradient of the region if it does not satisfy the definition.The gradient:
where wij = 0 if i = j, otherwise wij = 1 / | i – j |.
Rulesone region another region
clones
adja
cent
clo
ne d
ista
nce
new region
c
clones
adja
cent
clo
ne d
ista
nce
nonetheless, d(•,•) > c
c
59
3. Construction of regions
The mediod is taken as the representative signature.The mediod is the signature with least average distance to the other signatures.
Representative signature
S5 S6
1
1
1
1
0
0
0
0
0
1
1
1
2
0
-1
0
0
0
0
0
0
2
0
0
0
0
0
1
1
1
2
0
0
0
0
0
clone 1
clone 2
clone 3
clone 4
clone 5
clone 6
clone 7
clone 8
clone 9
S1 S2
0
0
0
2
0
0
0
0
0
S3
0
0
0
2
0
0
0
-1
0
S4
geno
mic
loca
tion
signature of region 1
signature of region 2
signature of region 3
60
3. Construction of regions
Threshold c is set equal to cmax, where
and T is maximum sustained proportion of mis-predictions.
a(c) is a measure of prediction error:
where K is the set of clone regions with many aberrations, and Am is the mediod signature.
Choice of c
61
Colorectal tumor datafrom Douglas et al. (2004):
from 3129 clones to 68 regions.
3. Application to Douglas et al. (2004)
62
T = 0.01
T = 0.025
3. Application to Douglas et al. (2004)
63
4. Hypothesis testing
64
Mean,Std. dev.0,1
Normal Distribution
-5 -3 -1 1 3 5
x
0
0.1
0.2
0.3
0.4
dens
ity
Chr. Pos.Clone #loss #norm. #gain #loss #norm. #gain1 …1 3 7 2 1 11 01 …2 3 7 2 1 11 0… …… … … … … … …18 …2384 2 8 2 1 9 318 …2385 2 8 2 1 9 3… … …… … … … … …
• Normal distribution does not hold• t-test results in WRONG p-values!
Compare condition 1 and condition 2
Condition 1 Condition 2
4. Hypothesis testing
65
- Two-sample Wilcoxon test corrected for ties, deals with discreteness and natural ordering of the levels.
- A fast method to generate correct, exact p-values for multi-array called aCGH data.
Chr. Pos.Clone #loss #norm. #gain #loss #norm. #gain1 …1 3 7 2 1 11 01 …2 3 7 2 1 11 0… …… … … … … … …18 …2384 2 8 2 1 9 318 …2385 2 8 2 1 9 3… … …… … … … … …
Condition 1 Condition 2
CGHMultiArray (Van de Wiel et al., 2005)
4. Hypothesis testing
66
Maths behind CGHMultiArray
Q: “There exist one-line formulas to approximate the p-value, can’t we use these?”
A: “No, for this situation these approximations may be a factor 2 to 3 off!”
Q: “Couldn’t we generate the distribution of the test statistic just once and then read off the p-values for all clones?”
A: “No, the structure of the data (total number of losses, normals, and gains) is different for many clones; the p-value has to be re-computed for each clone with a previously unobserved structure.”
4. Hypothesis testing
67
4. Hypothesis testingLoads of identical
p-values.
68
Life scientist: “That’s nice, but those pessimistic statisticians always tell me to correct for ‘multiple testing’ which means multiplying the p-value by a huge factor!”
Life scientist: “Moreover, subsequent clones are highly correlated. I would like to make statements about chromosomal regions rather than about individual clones.”
Combine CGHMultiArray with dimension reduction, and apply special FDR-correction
4. Hypothesis testing
69
FDR controlStandard Benjamini-Hochberg far too conservativeName Chr Pos
# loss # norm # gains # loss # norm # gains # loss # norm # gains012 0 0 24
Condition 1 Condition 2 Condition 1 + 2
BAC1432 - BAC1476
8 670946 - 8769451
0 12 0 0
Case above: minimal p-value equals 1!!!
Gilbert (2005) : “For FDR correction for a particular case with p-value = P only include other cases that, based on the aggregated data, can reach a p-value ≤ P.”Name Chr Pos
# loss # norm # gains # loss # norm # gains p-val FDR-BH FDR-GilCondition 1 Condition 2 Significance
BAC738 - BAC755
8 1432258 - 1260024
0 2 10 1 0.03010 1 0.00055 0.072
4. Hypothesis testing
70
4. Re-analysis of Douglas et al. (2004)
Experimental design (Douglas et al., 2004)aCGH profiles of 7 MSI+ colorectal cancers, andaCGH profiles of 30 CIN+ colorectal cancers.
Research questionAre there genomic differences between the two groups? If so, what is their location?
ResultsChromosome 20 and chromosome arms 8p, 17p, 18q are reported.
ProblemNo statistical underpinning of the findings is provided!
Orginal analysis
71
4. Re-analysis of Douglas et al. (2004)
Pre-processed (missing values, normalization, segmenta-tion, calling) clone data are used for testing.
Analysis of the clone data reveals that the smallest FDR-value equals 0.3815.
Clone data
Pre-pocessed data are trans-formed into regions.
Region data is used for testing.
Region data
72
Chr.8
20181888
1818
13
17
Start BP7938099
308144893241339873991368
73120034108046
22516827315721
19104448
….
#clones3250548
142
324
16
….
p-value0.001660.004640.006180.006180.007530.016770.019610.02259
0.02264
0.09346
Start BP32678693635898687288681877615559
6933218350261372570056829970100
29970100
….
FDR0.0130.0360.0360.0360.0360.0440.0450.045
0.045
0.204
Extra
Not found
4. Re-analysis of Douglas et al. (2004)
73
5. Clustering
74
5. IntroductionResearch question (Fridlyand et al., 2006)Can DNA copy number profiles be used to divide ductalinvasive breast cancers into subgroups?
Research question (Fridlyand et al., 2006)Can DNA copy number profiles be used to divide ductalinvasive breast cancers into subgroups?
Experimental designDNA copy number profiles of 67 tumors.Experimental designDNA copy number profiles of 67 tumors.
ResultsCluster analysis identified 4 subgroups: known (BRCA), others not clinically recognized yet.
ResultsCluster analysis identified 4 subgroups: known (BRCA), others not clinically recognized yet.
ConclusionDNA copy number profiles can be used to identify subtypes of breast cancer.
ConclusionDNA copy number profiles can be used to identify subtypes of breast cancer.
75
5. Introduction
WECCA (Van Wieringen et al., 2007a)A cluster method:- tailor-made for the discrete nature of called aCGH data,- that uses easily interpretable similarity measures,- that allows certain clones to have more influence.- yields compact, well-seperated clusters.
WW E ighted
C lustering ofC alleda CGH data
Currently, aCGH data are clusteredwith techniques for expression data.
These do not take into account the nature of data.
Motivation
76
5. Introduction
Cluster analysis seeks• meaningful data-determined groupings of samples, s.t.
• samples are more “similar” within than across groups,
• this similarity in copy number is assumed to imply someform of regulatory or functional similarity of samples.
Cluster analysis seeks• meaningful data-determined groupings of samples, s.t.
• samples are more “similar” within than across groups,
• this similarity in copy number is assumed to imply someform of regulatory or functional similarity of samples.
Objective of cluster analysisObjective of cluster analysis
WECCA is a hierarchical cluster method that produces a nested sequence of clusters, represented by a dendrogram.
Central to WECCA is the notion of similarity.
77
5. Similarity
Central to WECCA is the notion of similarity (or distance) between objects being clustered. Central to WECCA is the notion of similarity (or distance) between objects being clustered.
SimilaritySimilarity
Properties of a similarity S(xi,xj): • S(xi,xj) takes on values between 0 and 1, • S(xi,xj) = 0 means that xi and xj are not similar at all.• S(xi,xj) = 1 reflects maximum similarity. • In particular, S(xi,xi) = 1.• S(xi,xj) is symmetric: S(xi,xj) = S(xj,xi).
Properties of a similarity S(xi,xj): • S(xi,xj) takes on values between 0 and 1, • S(xi,xj) = 0 means that xi and xj are not similar at all.• S(xi,xj) = 1 reflects maximum similarity. • In particular, S(xi,xi) = 1.• S(xi,xj) is symmetric: S(xi,xj) = S(xj,xi).
78
The probability that the copynumber of an arbitrary clone of samples i1 and i2 agree:
This can be unbiasedly estimated:
Agreement similarity
5. Similarity
The copy number of a clone of two samples is in agreement ifthey are identical.The copy number of a clone of two samples is in agreement ifthey are identical.
AgreementAgreement
-1
-1
1
1
0
-1
0
1
1
0
...
...
clone 1
clone 2
clone 3
clone p-1
clone p
...
S1 S2
79
5. Similarity
The probability of agreement is a similarity, for:• P(Xi1j = Xi2j) in [0,1].• P(Xi1j = Xi1j) = 1.• P(Xi1j = Xi2j) = P(Xi2j = Xi1j).
In addition, it is:• Location invariant.• Scale invariant.• Consistent, that is, two samples are more similar if they stem
from the same cluster than if they do not:P(Xi1j = Xi2j | Zi1
= Zi2) ≥ P(Xi1j = Xi2j | Zi1
≠ Zi2).
The probability of agreement is a similarity, for:• P(Xi1j = Xi2j) in [0,1].• P(Xi1j = Xi1j) = 1.• P(Xi1j = Xi2j) = P(Xi2j = Xi1j).
In addition, it is:• Location invariant.• Scale invariant.• Consistent, that is, two samples are more similar if they stem
from the same cluster than if they do not:P(Xi1j = Xi2j | Zi1
= Zi2) ≥ P(Xi1j = Xi2j | Zi1
≠ Zi2).
Properties of agreement similarityProperties of agreement similarity
80
The probability that the copy number of two arbitraryclones of samples i1 and i2 are in concordance:
Concordance similarity
5. Similarity
The copy number of two clones of two samples are in concordance if they agree on which clone has the largestcopy number.
The copy number of two clones of two samples are in concordance if they agree on which clone has the largestcopy number.
ConcordanceConcordance
81
5. Similarity
-1
-1
1
1
0
-1
0
1
1
0
...
...
clone 1
clone 2
clone 3
clone p-1
clone p
...
S1 S2
Pairs of clones that are in concordance
Pairs of clones that are in dis-concordance
-1
-1
1
1
0
-1
0
1
1
0
...
...
clone 1
clone 2
clone 3
clone p-1
clone p
...
S1 S2
82
5. Weighting
Three types of weighting:1) Per clone: researcher assigns weights to each clone.2) Per region: data-driven weighting.3) Combination of 1 and 2.
Types of weighting
Allow for the weighting of clones to give them a largerinfluence on the clustering.
Motivation, e.g.:• Some chromosomal regions are more important.• Some chromosomes have a higher gene density.
Allow for the weighting of clones to give them a largerinfluence on the clustering.
Motivation, e.g.:• Some chromosomal regions are more important.• Some chromosomes have a higher gene density.
WeightingWeighting
83
5. Weighting
Specify a weight wj ≥ 0 for each clone.
The estimator of the probability of agreement is modified straightforwardly:
Similarly, for the probability of concordance, e.g.:
The weighted estimators are also unbiased.
Weighting per clone
84
5. Weighting
Region 1 0 1 0 0 1 1
Region 2 2 2 2 2 1 2
Region 3 0 0 0 0 0 0
All regions have equal weight in the
clustering!
S5 S6
1
1
1
1
0
0
0
0
0
1
1
1
2
0
0
0
0
0
0
0
0
2
0
0
0
0
0
1
1
1
2
0
0
0
0
0
clone 1
clone 2
clone 3
clone 4
clone 5
clone 6
clone 7
clone 8
clone 9
S1 S2
0
0
0
2
0
0
0
0
0
S3
0
0
0
2
0
0
0
0
0
S4
geno
mic
loca
tion
Signature of region
85
5. Linkage
Cluster A Cluster B
Single linkage
Average linkage
Complete linkage
Minimum dis-similarity
Average dis-similarity
Maximum dis-similarity
86
5. Linkage
Total linkage measures the overall (not just pairwise) similarity between the samples in the clusters.
E.g., the agreement similarity becomes the probability of all samples being in agreement, which is estimated by:
Total linkage produces more compact clusterings.
Total linkage
Total linkage
Overall similarity
Cluster A Cluster B
87
5. Linkage
Agreement with
total linkageS5 S6
-1
-1
1
1
0
-1
0
1
0
0
...
...
-1
-1
1
1
0
-1
0
1
1
0
...
...
clone 1
clone 2
clone 3
clone p-1
clone p
...
S1 S2
-1
0
1
1
0
S3
-1
0
1
1
0
S4
...
...
Cluster A Cluster B
Example
88
5. Linkage
Illustration of several types of linkage, on oral squamous cell carcinoma data from Snijders et al. (2005).
Total
Average Complete
Effects of linkage
89
5. WECCA
WECCA consists of:Step 1: Assign weights to each clone, or construct
regions.
Step 2: Form initial clusters, each containing one sample.
Step 3: Calculate the similarity between all cluster pairs.
Step 4: Merge the two clusters with highest similarity.
Step 5: Iterate between step 3 and 4 until one final cluster remains.
Step 6: Determine the cut-off for the obtained dendrogram.
WECCA (WEighted Clustering of Called aCGH data)
90
5. Simulation
• Set-up from Willenbrock & Fridlyand (2005).
• 100 datasets, 20 or 40 samples divided into two clusters.
• 1 chromosome, 500 clones.
• Log2 ratios are median normalized.
• Segmentation with DNAcopy (Olshen et al., 2004).
• Calls are made using CGHcall (Van de Wiel et al., 2007a).
• Region data is constructed from the called data .
• Cont., called, region data are hierarchically clustered.
• Six external validation measures are registered.
Specifics
91
5. Simulation
Link.
aver.comp.
Adj. R
0.910.90
F
0.970.97
NMI
0.910.90
Data
cont.cont.
Sim.
pear.pear.
aver.comp.
0.950.93
0.980.98
0.940.92
reg.reg.
conc.conc.
total 0.86 0.97 0.86reg. conc.
aver.comp.
0.810.78
0.940.93
0.810.78
call.call.
conc.conc.
total 0.65 0.89 0.68call. conc.
20 samples 92
5. Real-life dataset
Breast cancer data from Fridlyand et al. (2006):• 67 tumor profiles,• 56 with survival data,• 2041 clones per profile,• 236 resulting regions.
Apply WECCA with:• concordance,• full linkage,• unweighted region data.
Breast cancer data
93
5. Real-life dataset
7
Choosing the cut-off: 7 clusters
73
94
5. Real-life dataset
max
. par
iwis
eK
L-di
stan
ce
chr. position
Importance of feature for clustering
95
5. Real-life dataset
Pollack et al. (2002) present a list of potential and known oncogenes.
The majority is on 8q, 15q, 17q, 20q.
Re-apply WECCA with a 10-fold weight for regions on these chromosomal arms.
Breast cancer data
96
5. Real-life dataset
Clustering
Disease Spec. Surv.Overall Surv.
I
0.0090.165
II
0.0060.073
III
0.0020.004
IV
0.0050.007
V
0.0020.000
p-values of log-rank test for difference in survival
I : clustering of Fridlyand et al. (3 clusters)II : unwe. conc., full linkage (3 clusters)III : unwe. conc., full linkage (7 clusters)IV : we. conc., full linkage (3 clusters)V : we. conc., full linkage (6 clusters)
97
6. Integration withexpression
98
arrayCGH measures chromosomal copy number changes
Expression arraysmeasure differences in RNA levels
Maldi-Tof measures protien compositions
6. Central dogma of biology
99
6. DNA → RNA
Array CGH Expression - integration tool
ACE-it consists of three steps:• Link chromosomal position of the two platforms,• Decide which genes are to be tested, and• Test for difference in expression between normal and
aberrated copy number.
ACE-it (Van Wieringen et al., 2006)
The central dogma of biology suggests that aCGH and expression data can be used to investigate the relation between DNA and RNA.
100
6. Link chromosomal position
Gen
e 1
Gen
e 2
Gen
e i
Gen
e j
…. ….
A copy number at each genomic location
Missing data Overlapping clonesGenomic segments not covered
BioinformaticsCalling
101
6. Gene selectionACE-it tests
loss ↔ normal, or
normal ↔ gainin genes with a non-contaminated and representivecopy number-distrubition.
A gene is either:- Oncogene: only normals and gains.- Tumorsurpressor gene: only losses and normals.
Genes with losses, normals and gains are “biologically”uninterpretable.
Assumption
102
Take the contamination threshold equal to 5.
A contaminated copy number-distribution(# losses = 17, # normals = 16, # gains = 17).
A non-contaminated copy number-distribution(# losses = 2, # normals = 24, # gains = 24).
Contamination is defined as the number of samples in the group not used in the comparison.
The contimation should not exceed a threshold.
6. Gene selection
Contamination
103
Representiveness is defined as the minimum number of samples in the groups used in the comparison.
The representiveness should exceed a threshold.
Representiveness
6. Gene selection
Take the representiveness threshold equal to 5.
A representive copy number-distribution(# losses = 0, # normals = 25, # gains = 25).
A non-representive copy number-distribution(# losses = 0, # normals = 48, # gains = 2).
104
6. Testing
The one-sided Wilcoxon rank sum test is used to test :
H0 : mediangain ≤ mediannormal,Or,
H0 : mediannormal ≤ medianloss.
The BH-multiplicity correction is used to control the False Discovery Rate.
Testing
105
6. Real-life dataset
Breast cancer data from Pollack et al. (2002):• expression and aCGH profiles of• 41 tumors, consisting of• 5889 genes.
Breast cancer data
ACE-it settings:• Contamination threshold = 9.• Representiveness threshold = 9.(results in 2450 genes taken along for analysis)
• α = 0.20.
106
6. Real-life dataset
Name Chr Start End Comp #L #N #G Raw.p Adj.pIMAGE:245198 1 117621906 117622206 n vs. g 6 25 10 0.000842 0.0368IMAGE:432564 1 169737131 169737431 n vs. g 0 24 17 2.3e-05 0.0106IMAGE:245015 1 175305900 175306200 n vs. g 0 19 22 0.000786 0.0368IMAGE:878406 1 176394346 176394646 n vs. g 0 18 23 0.001146 0.0435IMAGE:47665 1 176432278 176432578 n vs. g 0 22 19 0.000169 0.0238IMAGE:267458 1 176895144 176895444 n vs. g 0 22 19 6e-06 0.0076IMAGE:782718 1 177854451 177854751 n vs. g 0 30 11 0.000439 0.0326IMAGE:244801 1 178324754 178325054 n vs. g 0 25 16 0.000648 0.0345IMAGE:725630 1 181699421 181699721 n vs. g 2 21 18 0.000838 0.0368IMAGE:795185 1 204750990 204751290 n vs. g 1 24 16 0.000237 0.0270
In total 354 genes are found whose expression is significantly affected by copy number.
……
..
……
..
……
..
……
..
……
..
……
..
……
..
……
..
……
..
……
..
107
6. Real-life dataset
Among the genes found by ACE-it is ERBB2.
ERBB2 is a well-known onco-gene, which is amplified in 20 to 30% of the breast cancer cases.
108
6. Real-life dataset
FDR
-val
ueP
ropo
rtion
109
Acknowledgements
This work was in part supported by the Center for Medical Systems Biology (CMSB) established by the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research (NGI/NWO).
Acknowledgements
110
ReferencesDouglas et al. (2004), “Array Comparative genomic hybridization analysis of colorectal cancer cell lines and primary carcinomas”, Cancer Research, 64, 4817-4825.Fridlyand et al. (2006), “Breast tumor copy number aberration phenotypes and genomic instability”, BMC Cancer.Jong et al. (2004), “Breakpoint identification and smoothing of array Comparative Genomic Hybridization data”,Bioinformatics, 20, 3636-3637.Gilbert et al. (2005), “A modified false discovery rate multiple comparisons procedure for discrete data, applied to human immunodeficiency virus genetics”, Applied Statistics, 54, 132-153.Olshen et al. (2004), “Circular binary segmentation for the analysis of array-based DNA copy number data”, Biostatistics, 5, 557-572.Pollack et al. (2002), “Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors”, PNAS, 99, 12963-12968. Snijders et al. (2005), “Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma”, Oncogene, 24, 4232-4242.van de Wiel (2001). “The split-up algorithm: a fast symbolic method for computing P-values of rank statistics”, Comput. Stat., 16, 519–538.Van de Wiel et al. (2005), “CGHMultiArray: exact P-values for multi-array comparative genomic hybridization data”, Bioinformatics, 21, 3193-3194.Van de Wiel et al. (2007a), “CGHcall: calling aberrations for array CGH tumor profiles”, Bioinformatics, to appear.Van de Wiel et al. (2007b), “CGHregions: dimension reduction for array CGH data with minimal information loss”, Cancer Informatics, to appear. Van Wieringen et al. (2006), “ACE-it: a tool for genome-wide integration of gene dosage and RNA expression data”, Bioinformatics, 22, 1919-1920.Van Wieringen et al. (2007a), “Weighted clustering of called aCGH data”, technical report, Vrije Universiteit Amsterdam. Van Wieringen et al. (2007b), “Normalized, segmented or called aCGH data”, technical report, Vrije UniversiteitAmsterdam.Willenbrock et al. (2005), “A comparison study: applying segmentation to array CGH data for downstream analyses”, Bioinformatics, 21, 4084-4091.Wilting et al. (2006), “Increased gene copy numbers at chromosome 20q are frequent in both squamous cell carcinomas and adenocarcinomas of the cervix”, J. Pathol., 209, 220-230.