Modelling of CGH arrays experiments
Embed Size (px)
description
Transcript of Modelling of CGH arrays experiments

1
Modelling of CGH arrays experiments
• Philippe Broët Faculté de Médecine,Université de Paris-XI
• Sylvia RichardsonImperial CollegeLondon
CGH = Competitive Genomic Hybridization

2
Outline
• Background• Mixture model with spatial allocations• Performance, comparison with CGH-
Miner• Analyses of CGH-array cancer data sets• Extensions

3
The development of solid tumors is associated with the acquisition of complex genetic alterations that modify normal cell growth and survival.
Many of these changes involve gains and/or losses of parts of the genome: Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis.
Loss Gain
Tumor supressor gene Oncogene
Aim: study genomic alterations in oncology

4
1. Extraction- DNA
2. Labelling (fluo)
3. Co-hybridization
4. Scanning
Case Control
CGH = Competitive Genomic hybridization• Array containing short sequences of DNA bound to
glass slide• Fluorescein-labeled normal and pathologic samples
co-hybridised to the array

5
• Once hybridization has been performed, the signal intensities of the fluorophores is quantified
Provides a means to quantitatively measure DNA copy-number alterations and to map them directly onto genomic sequence

6
MCF7 cell line investigated in Pollack et al (2002)23 chromosomes and 6691 cDNA sequences
Data log transformed: Difference bet. MCF7 and reference

7
Types of alterations observed
• (Single) Gain or Deletion of sequences, occurring for contiguous regions
Low level changes in the ratio ± log2but attenuation (dye bias) ratio ≈ ± 0.4• Multiple gains (small regions)
High level change, easy to pick upFocus the modelling on the first
common type of alterations

8
Deletion?
Multiple gains ?
Normal?
Chromosome 1

9
2 -- Mixture model

10
Specificity of CGH array experiment
A priori biological knowledge from conventional CGH :
• Limited number of states for a genomic sequence : - presence (modal), - deletion, - gain(s)
corresponding to different intensity ratios on the arrayMixture model to capture the underlying discrete states
• GS located contiguously on chromosomes are likely to carry alterations of the same type
Use clone spatial location in the allocation model
3 component mixture model with spatial allocation

11
Mixture model
For chromosome k:
Zgk : log ratio of measurement of normal versus tumoral change, genomic sequence (GS) g, chromosome k
Dye bias is estimated by using a reference array (normal/normal) and then subtracting the bias from Zgk
Zgk w1gkN(μ1 ,1
2) + w2gkN(μ2 ,2
2) + w3gkN(μ3 ,3
2)
For unique labelling:μ1 < 0 , μ3 > 0μ2 = 0 (dye bias has been adjusted)
2=presence1=deletion 3=gain

12
• Define mixture proportions to depend on the chromosomic location via a logistic model:
wcgk = exp(uc
gk) / Σm exp(umgk)
favours allocation of nearby GS to same component
Mixture model with spatial allocation
Zgk w1gkN(μ1 ,1
2) + w2gkN(μ2 ,2
2) + w3gkN(μ3 ,3
2)
Spatial structure on the weights (c.f. Fernandez and Green, 2002):
• Introduce 3 centred Markov random fields {umgk}, m = 1, 2, 3
with nearest neighbours along the chromosomes
x x xg -1 g g+1
Spatial neighbours of GS g

13
Prior structure
• wcgk = exp(uc
gk) / Σm exp(umgk)
with Gaussian Conditional AutoRegressive model :
ucgk | uc
-gk ~ N (h uc hk /ng , ck
2/ng)for h = neighbour of g (ng = #h, one or two in this simple case),
with constraint g uc gk = 0
• Variance parameters ck2 of the CAR acts as a smoothing
prior: indexed by the chromosome : ‘switching structure’ between the states can be different between chromosomes
• Mean and variances (μc ,c2 ) of the mixture components are
common to all chromosomes borrowing information• Inverse gamma priors for the variances, uniform priors for
the means

14
Posterior quantities of interest
• Bayesian inference via MCMC, implemented using Winbugs• In particular, latent allocations, Lgk , of GS g on chromosome k
to state c, are sampled during the MCMC run • Compute posterior allocation probabilities :
pcgk= P(Lgk = c | data), c =1,2,3
• Probabilistic classification of each GS using threshold on pc
gk :-- Assign g to modified state: deletion (c=1) or gain (c=3) if corresponding pc
gk > 0.8, -- Otherwise allocate to modal state.
Subset S of genomic sequences classified as modified(this subset depends on the chosen threshold)

15
False Discovery Rate
• Using the posterior allocation probabilities, can compute an estimate of FDR for the list S :
• Bayes FDR (S) | data = 1/card(S) Σg S p2gk
where p2gk is posterior probability of allocation to
the modal (c=2) state
Note: Can adjust the threshold to get a desired FDR and vice versa

16
3 -- Performance

17
Simulation set-up
• 200 fake GS with Z ~ N(0 ,.32) , modal
Z ~ N(log 2 ,.32) , deletion, a block of 30 GS
Z ~ N(- log 2 ,.32), gains, blocks of 20 and 10 GS
• Reference array with Z ~ N(0 ,.32) • 50 replications
Modal Deletion Modal Gain ModalGainMod
30 1020

18
CGH-Miner
• Data mining approach to select gain and losses (Wang et al 2005):
– Hierarchical clustering with a spatial constraint(ie only spatially adjacent clusters are joined)
– Subtree selection according to predefined rules focus on selecting large consistent gain/loss regions and small (big spike) regions
– Implemented in CGH-Miner Excel plug in• Estimation of FDR using a reference
(normal/normal) array and the same set of rules to prune the tree. Declared target 1%
• Simulation set-up is similar to Wang et al.

19
Classification obtained by CGH miner and CGH mix
Modal Deletion Modal Gain ModalGainMod
30 1020

20
Posterior probabilities of allocation to the 3 components

21
Comparative performance between CGHmix and CGH-Miner
50 simulations CGHmix CGH-MinerRealised false positive (mean)
1.9 16.4
Realised false positive (range)
0 -- 20 3 -- 39
Realised false negative (mean)
1.0 9.6
Realised false negative (range)
0 -- 4 0 -- 50
Realised FDR (%) 2.8 23.7Estimated FDR (%) 1.3 1.2

22
4 -- Analyses of CGH-array cancer data sets

23
Breast cancer cell line MCF7
• Data from Pollack et al., 6691 GS on 23 chromosomes
• μ1 = -0.35, 1 = 0.37
• (μ2 = 0) 2 = 0.27
• μ3 = 0.44, 3 = 0.54• Estimated FDR CGHmix = 2.6%• Estimated FDR CGH-Miner = 1.5%
^
^
^^^

24

25
Classification of GS obtained by CGHmix

26
knownalterationsfound byboth methods
additionalknownAlterationsfound byCGHmix

27
Neuroblastoma KCNR cell lineCurie Institute CGH custom array
for chromosome 1
• 190 genomic clones, mostly on the short arm• 3 replicate spots for each• μ1 = - 0.49, loss component• μ3 = 0.04, not plausible no gain in this case• Estimate FDR by regrouping c=2 and c=3
classes• Substantial number of deletions on short arm • No deletion found for the long arm by CGHmix,
a result confirmed by classical cytogenetic information
^^

28
Long arm

29
Extensions
• Account for variability in the case of repeated measurement
add a measurement model with GS specific noise, with exchangeable prior
• Refine the spatial model:– Incorporate genomic sequence location in the
neighbourhood definition of the CAR model0-1 contiguity spatial weights– In particular, account for overlapping sequences
by using weights that depend on the overlap