# Modelling of CGH arrays experiments

Embed Size (px)

description

### Transcript of Modelling of CGH arrays experiments

1

Modelling of CGH arrays experiments

• Philippe Broët Faculté de Médecine,Université de Paris-XI

• Sylvia RichardsonImperial CollegeLondon

CGH = Competitive Genomic Hybridization

2

Outline

• Background• Mixture model with spatial allocations• Performance, comparison with CGH-

Miner• Analyses of CGH-array cancer data sets• Extensions

3

The development of solid tumors is associated with the acquisition of complex genetic alterations that modify normal cell growth and survival.

Many of these changes involve gains and/or losses of parts of the genome: Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis.

Loss Gain

Tumor supressor gene Oncogene

Aim: study genomic alterations in oncology

4

1. Extraction- DNA

2. Labelling (fluo)

3. Co-hybridization

4. Scanning

Case Control

CGH = Competitive Genomic hybridization• Array containing short sequences of DNA bound to

glass slide• Fluorescein-labeled normal and pathologic samples

co-hybridised to the array

5

• Once hybridization has been performed, the signal intensities of the fluorophores is quantified

Provides a means to quantitatively measure DNA copy-number alterations and to map them directly onto genomic sequence

6

MCF7 cell line investigated in Pollack et al (2002)23 chromosomes and 6691 cDNA sequences

Data log transformed: Difference bet. MCF7 and reference

7

Types of alterations observed

• (Single) Gain or Deletion of sequences, occurring for contiguous regions

Low level changes in the ratio ± log2but attenuation (dye bias) ratio ≈ ± 0.4• Multiple gains (small regions)

High level change, easy to pick upFocus the modelling on the first

common type of alterations

8

Deletion?

Multiple gains ?

Normal?

Chromosome 1

9

2 -- Mixture model

10

Specificity of CGH array experiment

A priori biological knowledge from conventional CGH :

• Limited number of states for a genomic sequence : - presence (modal), - deletion, - gain(s)

corresponding to different intensity ratios on the arrayMixture model to capture the underlying discrete states

• GS located contiguously on chromosomes are likely to carry alterations of the same type

Use clone spatial location in the allocation model

3 component mixture model with spatial allocation

11

Mixture model

For chromosome k:

Zgk : log ratio of measurement of normal versus tumoral change, genomic sequence (GS) g, chromosome k

Dye bias is estimated by using a reference array (normal/normal) and then subtracting the bias from Zgk

Zgk w1gkN(μ1 ,1

2) + w2gkN(μ2 ,2

2) + w3gkN(μ3 ,3

2)

For unique labelling:μ1 < 0 , μ3 > 0μ2 = 0 (dye bias has been adjusted)

2=presence1=deletion 3=gain

12

• Define mixture proportions to depend on the chromosomic location via a logistic model:

wcgk = exp(uc

gk) / Σm exp(umgk)

favours allocation of nearby GS to same component

Mixture model with spatial allocation

Zgk w1gkN(μ1 ,1

2) + w2gkN(μ2 ,2

2) + w3gkN(μ3 ,3

2)

Spatial structure on the weights (c.f. Fernandez and Green, 2002):

• Introduce 3 centred Markov random fields {umgk}, m = 1, 2, 3

with nearest neighbours along the chromosomes

x x xg -1 g g+1

Spatial neighbours of GS g

13

Prior structure

• wcgk = exp(uc

gk) / Σm exp(umgk)

with Gaussian Conditional AutoRegressive model :

ucgk | uc

-gk ~ N (h uc hk /ng , ck

2/ng)for h = neighbour of g (ng = #h, one or two in this simple case),

with constraint g uc gk = 0

• Variance parameters ck2 of the CAR acts as a smoothing

prior: indexed by the chromosome : ‘switching structure’ between the states can be different between chromosomes

• Mean and variances (μc ,c2 ) of the mixture components are

common to all chromosomes borrowing information• Inverse gamma priors for the variances, uniform priors for

the means

14

Posterior quantities of interest

• Bayesian inference via MCMC, implemented using Winbugs• In particular, latent allocations, Lgk , of GS g on chromosome k

to state c, are sampled during the MCMC run • Compute posterior allocation probabilities :

pcgk= P(Lgk = c | data), c =1,2,3

• Probabilistic classification of each GS using threshold on pc

gk :-- Assign g to modified state: deletion (c=1) or gain (c=3) if corresponding pc

gk > 0.8, -- Otherwise allocate to modal state.

Subset S of genomic sequences classified as modified(this subset depends on the chosen threshold)

15

False Discovery Rate

• Using the posterior allocation probabilities, can compute an estimate of FDR for the list S :

• Bayes FDR (S) | data = 1/card(S) Σg S p2gk

where p2gk is posterior probability of allocation to

the modal (c=2) state

Note: Can adjust the threshold to get a desired FDR and vice versa

16

3 -- Performance

17

Simulation set-up

• 200 fake GS with Z ~ N(0 ,.32) , modal

Z ~ N(log 2 ,.32) , deletion, a block of 30 GS

Z ~ N(- log 2 ,.32), gains, blocks of 20 and 10 GS

• Reference array with Z ~ N(0 ,.32) • 50 replications

Modal Deletion Modal Gain ModalGainMod

30 1020

18

CGH-Miner

• Data mining approach to select gain and losses (Wang et al 2005):

– Hierarchical clustering with a spatial constraint(ie only spatially adjacent clusters are joined)

– Subtree selection according to predefined rules focus on selecting large consistent gain/loss regions and small (big spike) regions

– Implemented in CGH-Miner Excel plug in• Estimation of FDR using a reference

(normal/normal) array and the same set of rules to prune the tree. Declared target 1%

• Simulation set-up is similar to Wang et al.

19

Classification obtained by CGH miner and CGH mix

Modal Deletion Modal Gain ModalGainMod

30 1020

20

Posterior probabilities of allocation to the 3 components

21

Comparative performance between CGHmix and CGH-Miner

50 simulations CGHmix CGH-MinerRealised false positive (mean)

1.9 16.4

Realised false positive (range)

0 -- 20 3 -- 39

Realised false negative (mean)

1.0 9.6

Realised false negative (range)

0 -- 4 0 -- 50

Realised FDR (%) 2.8 23.7Estimated FDR (%) 1.3 1.2

22

4 -- Analyses of CGH-array cancer data sets

23

Breast cancer cell line MCF7

• Data from Pollack et al., 6691 GS on 23 chromosomes

• μ1 = -0.35, 1 = 0.37

• (μ2 = 0) 2 = 0.27

• μ3 = 0.44, 3 = 0.54• Estimated FDR CGHmix = 2.6%• Estimated FDR CGH-Miner = 1.5%

^

^

^^^

24

25

Classification of GS obtained by CGHmix

26

knownalterationsfound byboth methods

additionalknownAlterationsfound byCGHmix

27

Neuroblastoma KCNR cell lineCurie Institute CGH custom array

for chromosome 1

• 190 genomic clones, mostly on the short arm• 3 replicate spots for each• μ1 = - 0.49, loss component• μ3 = 0.04, not plausible no gain in this case• Estimate FDR by regrouping c=2 and c=3

classes• Substantial number of deletions on short arm • No deletion found for the long arm by CGHmix,

a result confirmed by classical cytogenetic information

^^

28

Long arm

29

Extensions

• Account for variability in the case of repeated measurement

add a measurement model with GS specific noise, with exchangeable prior

• Refine the spatial model:– Incorporate genomic sequence location in the

neighbourhood definition of the CAR model0-1 contiguity spatial weights– In particular, account for overlapping sequences

by using weights that depend on the overlap