Algorithms for Smoothing Array CGH data

30
1 Algorithms for Smoothing Array CGH data Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, Computer Science) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC)

description

Algorithms for Smoothing Array CGH data. Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, Computer Science) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC). Tumor Cell. Chromosomes of tumor cell:. CGH Data.  C o p y #. - PowerPoint PPT Presentation

Transcript of Algorithms for Smoothing Array CGH data

1

Algorithms forSmoothing Array CGH data

Kees Jong (VU, CS and Mathematics)Elena Marchiori (VU, Computer Science)Aad van der Vaart (VU, Mathematics)Gerrit Meijer (VUMC)Bauke Ylstra (VUMC)Marjan Weiss (VUMC)

2

Tumor Cell

Chromosomes of tumor cell:

3

CGH Data

Clones/Chromosomes

Copy#

4

Naïve Smoothing

5

“Discrete” Smoothing

Copy numbers are integers

6

Why Smoothing ?• Noise reduction

• Detection of Loss, Normal, Gain, Amplification

• Breakpoint analysis

Recurrent (over tumors) aberrations may indicate:–an oncogene or –a tumor suppressor gene

7

Is Smoothing Easy?

Measurements are relative to a reference sample

Printing, labeling and hybridization may be uneven

Tumor sample is inhomogeneous

•vertical scale is relative

•do expect only few levels

8

Smoothing: example

9

Problem Formalization

A smoothing can be described by• a number of breakpoints • corresponding levels

A fitness function scores each smoothing according to fitness to the data

An algorithm finds the smoothing with the highest fitness score.

10

Smoothing

breakpoints

levelsvariance

11

Fitness Function

We assume that data are a realization of a Gaussian noise process and use the maximum likelihood criterion adjusted with a penalization term for taking into account model complexity

We could use better models given insight in tumor pathogenesis

12

Fitness Function (2)CGH values: x1 , ... , xn

breakpoints: 0 < y1< … < yN < xN

levels:

error variances:

likelihood:

13

Fitness Function (3)

Maximum likelihood estimators of μ and 2 can be found explicitly

Need to add a penalty to log likelihood tocontrol number N of breakpoints

penalty

14

Algorithms

Maximizing Fitness is computationally hard

Use genetic algorithm + local search to find approximation to the optimum

15

Algorithms: Local Search

choose N breakpoints at random

while (improvement)

- randomly select a breakpoint

- move the breakpoint one position to left

or to the right

16

Genetic Algorithm

Given a “population” of candidate smoothings create a new smoothing by

- select two “parents” at random from population- generate “offspring” by combining parents

(e.g. “uniform crossover” or “union”)- apply mutation to each offspring- apply local search to each offspring- replace the two worst individuals with the offspring

17

Experiments• Comparison of

– GLS

– GLSo

– Multi Start Local Search (mLS)

– Multi Start Simulated Annealing (mSA)

• GLS is significantly better than the other algorithms.

18

Comparison to Expert

expert

algorithm

19

Relating to Gene Expression

20

Relating to Gene Expression

21

Algorithms forSmoothing Array CGH data

Kees Jong (VU, CS and Mathematics)Elena Marchiori (VU, CS)Aad van der Vaart (VU, Mathematics)Gerrit Meijer (VUMC)Bauke Ylstra (VUMC)Marjan Weiss (VUMC)

22

23

Conclusion

• Breakpoint identification as model fitting to search for most-likely-fit model given the data

• Genetic algorithms + local search perform well• Results comparable to those produced by hand

by the local expert• Future work:

– Analyse the relationship between Chromosomal aberrations and Gene Expression

24

Example of a-CGH Tumor

Clones/Chromosomes

Value

25

a-CGH vs. Expression

a-CGH• DNA

– In Nucleus

– Same for every cell

• DNA on slide• Measure Copy

Number Variation

Expression• RNA

– In Cytoplasm

– Different per cell

• cDNA on slide• Measure Gene

Expression

26

Breakpoint Detection

• Identify possibly damaged genes:– These genes will not be expressed anymore

• Identify recurrent breakpoint locations:– Indicates fragile pieces of the chromosome

• Accuracy is important:– Important genes may be located in a region

with (recurrent) breakpoints

27

Experiments

• Both GAs are Robust:– Over different randomly initialized runs breakpoints

are (mostly) placed on the same location

• Both GAs Converge:– The “individuals” in the pool are very similar

• Final result looks very much like (mean error = 0.0513) smoothing conducted by the local expert

28

Genetic Algorithm 1 (GLS)

initialize population of candidate solutions randomly

while (termination criterion not satisfied)

- select two parents using roulette wheel

- generate offspring using uniform crossover

- apply mutation to each offspring

- apply local search to each offspring

- replace the two worst individuals with the offspring

29

Genetic Algorithm 2 (GLSo)

initialize population of candidate solutions randomly

while (termination criterion not satisfied)

- select 2 parents using roulette wheel

- generate offspring using OR crossover

- apply local search to offspring

- apply “join” to offspring

- replace worst individual with offspring

30

Fitness function (2)CGH values: x1 , ... , xn

breakpoints: 0 < y1< … < yN < xN

likelihood:

levels:

error variances: