Interval Scores for Quality Annotated CGH Data

1
Interval Scores for Quality Annotated CGH Data Doron Lipson 1 , Anya Tsalenko 2 , Zohar Yakhini 1,2 and Amir Ben-Dor 2 1 Technion, Haifa, Israel 2 Agilent Laboratories, Palo Alto, CA References 1. Barrett MT, Scheffer A, Ben-Dor A, Sampas N, Lipson D, Kincaid R, Tsang P, Curry B, Baird K, Meltzer PS, Yakhini Z, Bruhn L, and Laderman S., Comparative Genomic Hybridization using Oligonucleotide Microarrays and Total Genomic DNA. PNAS 2004; 101(51):17765-70 . 2. Lipson D, Aumann Y, Ben-Dor A, Linial N, and Yakhini Z., Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis. Ninth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2005 (Cambridge, MA). 3. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, and Brown PO. Microarray Analysis Reveals a Major Direct Role of DNA Copy Number Alteration in the Transcriptional Program of Human Breast Tumors. PNAS 2002; 99(20): 12963- 12968. 4. Dehan E, Ben-Dor A, Liao W, Lipson D, Rienstein S, Simansky D, Krupsky M, Yaron P, Friedman E, Rechavi G, Perlman M, Aviram-Goldring A, Bittner M, Yakhini Z, and Kaminski N. Chromosomal Aberrations and Gene Expression Profiles in Non Small Cell Lung Cancer. In preparation. Most human cancers arise as a result of an acquired genomic instability and the subsequent evolution of clonal populations of cells with accumulated genetic errors. Accordingly, most cancers and some premalignant tissues contain multiple genomic abnormalities not present in cells within the normal tissues from which the neoplasias arose. These abnormalities include gains and losses of chromosomal regions that vary extensively in their sizes, up to and including whole chromosomes. Increases in genomic copy number can lead to overexpression of tumor promoter genes (oncogenes) while losses are associated with disruption of normal cell regulatory processes (e.g through the loss of tumor suppressor genes). The Cancer Genome Normal Human Genome Stable diploid copy number even in most diseases, e.g. cardiovascular, neurological. Cancer Genome Multiple genome-wide chromosome aberrations including copy number changes and rearrangements Array-based Comparative Genomic Hybridization (aCGH) DNA copy number alterations have been measured using fluorescence in situ hybridization-based techniques. The development of a genome wide technique Comparative Genomic Hybridization (CGH) – allowed to jointly measure multiple chromosomal alterations present in cancer cells. Differentially labeled tumor and normal DNA are co-hybridized to normal metaphase chromosomes and ratios between the two labels allow the quantification of changes in DNA copy number. In a more advanced method termed array CGH (aCGH), the metaphase chromosomes are replaced by a microarray of thousands of genomic BAC, cDNA or oligonucleotide probes, greatly enhancing the resolution at which changes in DNA copy number may be detected. HT-29 colon carcinoma cell line [1] The Interval Score Let C=(c 1 c n ) be a vector of all log(R/G) measurements along some chromosome. if the target contains an aberration then we expect to see many consecutive positive or negative entries in C. On the other hand, if the target is normal we expect no localized effects. Intuitively, we look for intervals (sets of consecutive probes) where signal sums are significantly higher or lower than expected at random. As a null model we assume that no aberration is present in the target, and therefore the variation in C represents only the noise of the measurement. Assuming that the measurement noise along the chromosome is independent for distinct probes and normally distributed, let µ and denote the mean and standard deviation of the normal genomic data. Given an interval I spanning k probes, we define its score as: σ I μ c I S I i i | | ) ( MaxInterval Algorithm I: LookAhead Assume you are given: m – An upper bound for the value of a single element c i t – A lower bound on the maximum score If we are currently considering an interval I=[i,…,i+k- 1] with a sum of s = jI c j , then the score of I is: The score of an interval I’ = [i,…,i+k+x-1] is then bounded by: Complexity : Expected O(n 1.5 ) (unproved) k s I ) S( ) ( ) ( ) S( x k mx s I Solve for first x for which S(I ) may exceed t. sum length score s k s+mx k+x I I k s ) ( ) ( x k mx s Applications: Common Aberrations Finding common aberrations in a set of samples can be performed directly by using variants of the interval score (see [2] for details). 0 20 40 60 80 100 120 140 160 180 2001T-1 2002T-1 2009T-1 2010T-1 2011T-1 2014T-1 2017T-1 2020T-1 2022T-1 2062T-1 2068T-1 2069T-1 2073T-1 2075T-1 2076T-1 2079T-1 2080T-1 2082T-1 2083T-1 2086T-1 2090T-1 2091T-1 2092T-1 2093T-1 2097T-1 2099T-1 >0 <0 Chromosome 3 of 26 lung tumor samples on mid- density cDNA array. Data from Dehan et al [4]. Common deletion located in 3p21 and common amplification – in 3q. Chromosomes 8 and 11 of 37 breast tumor samples on mid-density cDNA array. Data from Pollack et al [4]. Common deletion located in 8p and common amplification – in 11q. Samples Samples Applications: Single Samples Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464). Data from Barrett et al [1]. Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364). Data from Pollack et al [3]. 50 0 25 75 Mbp 0 1 0 1 0 1 0 1 ERBB2 Log 2 (ratio) 0 -1 1 FRA16B A2BP1 0 50 Mbp 25 75 Log 2 (ratio) Quality Weighted Interval Scores For an interval I, spanning k probes, compute a weighted mean: Variance of individual loci: Variance due to consistency within the interval: And finally, the interval score: I i i loci q σ 2 / 1 1 I i i I i i i con q q μ c k k σ 2 2 2 / 1 / 1 2 2 ) 1 ( 1 ) ( con loci σ α k ασ I σ I i i I i i i q q c I μ μ 2 2 / 1 / ) ( ) ( ) ( ) ( I σ I μ I S Consider the vector V=((c 1 ,q 1 ),(c 2 ,q 2 ),… (c n ,q n )) where at each locus i the number c i is the measured log(R/G) and the number q i represents the standard deviation of this particular measurement. For every I set w i =(q i ) -2 . Chr. 17 of MDA-MB-453 breast cancer cell-line sample Data from Barrett et al [1]. Analysis using simple interval score: Analysis that accounts the signal consistency within the interval ( con ) and single locus variance ( loci ). Note the difference in the aberrations called for the genomic regions 58-75Mbp, and 8-15Mbp. Radii of the datapoints proportional to w i The MaxInterval Problem For convenience of algorithmic analysis we define the MaxInterval problem of finding the maximal scoring interval. Other intervals with high scores may be found by recursively calling this function. Input: A vector C=(c 1 c n ) Output: An interval I[1…n], that maximizes S(I ) 1 2 Identification and Mapping of Genomic Alteration Events A common first step in analyzing DNA copy number data consists of identifying aberrant (amplified or deleted) regions in each individual sample. Given a series of log(R/G) measurements along some genomic region, e.g. a chromosome, we would like to identify intervals within this vectors that consistently contain significantly high values (amplifications) or significantly low values (deletions) 0 -0.5 0.5 Log 2 (R/G) 0 -0.5 0.5 Deletion Amplification Genomic position Genomic position 3 4 5 6 MaxInterval Algorithm II: Geometric Family Approximation (GFA) For >0 define the following geometric family of intervals: j j j j j j j j j j j k n i k i i j k k ) ( 0 : ] 1 , [ ) ( , ) 1 ( k j j (j 1 ) (j 2 ) (j 3 ) Theorem [2]: Let I* be the optimal scoring interval. Let J be the leftmost longest interval of fully contained in I*. Then S(J) ≥ S(I*)/, where -2 . Complexity : O(n) 7 Benchmarking Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n 2 ), O(n 1.5 ), O(n), respectively. 8 9 10 11

description

(j 1 ). >0. (j 2 ).

Transcript of Interval Scores for Quality Annotated CGH Data

Page 1: Interval Scores for Quality Annotated CGH Data

Interval Scores for Quality Annotated CGH Data Doron Lipson1, Anya Tsalenko2, Zohar Yakhini1,2 and Amir Ben-Dor2

1Technion, Haifa, Israel 2Agilent Laboratories, Palo Alto, CA

References1. Barrett MT, Scheffer A, Ben-Dor A, Sampas N, Lipson D, Kincaid R, Tsang P, Curry B, Baird K, Meltzer PS,

Yakhini Z, Bruhn L, and Laderman S., Comparative Genomic Hybridization using Oligonucleotide Microarrays and Total Genomic DNA. PNAS 2004; 101(51):17765-70 .

2. Lipson D, Aumann Y, Ben-Dor A, Linial N, and Yakhini Z., Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis. Ninth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2005 (Cambridge, MA).

3. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, and Brown PO. Microarray Analysis Reveals a Major Direct Role of DNA Copy Number Alteration in the Transcriptional Program of Human Breast Tumors. PNAS 2002; 99(20): 12963-12968.

4. Dehan E, Ben-Dor A, Liao W, Lipson D, Rienstein S, Simansky D, Krupsky M, Yaron P, Friedman E, Rechavi G, Perlman M, Aviram-Goldring A, Bittner M, Yakhini Z, and Kaminski N. Chromosomal Aberrations and Gene Expression Profiles in Non Small Cell Lung Cancer. In preparation.

Most human cancers arise as a result of an acquired genomic instability and the subsequent evolution of clonal populations of cells with accumulated genetic errors. Accordingly, most cancers and some premalignant tissues contain multiple genomic abnormalities not present in cells within the normal tissues from which the neoplasias arose. These abnormalities include gains and losses of chromosomal regions that vary extensively in their sizes, up to and including whole chromosomes. Increases in genomic copy number can lead to overexpression of tumor promoter genes (oncogenes) while losses are associated with disruption of normal cell regulatory processes (e.g through the loss of tumor suppressor genes).

The Cancer Genome

Normal Human GenomeStable diploid copy number even in most diseases, e.g. cardiovascular, neurological.

Cancer GenomeMultiple genome-wide chromosome aberrations including copy number changes and rearrangements

Array-based Comparative Genomic Hybridization (aCGH)

DNA copy number alterations have been measured using fluorescence in situ hybridization-based techniques. The development of a genome wide technique – Comparative Genomic Hybridization (CGH) – allowed to jointly measure multiple chromosomal alterations present in cancer cells. Differentially labeled tumor and normal DNA are co-hybridized to normal metaphase chromosomes and ratios between the two labels allow the quantification of changes in DNA copy number. In a more advanced method termed array CGH (aCGH), the metaphase chromosomes are replaced by a microarray of thousands of genomic BAC, cDNA or oligonucleotide probes, greatly enhancing the resolution at which changes in DNA copy number may be detected.

HT-29 colon carcinoma cell line [1]

The Interval ScoreLet C=(c1…cn) be a vector of all log(R/G) measurements along some chromosome. if the target contains an aberration then we expect to see many consecutive positive or negative entries in C. On the other hand, if the target is normal we expect no localized effects. Intuitively, we look for intervals (sets of consecutive probes) where signal sums are significantly higher or lower than expected at random. As a null model we assume that no aberration is present in the target, and therefore the variation in C represents only the noise of the measurement.

Assuming that the measurement noise along the chromosome is independent for distinct probes and normally distributed, let µ and denote the mean and standard deviation of the normal genomic data. Given an interval I spanning k probes, we define its score as:

σI

μcIS Ii i

||)(

MaxInterval Algorithm I:LookAhead

Assume you are given:• m – An upper bound for the value of a single element ci

• t – A lower bound on the maximum score

If we are currently considering an interval I=[i,…,i+k-1] with a sum of s = jI cj, then the score of I is:

The score of an interval I’ = [i,…,i+k+x-1] is then bounded by:

Complexity:Expected O(n1.5) (unproved)

ksI )S(

)()()S( xkmxsI Solve for first x for which S(I ) may exceed t.

sumlength

score

sk

s+mxk+x

I I’

ks)()(

xkmxs

Applications: Common AberrationsFinding common aberrations in a set of samples can be performed directly by using variants of the interval score (see [2] for details).

0 20 40 60 80 100 120 140 160 180

2001T-12002T-12009T-12010T-12011T-12014T-12017T-12020T-12022T-12062T-12068T-12069T-12073T-12075T-12076T-12079T-12080T-12082T-12083T-12086T-12090T-12091T-12092T-12093T-12097T-12099T-1

>0<0

Chromosome 3 of 26 lung tumor samples on mid-density cDNA array. Data from Dehan et al [4].Common deletion located in 3p21 and common amplification – in 3q.

Chromosomes 8 and 11 of 37 breast tumor samples on mid-density cDNA array. Data from Pollack et al [4].Common deletion located in 8p and common amplification – in 11q.

Sam

ples

Sam

ples

Applications: Single Samples

Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464).Data from Barrett et al [1].

Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364).Data from Pollack et al [3].

500 25 75 Mbp

0

1

0

1

0

1

0

1ERBB2

Log 2

(rat

io)

0

-1

1 FRA16BA2BP1

0 50 Mbp25 75

Log 2

(rat

io)

Quality Weighted Interval Scores

For an interval I, spanning k probes, compute a weighted mean:

Variance of individual loci:

Variance due to consistency within the interval:

And finally, the interval score:

Ii iloci qσ 2/11

Ii i

Ii iicon q

qμck

kσ 2

22

/1/

1

22 )1(1)( conloci σαk

ασIσ

Ii i

Ii ii

qqc

Iμμ 2

2

/1/

)(

)()()( IσIμIS

Consider the vector V=((c1,q1),(c2,q2),…(cn,qn)) where at each locus i the number ci is the measured log(R/G) and the number qi represents the standard deviation of this particular measurement. For every I set wi=(qi)-2.

Chr. 17 of MDA-MB-453 breast cancer cell-line sample Data from Barrett et al [1].

Analysis using simple interval score:

Analysis that accounts the signal consistency within the interval (con) and single locus variance (loci).

Note the difference in the aberrations called for the genomic regions 58-75Mbp, and 8-15Mbp.

Radii of the datapoints proportional to wi

The MaxInterval ProblemFor convenience of algorithmic analysis we define the MaxInterval problem of finding the maximal scoring interval. Other intervals with high scores may be found by recursively calling this function.

Input: A vector C=(c1…cn)

Output: An interval I[1…n], that maximizes S(I )

1 2 Identification and Mapping of GenomicAlteration Events

A common first step in analyzing DNA copy number data consists of identifying aberrant (amplified or deleted) regions in each individual sample.

Given a series of log(R/G) measurements along some genomic region, e.g. a chromosome, we would like to identify intervals within this vectors that consistently contain significantly high values (amplifications) or significantly low values (deletions)

0

-0.5

0.5

Log 2

(R/G

)

0

-0.5

0.5

Deletion Amplification

Genomic position Genomic position

3 4

5

6

MaxInterval Algorithm II:Geometric Family Approximation (GFA)For >0 define the following geometric family of intervals:

j

j

jjjj

jjj

j

j

knikiij

kk

)(

0:]1,[)(

, )1(

kjj

(j1)

(j2)

(j3)

Theorem [2]: Let I* be the optimal scoring interval. Let J be the leftmost longest interval of fully contained in I*. Then S(J) ≥ S(I*)/, where -2.

Complexity: O(n)

7

BenchmarkingBenchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths.Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively.

8

9

10

11