Post on 25-Jan-2016
description
Biology-Driven Clustering of Microarray Data:Applications to the NCI60 Data
SetK.R. Coombes, K.A. Baggerly, D.N. Stivers,
J. Wang, D. Gold, H.G. Sung, and S.J. Lee
Introduction
Most analyses of microarray data proceed as though it were simply a large, unstructured matrix. Such analyses ignore substantial amounts of existing biological information. In the study of cancer, we already know many important genes through their involvement in specific biological processes, and we know that reproducible chromosomal abnormalities play an important role. We see a need for developing analytic strategies that exploit this biological information.
We analyzed the NCI60 data set by first determining the chromosomal location and biological function of the genes on the microarray. We performed separate analyses using genes on individual chromosomes and genes involved in different biological processes. The fundamental advantage of this approach is that it provides results that are immediately and directly interpretable without resorting to ex post facto rationalizations.
Methods
How many genes on the microarray have good annotations?
Numberof Spots
AccessionNumbers
Current UniGeneStatus
294 None None (control spots)128 Only 3’ Unknown to UniGene
1379 Only 3’ Known to UniGene1 Only 5’ Unknown to UniGene6 Only 5’ Known to UniGene
399 Both Unknown to UniGene763 Both 3’ known, 5’ unknown291 Both 3’ unknown, 5’ known646 Both Both known, but disagree
6093 Both Both known, and agree
Table 1: There are only 7478 spots (out of 10,000) on the array with valid, matching UniGene cluster IDs. Genes with unknown or conflicting annotations were eliminated before performing any further analysis.
• Problem:– I.M.A.G.E. clone IDs and
GenBank accession numbers are archival.
– UniGene clusters, gene names, descriptions, etc., are changeable.
• Solution:– Download the latest
version of UniGene (build 137) and LocusLink (July 2001) to update annotations, using the GenBank accession numbers describing both 3’ and 5’ ends of the genes spotted on the microarrays.
Where are the genes located?
Chromosome
(Ob
serv
ed
- E
xpe
cte
d)
/ SD
5 10 15 20
-6-4
-20
24
6
X Y
chi^2 = 148.8p < 10^(-10)
Figure 1: Distribution of the genes on the array by chromosome.Chromosomes 19 and Y are substantially underrepresented whencompared to the numbers known to LocusLink; chromosomes 6and 13 are overrepresented.
We compared the number of genes on the microarray that mapped to each chromosome with the number known to be on the chromosome, based on current figures from the NCBI. A chi-squared test was used to test whether the distribution of genes on chromosomes was uniform.
How do we determine gene functions?• Using our updated UniGene
clusters, we followed the links from UniGene to LocusLink to GeneOntology.
• GeneOntology is a structured, hierarchical vocabulary to describe gene functions in three broad areas:– biological process (why)– molecular function (what)– cellular component
(where)• The 7478 good spots on the
array corresponded to 6614 distinct genes, of which 5074 were known to LocusLink, and 2989 had at least one annotation in GeneOntology.
We focused on the biological process annotations in the GeneOntology vocabulary, since these had the most natural interpretation for application to the study of cancer. We counted the number of genes having annotations of functions at or below each level in the hierarchy, and selected a set of categories that each contained roughly one to a few hundred genes, with the categories as a whole accounting for more than 95% of all annotations (Table 2).
What functional categories are represented on the array?
Function # Ann. # Spots Function # Ann. # Spots
Oncogenesis 140 180 Cell shape and size 78 101Apoptosis 128 138 Protein traffic 157 188
Physiological proc. 180 210 Transport 146 136Perc. of ext. stimuli 238 150 Cell proliferation 197 249
Ectoderm devel. 129 152 Stress response 599 372Mesoderm devel. 92 102 Radiation response 147 136
Cell adhesion 111 140 Cell cycle 494 283Cell-cell signaling 137 166 Nucleic acid met. 695 595
Signal transduction 222 228 Protein metabolism 471 567Intracell sig cascade 110 110 Lipid metabolism 146 156
Cell motility 120 153 Carbohydrate met. 103 97Cell organization 98 118 Energy pathways 88 98
Table 2: The number of annotations (Ann.) into and the number of spots on the array in various functional categories chosen from the biological process annotations from LocusLink into GeneOntology. Individual spots may have multiple annotations into the same category; individual genes may be represented by multiple spots.
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60
leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achnrenal.tk10
renal.sn12c
renal.rxf393
renal.uo31renal.786o
renal.a498
breast.unknown
Cancer B C L M N O P R S
Grade A A D F D C B
Figure 2: Dendrogram using allgenes with valid annotations andwith expression levels abovethose of the blank spots.
How good is a dendrogram?
• A = there is a cluster containing all and only one kind of cancer
• B = all, with one or two extras• C = all except one• D = all except one, with extras• E = all except two• F = all except two, with extras
We introduced a quality grade, based on the dendrograms, to describe how well each set of genes used to produce a dendrogram classifies each kind of cancer:
Grades for the dendrogram of Figure 2are displayed in the following table.
Heterogeneity of different types of cancer
ch B C L M N O P R S ch B C L M N O P R S
1 B A D F D B 13 D E
2 E C D D E D E 14 A A F
3 C E D E F 15 C B C F C
4 E E E E 16
5 A A D F E 17 A A D F E E
6 C A D E E D 18 E D
7 E A D E C E 19 D D
8 E C D 20 E C
9 B C C E E E 21
10 D E 22 A E E
11 E C C D X B A D E D
12 B C C E E E
• Some cancers (colon, leukemia) are fairly homogeneous and easy to distinguish from others.
• Some (breast, lung) are so heterogeneous as to be nearly impossible to distinguish.
• Some chromosomes (1, 2, 6, 7, 9, 12, 17) can distinguish many types of cancer.
• Some (16, 21) can not accurately distinguish any kind of cancer. The dendrograms using genes from these chromosomes are equivalent to randomly scrambling of the cancer cell lines.
Table 3: Grades given to dendrograms that cluster samples by genes on specific chromosomes. Grades range from A to F, with blanks indicating no clustering for that type of sample. Abbreviations: B=breast, C=colon, L=leukemia, M=melanoma, N=non small cell lung, O=ovarian, P=prostate, R=renal, S=central nervous system.
0.0
0.2
0.4
0.6
0.8
Chromosome 20.00.20.40.6
breast.bt549breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29
colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60
leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5melanoma.malme3m
melanoma.skmel28
melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4ovarian.3
ovarian.8
ovarian.5ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
Figure 3: The genes on chromosome 2 do anexcellent job of distinguishing cancer types. We can also locate specific clusters of genes on thechromosome with strong signatures identifyingleukemia, melanoma, and colon cancer.
Chromosome 2
0.0
0.2
0.4
0.6
0.8
Chromosome 160.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19
cns.u251
colon.ht29
colon.hct116
colon.hct15colon.km12
colon.sw620
colon.hcc2998colon.colo205
leukemia.k562
leukemia.hl60
leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1
ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
Figure 4: Genes on chromosome 16 cannot reliablydistinguish any single kind of cancer in this study.There are, nevertheless, strong gene signaturesdriving the clustering, which does not appear tomatch anything we know about the biology of thesamples.
Chromosome 16
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29
colon.hct116
colon.hct15colon.km12
colon.sw620colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2melanoma.skmel5melanoma.malme3mmelanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3ovarian.8
ovarian.5ovarian.igrov1
ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achnrenal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
0.0
0.2
0.4
0.6
0.8
protein metabolism and modificationProtein Metabolism
Figure 5: The genes involved in protein metabolism do an excellent job of distinguishing cancer types. We can also locate specific clusters of genes on the chromosome with strong signatures identifying leukemia, colon cancer, lung cancer, and central nervous system cancer.
0.0
0.2
0.4
0.6
0.8
death (apoptosis)0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19
cns.u251
colon.ht29
colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562
leukemia.hl60leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
Apoptosis
Figure 6: The genes involved in apoptosis do a poor job of distinguishing cancer types. This suggests that the mechanisms by which cancers overcome cell death cut across the normal biological lines drawn by histology.
ConclusionsMultiple views into the data provide substantial insight into differences in cancer types and gene sets.Cancer types differ greatly in their degree of heterogeneity, ranging from homogeneous (colon, leukemia) through moderately heterogeneous (renal, melanoma) to extremely heterogeneous (breast and lung).Homogeneous cancers exhibit strong identifying signals across most views of the data, regardless of function or chromosome.There are large difference in the ability of genes of different chromosomes to distinguish cancer types. There are similar differences for genes involved in different biological processes (data not shown).
Functional categories that are good at distinguishing cancers include signal transduction, cell cycle, cell proliferation, and protein metabolism. Some differences result from the histology of the underlying tissue. Others reflect differences in the way particular kinds of cancers overcome limits on cell growth.Categories that are poor at distinguishing cancers include energy pathways and apoptosis. The latter observation has potential implications for cancer therapies designed to trigger apoptosis, since it suggests that the mechanisms by which cancer cells avoid cell death are not linked to the general type of cancer but are either common across cancers or idiosyncratic.