Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and...
-
Upload
joshua-day -
Category
Documents
-
view
217 -
download
1
Transcript of Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and...
![Page 1: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/1.jpg)
Tree Based Methodsfor Analyzing
Tissue Microarray Data
Steve HorvathHuman Genetics and Biostatistics
University of California, Los Angeles
![Page 2: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/2.jpg)
Acknowledgements
• Horvath Lab– Yunda Huang – Xueli Liu Ph.D.– Zeke Fang Ph.D.– Tuyen Hoang
• UCLA Tissue Microarray Core– David Seligson– Aarno Palotie
• Clinicians– Hyung Kim– Arie Belldegrun
![Page 3: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/3.jpg)
Contents
• Statistical issues with tissue microarray (TMA) data
• Random forest (RF) predictors
• RF clustering
• Application of RF clustering to TMA data
• Supervised Learning Methods
![Page 4: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/4.jpg)
Background TMA data
![Page 5: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/5.jpg)
Description of TMA data
• TMA data are a high-throughput tool in validating newly-identified biomarker in genome wide discovery
• Basic technique was summarized in Kononen et al. 1998
![Page 6: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/6.jpg)
donor block array block slide
Tissue Microarray (TMA) TechnologyKononen et al. Nature Medicine 1998
• Hundreds of tiny (typically 0.6 mm diameter) cylindrical tissue cores
–densely and precisely arrayed into a single histologic paraffin block.
• From this new array block, up to 300 serial 4-8 m thick sections may be produced.
• Targets for fluorescence in situ hybridization (FISH) and protein expression by immunohistochemical studies.
![Page 7: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/7.jpg)
Pathologists score each spot by looking through a microscope. slide by David Seligson
Non-normal and highly correlated
![Page 8: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/8.jpg)
Several Spots per Pathology Case Several “Scores” per Spot
• Maximum intensity = Max (1 – 4)
• Percent of cells staining = Pos (0 – 100)
• Percent of cells staining with the
maximum intensity = PosMax (0 – 100)
• Spots have a spot grade: NL,1,2,..
• Indicator of informativeness
• Each case is usually represented by 4 or more spots
– >3 malignant lesions, 1 matched normal
![Page 9: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/9.jpg)
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
0 0.5 1 1.5 2 2.5 3
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
P53 CA9 EpCamPercent of Cells Staining(POS)
Maximum Intensity (MAX)
Histogram of tumor marker expression scores: POS and MAX
![Page 10: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/10.jpg)
P53 and Ki67: Max versus Pos
0.0 0.5 1.0 1.5
1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5KiNuclMax
0 20 40
40 60 80
40
60
80
0
20
40KiPos
0.0 0.5 1.0 1.5
1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5P5NuclMax
0 20 40
60 80 100
60
80
100
0
20
40P5Pos
![Page 11: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/11.jpg)
Characteristics of TMA data
• Non-normal, discrete, strongly correlated• Mixed variable types • Pooling (combining) spot measurements across
every patient – between 1 to 10 spots of different grade
– current strategy pools tumor spots and forms median, mean, minimum or max
• Message: tumor marker intensity is measured by up to 12 highly correlated staining scores multicollinearity
![Page 12: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/12.jpg)
Our main tool are random forest predictors
• Unsupervised analysis of TMA data– RF clustering
• Supervised Analysis– RF based pre-validation method
![Page 13: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/13.jpg)
Background random forest predictors
L. Breiman 1999
![Page 14: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/14.jpg)
Random Forests (RFs)
• RFs are a collection of tree predictors such that each tree depends on the values of an independently sampled random vector
![Page 15: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/15.jpg)
Classification and Regression Trees (CART)
by– Leo Breiman,
UC Berkeley– Jerry Friedman,
Stanford University– Charles J. Stone,
UC Berkeley– Richard Olshen,
Stanford University
![Page 16: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/16.jpg)
An example of CART
• Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack
• Training data set:– # of subjects = 215– Outcome variable = High/Low Risk
determined– 19 noninvasive clinical and lab variables were
used as the predictors
![Page 17: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/17.jpg)
High 12%Low 88%
High 17%Low 83%
Is BP <= 91?
High 70%Low 30%
High 11%Low 89%
High 50%Low 50%
High 2%Low 98%
High 23%Low 77%
Is age <= 62.5?Classified as high risk!
Classified as low risk!
Classified as high risk! Classified as low risk!
Is ST present?
CART construction
Yes No
No
No
Yes
Yes
![Page 18: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/18.jpg)
CART Construction
BINARY RECURSIVE PARTITIONING
• Binary: split parent node into two child nodes
• Recursive: each child node can be treated as parent node
• Partitioning: data set is partitioned into mutually exclusive subsets in each split
![Page 19: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/19.jpg)
RF Construction
…
![Page 20: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/20.jpg)
Prediction by plurality voting
• The forest consists of N trees.
• Class prediction: – Each tree votes for a class; the predicted
class C for an observation is the plurality, maxC k [fk(x,T) == C]
• Regression random forest: – predicted value is the average prediction
![Page 21: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/21.jpg)
Clustering with random forest predictors
![Page 22: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/22.jpg)
Intrinsic Proximity Measure
• Terminal tree nodes contain few observations
• If case i and case j both land in the same terminal node, increase the proximity between i and j by 1.
• At the end of the run divide by 2* no. of trees.
• Dissimilarity=sqrt(1-Proximity)
![Page 23: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/23.jpg)
Casting an unsupervised problem into a supervised RF
problem • Key Idea (Breiman 1999)
– Label observed data as class 1– Generate synthetic observations and
label them as class 2– Construct a RF predictor to distinguish
class 1 from class 2– Use the resulting dissimilarity measure
in unsupervised analysis
![Page 24: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/24.jpg)
How to generate synthetic observations
• Synthetic observations are simulated to contain no clusters– e.g. randomly sampling from the product of
empirical marginal distributions of the input.
![Page 25: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/25.jpg)
RF clustering
• Compute distance matrix from RF– distance matrix = sqrt(1-proximity matrix)
• Compute the first 2~3 classical multi-dimensional scaling coordinates based on the distance matrix
• Conduct partitioning around medoid (PAM) clustering analysis
– input parameter=no. of clusters k – use the Euclidean distance between the resulting
scaling points
![Page 26: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/26.jpg)
Theoretical Study of RF Clustering
Ref: Using random forest proximity for unsupervised learning, BIOKDD-CBGI'03, 7th Joint Conference on Information Sciences, Cary, North Carolina.
![Page 27: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/27.jpg)
Applying Random Forest Clustering to Tissue Microarray Data--Application to Kidney Cancer
Tao Shi and Steve Horvath
![Page 28: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/28.jpg)
Scientific Question:Can one discover cancer subtypes
based on the protein expression patterns of tumor markers?
![Page 29: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/29.jpg)
Why use RF clustering for TMA data?
• no need to transform the often highly skewed features– based on ranks of features
• natural way of weighing tumor marker contributions to the dissimilarity
• elegant way to deal with missing covariates
• intrinsic proximity matrix handles mixed variable types well
![Page 30: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/30.jpg)
Kidney Multi-marker Data
• 366 patients with Renal Cell Carcinoma (RCC) admitted to UCLA between 1989 and 2000.
• Immuno-histological measures of total 8 tumor markers were obtained from tissue microarrays constructed from the tumor samples of these patients.
![Page 31: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/31.jpg)
MDS plot of clear cell patients
• Labeled and colored by their RF cluster
-0.1 0.0 0.1 0.2 0.3
-0.2
-0.1
0.0
0.1
cmd plot
coordinate 1
coo
rd 2
1
2
1
2
1
1 11
2
1
2
2
1
2
2 2
1
22
2
11
3
1
3
2
1
1
3
11
1
2
2
2
3
1
2
3
2
22
2 2
2
2
3
1
22
2
1
1
3
1
32
2
1
2
3
1
2 2
1
2 22
2
3322
2
22
2
2
3
2
22
2
1
22
22
11
2
1
2
2
2
1
2
2
2
2
3
1
2
3
3
2
3
2
2
2
2
1
2
22
2
22
2
2
1
2
1
222
1
2
2
1
2
1
1
2
2
1
2
2
2
3
22
1
2
2 3
1
21
2
2
2
1
2
2
222
2
2
2
1
2
2
222
2
2
2
3
2
222
1
2
2
1
3
2
1
2
2
2
2
2
22
1
1
1
2
1
1
22
1
22
2
2
1
22
2
22
2
2
22
2
3
2
11
1
2
2
2
1
22
1
2
1
2
2
3
2
2
1
3
2
22
3
2
3
1
1
2
1
1
31
22
22
1
2
2
2
2
1
2 2
2
22
22
2
2
2
2
1
22
3
2
3
2
2
2
1
2
23
1
2
2
3
1
3
1
2
11
1
22
22
1
2
23
2
2
2
1
3
2 2
2
2
1
22
22
31
3
1
2
2
2
2
2
22
1
22
22
1
2
3
1
1
2
2
3
2
2
1
2
1
1
1
1
3
2
3
2
22
2
22
2
2
1
2
2
22
2
2
1
2
![Page 32: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/32.jpg)
Interpreting the clusters in terms of survival
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
K-M curves
Time to death(Months)
Su
rviv
al
1 Log Rank p value= 0.00037423
Clustering label
Non clearCell
patients
Clear cellpatients
1 0 92
2 20 215
3 30 9
![Page 33: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/33.jpg)
Hierarchical clustering with Euclidean distance leads to less satisfactory results
11 1
11 1
11
1 11
1 11 1 1
01 1
1 11
1 11
1 11
1 11
1 11
11 1
11
1 11
1 11 1
11 1
11
1 11 1
11
1 11
1 11 1
1 11 1
11 1 1
11 1
1 1 1 11
11 1
11 1 11 1
11 1 1 1
11 1
1 11
1 11 1
1 11 1
11 1
11 1 1
11 1
1 11 1 1 1 1
1 1 1 11
1 1 11
1 1 1 1 1 11 1
11 1 1 1 11 1 1
1 1 11
1 1 11 1
11
1 11
1 1 1 1 1 1 1 11 1
1 11
11
1 11 1
1 11 1
1 1 1 11
1 11
1 11
1 11
1 1 1 11 1
11 1
11 1
1 1 1 1 1 0 11 0
11
1 11
11
1 11 1
11 1
11 1
1 11
11 1 1 1
1 11
1 11 1
1 11 1
11
0 1 11 1
11
1 11
11 1
01 1
11
0 11
11 1
1 01
1 10
01
1 11 1
01
00 0
11 1
11 1
10 0
0 00 0
1 11 0 0
0 00 0
11
1 01
00 1
10 0
0 10
10 1
1 10
00 0 0 0
0 00
0 00 0
11
0 10
0 01
1 1
05
01
00
15
0
Cluster Dendrogram
hclust (*, "average")dist(KidneyRF)
He
igh
t
Cluster-ing label
NonclearCell
patients
Clearcell
patients
1 9 (20)
286 (307)
2 41(30)
30 (9)
* RF clustering grouping in red
![Page 34: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/34.jpg)
Euclidean vs. RF Distance
RF
dis
tan
ce
Euclidean distance
![Page 35: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/35.jpg)
Molecular grouping vs. Pathological grouping
Message: molecular grouping is superior to pathological grouping
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
327 patients in cluster 1 and 239 patients in cluster 3
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al316 non-clear cell patients50 clear cell patients
p = 0.0229p = 9.03e-05
Molecular Grouping Pathological Grouping
![Page 36: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/36.jpg)
Identify “irregular” patients
Clustering label
Non clearCell
patients
Clear cellpatients
1 0 92
2 20 215
3 30 9
Message: molecular grouping can be used to refine clear celldefinition.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
p = 0.00522
9 irregular clear cell patients307 regular clear cell patients
50 non-clear cell patients
![Page 37: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/37.jpg)
Detect novel cancer subtypes
• Group clear cell grade 2 patients into two clusters with significantly different survival.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
K-M curves
Time to death (years)
Su
rviv
al
p value= 0.0125
![Page 38: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/38.jpg)
Results TMA clustering
• Clusters reproduce well known clinical subgroups– Ex: global expression differences between
clear cell and non-clear cell patients– RF clustering works better than clustering
based on the Euclidean distance for TMA data
• RF clustering allows one to identify “outlying” tumor samples.
• Can detect previously unknown sub-groups
![Page 39: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/39.jpg)
Boxplots of tumor marker expression vs. cluster
1 2 3
020
40
60
80
100
CA
9M
em
PosM
n
p= 9.95e-28
1 2 3
020
40
60
80
100
CA
12M
em
PosM
n
p= 4.61e-15
1 2 3
010
20
30
40
50
Ki6
7P
osM
n
p= 3.51e-13
1 2 3
020
40
60
80
100
GeP
osH
arr
iMn
p= 3.33e-21
1 2 3
020
40
60
80
p53P
osM
n
p= 1.7e-10
1 2 3
020
40
60
80
100
EpD
ctP
osM
n
p= 1.64e-14
1 2 3
020
40
60
80
100
pT
EN
PosM
np= 1.43e-27
1 2 30
20
40
60
80
100
Vim
Pos
p= 7.97e-14
Message: clusters can be explained in terms of tumor expression values, i..e in terms of biological pathways.
![Page 40: Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649e905503460f94b94adf/html5/thumbnails/40.jpg)
Conclusions
• There is a need to develop tailor made data mining methods for TMA data– Major differences:
• highly non-normal data • Euclidean distance metrics seems to be sub-
optimal for TMA data
• tree or forest based methods work well for kidney and prostate TMA data